Many of the posters have the column "HIGHLIGHTED THEME (OPTIONAL)" filled with values (breakdown as follows). There are also quite a few posters with blank themes. We use the LLM generated embeddings for each poster (title + abstract) as features, use the posters with themes as training data, and predict the themes for those posters with blank theme column.

In [1]:
import pandas as pd
import numpy as np
import ast
from tqdm import tqdm



In [2]:
df = pd.read_csv('accepted_poster.csv')

In [5]:
df['HIGHLIGHTED THEME (OPTIONAL)'].value_counts()

HIGHLIGHTED THEME (OPTIONAL)
Harnessing the Power of Large Language Models in Health Data Science                                            29
Real-World Evidence in Informatics: Bridging the Gap between Research and Practice                              26
Implementation Science and Deployment in Informatics: From Theory to Practice                                   15
Telehealth, Wearable Devices, and Patient-Generated Health Data: The New Frontiers of Informatics               14
Integrating Multi-Modal health Data to Enhance the Power of Informatics                                         13
Fairness and Disparity in Health and Biomedical Informatics: Addressing Inequities through Innovation            8
Proactive Machine Learning in Biomedical Applications: The Power of Generative AI and Reinforcement Learning     4
Citizen Science and Democratizing AI and Informatics for Healthcare                                              2
Name: count, dtype: int64

In [9]:
themed = df[df['HIGHLIGHTED THEME (OPTIONAL)'].notna()]
unthemed = df[df['HIGHLIGHTED THEME (OPTIONAL)'].isna()]

In [11]:
themed_text = ('TITLE: ' + themed.submission_TITLE + 'ABSTRACT: '+themed.submission_ABSTRACT).tolist()
unthemed_text = ('TITLE: ' + unthemed.submission_TITLE + 'ABSTRACT: '+unthemed.submission_ABSTRACT).tolist()

In [14]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [18]:
themed.loc[:,'label'] = label_encoder.fit_transform(themed['HIGHLIGHTED THEME (OPTIONAL)'].tolist())

# GPT

In [10]:
import requests

url = "https://api.openai.com/v1/embeddings"
access_token = "sk-xxx" # please replace with your openai token
headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {access_token}'
}


def get_GPT_embedding(txt):
    return requests.post(url, headers=headers, json={
        "input": txt,
        "model": "text-embedding-ada-002"}).json()['data'][0]['embedding']

In [12]:
themed_gpt_embed = [get_GPT_embedding(t) for t in tqdm(themed_text)]
unthemed_gpt_embed = [get_GPT_embedding(t) for t in tqdm(unthemed_text)]

100%|█████████████████████████████████████████| 111/111 [00:32<00:00,  3.38it/s]
100%|███████████████████████████████████████████| 59/59 [00:18<00:00,  3.17it/s]


In [30]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(themed_gpt_embed, themed.label.tolist())
unthemed.loc[:,'label'] = knn.predict(unthemed_gpt_embed)
unthemed.loc[:,'GPT_theme'] = label_encoder.inverse_transform(unthemed.loc[:,'label'])

# Llama

In [32]:
import os 
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

peft_model_id = 'SeanLee97/angle-llama-7b-nli-v2'
config = PeftConfig.from_pretrained(peft_model_id)
os.environ['TRANSFORMERS_CACHE'] = '/data/HF/'
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, cache_dir="/data/HF/", do_sample=True)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, cache_dir="/data/HF/", do_sample=True).bfloat16().cuda()
model = PeftModel.from_pretrained(model, peft_model_id, cache_dir="/data/HF/").cuda()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [35]:
def get_llama_embedding(txt):
    tok = tokenizer([txt], return_tensors='pt')
    for k, v in tok.items():
        tok[k] = v.cuda()
    return model(output_hidden_states=True, **tok).hidden_states[-1][:, -1].float().detach().cpu().numpy()[0]


In [37]:
themed_llama_embed = [get_llama_embedding(t) for t in tqdm(themed_text)]
unthemed_llama_embed = [get_llama_embedding(t) for t in tqdm(unthemed_text)]

100%|█████████████████████████████████████████| 111/111 [00:41<00:00,  2.71it/s]
100%|███████████████████████████████████████████| 59/59 [00:21<00:00,  2.73it/s]


In [39]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(themed_llama_embed, themed.label.tolist())
unthemed.loc[:,'label'] = knn.predict(unthemed_llama_embed)
unthemed.loc[:,'llama_theme'] = label_encoder.inverse_transform(unthemed.loc[:,'label'])

In [43]:
df = df.merge(unthemed[['row_id','GPT_theme', 'llama_theme']], how = 'left')