[This notebook showing example of toy GO terms]

Rank all GO biological processes by the similarity with the LLM term. 

* % of other GO names have smaller semantic similarity with the GPT-4 name comparing to the assgined GO name



In [2]:
import pandas as pd
all_go = pd.read_csv('data/go_terms.csv', index_col=0)
len(all_go)

12214

## Step 1 get the word embeddings for all the go terms (only need to run once for all)

In [3]:
## create embeddings for all GO Terms and save the embeddings 
from semanticSimFunctions import getSentenceEmbedding
from transformers import AutoTokenizer, AutoModel
import pandas as pd

SapBERT_tokenizer = AutoTokenizer.from_pretrained('cambridgeltl/SapBERT-from-PubMedBERT-fulltext')
SapBERT_model = AutoModel.from_pretrained('cambridgeltl/SapBERT-from-PubMedBERT-fulltext')

all_go = pd.read_csv('data/go_terms.csv', index_col=0)
all_go_terms = all_go['Term_Description'].tolist()

all_go_terms_embeddings_dict = {}
for i, go_term in enumerate(all_go_terms):
    tensor = getSentenceEmbedding(go_term, SapBERT_tokenizer, SapBERT_model)
    all_go_terms_embeddings_dict[go_term] = tensor.numpy()  # Convert to numpy array

import pickle
with open('data/all_go_terms_embeddings_dict.pkl', 'wb') as handle:  
    pickle.dump(all_go_terms_embeddings_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
#check if embeddings are saved correctly
import pickle
with open('data/all_go_terms_embeddings_dict.pkl', 'rb') as handle:
    all_go_terms_embeddings_dict = pickle.load(handle)
print(len(all_go_terms_embeddings_dict))
# all_go_terms_embeddings_dict['cellular response to DNA damage stimulus']

12214


## Step2: iterate through each GO term and its corresponsing LLM term, rank the similarity score of the LLM with all GO terms and fin where is the trueGO-LLM term is among the list


When running for the 1000 gene set, used the python function rank_GOterm_LLM_sim.py to run at the background

 ```
 python rank_GOterm_LLM_sim_rand.py --input_file data/GO_term_analysis/LLM_processed_selected_1000_go_terms.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 50 --output_file data/GO_term_analysis/simrank_LLM_processed_selected_1000_go_terms.tsv --background_file data/GO_term_analysis/all_go_sim_scores.txt
 ```

the code at the bottom is just an example

In [4]:
%run rank_GOterm_LLM_sim_rand.py --input_file data/GO_term_analysis/LLM_processed_toy_example.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 50 --output_file data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv --background_file data/GO_term_analysis/toy_all_go_sim_scores.txt

  0%|          | 0/10 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


1673


 10%|█         | 1/10 [00:05<00:46,  5.19s/it]

2832
12


 20%|██        | 2/10 [00:10<00:41,  5.18s/it]

3034
26


 30%|███       | 3/10 [00:15<00:36,  5.17s/it]

3393
639


 40%|████      | 4/10 [00:20<00:31,  5.18s/it]

7315
29


 50%|█████     | 5/10 [00:25<00:25,  5.17s/it]

9349
9425


 60%|██████    | 6/10 [00:31<00:20,  5.18s/it]

10753
218


 70%|███████   | 7/10 [00:36<00:15,  5.17s/it]

8667
19


 80%|████████  | 8/10 [00:41<00:10,  5.20s/it]

9837
5107


 90%|█████████ | 9/10 [00:46<00:05,  5.19s/it]

8577
343


100%|██████████| 10/10 [00:51<00:00,  5.19s/it]

11989
Saved progress after 10 rows.
DONE





In [6]:
# sanity check
df = pd.read_csv('data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv', sep='\t', index_col=0)
df.head()

Unnamed: 0_level_0,Genes,Gene_Count,Term_Description,LLM Name,LLM Analysis,LLM_name_GO_term_sim,sim_rank,true_GO_term_sim_percentile,random_GO_name,random_go_llm_sim,random_sim_rank,random_sim_percentile,top_50_hits,top_50_sim
GO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
GO:0032385,LDLRAP1 SCP2D1 ANXA2 SCP2,4,positive regulation of intracellular cholester...,Lipid Transport and Metabolism,"Proteins: LDLRAP1, SCP2D1, ANXA2, SCP2\n\n1. L...",0.382285,1673,0.863026,Tie signaling pathway,0.344493,2832,0.768135,lipid metabolic process|lipid transport|cellul...,0.8934768|0.8777427|0.8366494|0.82344246|0.800...
GO:0002468,NOD1 HLA-DRA CLEC4A HLA-DRB1 CCL21 NOD2 CCL19 ...,15,dendritic cell antigen processing and presenta...,Antigen Presentation and Immune Response Modul...,The primary biological process performed by th...,0.725125,12,0.999018,positive regulation of reciprocal meiotic reco...,0.352192,3034,0.751597,regulation of antigen processing and presentat...,0.8205818|0.788555|0.77849185|0.7683631|0.7679...
GO:0033683,OGG1 ERCC5 XPA ERCC4 NTHL1,5,"nucleotide-excision repair, DNA incision",DNA Repair,"The system of interacting proteins OGG1, ERCC5...",0.68892,26,0.997871,Tie signaling pathway,0.342632,3393,0.722204,DNA repair|DNA synthesis involved in DNA repai...,0.9999999|0.8868111|0.80208004|0.7669432|0.763...
GO:0035672,SLC7A11 SLC25A39 SLC26A6 ABCB9 SLC15A4 ABCC5 C...,15,oligopeptide transmembrane transport,Ion and Nutrient Transport Regulation,The primary biological process performed by th...,0.495159,639,0.947683,hypomethylation of CpG island,0.294337,7315,0.401097,regulation of ion transport|regulation of ion ...,0.84507704|0.797101|0.77553225|0.7660217|0.749...
GO:0048023,OPN3 CDH3 ATP7A APPL1 ASIP RAB38 ZEB2 TYRP1 GIPC1,9,positive regulation of melanin biosynthetic pr...,Melanogenesis Regulation,"Proteins: OPN3, CDH3, ATP7A, APPL1, ASIP, RAB3...",0.637935,29,0.997626,RNA (guanine-N7)-methylation,0.255608,9349,0.234567,regulation of melanocyte differentiation|regul...,0.8627623|0.8499877|0.82896507|0.80347717|0.79...


### Check the rank similarity result of the 1000 gene sets 

In [11]:
import pandas as pd

rank_sim_df = pd.read_csv('data/GO_term_analysis/simrank_LLM_processed_selected_1000_go_terms.tsv', sep='\t')
## if duplicate
print(sum(rank_sim_df.duplicated(subset=['GO'])))
print(sum(rank_sim_df.duplicated(subset=['LLM Analysis'])))

## half point of the similarity distribution
rank_sim_sorted = rank_sim_df.sort_values(by='true_GO_term_sim_percentile', ascending=False)
print('half of the sample have the percentile score higher than: ',rank_sim_sorted.iloc[500-1]['true_GO_term_sim_percentile'])

## number of GO terms in top 10% of similarities
print('number of GO terms in top 10%: ', sum(rank_sim_df['true_GO_term_sim_percentile'] <= 0.1))

## number of GO terms ranked top 10 of similarities

print('number of GO terms ranked top 10: ', sum(rank_sim_df['sim_rank'] <= 10))

0
0
half of the sample have the percentile score higher than:  0.9799410512526608
number of GO terms in top 10%:  10
number of GO terms ranked top 10:  151


In [None]:
# rank the GO terms by the similarity of LLM name and GO term and pick top 25 and bottom 25 for manual evaluation
rank_sim_df.sort_values(by=['LLM_name_GO_term_sim'], ascending=False, inplace=True)
top = rank_sim_df.head(25)
bottom = rank_sim_df.tail(25)
combine_df = pd.concat([top,bottom], ignore_index=True)


# # add a column to randomly assign number from 1-5, each has the same number of GO terms
# team = [1,2,3,4,5]*10
# import random
# random.seed(2023)
# random.shuffle(team)
# combine_df['team'] = team



combine_df.to_csv('data/GO_term_analysis/best_25_worst_25_similarity_among1000GO.tsv', sep='\t', index=False)
