# Tutorial 4

## Ordering candidate terms by information accrued

In this tutorial, we're going to prioritize over which terms we want to construct models for. We are going to use a very simple score based in the information accrued by the term, multiplied by the number of proteins in the test set where these terms are candidates (based on the distance criteria from previous tutorials)

In [1]:
# Ignore warnings 
import warnings
warnings.filterwarnings('ignore')

In [2]:
# We're going to play with dataframes
import pandas as pd

In [3]:
# Path for the GO terms, candidate GO terms and information accrued by term files
GO_TERMS_TRAIN_SET = '../data/go_terms_train_set_maxlen500_minmembers100.tsv'
CANDIDATE_TERMS_TEST_SET = '../data/go_terms_test_parent_candidates_maxlen500_minmembers100.tsv'
TERMS_IA = '../cafa-5-protein-function-prediction/IA.txt'

We load the GO terms for the training set:


In [4]:
go_terms_train_set = pd.read_csv(GO_TERMS_TRAIN_SET, sep='\t', header=None)
go_terms_train_set.columns=['go_term', 'proteins']
go_terms_train_set

Unnamed: 0,go_term,proteins
0,GO:0003677,"P20536,P04911,Q9Z288,Q12415,Q8BUN5,B4FK49,Q9C6..."
1,GO:0006281,"P20536,P04911,O75771,Q6Q783,Q9CAM7,Q9V3K3,Q9Y7..."
2,GO:0005737,"O73864,A0A0B4J1F4,Q9VSA3,O94652,Q7ZT11,P05179,..."
3,GO:0005615,"O73864,Q6DUW9,Q9NZH8,Q8NF86,O76745,P13236,Q9DB..."
4,GO:0005886,"O73864,A0A0B4J1F4,P33681,Q96S79,P04632,Q41931,..."
...,...,...
1158,GO:0050808,"O76070,Q22125,Q71U36,P37377,Q9VHG4,P97887,P683..."
1159,GO:0070328,"P40025,P58004,Q13133,Q8R1L8,Q60612,Q9VT51,Q9BY..."
1160,GO:0030017,"Q5JVS0,P0DP27,P23693,P60901,P0DP26,P0DP30,Q8WU..."
1161,GO:0020023,"P86927,P86924,Q57VB1,Q07053,O61016,Q383Q5,Q381..."


We load the GO terms for the test set:

In [5]:
# Load the candidate GO terms for the test set
candidate_terms_test_set = pd.read_csv(CANDIDATE_TERMS_TEST_SET, sep='\t', header=None)
candidate_terms_test_set.columns=['go_term', 'proteins']
candidate_terms_test_set

Unnamed: 0,go_term,proteins
0,GO:0000344,"Q9CQV8,P62259,P68510,P61982,O70456,P68254,P631..."
1,GO:1904841,"Q9CQV8,P62259,P63101,Q6P1F6,P35363,Q60612,Q606..."
2,GO:0045273,"Q9CQV8,P68254,P63101,Q6PD03,Q60876,Q02152,P972..."
3,GO:0002766,"Q9CQV8,P62259,P68510,P61982,O70456,P68254,P631..."
4,GO:0042565,"Q9CQV8,P68254,P63101,Q60876,Q9R078,O54950,P052..."
...,...,...
37805,GO:0044277,"Q88QT3,Q88M11,Q88QT2,P63957,Q97RU8,P63640,P966..."
37806,GO:0031505,"Q88QT3,Q88M11,Q88QT2,P63957,Q97RU8,P63640,P966..."
37807,GO:0009827,"Q88QT3,Q88M11,Q88QT2,P63957,Q97RU8,P63640,P966..."
37808,GO:0009664,"Q88QT3,Q88M11,Q88QT2,P63957,Q97RU8,P63640,P966..."


And the the IA file:

In [6]:
terms_ia = pd.read_csv(TERMS_IA, sep='\t', header=None)
terms_ia.columns=['go_term', 'ia']
terms_ia

Unnamed: 0,go_term,ia
0,GO:0000001,0.000000
1,GO:0000002,3.103836
2,GO:0000003,3.439404
3,GO:0000011,0.056584
4,GO:0000012,6.400377
...,...,...
43243,GO:2001083,7.159871
43244,GO:2001084,7.592457
43245,GO:2001085,7.159871
43246,GO:2001147,5.554589


Let's create a new dataframe with only the useful terms, their IA, the number of proteins proteins in the test set (they are comma separated)


In [7]:
useful_terms = set(go_terms_train_set['go_term']) & set(candidate_terms_test_set['go_term'])
useful_terms_ia = terms_ia[terms_ia['go_term'].isin(useful_terms) & terms_ia['ia'].ge(0.0001)]
useful_terms_ia['num_proteins'] = useful_terms_ia['go_term'].map(candidate_terms_test_set.set_index('go_term')['proteins']).map(lambda x: len(x.split(',')))
useful_terms_ia['score'] = useful_terms_ia['ia'] * useful_terms_ia['num_proteins']
useful_terms_ia

Unnamed: 0,go_term,ia,num_proteins,score
13,GO:0000028,0.018859,541,10.202734
17,GO:0000045,0.010268,348,3.573381
24,GO:0000070,0.012405,1038,12.876412
29,GO:0000079,0.009848,2451,24.136923
31,GO:0000082,0.007089,899,6.373393
...,...,...,...,...
40834,GO:0070888,3.634080,5287,19213.382373
40912,GO:0071949,1.049328,524,549.847721
42423,GO:0106310,6.979187,353,2463.653007
43202,GO:1990837,0.023805,3174,75.555747


Now we just sort them by score

In [8]:
useful_terms_ia_ordered = useful_terms_ia.sort_values(by=['score'], ascending=False)
useful_terms_ia_ordered

Unnamed: 0,go_term,ia,num_proteins,score
28010,GO:0000407,9.175414,11921,109380.109699
28298,GO:0005793,8.166037,12609,102965.554769
28792,GO:0020023,5.231151,16184,84660.945708
28438,GO:0008180,5.615446,14814,83187.212975
28297,GO:0005791,5.236855,14798,77494.973859
...,...,...,...,...
12458,GO:0045893,0.000343,2097,0.718521
13509,GO:0048364,0.002516,229,0.576072
2192,GO:0006355,0.000499,922,0.460238
2601,GO:0006874,0.003343,113,0.377809


And we save it into a file for further tutorials

In [None]:
# Save the ordered terms
useful_terms_ia_ordered.to_csv('../data/go_terms_test_candidates_maxlen500_minmembers100_ordered.tsv', sep='\t', index=False)