# Tutorial 3

## Generating a list of candidate terms 

In this tutorial we will dive into the GO graph in order to get candidate GO terms that we will use afterwards to construct our predictor models. 

Which terms are we interested in on this tutorial? In those terms that are at `distance==3` of any term that are at least present in 75 proteins in our test set (to be able to train our models with sound data). This terms just serve as an example, given their close distance to actually assigned terms, have more probablity of not being assigned yet to the proteins but to represents valid terms for those proteins. A complete training will comprehend more terms, but this is just a tutorial.

In [1]:
# Imports
from manas_cafa.bio.protein import Protein
from Bio import SeqIO
import random

In [2]:
# Load the proteins (max length 500)
proteins = []
for record in SeqIO.parse('../cafa-5-protein-function-prediction/Test (Targets)/testsuperset.fasta', 'fasta'):
    if len(record.seq) <= 500:
        proteins.append(record.id)

Let's instantiate the GO graph

In [3]:
g = Protein.build_graph('../cafa-5-protein-function-prediction/Train/go-basic.obo')

Let's compute (in a parallel fashion) the terms at distance equal to three.

In [4]:
# Parallel loading of go_terms into a dictionary (faster)
from joblib import Parallel, delayed
import multiprocessing

def processInput(protein_id):
    try:
        protein = Protein(protein_id)
        try:
            protein.load_uniprot()
        except:
            print('Continued with protein {}'.format(protein_id))
            return {protein_id : []}
        candidate_terms = (set(protein.go_terms_children(g, 3)) - set(protein.go_terms_children(g, 2))) - set([x['id'] for x in protein.go_terms()])
        return {protein_id : list(candidate_terms)}
    except:
        print('Error with protein {}'.format(protein_id))
        return {protein_id : []}

num_cores = multiprocessing.cpu_count()*2
# compute go terms for each protein, print the evolution of the process
go_terms = {}
for i in proteins[0:10]:
    go_terms.update(processInput(i))


print(f'Proteins computed: {go_terms.keys()}')

print(f'Go terms at distance 3 for {list(go_terms.keys())[0]}: {go_terms[list(go_terms.keys())[0]]}')

# If we want to do this for the whole protein set, we can use the following line:
#go_terms = Parallel(n_jobs=num_cores, verbose=100)(delayed(processInput)(i) for i in proteins)


Proteins computed: dict_keys(['Q9CQV8', 'P62259', 'P68510', 'P61982', 'O70456', 'P68254', 'P63101', 'Q6PD03', 'Q6PD28', 'Q61151'])
Go terms at distance 3 for Q9CQV8: ['GO:0032774', 'GO:0019220', 'GO:2001141', 'GO:0023051', 'GO:0031327', 'GO:0043226', 'GO:0065007', 'GO:0051641', 'GO:0097708', 'GO:0030234', 'GO:0048523', 'GO:0019538', 'GO:0045184', 'GO:0010563', 'GO:0010646', 'GO:0010467', 'GO:0051649', 'GO:0051246', 'GO:0140678', 'GO:0048583', 'GO:0006796', 'GO:0010558', 'GO:0060255', 'GO:0071702', 'GO:0051253', 'GO:0006810', 'GO:0048519', 'GO:0043412', 'GO:0010605', 'GO:0070727', 'GO:0051172', 'GO:0051179', 'GO:0071705']


Lets make a dictionary with the go terms as keys and the proteins as values


In [5]:
go_terms_dict = {}
for protein in go_terms.keys():
    for go_term in go_terms[protein]:
        if go_term not in go_terms_dict.keys():
            go_terms_dict[go_term] = [protein]
        else:
            go_terms_dict[go_term].append(protein)

go_terms_dict

{'GO:0032774': ['Q9CQV8', 'P68510', 'P68254'],
 'GO:0019220': ['Q9CQV8', 'P62259', 'Q6PD03'],
 'GO:2001141': ['Q9CQV8', 'P68510', 'P68254'],
 'GO:0023051': ['Q9CQV8', 'P68510', 'P61982'],
 'GO:0031327': ['Q9CQV8', 'P68254'],
 'GO:0043226': ['Q9CQV8', 'P62259', 'O70456', 'Q6PD03', 'Q6PD28', 'Q61151'],
 'GO:0065007': ['Q9CQV8',
  'P68510',
  'O70456',
  'P68254',
  'P63101',
  'Q6PD03',
  'Q61151'],
 'GO:0051641': ['Q9CQV8', 'P62259', 'P61982', 'O70456', 'P68254'],
 'GO:0097708': ['Q9CQV8', 'P62259', 'P63101'],
 'GO:0030234': ['Q9CQV8', 'Q6PD03', 'Q6PD28', 'Q61151'],
 'GO:0048523': ['Q9CQV8', 'P68510', 'Q6PD03', 'Q6PD28'],
 'GO:0019538': ['Q9CQV8', 'P62259'],
 'GO:0045184': ['Q9CQV8', 'P62259', 'P61982', 'O70456', 'P68254', 'P63101'],
 'GO:0010563': ['Q9CQV8', 'P62259'],
 'GO:0010646': ['Q9CQV8', 'P68510', 'P61982'],
 'GO:0010467': ['Q9CQV8', 'P68510', 'P68254'],
 'GO:0051649': ['Q9CQV8', 'P62259', 'P61982', 'O70456', 'P68254'],
 'GO:0051246': ['Q9CQV8', 'P62259', 'Q6PD03'],
 'GO:0140678

The following two cells are not executed in the tutorial, but were executed to generate the files with the whole set of proteins to generate the candidate files to be used in following tutorials.

In [None]:
# Save the dict to a file in the data folder
with open('../data/go_terms_test_parent_d3_candidates_maxlen500.tsv', 'w') as f:
    for go_term in go_terms_dict.keys():
        f.write('{}\t{}\n'.format(go_term, ','.join(go_terms_dict[go_term])))

In [None]:
# Save the dict to a file in the data folder only for those terms with have at least 250 proteins
with open('../data/go_terms_test_parent_d3_candidates_maxlen500_minmembers100.tsv', 'w') as f:
    for go_term in go_terms_dict.keys():
        if len(go_terms_dict[go_term]) >= 100:
            f.write('{}\t{}\n'.format(go_term, ','.join(go_terms_dict[go_term])))