## Orange Team CQ1.2

### Query:  
What genes express proteins that are functionally similar to the 11 Fanconi Anemia core complex genes (set FA-core), based on GO annotations?

### Goal:
This query aims to expand the FA-core gene set based on GO functional similarity, in service of Task 1 in the St. Jude Life Demonstrator described [here](https://github.com/NCATS-Tangerine/cq-notebooks/wiki/St.-Judes-FA-Demonstrator).
  
### Data Types, Sources, and Routes:
1. **Gene-ortholog associations** - from Panther, via SciGraph (BioLink [/bioentity/gene/{id}/homologs/](https://api.monarchinitiative.org/api/#!/bioentity/get_gene_homolog_associations))
2. **Gene-protein associations** -  from Ensembl, via SciGraph ([BioLink API](https://api.monarchinitiative.org/api/) call or [Monarch API cypher query](https://scigraph-data-dev.monarchinitiative.org/scigraph/docs/#!/cypher/execute)?)
3. **Functional annotations** - from GO, via SciGraph ([BioLink API](https://api.monarchinitiative.org/api/) or via Wikidata (SPARQL API)
  
### Sub-Queries/Tasks:
   
**Input:** NCBIGene identifiers for 11 FA-core genes
  1. Retrieve orthologes of FA-core genes and add to human FA-core set  
  2. Retrieve proteins encoded by genes in this cross-species set   
  3. Retrieve GO terms annotated to these proteins  
  4. Execute analysis to return ranked list of funcionally similar proteins 
  5. Select subset of proteins meeting some defined threshhold  
  6. Retrieve genes that encode these proteins  
  7. Retrieve human orthologs of all non-human genes in this set  

**Output:** GeneSetQ2 (functionally similar human genes based on cross-species GO-similarity analysis)

*Note that the subqueries above can be parameterized in any number of ways in their implementation (e.g. select specific taxa  for cross-species expansion, facet on a MF or BP or CC subset of GO annotations, different inclusion threshholds for GO-based similarity, selecting different knowledge sources or routes to retrieve a particular data type). Different combinations of parameters can be explored using different notebooks in this directory.*


In [2]:
import pandas as pd
import requests
from pprint import pprint
import ontobio

### Get the 11 FA core genes from [here](https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/)

In [6]:
base_url = "https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/"
FA_1_core_complex = "FA_1_core_complex.txt"
columns = ['gene_curie', 'gene_symbol']
fa_genes = pd.read_csv(base_url + FA_1_core_complex, sep='\t', names=columns)
fa_genes

Unnamed: 0,gene_curie,gene_symbol
0,NCBIGene:2175,FANCA
1,NCBIGene:2187,FANCB
2,NCBIGene:2176,FANCC
3,NCBIGene:2178,FANCE
4,NCBIGene:2188,FANCF
5,NCBIGene:2189,FANCG
6,NCBIGene:55120,FANCL
7,NCBIGene:57697,FANCM
8,NCBIGene:2177,FANCD2
9,NCBIGene:55215,FANCI


### Utility wrapper for [BioLink API](api.monarchinitiative.org/api)

In [44]:
class BioLinkWrapper(object):
    def __init__(self):
        self.endpoint = 'https://api.monarchinitiative.org/api/'
        
    def get_gene(self, gene_curie):
        params = {}
        url = '{}bioentity/gene/{}'.format(self.endpoint, gene_curie)
        response = requests.get(url, params)
        return response.json()

    def get_orthologs(self, gene_curie, orth_taxon=None):
        params = {}
        if orth_taxon:
            params['homolog_taxon'] = orth_taxon
        url = '{}bioentity/gene/{}/homologs/'.format(self.endpoint, gene_curie)
        response = requests.get(url, params)
        return response.json()

    def get_phenotypes(self,gene_curie):
        url = '{}bioentity/gene/{}/phenotypes/'.format(self.endpoint, gene_curie)
        response = requests.get(url)
        return response.json()
    
    def get_diseases(self, gene_curie):
        url = '{}bioentity/gene/{}/diseases/'.format(self.endpoint, gene_curie)
        response = requests.get(url)
        return response.json()
    
    def get_interactions(self, gene_curie):
        url = '{}bioentity/gene/{}/interactions/'.format(self.endpoint, gene_curie)
        response = requests.get(url)
        return response.json()
    
    def get_functions(self, gene_curie):
        url = '{}bioentity/gene/{}/functions/'.format(self.endpoint, gene_curie)
        response = requests.get(url)
        return response.json()
    
    def get_disease_models(self, disease_curie):
        url = '{}/bioentity/disease/{}/models/'.format(self.endpoint, disease_curie)
        response = requests.get(url)
        return response.json()
    
    def get_all_phenotypes_for_taxon(self, taxon_curie):
        # get phenotypes associated with taxid
        url = "mart/gene/phenotype/{}".format(self.endpoint, taxon_curie)
        response = requests.get(url)
        return response.json()
    
    def get_gene_function(self, gene_curie):
        # get function associated with gene
        url = "{}bioentity/gene/{}/function/".format(self.endpoint, gene_curie)
        response = requests.get(url)
        return response.json()
    

### Get orthologs from Biolink API

In [9]:
fa_genes_orthologs = list()
BLW = BioLinkWrapper()
for index, gene in fa_genes.iterrows():
    orthologs = BLW.get_orthologs(gene[0])
    for orth in orthologs['associations']:
        orth_dict = {
            'gene_name': gene[1],
            'gene_curie': gene[0],
            'ortholog_name': orth['object']['label'],
            'ortholog_curie': orth['object']['id'],
            'orth_tax_label': orth['object']['taxon']['label'],
            'orth_tax_id': orth['object']['taxon']['id']
        }
        fa_genes_orthologs.append(orth_dict)

### Load into data frame

In [17]:
orth_columns  = [ 'gene_name', 'gene_curie', 'ortholog_name', 'ortholog_curie', 'orth_tax_label', 'orth_tax_id']
orth_df = pd.DataFrame(data=fa_genes_orthologs, columns=orth_columns)
orth_df.head()

Unnamed: 0,gene_name,gene_curie,ortholog_name,ortholog_curie,orth_tax_label,orth_tax_id
0,FANCA,NCBIGene:2175,FANCA,NCBIGene:100080126,Ornithorhynchus anatinus,NCBITaxon:9258
1,FANCA,NCBIGene:2175,Fanca,RGD:1311380,Rattus norvegicus,NCBITaxon:10116
2,FANCA,NCBIGene:2175,FANCA,NCBIGene:100621453,Sus scrofa,NCBITaxon:9823
3,FANCA,NCBIGene:2175,FANCA,NCBIGene:618375,Bos taurus,NCBITaxon:9913
4,FANCA,NCBIGene:2175,FANCA,NCBIGene:454393,Pan troglodytes,NCBITaxon:9598


In [18]:
'{} Orthologs found'.format(len(orth_df.index))

'111 Orthologs found'

### Query wrapper for retrieving UniProt IDs from [MyGene.info](mygene.info)

In [19]:
def query_mygene(gene_curie):
    gene_curie = gene_curie.replace('NCBIGene:', '')
    url = 'https://mygene.info/v3/query?q={}&fields=all'.format(gene_curie)
    hit = requests.get(url)
    hit = hit.json()
    uniprot = hit['hits'][0]['uniprot']
    if 'Swiss-Prot' in uniprot.keys():
        if isinstance(uniprot['Swiss-Prot'], list):
            return uniprot['Swiss-Prot']
        elif isinstance(uniprot['Swiss-Prot'], str):
            return [uniprot['Swiss-Prot']]
        else:
            return None

### Use [Ontobio](https://github.com/biolink/ontobio) for computing functional similarity between sets of GO terms annotated to genes

In [20]:
from ontobio.ontol_factory import OntologyFactory
# Create ontology object, for GO
# Transparently uses remote SPARQL service.
# (May take a few seconds to run first time, Jupyter will show '*'. BE PATIENT, do
# not re-execute cell)
ofactory = OntologyFactory()
ont = ofactory.create('go')

### Get gene-function associations from the [GeneOntology](http://geneontology.org/)

In [21]:
from ontobio.io.gafparser import GafParser
from ontobio.assoc_factory import AssociationSetFactory

p = GafParser()
afactory = AssociationSetFactory()

def make_assocs(group, parse=False):
    url = "http://geneontology.org/gene-associations/gene_association.{}.gz".format(group)
    if group == 'human':
        url = "http://geneontology.org/gene-associations/goa_human.gaf.gz"
    assocs = p.parse(url)
    if parse:
        return assocs
    else:
        return afactory.create_from_assocs(assocs, ontology=ont)

### Begin with finding genes in the mouse genome that are functionally similar to the mouse orthologs of the FA Core genes in humans

Get mouse gene function associations

In [26]:
asoc_mouse = make_assocs(group='mgi')

### Compute Jaccard Similarity and assign scores

Use Ontobio's Jaccard similarity functionality to compare annotation sets of genes (within the same genome) and assign a sim score to those hits that have a similarity score of greater than 70% 

In [27]:
mouse_orthologs = orth_df.loc[orth_df['orth_tax_id'] == 'NCBITaxon:10090']
mouse_ortho_sims = list()
for index, row in mouse_orthologs.iterrows():
    gene_name = row[0]
    gene_curie = row[1]
    ortholog_name = row[2]
    ortholog_curie = row[3]
    ortholog_taxon_curie = row[5]
    ortholog_taxon_name = row[4]
    mo = 'MGI:{}'.format(ortholog_curie)
    for mgene in list(asoc_mouse.subject_label_map.keys()):
        amScore = asoc_mouse.jaccard_similarity(mo, mgene)     
        if amScore > .7 and amScore < 1:
            mouse_ortho_sims.append({
                    'gene_name': gene_name,
                    'gene_curie': gene_curie,
                    'ortholog_name': ortholog_name,
                    'ortholog_curie': ortholog_curie,
                    'ortholog_taxon_curie': ortholog_taxon_curie,
                    'ortholog_taxon_name': ortholog_taxon_name,
                    'non_fa_hit_name': asoc_mouse.label(mgene),
                    'non_fa_hit_curie': mgene.replace('MGI:MGI:', 'MGI:'),
                    'sim_score' : amScore
                })

### Load annotations into dataframe mouse specific dataframe

In [83]:
mouse_sims_columns = [ 'gene_name','gene_curie', 'ortholog_name', 
                      'ortholog_curie', 'ortholog_taxon_curie', 
                      'ortholog_taxon_name', 'non_fa_hit_name', 'non_fa_hit_curie', 'sim_score' ]
mouse_sims_df = pd.DataFrame(data=mouse_ortho_sims, columns=mouse_sims_columns )

In [63]:
mouse_sims_df.sort_values(by=['sim_score'], ascending=False).head()

Unnamed: 0,gene_name,gene_curie,ortholog_name,ortholog_curie,ortholog_taxon_curie,ortholog_taxon_name,non_fa_hit_name,non_fa_hit_curie,sim_score,human_orth_hit
10,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,Cfap45,MGI:1919120,0.833333,[HGNC:17229]
1,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,1110032A03Rik,MGI:1915971,0.833333,[HGNC:1163]
62,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,Zbtb34,MGI:2685195,0.833333,[]
61,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,Zbtb11,MGI:2443876,0.833333,[]
4,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,2310022A10Rik,MGI:1913617,0.833333,[HGNC:26723]


### Find human orthologs of mouse genes with functional similarity over 75% to mouse FA core gene orthologs

In [59]:
def find_human_orthologs(curie):
    orth_list = list()
    BioL = BioLinkWrapper()
    orths = BioL.get_orthologs(gene_curie=curie, orth_taxon='NCBITaxon:9606')
    if 'associations' in orths.keys():
        for orth in orths['associations']:
            orth_list.append(orth['object']['id'])
    return orth_list

In [61]:
mouse_sims_df['human_orth_hit'] = mouse_sims_df['non_fa_hit_curie'].apply(find_human_orthologs)

### 'human_orth_hit' column are human genes, that orthologs to mouse genes that are functionally similar to 

Example Row[0]:   
FancE (NCBIGene:2178) has mouse ortholog Fance (MGI:1920025).  
Fance has functional similarity to the mouse gene  0610040J01Rik (MGI:1923511). 
0610040J01Rik is an ortholog to the human gene (HGNC:25618, NCBIGene:55286)	

In [65]:
mouse_sims_df.head()

Unnamed: 0,gene_name,gene_curie,ortholog_name,ortholog_curie,ortholog_taxon_curie,ortholog_taxon_name,non_fa_hit_name,non_fa_hit_curie,sim_score,human_orth_hit
0,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,0610040J01Rik,MGI:1923511,0.8,[HGNC:25618]
1,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,1110032A03Rik,MGI:1915971,0.833333,[HGNC:1163]
2,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,1700030J22Rik,MGI:1916778,0.740741,[HGNC:26525]
3,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,2200002D01Rik,MGI:1919525,0.703704,[]
4,FANCE,NCBIGene:2178,Fance,MGI:1920025,NCBITaxon:10090,Mus musculus,2310022A10Rik,MGI:1913617,0.833333,[HGNC:26723]


In [67]:
human_functionally_similar_genes = mouse_sims_df['human_orth_hit'].tolist()


#### Final list of human genes that are orthologs of mouse genes that are functionally similar to mouse orthologs of FA core human genes.

In [95]:
final_set = set([j for i in human_functionally_similar_genes for j in i])

### Display gene id, name and functional annotations

In [None]:
final_hits = list()
for index, gene in enumerate(final_set):
    function_list = list()
    gene_info = BLW.get_gene(gene_curie=gene)
    gene_function = BLW.get_gene_function(gene_curie=gene)
    if 'associations' in gene_function.keys():
        for assoc in gene_function['associations']:
            function_list.append(assoc['object']['label'])
    function_set = set(function_list)
    if len(function_set) > 0: 
        try:
            final_hits.append([gene, gene_info['label'], ", ".join(function_set)])
            print('yes')
        except Exception as e:
            print(e)
hits_frame = pd.DataFrame(data=final_hits, columns=['HGNC', 'Symbol', 'Function(s)' ])
hits_frame.to_csv('MouseFunctionallySimilarGenes.csv', sep='\t')
pd.set_option('display.max_colwidth', -1)

yes
yes
yes
yes
yes
yes
yes
yes


In [146]:
hits_frame

Unnamed: 0,HGNC,Symbol,Function(s)
0,HGNC:24661,REXO5,"nucleic acid phosphodiester bond hydrolysis, nucleolus, extracellular exosome, RNA binding, exonuclease activity"
1,HGNC:6658,VWA5A,"nucleus, nucleoplasm"
2,HGNC:23484,FAM208B,"cytosol, nucleus, protein binding, nucleoplasm"
3,HGNC:25563,FAM222B,nucleoplasm
4,HGNC:28189,NUDT22,"hydrolase activity, protein binding, nucleoplasm"
5,HGNC:17229,CFAP45,"cilium, nucleus, nucleoplasm"
6,HGNC:16933,FEM1C,"protein ubiquitination, nucleoplasm, cytosol, protein binding, post-translational protein modification"
7,HGNC:29135,ANKRD12,"cytosol, nucleoplasm"
8,HGNC:32070,LYSMD1,"nucleus, protein binding, nucleoplasm"
9,HGNC:29039,RPRD2,"DNA-directed RNA polymerase II, holoenzyme, nucleoplasm, snRNA transcription by RNA polymerase II"


TODO: This analysis is for mouse only.  This needs to be done on a species by species basis to find more candidates. 

In [None]:
dict_keys(['disease_associations', 'transcripts', 'phenotype_associations', 'pathway_associations', 'description', 'full_name', 'literature_associations', 'sequence_features', 'genotype_associations', 'chromosome', 'systematic_name', 'families', 'interaction_associations', 'summaries', 'function_associations', 'homology_associations', 'taxon', 'xrefs', 'deprecated', 'id', 'replaced_by', 'consider', 'label', 'categories', 'synonyms', 'types'])