## Orange QB1.4

### Query:
Return sequences of FA genes associated with functional post-translational modifications.

In [1]:
# import biothing explorer (current in local, will make it an independent python package)
from visual import pathViewer

In [2]:
# pathViewer is a Python class for graphically display API connection maps and explore bio-entity relaitonships
k = pathViewer()

#### Show All Available IDs
Users could call the **availbale_ids** function to retrieve all IDs as well as descriptions for these ids used in BioThings Explorer

In [3]:
# uncomment the code below to show all available ids in BioThings Explorer
# k.available_ids()

#### Show How APIs/Endpoints/Bio-Entities can be connected together
graph is **interactive**

In [4]:
# uncomment the code the below to show api road map
# k.show_api_road_map()

#### Find Path
The CQ question above ask to find the path connecting **NCBI Gene IDs** and **PTM info**.
**find_path** function could be used here to find how two different biological entities/concepts can be connected together through API endpoints

In [11]:
# set display_graph to True to show graph display of the path between 'NCBI Gene' and 'PTM Object'
k.find_path('NCBI Gene', 'PTM Object', display_graph=False)

Path 0: [{'endpoint': 'http://mygene.info/v3/gene/geneid', 'input': 'NCBI Gene', 'output': 'UniProt Knowledgebase'}, {'endpoint': 'http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM', 'input': 'UniProt Knowledgebase', 'output': 'PTM Object'}]


[[{'endpoint': 'http://mygene.info/v3/gene/geneid',
   'input': 'NCBI Gene',
   'output': 'UniProt Knowledgebase'},
  {'endpoint': 'http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM',
   'input': 'UniProt Knowledgebase',
   'output': 'PTM Object'}]]

#### Find output for single input
Users could utilize **find_output** function to find the output for one single input using the selected path

In [12]:
# 675 is a NCBI Gene ID in the FA Gene List
# find_output takes two parameters
# the first parameter is path, users should select the path ID based on the results from 'find_path'
# the second parameter is value, which represent the input value
k.find_output(path=k.paths[0], value='675', display_graph=False)

#### Print result summary

In [13]:
k.result_summary()

Your exploration starts from NCBI Gene: 675. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



#### Explore results
The final results is represented using a dictionary with the key being the input, and the value being the final output

In [14]:
k.final_results

{'675': [{'begin': '70',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '70',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '445',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '445',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '492',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '492',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih

#### Explorer a list of Inputs
FA Gene List contains 26 different genes, users could input all these genes together using **find_output** function

In [15]:
import pandas as pd
'''
return list of FA genes from github txt file
'''
def get_fa_gene_list():
    gene_list_url = 'https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_4_all_genes.txt'
    gl = pd.read_table(gene_list_url, header=None)
    gene_id = [_gene.replace('NCBIGene:','') for _gene in gl[0].values.tolist()]
    gene_symbol = gl[1].values.tolist()
    new_dict = {}
    for i, _id in enumerate(gene_id):
        new_dict.update({_id: gene_symbol[i]})
    return new_dict
gene_dict = get_fa_gene_list()
gene_list = list(gene_dict.keys())

print(gene_list)

['2187', '2178', '55215', '5888', '675', '378708', '199990', '5889', '2177', '2175', '55120', '84464', '201254', '2189', '7516', '2072', '91442', '79728', '29089', '80233', '55159', '672', '2176', '83990', '57697', '2188', '10459']


In [25]:
k.find_output(path=k.paths[0], value=gene_list, display_graph=False)

In [26]:
k.result_summary()

Your exploration starts from NCBI Gene: ['2187', '2178', '55215', '5888', '675', '378708', '199990', '5889', '2177', '2175', '55120', '84464', '201254', '2189', '7516', '2072', '91442', '79728', '29089', '80233', '55159', '672', '2176', '83990', '57697', '2188', '10459']. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



Each key in the final_results represent one input (e.g. NCBI Gene ID)

In [27]:
results = k.final_results

#### Retrieve output for a specific NCBI Gene ID in the FA Gene List

In [28]:
results['199990']

[{'begin': '113',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '113',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'},
 {'begin': '137',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '137',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'}]

#### Organize data

In [29]:
from bs4 import BeautifulSoup
import requests

def get_label_for_eco_code(eco):
    url_template = 'http://data.bioontology.org/ontologies/ECO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F{{input}}'
    request_url = url_template.replace('{{input}}', eco.replace(':', '_'))
    r = requests.get(request_url,  headers={ "Content-Type" : "application/json"}, params={'apikey': 'c97ec130-f6ee-45e2-91fd-46e34135051f'})
    if r.status_code == 200:
        return r.json()['prefLabel']
                    
                                                                                           
def rearrange_dict(ptm_dict):
    new_dict = {}
    i = 0
    eco_label_dict = {}
    for gene_id, ptm_info in ptm_dict.items():
        if ptm_info:
            for _info in ptm_info:
                _info.update({'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]})
                new_evidence = ''
                if 'evidences' in _info:
                    for _evidence in _info['evidences']:
                        if _evidence['code'] in eco_label_dict.keys():
                            eco_label = eco_label_dict[_evidence['code']]
                        else:
                            eco_label = get_label_for_eco_code(_evidence['code'])
                            eco_label_dict.update({_evidence['code']: eco_label})
                        new_evidence += (_evidence['code'] + ' - ' + eco_label + '\n')
                        if 'source' in _evidence and 'id' in _evidence['source']: 
                             new_evidence += ('PMID:' + _evidence['source']['id'] + '\n\n')
                _info.update({'evidences': new_evidence})
                if 'description' in _info:
                    _info['PTM_Type'] = _info.pop('description')
                new_dict.update({i: _info})
                i+=1
        else:
            new_dict.update({i: {'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]}})
            i+=1
    return new_dict

In [30]:
new_dict = rearrange_dict(k.final_results)

In [31]:
df = pd.DataFrame.from_dict(new_dict, orient="index")

In [32]:
df

Unnamed: 0,evidences,begin,end,PTM_Type,category,NCBI Gene ID,Gene Symbol,type
0,ECO:0000244 - combinatorial evidence used in m...,2,2,N-acetylthreonine,PTM,2187,FANCB,MOD_RES
1,ECO:0000244 - combinatorial evidence used in m...,249,249,Phosphoserine,PTM,2178,FANCE,MOD_RES
2,ECO:0000269 - experimental evidence used in ma...,346,346,Phosphothreonine; by CHEK1,PTM,2178,FANCE,MOD_RES
3,ECO:0000269 - experimental evidence used in ma...,374,374,Phosphoserine; by CHEK1,PTM,2178,FANCE,MOD_RES
4,ECO:0000244 - combinatorial evidence used in m...,2,2,N-acetylalanine,PTM,5888,RAD51,MOD_RES
5,ECO:0000269 - experimental evidence used in ma...,54,54,Phosphotyrosine; by ABL1,PTM,5888,RAD51,MOD_RES
6,ECO:0000269 - experimental evidence used in ma...,309,309,Phosphothreonine; by CHEK1,PTM,5888,RAD51,MOD_RES
7,ECO:0000269 - experimental evidence used in ma...,58,58,Glycyl lysine isopeptide (Lys-Gly) (interchain...,PTM,5888,RAD51,CROSSLNK
8,ECO:0000269 - experimental evidence used in ma...,64,64,Glycyl lysine isopeptide (Lys-Gly) (interchain...,PTM,5888,RAD51,CROSSLNK
9,ECO:0000244 - combinatorial evidence used in m...,70,70,Phosphoserine,PTM,675,BRCA2,MOD_RES


In [34]:
# output final results to csv file
df.to_csv('ptm_final_results.csv')

The final output file could be viewed or downloaded at https://github.com/NCATS-Tangerine/cq-notebooks/blob/master/Orange_QB1_Benchmark_CQs/QB1.4_Post_Trans_Mod/final_results/ptm_final_results.csv