## Orange QB1.4

### Query:
Return sequences of FA genes associated with functional post-translational modifications.

In [1]:
# import biothing explorer (current in local, will make it an independent python package)
from visual import pathViewer

In [2]:
# pathViewer is a Python class for graphically display API connection maps and explore bio-entity relaitonships
k = pathViewer()

#### Show All Available IDs
Users could call the **availbale_ids** function to retrieve all IDs as well as descriptions for these ids used in BioThings Explorer

In [3]:
k.available_ids()

0,1,2,3,4
Preferred Name,URI,Description,Identifier pattern,Type
Molecular Processing Object,http://biothings.io/concepts/molecular_processing/,,,Object
ClinicalTrials.gov,http://identifiers.org/clinicaltrials/,ClinicalTrials.gov provides free access to information on clinical studies for a wide range of diseases and conditions. Studies listed in the database are conducted in 175 countries,^NCT\d{8}$,Entity
KEGG Pathway,http://identifiers.org/kegg.pathway/,KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks.,"^\w{2,4}\d{5}$",Entity
ENSEMBL TRANSLATION ID,http://identifiers.org/ensembl.translation/,,,Entity
PubChem-compound,http://identifiers.org/pubchem.compound/,PubChem provides information on the biological activities of small molecules. It is a component of NIH's Molecular Libraries Roadmap Initiative. PubChem Compound archives chemical structures and records.,^\d+$,Entity
ClinVar Variant,http://identifiers.org/clinvar/,"ClinVar archives reports of relationships among medically important variants and phenotypes. It records human variation, interpretations of the relationship specific variations to human health, and supporting evidence for each interpretation. Each ClinVar record (RCV identifier) represents an aggregated view of interpretations of the same variation and condition from one or more submitters. Submissions for individual variation/phenotype combinations (SCV identifier) are also collected and made available separately. This collection references the Variant identifier.",^\d+$,Entity
ENSEMBL TRANSCRIPT ID,http://identifiers.org/ensembl.transcript/,,,Entity
TOPLOGY Object,http://biothings.io/concepts/topology/,,,Object
KEGG Drug,http://identifiers.org/kegg.drug/,KEGG DRUG contains chemical structures of drugs and additional information such as therapeutic categories and target molecules.,^D\d+$,Entity


#### Show How APIs/Endpoints/Bio-Entities can be connected together
graph is **interactive**

In [4]:
k.show_api_road_map()

#### Find Path
The CQ question above ask to find the path connecting **NCBI Gene IDs** and **PTM info**.
**find_path** function could be used here to find how two different biological entities/concepts can be connected together through API endpoints

In [5]:
k.find_path('NCBI Gene', 'PTM Object')

Path 0: [{'input': 'NCBI Gene', 'output': 'UniProt Knowledgebase', 'endpoint': 'http://mygene.info/v3/gene/geneid'}, {'input': 'UniProt Knowledgebase', 'output': 'PTM Object', 'endpoint': 'http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM'}]


#### Find output for single input
Users could utilize **find_output** function to find the output for one single input using the selected path

In [6]:
# 675 is a NCBI Gene ID in the FA Gene List
# find_output takes two parameters
# the first parameter is path, users should select the path ID based on the results from 'find_path'
# the second parameter is value, which represent the input value
k.find_output(path=k.paths[0], value='675')

#### Print result summary

In [8]:
k.result_summary()

Your exploration starts from NCBI Gene: 675. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



#### Explore results
The final results is represented using a dictionary with the key being the input, and the value being the final output

In [9]:
k.final_results

{'675': [{'begin': '70',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '70',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '445',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '445',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '492',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '492',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih

#### Explorer a list of Inputs
FA Gene List contains 26 different genes, users could input all these genes together using **find_output** function

In [10]:
import pandas as pd
'''
return list of FA genes from github txt file
'''
def get_fa_gene_list():
    gene_list_url = 'https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_4_all_genes.txt'
    gl = pd.read_table(gene_list_url, header=None)
    gene_id = [_gene.replace('NCBIGene:','') for _gene in gl[0].values.tolist()]
    gene_symbol = gl[1].values.tolist()
    new_dict = {}
    for i, _id in enumerate(gene_id):
        new_dict.update({_id: gene_symbol[i]})
    return new_dict
gene_dict = get_fa_gene_list()
gene_list = list(gene_dict.keys())

print(gene_list)

['201254', '2188', '5888', '80233', '2176', '2072', '672', '2187', '2175', '199990', '10459', '7516', '2177', '55215', '2189', '29089', '57697', '675', '79728', '378708', '5889', '55120', '55159', '83990', '84464', '91442', '2178']


In [11]:
k.find_output(path=k.paths[0], value=gene_list)

In [12]:
k.result_summary()

Your exploration starts from NCBI Gene: ['201254', '2188', '5888', '80233', '2176', '2072', '672', '2187', '2175', '199990', '10459', '7516', '2177', '55215', '2189', '29089', '57697', '675', '79728', '378708', '5889', '55120', '55159', '83990', '84464', '91442', '2178']. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



Each key in the final_results represent one input (e.g. NCBI Gene ID)

In [13]:
k.final_results

{'10459': [],
 '199990': [{'begin': '113',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '113',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '137',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '137',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'}],
 '201254': [{'begin': '1',
   'category': 'PTM',
   'description': 'N-acetylmethionine',
   'end': '1',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/22814378',
      'id': '22814378',
      'name': 'PubMed',
      

#### Retrieve output for a specific NCBI Gene ID in the FA Gene List

In [14]:
k.final_results['199990']

[{'begin': '113',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '113',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'},
 {'begin': '137',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '137',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'}]

#### Organize data

In [15]:
from bs4 import BeautifulSoup
import requests

def get_label_for_eco_code(eco):
    url_template = 'http://data.bioontology.org/ontologies/ECO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F{{input}}'
    request_url = url_template.replace('{{input}}', eco.replace(':', '_'))
    r = requests.get(request_url,  headers={ "Content-Type" : "application/json"}, params={'apikey': 'c97ec130-f6ee-45e2-91fd-46e34135051f'})
    if r.status_code == 200:
        return r.json()['prefLabel']
                    
                                                                                           
def rearrange_dict(ptm_dict):
    new_dict = {}
    i = 0
    eco_label_dict = {}
    for gene_id, ptm_info in ptm_dict.items():
        if ptm_info:
            for _info in ptm_info:
                _info.update({'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]})
                new_evidence = ''
                if 'evidences' in _info:
                    for _evidence in _info['evidences']:
                        if _evidence['code'] in eco_label_dict.keys():
                            eco_label = eco_label_dict[_evidence['code']]
                        else:
                            print(_evidence['code'])
                            eco_label = get_label_for_eco_code(_evidence['code'])
                            eco_label_dict.update({_evidence['code']: eco_label})
                        new_evidence += (_evidence['code'] + ' - ' + eco_label + '\n')
                        if 'source' in _evidence and 'id' in _evidence['source']: 
                             new_evidence += ('PMID:' + _evidence['source']['id'] + '\n\n')
                _info.update({'evidences': new_evidence})
                new_dict.update({i: _info})
                i+=1
        else:
            new_dict.update({i: {'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]}})
            i+=1
    return new_dict

In [16]:
%pdb
new_dict = rearrange_dict(k.final_results)

Automatic pdb calling has been turned ON
ECO:0000244
ECO:0000250
ECO:0000269
ECO:0000305


In [17]:
new_dict

{0: {'Gene Symbol': 'CENPX',
  'NCBI Gene ID': '201254',
  'begin': '1',
  'category': 'PTM',
  'description': 'N-acetylmethionine',
  'end': '1',
  'evidences': 'ECO:0000244 - combinatorial evidence used in manual assertion\nPMID:22814378\n\n',
  'type': 'MOD_RES'},
 1: {'Gene Symbol': 'FANCF', 'NCBI Gene ID': '2188'},
 2: {'Gene Symbol': 'FAAP20',
  'NCBI Gene ID': '199990',
  'begin': '113',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '113',
  'evidences': 'ECO:0000244 - combinatorial evidence used in manual assertion\nPMID:23186163\n\n',
  'type': 'MOD_RES'},
 3: {'Gene Symbol': 'FAAP20',
  'NCBI Gene ID': '199990',
  'begin': '137',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '137',
  'evidences': 'ECO:0000244 - combinatorial evidence used in manual assertion\nPMID:23186163\n\n',
  'type': 'MOD_RES'},
 4: {'Gene Symbol': 'FAAP100',
  'NCBI Gene ID': '80233',
  'begin': '667',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '6

In [18]:
df = pd.DataFrame.from_dict(new_dict, orient="index")

In [19]:
df

Unnamed: 0,category,begin,type,evidences,Gene Symbol,NCBI Gene ID,end,description
0,PTM,1,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,CENPX,201254,1,N-acetylmethionine
1,,,,,FANCF,2188,,
2,PTM,113,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,FAAP20,199990,113,Phosphoserine
3,PTM,137,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,FAAP20,199990,137,Phosphoserine
4,PTM,667,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,FAAP100,80233,667,Phosphoserine
5,,,,,FANCC,2176,,
6,PTM,289,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,ERCC4,2072,289,N6-acetyllysine
7,PTM,521,MOD_RES,ECO:0000250 - sequence similarity evidence use...,ERCC4,2072,521,Phosphoserine
8,PTM,764,MOD_RES,ECO:0000250 - sequence similarity evidence use...,ERCC4,2072,764,Phosphoserine
9,PTM,500,CROSSLNK,ECO:0000244 - combinatorial evidence used in m...,ERCC4,2072,500,Glycyl lysine isopeptide (Lys-Gly) (interchain...


In [20]:
df.to_csv('ptm_results1.csv')