## Orange QB1.4

### Query:
Return sequences of FA genes associated with functional post-translational modifications.

In [1]:
# import biothing explorer (current in local, will make it an independent python package)
from visual import pathViewer

In [2]:
# pathViewer is a Python class for graphically display API connection maps and explore bio-entity relaitonships
k = pathViewer()

#### Show All Available IDs
Users could call the **availbale_ids** function to retrieve all IDs as well as descriptions for these ids used in BioThings Explorer

In [3]:
k.available_ids()

0,1,2,3,4
Preferred Name,URI,Description,Identifier pattern,Type
Experimental Factor Ontology,http://identifiers.org/efo/,"The Experimental Factor Ontology (EFO) provides a systematic description of many experimental variables available in EBI databases. It combines parts of several biological ontologies, such as anatomy, disease and chemical compounds. The scope of EFO is to support the annotation, analysis and visualization of data handled by the EBI Functional Genomics Team.",^\d{7}$,Entity
CLINICAL SIGNIFICANCE,http://identifiers.org/clinicalsignificance/,,,Entity
TOPLOGY Object,http://biothings.io/concepts/topology/,,,Object
HGVS ID,http://identifiers.org/hgvs/,,,Entity
UNII,http://identifiers.org/unii/,"The purpose of the joint FDA/USP Substance Registration System (SRS) is to support health information technology initiatives by generating unique ingredient identifiers (UNIIs) for substances in drugs, biologics, foods, and devices. The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substanceÃÂÃÂÃÂªs molecular structure and/or descriptive information.",^[A-Z0-9]+$,Entity
SNOMED CT,http://identifiers.org/snomedct/,"SNOMED CT (Systematized Nomenclature of Medicine -- Clinical Terms), is a systematically organized computer processable collection of medical terminology covering most areas of clinical information such as diseases, findings, procedures, microorganisms, pharmaceuticals, etc.",^(\w+)?\d+$,Entity
PharmGKB Pathways,http://identifiers.org/pharmgkb.pathways/,"The PharmGKB database is a central repository for genetic, genomic, molecular and cellular phenotype data and clinical information about people who have participated in pharmacogenomics research studies. The data includes, but is not limited to, clinical and basic pharmacokinetic and pharmacogenomic research in the cardiovascular, pulmonary, cancer, pathways, metabolic and transporter domains. PharmGKB Pathways are drug centric, gene based, interactive pathways which focus on candidate genes and gene groups and associated genotype and phenotype data of relevance for pharmacogenetic and pharmacogenomic studies.",^PA\d+$,Entity
DrugBank,http://identifiers.org/drugbank/,"The DrugBank database is a bioinformatics and chemoinformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. This collection references drug information.",^DB\d{5}$,Entity
PubMed,http://identifiers.org/pubmed/,PubMed is a service of the U.S. National Library of Medicine that includes citations from MEDLINE and other life science journals for biomedical articles back to the 1950s.,^\d+$,Entity


#### Show How APIs/Endpoints/Bio-Entities can be connected together
graph is **interactive**

In [4]:
k.show_api_road_map()

#### Find Path
The CQ question above ask to find the path connecting **NCBI Gene IDs** and **PTM info**.
**find_path** function could be used here to find how two different biological entities/concepts can be connected together through API endpoints

In [5]:
k.find_path('NCBI Gene', 'PTM Object')

Path 0: [{'output': 'UniProt Knowledgebase', 'endpoint': 'http://mygene.info/v3/gene/geneid', 'input': 'NCBI Gene'}, {'output': 'PTM Object', 'endpoint': 'http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM', 'input': 'UniProt Knowledgebase'}]


#### Find output for single input
Users could utilize **find_output** function to find the output for one single input using the selected path

In [6]:
# 675 is a NCBI Gene ID in the FA Gene List
# find_output takes two parameters
# the first parameter is path, users should select the path ID based on the results from 'find_path'
# the second parameter is value, which represent the input value
k.find_output(path=k.paths[0], value='675')

#### Print result summary

In [7]:
k.result_summary()

Your exploration starts from NCBI Gene: 675. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



#### Explore results
The final results is represented using a dictionary with the key being the input, and the value being the final output

In [8]:
k.final_results

{'675': [{'begin': '70',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '70',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '445',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '445',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '492',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '492',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih

#### Explorer a list of Inputs
FA Gene List contains 26 different genes, users could input all these genes together using **find_output** function

In [11]:
import pandas as pd
'''
return list of FA genes from github txt file
'''
def get_fa_gene_list():
    gene_list_url = 'https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_4_all_genes.txt'
    gl = pd.read_table(gene_list_url, header=None)
    gene_id = [_gene.replace('NCBIGene:','') for _gene in gl[0].values.tolist()]
    gene_symbol = gl[1].values.tolist()
    new_dict = {}
    for i, _id in enumerate(gene_id):
        new_dict.update({_id: gene_symbol[i]})
    return new_dict
gene_dict = get_fa_gene_list()
gene_list = list(gene_dict.keys())

print(gene_list)

['2175', '5888', '7516', '55120', '57697', '83990', '84464', '2188', '378708', '2187', '2189', '2072', '199990', '10459', '79728', '91442', '2176', '201254', '675', '55159', '5889', '2177', '29089', '2178', '672', '80233', '55215']


In [12]:
k.find_output(path=k.paths[0], value=gene_list)

In [13]:
k.result_summary()

Your exploration starts from NCBI Gene: ['2175', '5888', '7516', '55120', '57697', '83990', '84464', '2188', '378708', '2187', '2189', '2072', '199990', '10459', '79728', '91442', '2176', '201254', '675', '55159', '5889', '2177', '29089', '2178', '672', '80233', '55215']. 
 It goes through 2 API Endpoints. 
 The final output comes from API Endpoint http://www.ebi.ac.uk/proteins/api/features/accession?categories=PTM. 
 You can access the final output by calling the 'final_results' object in pathViewer Class.



Each key in the final_results represent one input (e.g. NCBI Gene ID)

In [14]:
k.final_results

{'10459': [],
 '199990': [{'begin': '113',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '113',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'},
  {'begin': '137',
   'category': 'PTM',
   'description': 'Phosphoserine',
   'end': '137',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
      'id': '23186163',
      'name': 'PubMed',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
   'type': 'MOD_RES'}],
 '201254': [{'begin': '1',
   'category': 'PTM',
   'description': 'N-acetylmethionine',
   'end': '1',
   'evidences': [{'code': 'ECO:0000244',
     'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/22814378',
      'id': '22814378',
      'name': 'PubMed',
      

#### Retrieve output for a specific NCBI Gene ID in the FA Gene List

In [15]:
k.final_results['199990']

[{'begin': '113',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '113',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'},
 {'begin': '137',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '137',
  'evidences': [{'code': 'ECO:0000244',
    'source': {'alternativeUrl': 'http://europepmc.org/abstract/MED/23186163',
     'id': '23186163',
     'name': 'PubMed',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23186163'}}],
  'type': 'MOD_RES'}]

#### Organize data

In [16]:
from bs4 import BeautifulSoup
import requests

def get_label_for_eco_code(eco):
    url_template = 'http://data.bioontology.org/ontologies/ECO/classes/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2F{{input}}'
    request_url = url_template.replace('{{input}}', eco.replace(':', '_'))
    r = requests.get(request_url,  headers={ "Content-Type" : "application/json"}, params={'apikey': 'c97ec130-f6ee-45e2-91fd-46e34135051f'})
    if r.status_code == 200:
        return r.json()['prefLabel']
                    
                                                                                           
def rearrange_dict(ptm_dict):
    new_dict = {}
    i = 0
    eco_label_dict = {}
    for gene_id, ptm_info in ptm_dict.items():
        if ptm_info:
            for _info in ptm_info:
                _info.update({'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]})
                new_evidence = ''
                if 'evidences' in _info:
                    for _evidence in _info['evidences']:
                        if _evidence['code'] in eco_label_dict.keys():
                            eco_label = eco_label_dict[_evidence['code']]
                        else:
                            print(_evidence['code'])
                            eco_label = get_label_for_eco_code(_evidence['code'])
                            eco_label_dict.update({_evidence['code']: eco_label})
                        new_evidence += (_evidence['code'] + ' - ' + eco_label + '\n')
                        if 'source' in _evidence and 'id' in _evidence['source']: 
                             new_evidence += ('PMID:' + _evidence['source']['id'] + '\n')
                _info.update({'evidences': new_evidence})
                new_dict.update({i: _info})
                i+=1
        else:
            new_dict.update({i: {'NCBI Gene ID': gene_id, 'Gene Symbol': gene_dict[gene_id]}})
            i+=1
    return new_dict

In [17]:
%pdb
new_dict = rearrange_dict(k.final_results)

Automatic pdb calling has been turned ON
ECO:0000244
ECO:0000269
ECO:0000250
ECO:0000305


In [18]:
new_dict

{0: {'Gene Symbol': 'FANCA',
  'NCBI Gene ID': '2175',
  'begin': '1449',
  'category': 'PTM',
  'description': 'Phosphoserine',
  'end': '1449',
  'evidences': 'ECO:0000244 - combinatorial evidence used in manual assertion\nPMID:17525332\nECO:0000244 - combinatorial evidence used in manual assertion\nPMID:18691976\nECO:0000244 - combinatorial evidence used in manual assertion\nPMID:23186163\n',
  'type': 'MOD_RES'},
 1: {'Gene Symbol': 'RAD51',
  'NCBI Gene ID': '5888',
  'begin': '2',
  'category': 'PTM',
  'description': 'N-acetylalanine',
  'end': '2',
  'evidences': 'ECO:0000244 - combinatorial evidence used in manual assertion\nPMID:22814378\n',
  'type': 'MOD_RES'},
 2: {'Gene Symbol': 'RAD51',
  'NCBI Gene ID': '5888',
  'begin': '54',
  'category': 'PTM',
  'description': 'Phosphotyrosine; by ABL1',
  'end': '54',
  'evidences': 'ECO:0000269 - experimental evidence used in manual assertion\nPMID:9461559\n',
  'type': 'MOD_RES'},
 3: {'Gene Symbol': 'RAD51',
  'NCBI Gene ID': '

In [19]:
df = pd.DataFrame.from_dict(new_dict, orient="index")

In [20]:
df

Unnamed: 0,begin,type,evidences,category,Gene Symbol,description,end,NCBI Gene ID
0,1449,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,FANCA,Phosphoserine,1449,2175
1,2,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,RAD51,N-acetylalanine,2,5888
2,54,MOD_RES,ECO:0000269 - experimental evidence used in ma...,PTM,RAD51,Phosphotyrosine; by ABL1,54,5888
3,309,MOD_RES,ECO:0000269 - experimental evidence used in ma...,PTM,RAD51,Phosphothreonine; by CHEK1,309,5888
4,58,CROSSLNK,ECO:0000269 - experimental evidence used in ma...,PTM,RAD51,Glycyl lysine isopeptide (Lys-Gly) (interchain...,58,5888
5,64,CROSSLNK,ECO:0000269 - experimental evidence used in ma...,PTM,RAD51,Glycyl lysine isopeptide (Lys-Gly) (interchain...,64,5888
6,10,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,XRCC2,Phosphoserine,10,7516
7,2,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,FANCL,N-acetylalanine,2,55120
8,34,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,FANCM,Phosphoserine,34,57697
9,1673,MOD_RES,ECO:0000244 - combinatorial evidence used in m...,PTM,FANCM,Phosphoserine,1673,57697


In [21]:
df.to_csv('ptm_results1.csv')