# Workflow 7 - Module 2 (Ordered Pathway)
This notebook will focus on module 2 of workflow 7.  The objective is to take unordered lists of genes and metabolites and structure them into a pathway.  The first prototype use case for this workflow will focus on Codeine.

## Step 1. Query SMPDB API
For this one we're going to use the fully automated workflow.  

In [1]:
from pprint import pprint
from __future__ import print_function
from tkbeacon import build, KnowledgeSource
from tkbeacon.rest import ApiException

def get_concepts(query):
    b = build(KnowledgeSource.SMPDB)
    terms = [i.lower() for i in query]
    concepts = b.get_concepts(keywords=terms)
    return [i for i in concepts if i.name.lower() in terms]

def all_neighbors(query):
    b = build(KnowledgeSource.SMPDB)
    concepts = get_concepts(query)
    query_ids = [i.id for i in concepts]
    related_concepts = b.get_statements(s=query_ids)
    related_names = [i.object.name for i in related_concepts]
    return concepts + get_concepts(query=related_names)

def all_associations(stuff):
    output = {}
    b = build(KnowledgeSource.SMPDB)
    ids = [i.id for i in stuff]
    for i in b.get_predicates():
        predicate = i.relation
        output[predicate] = b.get_statements(s=ids, relation=predicate, t=ids)
    return output

### Module Input
First we query SMPDB for unorganized lists of chemicals and genes in a pathway.

In [2]:
drug = ['Codeine']
related = all_neighbors(drug)
genes = [i for i in related if 'protein' in i.categories]
chems = [i for i in related if 'chemical substance' in i.categories or 'metabolite' in i.categories]

In [3]:
pprint(genes)

[{'categories': ['protein'],
 'description': None,
 'id': 'UNIPROT:P08684',
 'name': 'Cytochrome P450 3A4'},
 {'categories': ['protein'],
 'description': None,
 'id': 'UNIPROT:P16662',
 'name': 'UDP-glucuronosyltransferase 2B7'},
 {'categories': ['protein'],
 'description': None,
 'id': 'UNIPROT:P10635',
 'name': 'Cytochrome P450 2D6'}]


In [4]:
pprint(chems)

[{'categories': ['metabolite'],
 'description': None,
 'id': 'CHEBI:16714',
 'name': 'Codeine'},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'CHEBI:16842',
 'name': 'Formaldehyde'},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'CHEBI:17200',
 'name': 'Uridine diphosphate glucuronic acid'},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'CHEBI:17659',
 'name': "Uridine 5'-diphosphate"},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'CHEBI:17303',
 'name': 'Morphine'},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'KEGG:C16577',
 'name': 'Codeine-6-glucuronide'},
 {'categories': ['metabolite'],
 'description': None,
 'id': 'KEGG:C16576',
 'name': 'Norcodeine'},
 {'categories': ['chemical substance'],
 'description': None,
 'id': 'HMDB:HMDB0060464',
 'name': 'Codeine-6-glucuronide'},
 {'categories': ['chemical substance'],
 'description': None,
 'id': 'HMDB:HMDB0060657',
 'name': 'Norcodeine'}]


### Module Output
Then we query SMPDB for the rest of the pathway associations. 

In [5]:
associations = all_associations(related)
substrate_product_associations = associations['used_to_produce']
substrate_gene_associations = associations['consumption_controlled_by']
gene_product_associations = associations['controls_production_of']

In [None]:
# pprint(substrate_product_associations)

In [None]:
# pprint(substrate_gene_assoications)

In [None]:
# pprint(gene_product_associations)

## GNBR Queries
Neither of the GNBR APIs currently has endpoints that provide the functionality we need.  So for now we will use custom queries via the bolt interface with the intention of promoting them to live API if they work.

In [6]:
import math
import gnbr_beacon
from neo4j import GraphDatabase

gnbr_concepts = gnbr_beacon.ConceptsApi()
gnbr_statements = gnbr_beacon.StatementsApi()

driver = GraphDatabase.driver("bolt://localhost:7687", auth=('',''))

def pk_motif(tx, source, target):
    query = """
    MATCH p=(:Chemical {uri: $source})-[s:STATEMENT]->(:Gene)<-[t:STATEMENT]-(:Chemical {uri: $target})
    RETURN nodes(p) as n, relationships(p) as r
    """
    result = []
    for record in tx.run(query, source=source, target=target):
        result.append(record)
    return result

def geometric_mean(path):
    total = []
    p = path['r']
    for edge in p:
        weight = max(edge.values())
        total.append(math.log(weight))
        geo_mean = math.exp(sum(total)/len(total))
    return geo_mean

### Harmonize Concepts
First we need to map concepts from SMPDB into GNBR.  We might imagine doing the reverse mapping, but for now we are keeping it as simple as possible.  We map concepts using simple keyword lookup.

In [7]:
import requests
import pandas as pd

robokop_server = 'robokop.renci.org'

def synonymize(nodetype,identifier):
    url=f'http://{robokop_server}:6010/api/synonymize/{identifier}/{nodetype}/'
    response = requests.post(url)
    print( f'Return Status: {response.status_code}' )
    if response.status_code == 200:
        return response.json()
    return []


def map_to_gnbr(nodetype, identifier):
    response = synonymize(nodetype,identifier)
    namespaces = {
        'chemical_substance': ['CHEBI', 'MESH'],
        'disease' : ['MESH'],
        'gene' : ['NCBIGENE'],
    }
    namespace = namespaces[nodetype]
    return [i[0] for i in response['synonyms'] if i[0].split(':')[0] in namespace]

def smpdb_to_gnbr(concepts):
    results = []
    for concept in concepts:
        identifier = concept.id
        nodetype = concept.categories[0]
        
        if nodetype == 'protein':
            identifier = identifier.replace('UNIPROT:', 'UniProtKB:')
            nodetype = 'gene'
        elif nodetype == 'metabolite' or nodetype == 'chemical substance':
            nodetype = 'chemical_substance'
            
        mapped = map_to_gnbr(nodetype, identifier)
        results.extend(mapped)
        
    return results

In [8]:
chem_ids = smpdb_to_gnbr(chems)

Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200
Return Status: 200


In [9]:
# mapped_chems = gnbr_concepts.get_concepts(keywords=[i.lower() for i in metabolites])
# pprint(mapped_chems)
details = [gnbr_concepts.get_concept_details(i) for i in chem_ids]
pprint([i.name for i in details if i.name])

['formalin',
 'CH2O',
 'UDP glucuronate',
 'UDPGA',
 'UDP',
 'morphine',
 'morphin',
 'norcodeine']


In [10]:
chem_ids = chem_ids + ['MESH:D003061']

In [11]:
# mapped_genes = gnbr_concepts.get_concepts(keywords=[i for i in genes])
# pprint(mapped_genes)
gene_ids = smpdb_to_gnbr(genes)

Return Status: 200
Return Status: 200
Return Status: 200


In [12]:
details = [gnbr_concepts.get_concept_details(i) for i in gene_ids]
pprint([i.name for i in details if i.name])

['CYP3A4', 'UGT2B7', 'CYP2D6']


In [13]:
import itertools
# chem_ids = [i.id for i in mapped_chems]
# gene_ids = [i.id for i in mapped_genes]
all_motifs = []
with driver.session() as neo4j:
    for source, target in itertools.combinations(chem_ids, 2):
        motifs = neo4j.read_transaction(pk_motif, source=source, target=target)
        all_motifs.extend(motifs)

In [14]:
all_motifs = sorted(all_motifs, key=geometric_mean, reverse=True)
for motif in all_motifs:
    nodes = motif['n']
    node_ids = [n['uri'] for n in nodes]
    if 'MESH:D005557' in node_ids:
        continue
    if node_ids[1] not in gene_ids:
        continue
    pprint([i['name'] for i in nodes])

['UDPGA', 'UGT2B7', 'UDP']
['UDPGA', 'UGT2B7', 'morphine']
['UDP', 'UGT2B7', 'morphine']
['UDPGA', 'UGT2B7', 'codeine']
['UDP', 'UGT2B7', 'codeine']
['morphine', 'UGT2B7', 'codeine']
['norcodeine', 'CYP3A4', 'codeine']
['morphine', 'CYP2D6', 'codeine']
['morphine', 'CYP3A4', 'codeine']
['morphine', 'CYP3A4', 'norcodeine']


We can see a problem with the mapping here - we end up pulling in synonyms (orthologs).  Species tags are not currently supported on the GNBR API, though they are in the underlying neo4j.  Future versions of the API will species info for genes.

### Metrics
The metrics we will compute for this module are Jaccard similarity and Average Precision.  Jaccard similarity doesn't explicitly take the ordering of the answers into account, but it is affected by the number of returned answers.

Average precision explicitly considers rankings and is less sensitive to the total number of results returned. Thus it is generally preferred for search algorithms that return long lists of results, where the top results should be the most relevant.

In [None]:
def jaccard_index(query_results, ground_truths):
    numerator = len(set(query_results) & set(ground_truths))
    demoninator = len(set(query_results + ground_truths))
    jc = 1.0*numerator/demoninator
    return jc

def avg_prec(query_results, ground_truths):
    hits, precision = 0, 0
    for n, result in enumerate(query_results):
        if result in ground_truths:
            hits += 1
            precision += hits/(n+1)
    avg_precision = precision/len(ground_truths)
    return avg_precision

### Further Investigation

#### Right Answers

#### Wrong Answers

## Conclusion

In [None]:
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=('',''))

def pk_gene(tx, source):
    query = """
    MATCH p=(:Chemical {uri: $source})-[:STATEMENT]->(:Gene)
    RETURN nodes(p) as n, relationships(p) as r
    """
    result = []
    for record in tx.run(query, source=source):
        result.extend(record['r'])
    return result



def pk_score(relationship):
    score = max( [relationship[i] for i in ['O','X','Z']] )
    return score

def max_score(relationship):
    score = max(relationship.values())
    return score