# Workflow 7 Validator
This workbook contains code stubs for validating module outputs for modules 1-3 of workflow 7.  Both the validator schemes and the set of knowledge beacons it relies on are under development and subject to change.  

## SMPDB
This section contains code for querying SMPDB for ground truth answers for the pathway building modules of workflow 7.  To recap, the respective objectives of the modules are:
1. Retreive or infer metabolites and genes in metabolic or mechanism of action pathway for chemical X
2. Retreive or infer substrate-product, substrate-gene, and gene-product associations. 

The SMPDB API and the scripts for querying are a work in progress, and thus subject to change as we identify shortcomings and fix them.  The ultimate goal is to get something where we can "turn the crank" and spit out a fully connected module answer and pathway graph for any compound in SMPDB.

#### Package Imports
Import packages for querying SMPDB beacon (among others).

In [1]:
from __future__ import print_function
from tkbeacon import build, KnowledgeSource
from tkbeacon.rest import ApiException
from pprint import pprint

## Module 1
These fuctions latest iteration of smpdb queries that satisfy the objective of module 1. Generate an exhaustive list of all the molecules and genes in a particular pathway.  I tied several other query strategies, but they either didn't work at all or were not generalizable.  This one seems to work fine, but could run into problems when run recersively.  Note that some type of recrsive structure will be necessary with the way the knowledge beacon is set up.

In [2]:
def get_concepts(query):
    b = build(KnowledgeSource.SMPDB)
    terms = [i.lower() for i in query]
    concepts = b.get_concepts(keywords=terms)
    return [b.get_concept_details(i.id) for i in concepts if i.name.lower() in terms]

def get_genes(query):
    b = build(KnowledgeSource.SMPDB)
    concepts = get_concepts(query)
    query_ids = [i.id for i in concepts]
    statements = b.get_statements(s=query_ids, edge_label='related_to', t_categories=['protein'])
    return statements

def get_chemicals(query):
    b = build(KnowledgeSource.SMPDB)
    concepts = get_concepts(query)
    query_ids = [i.id for i in concepts]
    statements = b.get_statements(s=query_ids, edge_label='related_to', 
                                  t_categories=['metabolite, chemical substance'])
    return statements

In [3]:
chemical = ['Codeine']
concepts = get_concepts(chemical)
pprint(concepts)

[{'categories': ['metabolite'],
 'description': None,
 'details': None,
 'exact_matches': [],
 'id': 'CHEBI:16714',
 'name': 'Codeine',
 'symbol': None,
 'synonyms': None,
 'uri': 'https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16714'}]


In [4]:
smpdb_chemicals = get_chemicals(query=chemical)
pprint(smpdb_chemicals)

[{'id': 'CHEBI:16714|related_to|used_to_produce|CHEBI:16842',
 'object': {'categories': ['metabolite'],
            'id': 'CHEBI:16842',
            'name': 'Formaldehyde'},
 'predicate': {'edge_label': 'related_to',
               'negated': False,
               'relation': 'used_to_produce'},
 'subject': {'categories': ['metabolite'],
             'id': 'CHEBI:16714',
             'name': 'Codeine'}},
 {'id': 'CHEBI:16714|related_to|reacts_with|CHEBI:17200',
 'object': {'categories': ['metabolite'],
            'id': 'CHEBI:17200',
            'name': 'Uridine diphosphate glucuronic acid'},
 'predicate': {'edge_label': 'related_to',
               'negated': False,
               'relation': 'reacts_with'},
 'subject': {'categories': ['metabolite'],
             'id': 'CHEBI:16714',
             'name': 'Codeine'}},
 {'id': 'CHEBI:16714|related_to|used_to_produce|CHEBI:17659',
 'object': {'categories': ['metabolite'],
            'id': 'CHEBI:17659',
            'name': "Uridine 5'-

In [5]:
smpdb_genes = get_genes(query=chemical)
pprint(smpdb_genes)

[{'id': 'CHEBI:16714|related_to|consumption_controlled_by|UNIPROT:P08684',
 'object': {'categories': ['protein'],
            'id': 'UNIPROT:P08684',
            'name': 'Cytochrome P450 3A4'},
 'predicate': {'edge_label': 'related_to',
               'negated': False,
               'relation': 'consumption_controlled_by'},
 'subject': {'categories': ['metabolite'],
             'id': 'CHEBI:16714',
             'name': 'Codeine'}},
 {'id': 'CHEBI:16714|related_to|consumption_controlled_by|UNIPROT:P10635',
 'object': {'categories': ['protein'],
            'id': 'UNIPROT:P10635',
            'name': 'Cytochrome P450 2D6'},
 'predicate': {'edge_label': 'related_to',
               'negated': False,
               'relation': 'consumption_controlled_by'},
 'subject': {'categories': ['metabolite'],
             'id': 'CHEBI:16714',
             'name': 'Codeine'}},
 {'id': 'CHEBI:16714|related_to|consumption_controlled_by|UNIPROT:P16662',
 'object': {'categories': ['protein'],
          

##  Automated Workflow
So far this is the closest thing we have to a fully generalizable and automated workflow that can find all the associations in an SMPD pathway in one fell swoop.  It has a few kinks that need to be worked out, but still looks promising.

#### Concept Lookup
Simply a wrapper for concepts endpoint.

In [6]:
def get_concepts(query):
    b = build(KnowledgeSource.SMPDB)
    terms = [i.lower() for i in query]
    concepts = b.get_concepts(keywords=terms)
    return [b.get_concept_details(i.id) for i in concepts if i.name.lower() in terms]

#### Get Related Concepts
Here we get all concepts related to our query using a generic predicate.

In [7]:
def all_neighbors(query):
    b = build(KnowledgeSource.SMPDB)
    concepts = get_concepts(query)
    query_ids = [i.id for i in concepts]
    related_concepts = b.get_statements(s=query_ids)
    related = [i.object.name for i in related_concepts]
    return concepts + get_concepts(query=related)

In [8]:
related = all_neighbors(['Codeine'])

In [9]:
pprint(related)

[{'categories': ['metabolite'],
 'description': None,
 'details': None,
 'exact_matches': [],
 'id': 'CHEBI:16714',
 'name': 'Codeine',
 'symbol': None,
 'synonyms': None,
 'uri': 'https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:16714'},
 {'categories': ['protein'],
 'description': None,
 'details': None,
 'exact_matches': [],
 'id': 'UNIPROT:P08684',
 'name': 'Cytochrome P450 3A4',
 'symbol': None,
 'synonyms': None,
 'uri': 'https://www.uniprot.org/uniprot/P08684'},
 {'categories': ['protein'],
 'description': None,
 'details': None,
 'exact_matches': [],
 'id': 'UNIPROT:P16662',
 'name': 'UDP-glucuronosyltransferase 2B7',
 'symbol': None,
 'synonyms': None,
 'uri': 'https://www.uniprot.org/uniprot/P16662'},
 {'categories': ['protein'],
 'description': None,
 'details': None,
 'exact_matches': [],
 'id': 'UNIPROT:P10635',
 'name': 'Cytochrome P450 2D6',
 'symbol': None,
 'synonyms': None,
 'uri': 'https://www.uniprot.org/uniprot/P10635'},
 {'categories': ['metabolite'],
 'descr

#### Get All Associations
Now we find all the assoications between all of the concepts we found in the last step.

In [10]:
def all_associations(stuff):
    output = {}
    b = build(KnowledgeSource.SMPDB)
    ids = [i.id for i in stuff]
    for i in b.get_predicates():
        predicate = i.relation
        output[predicate] = b.get_statements(s=ids, relation=predicate, t=ids)
    return output

In [11]:
associations = all_associations(related)

In [12]:
associations

{'catalysis_precedes': [{'id': 'UNIPROT:P10635|related_to|catalysis_precedes|UNIPROT:P08684',
   'object': {'categories': ['protein'],
              'id': 'UNIPROT:P08684',
              'name': 'Cytochrome P450 3A4'},
   'predicate': {'edge_label': 'related_to',
                 'negated': False,
                 'relation': 'catalysis_precedes'},
   'subject': {'categories': ['protein'],
               'id': 'UNIPROT:P10635',
               'name': 'Cytochrome P450 2D6'}},
  {'id': 'UNIPROT:P08684|related_to|catalysis_precedes|UNIPROT:P10635',
   'object': {'categories': ['protein'],
              'id': 'UNIPROT:P10635',
              'name': 'Cytochrome P450 2D6'},
   'predicate': {'edge_label': 'related_to',
                 'negated': False,
                 'relation': 'catalysis_precedes'},
   'subject': {'categories': ['protein'],
               'id': 'UNIPROT:P08684',
               'name': 'Cytochrome P450 3A4'}},
  {'id': 'UNIPROT:P08684|related_to|catalysis_precedes|UNIPROT

And viola!  We have a fully built pathway with all of the relations filled in.  The bugaboo here is that we will need this to run recursively, and the "stop word" molecules will cause problems in that endeavor.  But that is a problem for another day.