## WF2 Module 2

We want to find molecular phenotypes that are integral to a disease, and then chemicals that activate those phenotypes, and then genes that clean up those bad chemicals.   The tricky part with current data is that it's difficult to make the jump to chemicals.

For instance, if we say that molecular phenotype is a GO process, we might find that e.g. various DNA repair processes are involved with Fanconi Anemia.  But the only chemicals that are linked to these processes are participants in the process, like DNA.

From the chemical side, there are properties on chemicals in CHEBI, such as being genotoxic.  However, there's not any semantic value to these properties because they are just strings with no relation to e.g. GO terms.  This gap is being filled by CHIRO, but that is still ongoing.

To leapfrog this, we'll try (1) grabbing chemicals that look bad for the disease, (2) try to do some enrichments on their properties to get other similar chemicals, and (3) look for genes that degrade those things.

In [1]:
#To make nicer looking outputs
from IPython.core.display import display, HTML
import requests
import pandas as pd
import os
import sys

#Load some functions for parsing quick output
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
        sys.path.append(module_path)
from gg_functions import parse_answer, get_view_url, expand, quick

robokop='robokop.renci.org' 

In [2]:
disease = "MONDO:0019391"  #Fanconi Anemia

In [3]:
bad_chemicals = expand('disease',disease,'chemical_substance',predicate='contributes_to',direction='in')

In [5]:
df = parse_answer(bad_chemicals,node_list=['n1'])

In [6]:
df

Unnamed: 0,n1 - id,n1 - name,score
0,CHEBI:9288,streptozocin,1.390503
1,CHEBI:76451,alloxan,1.092735
2,CHEBI:22977,cadmium atom,0.963028
3,CHEBI:8382,prednisone,0.919533
4,CHEBI:7735,olanzapine,0.852368
5,CHEBI:18248,iron atom,0.840737
6,CHEBI:17158,methylglyoxal,0.780709
7,CHEBI:3766,clozapine,0.756415
8,CHEBI:28694,copper atom,0.751212
9,CHEBI:16598,DDE,0.751212


What properties do these things have in common?

In [7]:
chems = [x for x in bad_chemicals['knowledge_graph']['nodes'] if 'chemical_substance' in x['type']]
from collections import defaultdict
prop_counts = defaultdict(int)
for n in chems:
    for p in n:
        prop_counts[p] += 1
items = list(prop_counts.items())
count_frame = pd.DataFrame.from_records(items,columns=['property','count_in_subset'])
count_frame.head()

Unnamed: 0,property,count_in_subset
0,id,250
1,name,250
2,equivalent_identifiers,250
3,type,250
4,omnicorp_article_count,250


Many of these propertries are assigned to many chemicals (like SMILES or name).  We want to find the ones that are overrepresented.  To do that we need to know the details of how often each property is used in the database.  At the moment, ROBOKOP does not expose functions to allow this, but we have pre-calculated such a list:

In [9]:
property_counts = pd.read_csv('../../examples/chemprops.txt','\t')
property_counts.head()

Unnamed: 0,property,count
0,appetite_enhancer,2
1,Wiskott_Aldrich_syndrome_protein_inhibitor,1
2,peptide_coupling_reagent,2
3,EC_3.4.22.52_(calpain_1)_inhibitor,3
4,EC_2.3.1.5_(arylamine_N_acetyltransferase)_inh...,1


In [11]:
df = pd.merge(count_frame, property_counts,on='property',how='inner')

df.head()

Unnamed: 0,property,count_in_subset,count
0,id,250,355412
1,name,250,311887
2,equivalent_identifiers,250,355412
3,therapeutic_flag,208,244803
4,topical,208,244803


In [13]:
from scipy.stats import hypergeom
#in scipy:
#The hypergeometric distribution models drawing objects from a bin. 
#M is the total number of objects, 
#n is total number of Type I objects. 
#The random variate represents the number of Type I objects in N drawn without replacement from the total population.

total_chemical_count=355413 # M above
subset_count = len(chems) # N above

def calc_enrich_p(x,n):
    return hypergeom.sf(x-1, total_chemical_count, n, subset_count)

In [14]:
df['enrichment_p'] = df.apply(lambda x: calc_enrich_p(x['count_in_subset'], x['count']), axis=1)
df.sort_values(by='enrichment_p')

Unnamed: 0,property,count_in_subset,count,enrichment_p
22,application,162,8162,1.247381e-198
52,drug,141,5527,4.179293e-184
45,pharmaceutical,141,5605,3.018204e-183
57,biological_role,184,27273,5.710007e-147
33,pharmacological_role,90,2313,4.849076e-129
5,mass,219,97320,1.072944e-88
17,charge,219,98675,1.891211e-87
6,monoisotopic_mass,218,98074,9.602293e-87
71,chemical_role,106,16524,2.391485e-72
42,neurotransmitter_agent,46,978,1.495992e-68


Now we can see that the most enriched property is "mutagen", which occurs in 2 of the 3 chemicals that are contributors to FA, but in only 195 out of the 355k chemicals.  Let's look for genes that degrade mutagens.  We're going to lose the connection to FA, so that won't be included in our score, which is unfortunate.

In [15]:
def create_question(disease_id,prop):
    return {
    "machine_question": {
        "nodes": [
            {
                "id": "n0",
                "type": "chemical_substance",
                prop: True
            },
            {
                "id": "n1",
                "type": "gene"
            },
            {
                "id": "n2",
                "type": "disease",
                "curie": disease_id
            }
        ],
        "edges": [
            {
                "id": "e0",
                "source_id": "n1",
                "target_id": "n0",
                "type": "increases_degradation_of"
            }
        ]
    }
}

In [16]:
q = create_question('MONDO:0019391','mutagen')
a = quick(q)

Return Status: 200


In [17]:
aframe = parse_answer(a,node_list=['n0','n1'])

In [18]:
aframe

Unnamed: 0,n0 - id,n0 - name,n1 - id,n1 - name,score
0,CHEBI:17754,glycerol,HGNC:4289,GK,0.283601
1,CHEBI:28262,dimethyl sulfoxide,HGNC:4289,GK,0.274309
2,CHEBI:27732,caffeine,HGNC:2596,CYP1A2,0.266359
3,CHEBI:17754,glycerol,HGNC:381,AKR1B1,0.265992
4,CHEBI:17754,glycerol,HGNC:382,AKR1B10,0.264636
5,CHEBI:17754,glycerol,HGNC:28635,GK5,0.264351
6,CHEBI:15343,acetaldehyde,HGNC:404,ALDH2,0.263787
7,CHEBI:4027,cyclophosphamide,HGNC:2637,CYP3A4,0.257238
8,CHEBI:5864,ifosfamide,HGNC:2637,CYP3A4,0.257073
9,CHEBI:15343,acetaldehyde,HGNC:407,ALDH1B1,0.256739


This finds the expected result that ALDH2 is a potential modifier for FA.   Potentially, but looking for things that degrade into acetaldehyde or other chemicals in n0 we could find a list of environemntal exposures to avoid.  Note that, as ever, we are at the mercy of the data - caffeine is mutagenic at high dose but mostly in non-mammalian cells.