# Test querying classifyer

Example from justin: http://classyfire.wishartlab.com/entities/HNDVDQJCIGZPNO-YFKPBYRVSA-N (histidine)

Fetch the result from above query in json format.

In [2]:
import urllib2
import json
import jsonpickle

def get_json(url):
    response = urllib2.urlopen(url)
    data = json.load(response)   
    return data

Let's see what info we get back from classifyer

In [3]:
url = 'http://classyfire.wishartlab.com/entities/HNDVDQJCIGZPNO-YFKPBYRVSA-N.json'
data = get_json(url)
for key in data:
    print key

kingdom
smiles
inchikey
classification_version
description
predicted_lipidmaps_terms
molecular_framework
alternative_parents
subclass
intermediate_nodes
superclass
substituents
external_descriptors
direct_parent
class
predicted_chebi_terms


We need the kingdom, superclass, class, subclass, intermediate_nodes and direct parent to contruct the taxonomy path of this document (InChiKey). 

Wrap this nicely as a function. We pass in the inchi key and get back the taxonomy.

In [30]:
def get_taxa_path(inchikey):

    url = 'http://classyfire.wishartlab.com/entities/%s.json' % inchikey
    response = urllib2.urlopen(url)
    data = json.load(response)       
    
    # store the taxonomy path for this inchikey here
    taxa_path = []

    # add the top-4 taxa
    keys = ['kingdom', 'superclass', 'class', 'subclass']
    for key in keys:
        if data[key] is not None:
            taxa_path.append(data[key]['name'])

    # add all the intermediate taxa >level 4 but above the direct parent
    for entry in data['intermediate_nodes']:
        taxa_path.append(entry['name'])

    # add the direct parent
    taxa_path.append(data['direct_parent']['name'])

    return taxa_path

inchikey = 'HNDVDQJCIGZPNO-YFKPBYRVSA-N'
tp = get_taxa_path(inchikey)
print '\n'.join(tp)

Chemical entities
Organic compounds
Organic acids and derivatives
Carboxylic acids and derivatives
Amino acids, peptides, and analogues
Amino acids and derivatives
Alpha amino acids and derivatives
Histidine and derivatives


A method to extract the substituents from a query

In [17]:
def get_substituents(inchikey):
    url = 'http://classyfire.wishartlab.com/entities/%s.json' % inchikey
    response = urllib2.urlopen(url)
    data = json.load(response) 
    return data['substituents']

Now try with some Mass2Motif from MassBank. First get all the docs above the default doc-topic threshold (0.05). Retrieve the metadata (inchikey) and pass it to Classifyer.

In [31]:
def print_m2m_taxonomy(m2m_id):
    
    server = 'www.ms2lda.org'
    url = 'http://%s/basicviz/get_parents_metadata/%d' % (server, m2m_id)
    data = get_json(url)

    for metadata_str in data:
        doc = jsonpickle.decode(metadata_str)
        inchikey = doc['InChIKey']
        print doc['annotation'], inchikey
        for taxon in get_taxa_path(inchikey):
            print '-', taxon
        print

Print a list of substituents from all of the molecules, ranked by how often they appear

In [24]:
def get_all_substituents(m2m_id):
    server = 'www.ms2lda.org'
    url = 'http://%s/basicviz/get_parents_metadata/%d' % (server, m2m_id)
    data = get_json(url)
    substituents = {}
    for metadata_str in data:
        doc = jsonpickle.decode(metadata_str)
        inchikey = doc['InChIKey']
        substituents[inchikey] = get_substituents(inchikey)
        
    substituent_counts = {}
    for inchikey in substituents:
        for ss in substituents[inchikey]:
            if not ss in substituent_counts:
                substituent_counts[ss] = 1
            else:
                substituent_counts[ss] += 1    
    ss_c = zip(substituent_counts.keys(),substituent_counts.values())
    ss_c = sorted(ss_c,key = lambda x:x[1],reverse = True)
    for ss,count in ss_c:
        print "{},{} (/{})".format(ss,count,len(substituents))
    

### 1. Get the Taxonomy of Documents in the Histidine Mass2Motif (MassBank)

In [5]:
print_m2m_taxonomy(1083)

washington_0873 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

washington_0886 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

washington_0859 KRBMQYPTDYSENE-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Carboxylic acids and derivatives
- Amino acids, peptides, and analogues
- Peptides
- Dipeptides

riken_0373 DOUMFZQKYFQNTF-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Cinnamic acids and derivatives
- Hydroxycinnamic acids and derivatives
- Coumaric acids and derivatives

riken_0684 CQOVPNPJLQNMDC-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organic acids and derivatives
- Peptidomimetics
- Hyb

In [27]:
get_all_substituents(1083)


Organic oxide,5 (/5)
Carbonyl group,5 (/5)
Organooxygen compound,5 (/5)
Hydrocarbon derivative,5 (/5)
Carboxylic acid,5 (/5)
Organic oxygen compound,5 (/5)
Azole,4 (/5)
Organoheterocyclic compound,4 (/5)
Monocarboxylic acid or derivatives,4 (/5)
Imidazole,4 (/5)
Histidine or derivatives,4 (/5)
Organonitrogen compound,4 (/5)
Organic nitrogen compound,4 (/5)
Heteroaromatic compound,4 (/5)
Azacycle,4 (/5)
Organopnictogen compound,4 (/5)
Aromatic heteromonocyclic compound,4 (/5)
Imidazolyl carboxylic acid derivative,4 (/5)
Primary amine,3 (/5)
Primary aliphatic amine,3 (/5)
Amino acid,3 (/5)
Amine,3 (/5)
Fatty acyl,2 (/5)
Secondary carboxylic acid amide,2 (/5)
Alpha-amino acid or derivatives,2 (/5)
Aralkylamine,2 (/5)
Carboxamide group,2 (/5)
Carboxylic acid derivative,2 (/5)
N-acyl-alpha amino acid or derivatives,2 (/5)
Amino acid or derivatives,2 (/5)
N-acyl-alpha-amino acid,2 (/5)
Aromatic homomonocyclic compound,1 (/5)
Alcohol,1 (/5)
Primary alcohol,1 (/5)
Enoate ester,1 (/5)
Alpha-ami

### 2. Get the Taxonomy of Documents in the Adenine Mass2Motif (MassBank)

In [6]:
print_m2m_taxonomy(1367)

washington_0559 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

washington_0601 GOSWTRUMMSCNCW-UHFFFAOYSA-N
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Purine nucleosides
- Purine nucleosides

metabolights_0044 UZKQTCBAMSWPJD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines
- 6-alkylaminopurines

metabolights_0006 LNQVTSROQXJCDD-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Nucleosides, nucleotides, and analogues
- Ribonucleoside 3'-phosphates
- Ribonucleoside 3'-phosphates

ufz_0051 GFFGJBXGBJISGV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Organoheterocyclic compounds
- Imidazopyrimidines
- Purines and purine derivatives
- 6-aminopurines

ufz_0109 GFFGJBXGBJISGV-UHFFFAOYSA-N
- Chemical entities
- O

In [26]:
get_all_substituents(1367)

Organonitrogen compound,10 (/10)
Organic nitrogen compound,10 (/10)
Heteroaromatic compound,10 (/10)
Azacycle,10 (/10)
Hydrocarbon derivative,10 (/10)
Aromatic heteropolycyclic compound,10 (/10)
Azole,9 (/10)
Imidazole,9 (/10)
Organopnictogen compound,9 (/10)
Organooxygen compound,9 (/10)
Pyrimidine,9 (/10)
Organic oxygen compound,9 (/10)
Alcohol,8 (/10)
Aminopyrimidine,8 (/10)
Imidolactam,8 (/10)
Oxacycle,8 (/10)
6-aminopurine,7 (/10)
Organoheterocyclic compound,7 (/10)
N-substituted imidazole,7 (/10)
Primary alcohol,7 (/10)
Imidazopyrimidine,7 (/10)
Purine,7 (/10)
Glycosyl compound,6 (/10)
Secondary alcohol,6 (/10)
Pentose monosaccharide,6 (/10)
N-glycosyl compound,6 (/10)
Amine,6 (/10)
Monosaccharide,6 (/10)
Oxolane,5 (/10)
Primary aromatic amine,5 (/10)
Primary amine,5 (/10)
Purine nucleoside,4 (/10)
6-alkylaminopurine,3 (/10)
Organic oxide,2 (/10)
Secondary aliphatic/aromatic amine,2 (/10)
Tetrahydrofuran,2 (/10)
5'-deoxy-5'-thionucleoside,1 (/10)
1,2-diol,1 (/10)
Organic phosphor

### 3. Get the Taxonomy of Documents in the Ferulic Acid Mass2Motif (MassBank)

In [32]:
print_m2m_taxonomy(1430)

washington_0806 IRUHWRSITUYICV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins

washington_0627 ARQXEQLMMNGFDU-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Coumarin glycosides

washington_0222 ZKMLQDNHMSFULN-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Flavonoids
- Flavones

ipb_0145 ARQXEQLMMNGFDU-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Coumarin glycosides

riken_0389 HSHNITRMYYLLCV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins
- 7-hydroxycoumarins

riken_0404 HSHNITRMYYLLCV-UHFFFAOYSA-N
- Chemical entities
- Organic compounds
- Phenylpropanoids and polyketides
- Coumarins and derivatives
- Hydroxycoumarins
- 7-hydroxyc

In [33]:
get_all_substituents(1430)

Hydrocarbon derivative,10 (/10)
Organic oxide,10 (/10)
Organic oxygen compound,10 (/10)
Organooxygen compound,10 (/10)
Benzenoid,8 (/10)
Oxacycle,8 (/10)
Aromatic heteropolycyclic compound,8 (/10)
Organoheterocyclic compound,7 (/10)
1-benzopyran,6 (/10)
Benzopyran,6 (/10)
Carboxylic acid derivative,5 (/10)
Pyran,5 (/10)
Heteroaromatic compound,5 (/10)
Lactone,5 (/10)
Pyranone,5 (/10)
1-hydroxy-2-unsubstituted benzenoid,4 (/10)
Carbonyl group,3 (/10)
Carboxylic acid ester,3 (/10)
Chromone,2 (/10)
Phenol,2 (/10)
Benzoyl,2 (/10)
Aromatic homomonocyclic compound,2 (/10)
Benzofuran,2 (/10)
Acetal,2 (/10)
Monocarboxylic acid or derivatives,2 (/10)
1-hydroxy-4-unsubstituted benzenoid,2 (/10)
Carboxylic acid,2 (/10)
Benzoate ester,2 (/10)
Dicarboxylic acid or derivatives,2 (/10)
Monocyclic benzene moiety,2 (/10)
Glycosyl compound,1 (/10)
Phenol ether,1 (/10)
Styrene,1 (/10)
5-hydroxyflavonoid,1 (/10)
Flavan,1 (/10)
N-acyl-piperidine,1 (/10)
O-glucuronide,1 (/10)
Benzoic acid,1 (/10)
Naphthopyr