# Grounding a list of metabolites

In [1]:
import re
from gilda import ground

INFO: [2020-12-17 09:52:51] /Users/ben/Dropbox/postdoc/darpa/src/deft/adeft/recognize.py - OneShotRecognizer not available. Extension module for AlignmentBasedScorer is missing


We define some basic functions to load the strings and run Gilda on them. We also define a function to print grounding stats and print out any ungrounded strings.

In [2]:
def load_texts():
    with open('plasmax_name_to_kegg.txt') as fh:
        texts = [l.strip().split(',')[0] for l in fh.readlines()][1:]
    return sorted(set(texts))

def ground_texts(texts, grounding_fun):
    return {text: grounding_fun(text) for text in texts}

def print_grounding_stats(groundings):
    grounded = [t for t, g in groundings.items() if g]
    ungrounded = [t for t, g in groundings.items() if not g]
    num_texts = len(groundings)
    print('Grounded: %d/%d (%.2f%%)' % (len(grounded), num_texts, 100*len(grounded)/num_texts))
    print(ungrounded)

First we try running Gilda without any modifications and see what happens

In [3]:
texts = load_texts()
results = ground_texts(texts, ground)
print_grounding_stats(results)

Grounded: 163/230 (70.87%)
['2-aminomuconicacid', '2-hg', '2/3-phosphoglycerate', '4-pyridoxicacid', '5-phosphoribosyl-1-pyrophosphate', '6-phosphogluconate', 'aconitate', 'akg', 'argininosuccinate', 'ascorbicacid', 'carbamoyl_phosphate', 'carbamoylaspartate', 'carbamoylphosphate', 'cmp-acetylneuraminicacid', 'cysteicacid', 'dihydroacetonephosphate', 'dihydroxyacetonephosphate', 'fructose-16-bisphosphate', 'fructose1-6-bisphosphate', 'fructose1_6-biphosphate', 'glucosamine-6-phosphate', 'glucosamine6-phosphate', 'glutathioneoxidized', 'hydroxyphenyllacticacid', 'indole-3-lacticacid', 'isethionicacid', 'kiv', 'kmv+kic', 'kynurenic_acid', 'kynurenicacid', 'lactoylgsh', 'linoleicacid', 'mannitol/sorbitol', 'methioninesulfoxide', 'methyltryptophan', 'mevalonicacid', 'mevalonicacid5-pyrophosphate', 'myristic_acid', 'myristicacid', 'n-acetylglutamate', 'n-acetylneuramicacid', 'n-methylnicotinamide(nmnm)', 'nicotinamide/picolinamide', 'nicotinamidemononucleotide(nmn)', 'oleic_acid', 'oleicaci

It looks like 71% was grounded. Here is one example, each result is a list of ScoredMatch objects that each contain a Term and some metadata. The grounding is included in the Term. Matches are sorted by decreasing score with the highest scoring match on top.

In [4]:
print(results['lactate'])
print(results['lactate'][0].term.db, results['lactate'][0].term.id)

[ScoredMatch(Term(lactate,lactate,CHEBI,CHEBI:24996,lactate,assertion,famplex,None),1.0,Match(query=lactate,ref=lactate,exact=True,space_mismatch=False,dash_mismatches={},cap_combos=[]))]
CHEBI CHEBI:24996


Upon examination, the entries in the ungrounded list have some patterns of issues that can be fixed with some preprocessing in the `preprocess_text` function. We can then define and use `ground_preprocess` which preprocesses each text before grounding it with Gilda.

In [5]:
typos = {
    'stereamide': 'stearamide',
    'staericacid': 'stearicacid',
}

def preprocess_text(text):
    if text in typos:
        text = typos[text]
    # Example: nicotinamidemononucleotide(nmn)
    text = re.sub('(\([a-zA-Z]+\))$', '', text)
    # Example: palmitic_acid
    text = text.replace('_', ' ')
    # Example: pipecolicacid
    suffixes = ['acid', 'mononucleotide']
    for suffix in suffixes:
        text = re.sub('([^ ])(%s)$' % suffix, '\\1 %s' % suffix, text)
    # Example: nicotinamide/picolinamide
    if '/' in text:
        text = text.split('/')[0]
    return text
        
def ground_preprocess(text):
    text = preprocess_text(text)
    return ground(text)

results = ground_texts(texts, ground_preprocess)
print_grounding_stats(results)

Grounded: 195/230 (84.78%)
['2-hg', '2/3-phosphoglycerate', '5-phosphoribosyl-1-pyrophosphate', '6-phosphogluconate', 'aconitate', 'akg', 'argininosuccinate', 'carbamoylaspartate', 'carbamoylphosphate', 'dihydroacetonephosphate', 'dihydroxyacetonephosphate', 'fructose-16-bisphosphate', 'fructose1-6-bisphosphate', 'fructose1_6-biphosphate', 'glucosamine-6-phosphate', 'glucosamine6-phosphate', 'glutathioneoxidized', 'hydroxyphenyllacticacid', 'kiv', 'kmv+kic', 'lactoylgsh', 'methioninesulfoxide', 'methyltryptophan', 'mevalonicacid5-pyrophosphate', 'n-acetylglutamate', 'n-acetylneuramicacid', 'oleoamide', 'palmitoylcarnitinec16', 'pentose5-phosphates', 'phenolsulphate', 'pyridoxide', 'sedoheptulose7-phosphate', 'seduheptulose7-phosphate', 'succinicglutathione', 'succinylglutathione']


Gilda doesn't have the right synonyms to find groundings for these remaining ungrounded texts.

### Standardizing the results

INDRA offers utilities to map identifiers and standardize names which can be useful in this setting, see https://indra.readthedocs.io/en/latest/modules/ontology/standardize.html.

In [6]:
from indra.ontology.standardize import standardize_name_db_refs

In [7]:
standardize_name_db_refs({results['lactate'][0].term.db: results['lactate'][0].term.id})

INFO: [2020-12-17 09:53:11] indra.ontology.bio.ontology - Loading INDRA bio ontology from cache at /Users/ben/.indra/bio_ontology/1.5/bio_ontology.pkl


('lactate', {'CHEBI': 'CHEBI:24996', 'CAS': '113-21-3', 'PUBCHEM': '91435'})

We see that the standard name for this entry from CHEBI is `lactate` and we were able to get CAS and PUBCHEM mappings for it.

We can also look at ontological information for the grounded entries via INDRA as follows, with the example of `glutamine`. It looks like `glutamine` has a lot of children in the ChEBI ontology.

In [12]:
from indra.ontology.bio import bio_ontology
glutamine_term = ground('glutamine')[0].term
children = bio_ontology.get_children(glutamine_term.db, glutamine_term.id)
for child in children:
    print(bio_ontology.get_name(*child), child[1])

N(2)-acetyl-D-glutamine CHEBI:144430
N(2)-acylglutamine CHEBI:83985
alpha-chrysopine CHEBI:83080
poly-L-glutamic acid CHEBI:26173
Gln-Cys-Cys CHEBI:144458
Ala-Met-Gln-Gln CHEBI:137239
alpha-N-peptidyl-L-glutamine CHEBI:16376
Cys-Met-Gln CHEBI:144427
N-(gamma-L-glutamyl)-2-naphthylamine CHEBI:90444
Asn-Met-Gln-Pro CHEBI:138505
gamma-glutamylputrescine CHEBI:48006
N(5)-phenyl-L-glutamine CHEBI:79289
Dnp-Gln CHEBI:72487
Glu-Phe-Gln-Gln CHEBI:73488
Gln-Trp CHEBI:141431
(4-\{4-[2-(gamma-L-glutamylamino)ethyl]phenoxymethyl\}furan-2-yl)methanamine CHEBI:88248
Tnp-Gln CHEBI:72495
N(2)-[(2E)-3-methylhex-2-enoyl]-L-glutamine CHEBI:145321
coprine CHEBI:3875
N(5)-ethyl-L-glutamine CHEBI:17394
N(2)-phenylacetylglutamine CHEBI:8087
L-glutamine derivative CHEBI:24317
Arg-Asn-Gln-Arg CHEBI:73397
ophthalmic acid CHEBI:84058
5,6,7,8-tetrahydrofolyl-L-glutamic acid CHEBI:27650
Asp-Gln-Arg CHEBI:73447
10-formyltetrahydrofolyl glutamate CHEBI:19111
gamma-glutamyltyramine CHEBI:84215
N(2)-[4-(2,4-dichloroph