## Load the grounding reference JSON
We choose one of the test JSONs from the grounding-search repository and load it locally.

In [1]:
import json
import copy
import requests
from collections import Counter
json_url = ('https://raw.githubusercontent.com/PathwayCommons/'
            'grounding-search/8e3b1d7060dca3ca61325e03dadb33abb529caeb/'
            'test/util/data/molecular-cell.json')
def load_json(url):
    res = requests.get(url)
    return json.loads(res.text)
test_json = load_json(json_url)        

In [3]:
json_url

'https://raw.githubusercontent.com/PathwayCommons/grounding-search/8e3b1d7060dca3ca61325e03dadb33abb529caeb/test/util/data/molecular-cell.json'

## Set up a the call to the Gilda service
Here we specify the URL for the Gilda service and define a simple function to send a request to the service and return the top grounding result.

In [4]:
#grounding_url = 'http://34.201.164.108:8001/ground'
grounding_url = 'http://localhost:8001/ground'

def ground(text, context):
    res = requests.post(grounding_url, json={'text': text, 'context': context})
    rj  = res.json()
    if not rj:
        return None
    else:
        return rj[0]

## Ground the reference text strings with Gilda
We can now iterate over all the papers and entities in the JSON and store each entity text, along with its sentence for grounding. The resulting JSON has the same structure as the original allowing easier comparison. We also use INDRA's HGNC resource manager to map to NCBI groundings whenever available, again for easier comparison with the reference.

In [5]:
from indra.databases import hgnc_client
gilda_json = copy.deepcopy(test_json)
for paper in gilda_json:
    for entity in paper['entities']:
        text = entity['text']
        sentence = entity['sentence']
        grounding = ground(text, sentence)
        # If there is no grounding, we enter Nones
        if not grounding:
            entity['namespace'] = None
            entity['xref_id'] = None
        else:
            db, id = grounding['term']['db'].lower(), grounding['term']['id']
            # We get NCBI mappings for the genes that were grounded to for consistency
            # with the reference
            if db == 'hgnc':
                ncbi_id = hgnc_client.get_entrez_id(id)
                db = 'ncbi'
                id = ncbi_id
            # As for CHEBI, we strip off the CHEBI: prefix from the ID for consistency
            # with the reference
            elif db == 'chebi':
                id = id[6:]
            entity['namespace'] = db
            entity['xref_id'] = id
            

## Analyze the results
Let's calculate some statistics for the original JSON

In [6]:
def get_grounding_stats(jd):
    grounded = 0
    all_entities = 0
    ungrounded_texts = []
    namespaces = []
    for paper in jd:
        for entity in paper['entities']:
            all_entities += 1
            if entity['namespace'] and entity['xref_id']:
                grounded += 1
                namespaces.append(entity['namespace'])
            else:
                ungrounded_texts.append(entity['text'])
    print('Number of entity mentions: %s\nNumber Grounded: %s' %
          (all_entities, grounded))
    print('Name spaces grounded to:')
    for ns, count in sorted(Counter(namespaces).items(),
                           key=lambda x: x[1], reverse=True):
        print('- %s: %s' % (ns, count))
    print('Ungrounded texts:')
    for text, count in sorted(Counter(ungrounded_texts).items(),
                              key=lambda x: x[1], reverse=True):
        print('- %s: %s' % (text, count))

In [7]:
get_grounding_stats(test_json)

Number of entity mentions: 573
Number Grounded: 521
Name spaces grounded to:
- ncbi: 382
- chebi: 139
Ungrounded texts:
- RPA: 4
- APC: 2
- AMPK: 2
- PI3K: 2
- glutathione S-transferase: 2
- pocket proteins: 2
- replication protein A: 1
- PP2A: 1
- 5-Ethynyl-2'-deoxyuridine: 1
- EdU: 1
- AktVIII: 1
- BI-D1870: 1
- SB216763: 1
- Sin3: 1
- H4: 1
- H2: 1
- PRMTs: 1
- IkappaB kinase: 1
- RelB NF-kappaB: 1
- protein phosphatase 2b: 1
- mTORC1: 1
- HIF-1: 1
- NF-kappaB: 1
- CBF: 1
- RLCK: 1
- receptor-like cytoplasmic kinase: 1
- PP1: 1
- ER: 1
- GSK2606414: 1
- ISRIB: 1
- AMPK1: 1
- CaM: 1
- DNA polymerase delta: 1
- atezolizumab: 1
- sirtuins: 1
- topoisomerase IV: 1
- anaphase-promoting complex: 1
- SCF: 1
- Pdhk: 1
- GFAT: 1
- glutamine-fructose-6-phosphate aminotransferase: 1
- CIA: 1
- Ubiquitin: 1
- alpha-tubulin: 1


In [8]:
get_grounding_stats(gilda_json)

Number of entity mentions: 573
Number Grounded: 501
Name spaces grounded to:
- ncbi: 330
- chebi: 129
- fplx: 27
- mesh: 14
- go: 1
Ungrounded texts:
- Pontin: 2
- Swe1: 2
- Clb2: 2
- Clb5: 2
- pocket proteins: 2
- CYCD: 2
- CYCE: 2
- U6 snRNA: 2
- Ndd1: 1
- Sli15: 1
- Cdc55: 1
- Pph21: 1
- 5-Ethynyl-2'-deoxyuridine: 1
- EdU: 1
- AktVIII: 1
- BI-D1870: 1
- SB216763: 1
- Ldh1: 1
- Sin3: 1
- PRMTs: 1
- miR-196b-3p: 1
- RelB NF-kappaB: 1
- alpha-KG: 1
- GDH1: 1
- GDH2: 1
- Snail3: 1
- CRPK1: 1
- CAMTA3: 1
- MYB15: 1
- open stomata 1: 1
- MKK2: 1
- AtCIPK3: 1
- brassinosteroid-insensitive 1-EMS suppressor 1: 1
- BES1: 1
- COR15B: 1
- RD29A: 1
- KIN1: 1
- RLCK: 1
- receptor-like cytoplasmic kinase: 1
- 14-3-3lambda: 1
- IFNlambda: 1
- miR-130: 1
- IHO1: 1
- gammaH2AX: 1
- LC3B: 1
- Cytb5: 1
- GSK2606414: 1
- ISRIB: 1
- 14-3-3eta: 1
- AMPK1: 1
- atezolizumab: 1
- RPA32: 1
- YgdH: 1
- PpnN: 1
- pppGpp: 1
- lacZ: 1
- Clb3: 1
- Bni1: 1
- Bud3: 1
- Bud6: 1
- sterol regulatory element-binding pro

In [9]:
gilda_json

[{'id': 'http://dx.doi.org/10.1016/j.molcel.2017.01.019',
  'organismOrdering': [9606],
  'entities': [{'text': 'RUVBL1',
    'xref_id': '8607',
    'namespace': 'ncbi',
    'sentence': 'PRMT5 Dependent Methylation of the TIP60 Coactivator RUVBL1 Is a Key Regulator of Homologous Recombination  Volume 65 , Issue 5 , 2 March 2017 , Pages 900-916  Thomas L. Clarke , Maria Pilar Sanchez-Bailon , Kelly Chiang , John J. Reynolds , Joaquin Herrero-Ruiz , Tiago M. Bandeiras , Pedro M. Matias , Sarah L. Maslen , J. Mark Skehel , Grant S.Stewart , Clare C. Davies  Summary  Protein post-translation modification plays an important role in regulating DNA repair ; however , the role of arginine methylation in this process is poorly understood .'},
   {'text': 'TIP60',
    'xref_id': '10524',
    'namespace': 'ncbi',
    'sentence': 'PRMT5 Dependent Methylation of the TIP60 Coactivator RUVBL1 Is a Key Regulator of Homologous Recombination  Volume 65 , Issue 5 , 2 March 2017 , Pages 900-916  Thomas L.

In [10]:
def compare_groundings(g1, g2):
    if g1 == (None, None):
        if g2 == (None, None):
            return 'ungrounded_ungrounded'
        else:
            return 'ungrounded_grounded'
    elif g2 == (None, None):
        return 'grounded_ungrounded'
    elif g1 == g2:
        return 'grounded_matching'
    else:
        return 'grounded_different'

comparison = {}
for ref_block, gilda_block in zip(test_json, gilda_json):
    for ref_entry, gilda_entry in zip(ref_block['entities'], gilda_block['entities']):
        ref_grounding = (ref_entry['namespace'], ref_entry['xref_id'])
        gilda_grounding = (gilda_entry['namespace'], gilda_entry['xref_id'])
        comp = compare_groundings(ref_grounding, gilda_grounding)
        if comp in comparison:
            comparison[comp].append((ref_entry['text'], ref_grounding, gilda_grounding))
        else:
            comparison[comp] = [(ref_entry['text'], ref_grounding, gilda_grounding)]

### Compare Gilda with the reference
Let's now look at the comparison in detail and see if we can identify any patterns.

1. Both the reference and Gilda provided grounding but the groundings are different

In [11]:
comparison['grounded_different']

[('nitrogen', ('chebi', '25555'), ('mesh', 'D009584')),
 ('hydrogen', ('chebi', '49637'), ('mesh', 'D006859')),
 ('Ask1', ('ncbi', '853814'), ('ncbi', '4217')),
 ('Stb1', ('ncbi', '855407'), ('ncbi', '57216')),
 ('Net1', ('ncbi', '853369'), ('ncbi', '10276')),
 ('Sic1', ('ncbi', '850768'), ('ncbi', '2313')),
 ('Orc6', ('ncbi', '856518'), ('ncbi', '23594')),
 ('Foxk1', ('ncbi', '17425'), ('ncbi', '221937')),
 ('Tfeb', ('ncbi', '21425'), ('ncbi', '7942')),
 ('Atf4', ('ncbi', '11911'), ('ncbi', '468')),
 ('glucose', ('chebi', '4167'), ('chebi', '17234')),
 ('Foxk2', ('ncbi', '68837'), ('ncbi', '3607')),
 ('Shmt2', ('ncbi', '108037'), ('ncbi', '6472')),
 ('Pgm2', ('ncbi', '72157'), ('ncbi', '55276')),
 ('Mthfd1l', ('ncbi', '270685'), ('ncbi', '25902')),
 ('Eno1', ('ncbi', '13806'), ('ncbi', '2023')),
 ('Tpi1', ('ncbi', '21991'), ('ncbi', '7167')),
 ('Aldoa', ('ncbi', '11674'), ('ncbi', '226')),
 ('glucose-6-phosphate', ('chebi', '4170'), ('mesh', 'D019298')),
 ('Psat1', ('ncbi', '107272'),

Some key differences are:
- The reference grounds some genes to organism-specific identifiers, whereas Gilda is running in human-only mode.
- Some chemicals are grounded to MeSH by Gilda that are ChEBI in the reference - both groundings look correct. In fact, Gilda also returns ChEBI groundings but the MeSH groundings have a higher score in these cases.


2. Gilda provides grounding for something that is ungrounded in the reference

In [12]:
comparison['ungrounded_grounded']

[('replication protein A', (None, None), ('fplx', 'RPA')),
 ('PP2A', (None, None), ('fplx', 'PPP2')),
 ('APC', (None, None), ('ncbi', '324')),
 ('RPA', (None, None), ('fplx', 'RPA')),
 ('H4', (None, None), ('ncbi', '8030')),
 ('H2', (None, None), ('chebi', '18276')),
 ('IkappaB kinase', (None, None), ('fplx', 'IKK_family')),
 ('protein phosphatase 2b', (None, None), ('fplx', 'PPP3')),
 ('AMPK', (None, None), ('fplx', 'AMPK')),
 ('mTORC1', (None, None), ('fplx', 'mTORC1')),
 ('HIF-1', (None, None), ('fplx', 'HIF1')),
 ('NF-kappaB', (None, None), ('fplx', 'NFkappaB')),
 ('PI3K', (None, None), ('fplx', 'PI3K')),
 ('glutathione S-transferase', (None, None), ('fplx', 'GST')),
 ('CBF', (None, None), ('ncbi', '10153')),
 ('PI3K', (None, None), ('fplx', 'PI3K')),
 ('glutathione S-transferase', (None, None), ('fplx', 'GST')),
 ('PP1', (None, None), ('fplx', 'PPP1')),
 ('RPA', (None, None), ('fplx', 'RPA')),
 ('ER', (None, None), ('fplx', 'ESR')),
 ('AMPK', (None, None), ('fplx', 'AMPK')),
 ('RP

The vast majority of these cases are protein families or complexes, all of which are covered by FamPlex, which is integrated by Gilda.

3. Groundings provided by the reference that are ungrounded by Gilda

In [13]:
comparison['grounded_ungrounded']

[('Pontin', ('ncbi', '8607'), (None, None)),
 ('Ndd1', ('ncbi', '854554'), (None, None)),
 ('Swe1', ('ncbi', '853252'), (None, None)),
 ('Sli15', ('ncbi', '852453'), (None, None)),
 ('Cdc55', ('ncbi', '852685'), (None, None)),
 ('Clb2', ('ncbi', '856236'), (None, None)),
 ('Clb5', ('ncbi', '856237'), (None, None)),
 ('Pph21', ('ncbi', '851421'), (None, None)),
 ('Ldh1', ('ncbi', '16828'), (None, None)),
 ('Pontin', ('ncbi', '8607'), (None, None)),
 ('miR-196b-3p', ('ncbi', '442920'), (None, None)),
 ('alpha-KG', ('chebi', '30915'), (None, None)),
 ('GDH1', ('ncbi', '2746'), (None, None)),
 ('GDH2', ('ncbi', '2747'), (None, None)),
 ('Snail3', ('ncbi', '333929'), (None, None)),
 ('CRPK1', ('ncbi', '838236'), (None, None)),
 ('CAMTA3', ('ncbi', '816762'), (None, None)),
 ('MYB15', ('ncbi', '821904'), (None, None)),
 ('open stomata 1', ('ncbi', '829541'), (None, None)),
 ('MKK2', ('ncbi', '829103'), (None, None)),
 ('AtCIPK3', ('ncbi', '817240'), (None, None)),
 ('brassinosteroid-insensit

We can identify the following patterns above:
- The reference grounds yeast-specific genes (Bni1, Bud3, Bud6, etc.) whereas Gilda is running in human-only mode.
- Some synonyms are recognized by the reference that don't appear in a close enough form in either HGNC, UniProt, or FamPlex and are therefore not grounded (CYCD, CYCE, 14-3-3eta, etc.)