# Making a Custom Grounder for Gilda

This tutorial presents several ways of generating custom groundings that can be used with Gilda.

In [1]:
import gilda
import gilda.term
from tabulate import tabulate
from gilda.process import normalize
import pandas as pd
from tqdm.auto import tqdm
import time
import sys

In [2]:
print(sys.version)

3.10.2 (main, Feb  2 2022, 06:19:27) [Clang 13.0.0 (clang-1300.0.29.3)]


In [3]:
print(time.asctime())

Tue Apr 26 14:21:31 2022


In [4]:
def matches_df(scored_matches) -> pd.DataFrame:
    return pd.DataFrame([
        { 
            **m.term.to_json(),
            'score': m.score,
            'url': m.url,
            # **m.match.to_json()
        } 
        for m in scored_matches
    ])

# Custom Terms from OBO Graph JSON

Many ontologies are pre-parsed into the [OBO Graph JSON](https://github.com/geneontology/obographs) format that are readily usable without ontology-specific software. In this example, we get the URL for an OBO Graph JSON from the [Bioregistry](https://github.com/biopragmatics/bioregistry) for the [Monarch Disease Ontolgy (MONDO)](https://obofoundry.org/ontology/mondo) then generate Gilda terms for its entries based on their names and synonyms.

In [5]:
import requests
import bioregistry

url = bioregistry.get_json_download("MONDO")
res = requests.get(url).json()

In [6]:
custom_terms = []

prefix = "MONDO"
uri_prefix = "http://purl.obolibrary.org/obo/MONDO_"

missing_label = 0

for node in tqdm(res['graphs'][0]['nodes']):
    uri = node['id']
    if not uri.startswith(uri_prefix):
        continue  # skip imported terms
    
    identifier = uri[len(uri_prefix):]
    
    name = node.get('lbl')
    if name is None:
        missing_label += 1
        continue
    
    custom_terms.append(gilda.term.Term(
        norm_text=normalize(name),
        text=name,
        db=prefix,
        id=identifier,
        entry_name=name,
        status="name",
        source=prefix,
    ))
    for synonym_data in node.get('meta', {}).get('synonyms', []):
        synonym = synonym_data['val']
        custom_terms.append(gilda.term.Term(
            norm_text=normalize(synonym),
            text=synonym,
            db=prefix,
            id=identifier,
            entry_name=name,
            status="synonym",
            source=prefix,
        ))

print(f"{missing_label:,} nodes were missing labels")
custom_mondo_grounder = gilda.make_grounder(custom_terms)
custom_mondo_grounder.print_summary()

  0%|          | 0/44291 [00:00<?, ?it/s]

74 nodes were missing labels
Lookups: 112,921
Terms: 128,873
Term Namespaces: {'MONDO'}
Term Statuses: {'name': 24907, 'synonym': 103966}
Adeft Disambiguators: 0
Gilda Disambiguators: 1,008



In [7]:
matches_df(custom_mondo_grounder.ground("alzheimer disease"))

Unnamed: 0,norm_text,text,db,id,entry_name,status,source,score,url
0,alzheimer disease,Alzheimer disease,MONDO,4975,Alzheimer disease,name,MONDO,0.771593,https://identifiers.org/mondo:0004975
1,alzheimer disease,Alzheimer disease,MONDO,7088,Alzheimer disease type 1,synonym,MONDO,0.549371,https://identifiers.org/mondo:0007088


In [8]:
matches_df(custom_mondo_grounder.ground("alzheimer's disease"))

Unnamed: 0,norm_text,text,db,id,entry_name,status,source,score,url
0,alzheimer's disease,Alzheimer's disease,MONDO,4975,Alzheimer disease,synonym,MONDO,0.511647,https://identifiers.org/mondo:0004975


# Custom Terms from an Ontology via `obonet`

The [`obonet`](https://github.com/dhimmel/obonet) package is a lightweight tool for parsing ontologies in the OBO text format into NetworkX graph objects. This example shows loading a custom grounder with terms from the [Cell Ontology (CL)](https://obofoundry.org/ontology/cl).

In [9]:
import obonet

prefix = "CL"
g = obonet.read_obo(
    "https://raw.githubusercontent.com/obophenotype/cell-ontology/master/cl-basic.obo"
)

custom_terms = []
for node, data in g.nodes(data=True):
    # Skip entries imported from other ontologies
    if not node.startswith("CL:"):
        continue
        
    identifier = node.removeprefix("CL:")

    name = data["name"]
    custom_terms.append(gilda.term.Term(
        norm_text=normalize(name),
        text=name,
        db=prefix,
        id=identifier,
        entry_name=name,
        status="name",
        source=prefix,
    ))
    
    # Add terms for all synonyms
    for synonym_raw in data.get("synonym", []):
        try:
            # Try to parse out of the quoted OBO Field
            synonym = synonym.split('"')[1]
        except IndexError:
            continue  # the synonym was malformed

        custom_terms.append(gilda.term.Term(
            norm_text=normalize(synonym),
            text=synonym,
            db=prefix,
            id=identifier,
            entry_name=name,
            status="synonym",
            source=prefix,
        ))
        
custom_cl_grounder = gilda.make_grounder(custom_terms)
custom_cl_grounder.print_summary()

Lookups: 2,454
Terms: 2,454
Term Namespaces: {'CL'}
Term Statuses: {'name': 2454}
Adeft Disambiguators: 0
Gilda Disambiguators: 1,008



In [10]:
matches_df(custom_cl_grounder.ground("Mast cells"))

Unnamed: 0,norm_text,text,db,id,entry_name,status,source,score,url
0,mast cell,mast cell,CL,97,mast cell,name,CL,0.771593,https://identifiers.org/cl:0000097


# Custom Terms from PyOBO

[PyOBO](https://github.com/pyobo/pyobo) is a general tool for converting semantic spaces into ontologies and can be used to access additional vocabularies in an ontology-like way. In this example, several pathway databases are loaded for grounding including Reactome, WikiPathways, PathBank, and the Pathway Ontology (which itself actually is an ontology).

In [11]:
import pyobo
import pyobo.api.utils

print(pyobo.get_version())

0.7.0


In [12]:
custom_pathway_terms = []

prefixes = [
    "reactome", 
    "wikipathways", 
    "pw",  # Pathway ontology
    "pathbank",
]

# Repeat the steps for several pathway resources
for prefix in prefixes:
    version = pyobo.api.utils.get_version(prefix)
    names = pyobo.get_id_name_mapping(prefix)
    synonyms = pyobo.get_id_synonyms_mapping(prefix)
    print(
        f"{prefix} v{version}, {len(names):,} names, {sum(len(v) for v in synonyms.values()):,} synonyms"
    )

    for identifier, name in names.items():
        # Create a Gilda term for the standard label
        custom_pathway_terms.append(gilda.Term(
            norm_text=normalize(name),
            text=name,
            db=prefix,
            id=identifier,
            entry_name=name,
            status="name",
            source=prefix,
        ))
        
        # Create a Gilda term for each synonym
        for synonym in synonyms.get(identifier, []):
            custom_pathway_terms.append(gilda.Term(
                norm_text=normalize(synonym),
                text=synonym,
                db=prefix,
                id=identifier,
                entry_name=name,
                status="synonym",
                source=prefix,
            ))



reactome v80, 21,423 names, 0 synonyms
wikipathways v20220410, 1,718 names, 0 synonyms
pw v2019-10-23, 2,600 names, 1,957 synonyms
pathbank v2.0, 110,242 names, 0 synonyms


In [13]:
# Generate a grounder using a list of Gilda terms
custom_pathway_grounder = gilda.make_grounder(custom_pathway_terms)
custom_pathway_grounder.print_summary()

Lookups: 76,499
Terms: 137,940
Term Namespaces: {'wikipathways', 'pathbank', 'reactome', 'pw'}
Term Statuses: {'name': 135983, 'synonym': 1957}
Adeft Disambiguators: 0
Gilda Disambiguators: 1,008



In [14]:
scored_matches = custom_pathway_grounder.ground("apoptosis")
pd.DataFrame([
    { 
        **m.term.to_json(),
        'score': m.score,
        'url': m.url,
        **m.match.to_json()
    } 
    for m in scored_matches
])

Unnamed: 0,norm_text,text,db,id,entry_name,status,source,score,url,query,ref,exact,space_mismatch,dash_mismatches,cap_combos
0,apoptosis,Apoptosis,reactome,R-BTA-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-BTA-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
1,apoptosis,Apoptosis,reactome,R-CEL-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-CEL-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
2,apoptosis,Apoptosis,reactome,R-CFA-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-CFA-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
3,apoptosis,Apoptosis,reactome,R-DDI-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-DDI-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
4,apoptosis,Apoptosis,reactome,R-DME-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-DME-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
5,apoptosis,Apoptosis,reactome,R-DRE-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-DRE-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
6,apoptosis,Apoptosis,reactome,R-GGA-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-GGA-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
7,apoptosis,Apoptosis,reactome,R-HSA-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-HSA-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
8,apoptosis,Apoptosis,reactome,R-MMU-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-MMU-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
9,apoptosis,Apoptosis,reactome,R-PFA-109581,Apoptosis,name,reactome,0.762317,https://identifiers.org/reactome:R-PFA-109581,apoptosis,Apoptosis,False,False,[],"[(all_lower, initial_cap)]"
