# Rapid Construction of DKGs in ASKEM

This notebook explains what goes into creating a domain knowledge graph (DKG) and how new ones can be rapidly constructed for new use cases.


## Configuration

MIRA builds on a combination of pre-ASKEM and ASKEM-specific infrastructure for acquiring, parsing, and processing ontologies, knowledge graphs, and databases. In order to determine which parts of this infrastructure are used and how, each DKG is configured with the following pieces of information:

1. What ontologies should be pulled in? For example, the epidemiology use case brings in the Infectious Disease Ontology (IDO), the National Cancer Institute Thesaurus (NCIT), and several others from the biomedical domain.
2. Is there a custom ASKEM domain ontology? For both epidemiology and space weather, we have constructed custom domain ontologies to help align bits from existing ontologies as well as to curate novel terms that are necessary for grounding the concepts appearing in models in the domain
3. Custom knowledge resources (these are quite custom and not easily translatable to a declarative configuration).

There are several pre-defined configurations for DKGs built in to MIRA including:

- `epi` for epidemiology
- `space` for space weather
- `eco` for ecology

## Running the Constructor

After installing MIRA, the pre-defined DKGs can be constructed using the following shell script:

```shell
python -m mira.dkg.construct --use-case epi
```

This does the following:

1. Creates a nodes and edges file that can be loaded into a Neo4j graph database. These are orchestrated along with the MIRA web application in Docker (more information [here](https://github.com/indralab/mira/tree/main/docker)).
2. Creates an RDF dump that can be loaded in any triple store and queried with SPARQL
3. Uses graph machine learning to create dense node embeddings based on the topology of the knowledge graph
4. Creates a "metaregistry" configuration file. Note that we actually combine all of the metaregistry configrations together into a single service running at http://mira-metaregistry-lb-3d0089f8b56257ad.elb.us-east-1.amazonaws.com/.

When these have been adequately locally tested, the command can be run with the `--do-upload` flag to automate uplaoding these artifacts to S3. The uploads are versioned by the date on which the build was performed. Inside the DKG, various nodes, edges, and metadata are tagged with the version of the resources (e.g., ontologies) from which they came.

## Arbitrary Constrution

In addition to the pre-configured DKGs, new DKGs can be rapidly constructed using the Python API to MIRA.

In [1]:
from mira.dkg.construct import construct, DKGConfig

While the epidemiology use case had a close relationship between the DKG and the modeling framework, the new regnet framework being proposed in the early May 2023 hackathon can appear in several domains including:

- gene transcription regulatory networks
- ecological networks (i.e., predator-prey networks *a la* Lotka-Volterra)
- cell signalling networks

In the following example, we construct an *ad hoc* domain knowledge graph by specifying several resources useful for describing genes and phenomena that may appear in these networks including:

- HGNC - an ontology of human genes that contains relationships to gene families families, proteins. Useful for annotating components of models.
- Gene Ontology (GO) - an ontology of molecular functional and biological processes used in annotating genes. Useful for both model-level annotation and annotating relationships between components of models (e.g., edges).
- WikiPathways - a resource categorizing genes in functional pathways. Useful for model-level annotation
- ProbOnto - an ontology of probability distributions

In [2]:
genereg_config = DKGConfig(
    use_case="genereg",
    prefixes=[
        "hgnc",
        "go",
        "wikipathways",
        "probonto",
    ],
)

use_case_paths = construct(
    config=genereg_config,
    # do_upload=True,  # uncomment when you're ready to send to S3!
)

standardizing nodes:   0%|          | 0.00/1.45k [00:00<?, ?it/s]

standardizing edges:   0%|          | 0.00/2.85k [00:00<?, ?it/s]

Loading probonto:   0%|          | 0/142 [00:00<?, ?term/s]

[32m[1mUnits[0m


  0%|          | 0/1764 [00:00<?, ?unit/s]

[32m[1mHUGO Gene Nomenclature Committee (1 graphs)[0m


hgnc:   0%|          | 0/1 [00:00<?, ?graph/s]

  0%|          | 0.00/150k [00:00<?, ?edge/s]

output edges to /Users/cthoyt/.data/mira/genereg/sources/edges_hgnc.tsv
[32m[1mGene Ontology (1 graphs)[0m


go:   0%|          | 0/1 [00:00<?, ?graph/s]

  0%|          | 0.00/87.0k [00:00<?, ?edge/s]

output edges to /Users/cthoyt/.data/mira/genereg/sources/edges_go.tsv
[32m[1mWikiPathways (1 graphs)[0m


wikipathways:   0%|          | 0/1 [00:00<?, ?graph/s]

  0%|          | 0.00/70.4k [00:00<?, ?edge/s]

output edges to /Users/cthoyt/.data/mira/genereg/sources/edges_wikipathways.tsv


  0%|          | 0.00/90.8k [00:00<?, ?node/s]

output edges to /Users/cthoyt/.data/mira/genereg/nodes.tsv.gz


cat edges:   0%|          | 0/4 [00:00<?, ?it/s]

0.00node [00:00, ?node/s]

0.00edge [00:00, ?edge/s]

serializing to turtle
done


109knode [00:00, 169knode/s] 
191kedge [00:00, 442kedge/s] 




INFO: [2023-05-11 18:05:34] root - Ensmallen is using Haswell


Alternatively, if you would rather use the command line, you can declare this configuration  as the following JSON (let's say it's called `config.json`:

```json
{
  "use_case": "genereg",
  "prefixes": ["hgnc", "go", "wikipathways", "probonto"]
}
```

Then run the following code:

```shell
python -m mira.dkg.construct --use-case config.json
```

The `--use-case` flag is smart enough to figure out if it's pointing to a file or a pre-configured name for a use case.

## Exploring the Content

The job of this pipeline is not to fully automate building the docker image and deploying the service, so instead we directly explore the nodes and edges files created for the Neo4j graph database here. It can be seen that a combination of domain nodes and upper level ontology nodes appear, as well as a combination of instance-level relationships and higher level ontological relationships appear in the edges file.

In [3]:
import pandas as pd

In [6]:
nodes_df = pd.read_csv(use_case_paths.NODES_PATH, sep='\t', dtype=str)
nodes_df

Unnamed: 0,id:ID,:LABEL,name:string,synonyms:string[],obsolete:boolean,type:string,description:string,xrefs:string[],alts:string[],version:string,property_predicates:string[],property_values:string[],xref_types:string[],synonym_types:string[],sources:string[]
0,bfo:0000050,bfo,part of,,false,property,,,,2023-04-01,oboinowl:shorthand,bfo:0000050,,,go
1,bfo:0000051,bfo,has part,,false,property,,,,2023-04-01,oboinowl:shorthand,bfo:0000051,,,go
2,bfo:0000062,bfo,preceded by,,false,unknown,,,,2023-04-01,oboinowl:shorthand,bfo:0000062,,,go
3,bfo:0000063,bfo,precedes,,false,unknown,,,,2023-04-01,oboinowl:shorthand,bfo:0000063,,,go
4,bfo:0000066,bfo,occurs in,,false,property,,,,2023-04-01,oboinowl:shorthand,bfo:0000066,,,go
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90802,wikipathways:WP995,wikipathways,Prostaglandin synthesis and regulation,,false,class,,,,,,,,,wikipathways
90803,wikipathways:WP996,wikipathways,EPO receptor signaling,,false,class,,,,,,,,,wikipathways
90804,wikipathways:WP997,wikipathways,Cytokines and inflammatory response,,false,class,,,,,,,,,wikipathways
90805,wikipathways:WP998,wikipathways,MAPK signaling pathway,,false,class,,,,,,,,,wikipathways


In [7]:
edges_df = pd.read_csv(use_case_paths.EDGES_PATH, sep='\t', dtype=str)
edges_df

Unnamed: 0,:START_ID,:END_ID,:TYPE,pred:string,source:string,graph:string,version:string
0,go:0106110,doi:10.1007/BF03356188,has_citation,debio:0000029,go,http://purl.obolibrary.org/obo/go.owl,2023-04-01
1,bfo:0000050,bfo:0000051,inverseof,owl:inverseOf,go,http://purl.obolibrary.org/obo/go.owl,2023-04-01
2,go:0000001,go:0048308,subclassof,rdfs:subClassOf,go,http://purl.obolibrary.org/obo/go.owl,2023-04-01
3,go:0000001,go:0048311,subclassof,rdfs:subClassOf,go,http://purl.obolibrary.org/obo/go.owl,2023-04-01
4,go:0000002,go:0007005,subclassof,rdfs:subClassOf,go,http://purl.obolibrary.org/obo/go.owl,2023-04-01
...,...,...,...,...,...,...,...
307490,wikipathways:WP999,ncbigene:534599,ro:0000057,RO:0000057,wikipathways,https://bioregistry.io/bioregistry:wikipathways,
307491,wikipathways:WP999,ncbigene:537713,ro:0000057,RO:0000057,wikipathways,https://bioregistry.io/bioregistry:wikipathways,
307492,wikipathways:WP999,ncbigene:613338,ro:0000057,RO:0000057,wikipathways,https://bioregistry.io/bioregistry:wikipathways,
307493,wikipathways:WP999,ncbigene:614145,ro:0000057,RO:0000057,wikipathways,https://bioregistry.io/bioregistry:wikipathways,


## Next Steps

We've built on two exsting softwares, `pyobo` and `bioontologies` for the discovery and parsing of ontologies, but we don't yet have abstract tooling for ingesting knowledge graphs.

For example, in the ecological domain (where we might have predator and prey models like Lotka-Volterra), it would be useful to get and ingest the Global Biotic Interaction (GloBI) knowledge graph, which contains interactions between agents in the environment such as:

- What do sea otters (Enhydra lutris) eat?
- What do honey bees (Apis) pollinate?

These could also provide useful constraints on model construction, such as making sure that there is no relationship where a rabbit eats a wolf.