# Pipeline to transform the set of nmdc-schema-compliant mongodb collections to an RDF dataset amenable to SPARQL queries.

## Setup

Before running this notebook, make sure you have done the following:
- `make up-dev` has been run and mongo is mapped to `localhost:27018`
- a recent dump of the production mongo database has been loaded to `localhost:27018` (see `make mongorestore-nmdc-dev` for an example)
- .env has updated `MONGO_HOST` to `mongodb://localhost:27018`
- `export $(grep -v '^#' .env | xargs)` has been run in the shell before running `jupyter notebook`

In [1]:
# Ensure code changes in this notebook will be import-able without needing to restart the kernel and lose state
%load_ext autoreload
%autoreload 2

Connect to local dockerized dev environment.

In [2]:
!env | grep MONGO_HOST

MONGO_HOST=mongodb://localhost:27018


Initialize a db connection.

In [7]:
from nmdc_runtime.api.db.mongo import get_mongo_db
mdb = get_mongo_db()
print("success")

success


Get all populated nmdc-schema collections with entity `id`s.

In [8]:
from nmdc_runtime.util import schema_collection_names_with_id_field

populated_collections = sorted([
    name for name in set(schema_collection_names_with_id_field()) & set(mdb.list_collection_names())
    if mdb[name].estimated_document_count() > 0
])

## Get a JSON-LD context for the NMDC Schema, to serialize documents to RDF

In [9]:
import json
from pprint import pprint

from linkml.generators.jsonldcontextgen import ContextGenerator
from nmdc_schema.nmdc_data import get_nmdc_schema_definition

context = ContextGenerator(get_nmdc_schema_definition())
context = json.loads(context.serialize())["@context"]

for k, v in list(context.items()):
    if isinstance(v, dict): #and v.get("@type") == "@id":
        v.pop("@id", None) # use nmdc uri, not e.g. MIXS uri

Ensure `nmdc:type` has a `URIRef` range, i.e. `nmdc:type a owl:ObjectProperty`.

In [10]:
context['type'] = {'@type': '@id'}

## Initialize an in-memory graph to store triples, prior to serializing to disk

In [11]:
from rdflib import Graph

g = Graph()

Define a helper function to speed up triplification process.

In [17]:
def split_chunk(seq, n: int):
    """
    Split sequence into chunks of length n. Do not pad last chunk.
    
    >>> list(split_chunk(list(range(10)), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    for i in range(0, len(seq), n):
        yield seq[i : i + n]

Define a helper function to ensure each doc has exactly one type.

In [13]:
from nmdc_runtime.util import collection_name_to_class_names

def ensure_type(doc, collection_name):
    if "type" in doc:
        return doc

    class_names = collection_name_to_class_names[collection_name]
    
    if len(class_names) > 1:
        raise Exception("cannot unambiguously infer class of document")
        
    return assoc(doc, "type", class_names[0])

## Ingest mongo docs to in-memory graph 
Uses `rdflib` JSON-LD parsing

In [None]:
from toolz import assoc, dissoc
from tqdm.notebook import tqdm

chunk_size = 2_000

# setup for progress bar
total = sum((1 + mdb[name].estimated_document_count() // 2_000) for name in populated_collections)
pbar = tqdm(total=total)

for collection_name in populated_collections:
    print(f"loading {collection_name} collection")
    # dissociate mongo-generated `_id` field
    docs = [dissoc(doc, "_id") for doc in mdb[collection_name].find()]
    # split collection docs into chunks
    chunks = list(split_chunk(docs, chunk_size))
    
    for chunk in chunks:
        # ensure each doc in chunk is typed
        typed_chunk = [ensure_type(doc, collection_name) for doc in chunk]
        # convert each doc to json_ld
        doc_jsonld = {"@context": context, "@graph": chunk}
        # add each doc to Graph `g`
        g.parse(data=json.dumps(doc_jsonld), format='json-ld')
        pbar.update(1)
print(f"{len(g):,} triples loaded")

  0%|          | 0/124 [00:00<?, ?it/s]

loading biosample_set collection
loading data_object_set collection
loading extraction_set collection
loading field_research_site_set collection
loading library_preparation_set collection
loading mags_activity_set collection
loading metabolomics_analysis_activity_set collection
loading metagenome_annotation_activity_set collection
loading metagenome_assembly_set collection
loading metagenome_sequencing_activity_set collection
loading metaproteomics_analysis_activity_set collection


Correct URIs that end with newlines, which messes up graph serialization.

In [15]:
from rdflib import Namespace, RDF, Literal, URIRef

NMDC = Namespace("https://w3id.org/nmdc/")

for s, p, o in tqdm(g, total=len(g)):
    s_str = str(s)
    if s_str.endswith("\n"):
        s_str_fixed = str(s_str)[:-2]
        g.remove((s,p,o))
        g.add((URIRef(s_str_fixed), p,o))
    if isinstance(o, URIRef):
        o_str = str(o)
        if o_str.endswith("\n"):
            o_str_fixed = str(o_str)[:-2]
            g.remove((s,p,o))
            g.add((s, p, URIRef(o_str_fixed)))

  0%|          | 0/6348584 [00:00<?, ?it/s]

## Connect Schema-Collection Entities
Given a schema-collection entity (i.e. one with an `id` and its own mongo document), we want to easily find all other schema-collection entities to which it connects, via any slot.

To do this, we first gather all schema classes that are the type of a schema-collection entity, as well as these class' ancestors.

In [21]:
from linkml_runtime.utils.schemaview import SchemaView

from nmdc_runtime.util import nmdc_schema_view, nmdc_database_collection_instance_class_names

schema_view = nmdc_schema_view()
toplevel_classes = set()
for name in nmdc_database_collection_instance_class_names():
    toplevel_classes |= set(schema_view.class_ancestors(name))

Next, we determine which slots have such a "top-level" class as its range.

In [22]:
slots = schema_view.all_slots()

toplevel_entity_connectors = set()
for k, v in context.items():
    if isinstance(v, dict) and "@type" in v and v["@type"] == "@id":
        if slots[k].range in toplevel_classes and slots[k].domain != "Database":
            toplevel_entity_connectors.add(k)
print(toplevel_entity_connectors)

{'was_generated_by', 'was_informed_by', 'metagenome_annotation_id', 'has_output', 'part_of', 'collected_from', 'has_input'}


Let's construct an entity-relationship diagram to visualize relationships.

In [23]:
print("classDiagram\n")
for slot_name in toplevel_entity_connectors:
    slot = slots[slot_name]
    domain = slot.domain or "NamedThing"
    range = slot.range
    print(f"{domain} --> {range} : {slot_name}")

print()

inheritance_links = set()
for cls in toplevel_classes:
    ancestors = schema_view.class_ancestors(cls)
    for a in ancestors:
        if a != cls:
            inheritance_links.add(f"{a} <|-- {cls}")

for link in inheritance_links:
    print(link)

classDiagram

NamedThing --> Activity : was_generated_by
Activity --> Activity : was_informed_by
FunctionalAnnotationAggMember --> WorkflowExecutionActivity : metagenome_annotation_id
NamedThing --> NamedThing : has_output
NamedThing --> NamedThing : part_of
Biosample --> FieldResearchSite : collected_from
NamedThing --> NamedThing : has_input

MaterialEntity <|-- FieldResearchSite
Activity <|-- MetaproteomicsAnalysisActivity
NamedThing <|-- Site
NamedThing <|-- DataObject
NamedThing <|-- FieldResearchSite
MaterialEntity <|-- Site
Activity <|-- MetatranscriptomeActivity
NamedThing <|-- LibraryPreparation
WorkflowExecutionActivity <|-- MagsAnalysisActivity
NamedThing <|-- PlannedProcess
WorkflowExecutionActivity <|-- ReadBasedTaxonomyAnalysisActivity
Activity <|-- MetagenomeAssembly
WorkflowExecutionActivity <|-- NomAnalysisActivity
PlannedProcess <|-- Extraction
PlannedProcess <|-- LibraryPreparation
PlannedProcess <|-- Pooling
MaterialEntity <|-- ProcessedSample
BiosampleProcessing <|

### Assert a common `depends_on` relation for all entities connected by `toplevel_entity_connectors`
This allows us to traverse the graph of top-level entities without needing to specify any specific slot names.

In [24]:
from rdflib import PROV

for s, p, o in tqdm(g, total=len(g)):
    if (connector := p.removeprefix(str(NMDC))) in toplevel_entity_connectors:
        if connector == "has_output":
            g.add((o, NMDC.depends_on, s))
        else:
            g.add((s, NMDC.depends_on, o))

print(f"{len(g):,} triples in total")

  0%|          | 0/15851994 [00:00<?, ?it/s]

16,125,596 triples in total


### Materialize superclass relations
We want each entity to be associated with its own class and all the classes that its class inherits from. For example an entity of type `Biosample` should also be of type `NamedThing`.

In [31]:
schema_view = nmdc_schema_view()
toplevel_classes = set()

# get top level class names
for name in nmdc_database_collection_instance_class_names():
    toplevel_classes |= set(getattr(NMDC, a) for a in schema_view.class_ancestors(name))

# for each triple (s, p, o) in Graph, add all triples (s, p, o') where o' is a class ancestor of o.
for s, p, o in tqdm(g, total=len(g)):
    # get the local predicate name (eg mdb slot name) for that triple
    p_localname = p.removeprefix(str(NMDC))
    # skip if predicate is `type`, as this triple was already loaded 
    if p_localname != "type":
        continue
    # skip triple if the object is not a top-level class   
    if o not in toplevel_classes:
        continue
    # for each triple where the object is a top-level class,
    # for each `class_ancestor` associated with that top-level class,
    # add the triple (s, `NMDC.type`, `class_ancestor`) 
    for a in schema_view.class_ancestors(o.removeprefix(str(NMDC))):
        # print(f"{a=}")
        t = (s, NMDC.type, getattr(NMDC,a))
        # pprint(f"{t=}")
        g.add(t)

  0%|          | 0/16349744 [00:00<?, ?it/s]

In [None]:
Sanity check that we have the right number of ActivitySet records.

In [32]:
len([t for t in g.subjects(NMDC.type, NMDC.Activity)])

14889

## Serialize and store as gzipped N-Triples file.
This can take a few minutes...

In [35]:
import gzip

with gzip.open('data/nmdc-db.nt.gz', 'wb') as f:
    print("Serializing graph and writing to file...") 
    f.write(g.serialize(format='nt').encode())
    print("Success!")

serializing Graph and writing to file...
success!


In [None]:
## Load data into a dockerized fuseki server

1. Add the following to `/nmdc-runtime/docker-compose.yaml`.

```yml
  fuseki:
    container_name: fuseki
    build:
      dockerfile: nmdc_runtime/fuseki.Dockerfile
      context: .
    ports:
      - "3030:3030"
    volumes:
      - ./nmdc_runtime/site/fuseki/fuseki-config.ttl:/configuration/fuseki-config.ttl
      - ./nmdc_runtime/site/fuseki/shiro.ini:/fuseki/run/shiro.ini
      - nmdc_runtime_fuseki_data:/fuseki-base
```



2. Add the following to `/nmdc-runtime/nmdc-runtime/fuseki.Dockerfile`

```Dockerfile
# Use an appropriate base image that includes Java and wget
FROM openjdk:11-jre-slim

# Set environment variables
ENV FUSEKI_VERSION 4.9.0
ENV FUSEKI_HOME /fuseki

# Install wget
RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/*

# Download and extract Fuseki
RUN wget -qO- https://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-$FUSEKI_VERSION.tar.gz | tar xvz -C / && \
    mv /apache-jena-fuseki-$FUSEKI_VERSION $FUSEKI_HOME

# Expose the default port
EXPOSE 3030

# Download and extract Jena Commands
RUN wget -qO- https://archive.apache.org/dist/jena/binaries/apache-jena-$FUSEKI_VERSION.tar.gz | tar xvz -C / && \
    mv /apache-jena-$FUSEKI_VERSION $FUSEKI_HOME

# Copy the Fuseki configuration file to the container
COPY ./nmdc_runtime/site/fuseki/fuseki-config.ttl $FUSEKI_HOME/configuration/
COPY ./nmdc_runtime/site/fuseki/shiro.ini $FUSEKI_HOME/run/

# Set working directory
WORKDIR $FUSEKI_HOME

# Command to start Fuseki server with preloaded data
CMD ["./fuseki-server", "--config", "configuration/fuseki-config.ttl"]
```

3. Add the following to `/nmdc-runtime/nmdc-runtime/site/fuseki/shiro.ini`
```ini
[main]
localhost=org.apache.jena.fuseki.authz.LocalhostFilter

[urls]
## Control functions open to anyone
/$/server = anon
/$/ping   = anon
/$/stats = anon
/$/stats/* = anon
## and the rest are restricted to localhost
/$/** = anon
/**=anon
```

5. Add the following to `/nmdc-runtime/nmdc-runtime/site/fuseki/fuseki-config.ttl`
```ttl
@prefix afn: <http://jena.apache.org/ARQ/function#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix nmdc: <https://w3id.org/nmdc/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
@prefix xs: <http://www.w3.org/2001/XMLSchema#> .

<https://api.microbiomedata.org/fuseki/#baseModel>
	a tdb:GraphTDB ;
	tdb:dataset <https://api.microbiomedata.org/fuseki/#tdbDataset> ;
	.

<https://api.microbiomedata.org/fuseki/#dataset>
	a ja:RDFDataset ;
	ja:defaultGraph <https://api.microbiomedata.org/fuseki/#inferenceModel> ;
	.

<https://api.microbiomedata.org/fuseki/#inferenceModel>
	a ja:InfModel ;
	ja:baseModel <https://api.microbiomedata.org/fuseki/#baseModel> ;
	ja:reasoner [
		ja:reasonerURL <http://jena.hpl.hp.com/2003/TransitiveReasoner> ;
	] ;
	.

<https://api.microbiomedata.org/fuseki/#nmdc>
	a fuseki:Service ;
	fuseki:dataset <https://api.microbiomedata.org/fuseki/#dataset> ;
	fuseki:name "nmdc" ;
	fuseki:serviceQuery
		"query" ,
		"sparql"
		;
	fuseki:serviceReadWriteGraphStore "data" ;
	fuseki:serviceUpdate "update" ;
	fuseki:serviceUpload "upload" ;
	.

<https://api.microbiomedata.org/fuseki/#tdbDataset>
	a tdb:DatasetTDB ;
	ja:context [
		rdfs:comment "Query timeout on this dataset: 10s." ;
		ja:cxtName "arq:queryTimeout" ;
		ja:cxtValue "10000" ;
	] ;
	tdb:location "/fuseki-base/nmdc-db.tdb" ;
	.

[]
	a fuseki:Server ;
	fuseki:services (
		<https://api.microbiomedata.org/fuseki/#nmdc>
	) ;
	.
```

. Spin up a `fuseki` container. 

In [43]:
!docker compose up fuseki -d

[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Container fuseki  [32mRunning[0m                                               [34m0.0s [0m
[?25h

Wipe any existing persisted data, and copy new RDF data into the `fuseki` container.


In [42]:
!docker exec fuseki rm -rf /fuseki-base/nmdc-db.tdb
!docker cp data/nmdc-db.nt.gz fuseki:/fuseki-base/

Error response from daemon: No such container: fuseki
no such directory


Take server down in order to bulk-load data.

In [None]:
!docker compose down fuseki

Bulk-load data.

In [None]:
!docker compose run fuseki ./apache-jena-4.9.0/bin/tdbloader --loc /fuseki-base/nmdc-db.tdb /fuseki-base/nmdc-db.nt.gz

Start up server.

In [None]:
!docker compose up fuseki -d

Now go to <http://localhost:3030/#/dataset/nmdc/query> and SPARQL it up.

In [None]:
# 2024-03-14T09:40 : took <4min to run all the above.