# Pipeline to transform the set of nmdc-schema-compliant mongodb collections to an RDF dataset amenable to SPARQL queries.

Ensure that changes to the code will be import-able in this notebook without needing restart the kernel and thus lose state.

In [None]:
%load_ext autoreload
%autoreload 2

Connect to local dockerized dev environment.

In [None]:
from dotenv import load_dotenv

load_dotenv(".env.localhost")
!env | grep MONGO_HOST

Initialize a db connection.

In [None]:
from nmdc_runtime.api.db.mongo import get_mongo_db

mdb = get_mongo_db()

Get all populated nmdc-schema collections with entity `id`s.

In [None]:
from nmdc_runtime.util import schema_collection_names_with_id_field

populated_collections = sorted([
    name for name in set(schema_collection_names_with_id_field()) & set(mdb.list_collection_names())
    if mdb[name].estimated_document_count() > 0
])
populated_collections

Get a JSON-LD context for the NMDC Schema, to serialize documents to RDF.

In [None]:
import json
from pprint import pprint

from linkml.generators.jsonldcontextgen import ContextGenerator
from nmdc_schema.nmdc_data import get_nmdc_schema_definition

context = ContextGenerator(get_nmdc_schema_definition())
context = json.loads(context.serialize())["@context"]

for k, v in list(context.items()):
    if isinstance(v, dict): #and v.get("@type") == "@id":
        v.pop("@id", None) # use nmdc uri, not e.g. MIXS uri
pprint(context)

Ensure `nmdc:type` has a `URIRef` range, i.e. `nmdc:type a owl:ObjectProperty`.

In [None]:
context['type'] = {'@type': '@id'}

Initialize an in-memory graph to store triples, prior to serializing to disk.

In [None]:
from rdflib import Graph

g = Graph()

Define a helper function to speed up triplification process.

In [None]:
def split_chunk(seq, n: int):
    """
    Split sequence into chunks of length n. Do not pad last chunk.
    
    >>> list(split_chunk(list(range(10)), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    for i in range(0, len(seq), n):
        yield seq[i : i + n]

Use `rdflib` JSON-LD parsing to ingest mongo docs to in-memory graph.

In [None]:
from toolz import assoc, dissoc
from tqdm.notebook import tqdm

chunk_size = 2_000
total = sum((1 + mdb[name].estimated_document_count() // 2_000) for name in populated_collections)

pbar = tqdm(total=total)

for name in populated_collections:
    print(name)
    docs = [dissoc(doc, "_id") for doc in mdb[name].find()]
    chunks = list(split_chunk(docs, chunk_size))
    for chunk in chunks:
        doc_jsonld = {"@context": context, "@graph": chunk}
        g.parse(data=json.dumps(doc_jsonld), format='json-ld')
        pbar.update(1)
print(f"{len(g):,} triples loaded")

Correct crazy URIs that end with newlines, which messes up graph serialization.

In [None]:
from rdflib import Namespace, RDF, Literal, URIRef

NMDC = Namespace("https://w3id.org/nmdc/")

for s, p, o in g:
    s_str = str(s)
    if s_str.endswith("\n"):
        s_str_fixed = str(s_str)[:-2]
        g.remove((s,p,o))
        g.add((URIRef(s_str_fixed), p,o))
    if isinstance(o, URIRef):
        o_str = str(o)
        if o_str.endswith("\n"):
            o_str_fixed = str(o_str)[:-2]
            g.remove((s,p,o))
            g.add((s, p, URIRef(o_str_fixed)))

Now, we want to add OWL axioms to support fetching all "top-level" schema collection objects connected to a given schema collection object. 

In [None]:
from linkml_runtime.utils.schemaview import SchemaView

from nmdc_runtime.util import nmdc_schema_view, nmdc_database_collection_instance_class_names

schema_view = nmdc_schema_view()
slots = schema_view.all_slots()

collection_instance_class_names = nmdc_database_collection_instance_class_names()

toplevel_object_connectors = set()
for k, v in context.items():
    if isinstance(v, dict) and "@type" in v and v["@type"] == "@id":
        if slots[k].range in toplevel_objects and slots[k].domain != "Database":
            toplevel_object_connectors.add(k)
print(toplevel_object_connectors)

Let's invent a symmetric, transitive property so that an OWL reasoner connected to our triplestore can help us traverse the graph without needing to know any specific property names.

In [None]:
from rdflib import PROV, RDFS, RDF, OWL

superprop = URIRef("https://api.microbiomedata.org/fuseki/#connected")
g.add((superprop, RDF.type, OWL.SymmetricProperty))
g.add((superprop, RDF.type, OWL.TransitiveProperty))


for suffix in toplevel_object_connectors:
    prop = URIRef("https://w3id.org/nmdc/" + suffix)
    g.add((prop, RDFS.subClassOf, superprop))

print(f"{len(g):,} triples in total")

Serialize and store as gzipped N-Triples file.

In [None]:
import gzip

with gzip.open('data/nmdc-db.nt.gz', 'wb') as f:
    f.write(g.serialize(format='nt').encode())

Wipe any existing persisted data.

In [None]:
!docker compose up fuseki -d
!docker exec fuseki rm -rf /fuseki-base/nmdc-db.tdb

Ensure data is present to load.

In [None]:
!docker cp data/nmdc-db.nt.gz fuseki:/fuseki-base/

Take server down in order to bulk-load data.

In [None]:
!docker compose down fuseki

Bulk-load data.

In [None]:
!docker compose run fuseki ./apache-jena-4.9.0/bin/tdbloader --loc /fuseki-base/nmdc-db.tdb /fuseki-base/nmdc-db.nt.gz

Start up server.

In [None]:
!docker compose up fuseki -d

Now go to <http://localhost:3030/#/dataset/nmdc/query> and SPARQL it up.