# Pipeline to transform the set of nmdc-schema-compliant mongodb collections to an RDF dataset amenable to SPARQL queries.

Ensure that changes to the code will be import-able in this notebook without needing restart the kernel and thus lose state.

In [1]:
%load_ext autoreload
%autoreload 2

Connect to local dockerized dev environment.

In [2]:
!env | grep MONGO_HOST

MONGO_HOST=localhost:27018


Initialize a db connection.

In [3]:
from nmdc_runtime.api.db.mongo import get_mongo_db

mdb = get_mongo_db()

Get all populated nmdc-schema collections with entity `id`s.

In [4]:
from nmdc_runtime.util import schema_collection_names_with_id_field

populated_collections = sorted([
    name for name in set(schema_collection_names_with_id_field()) & set(mdb.list_collection_names())
    if mdb[name].estimated_document_count() > 0
])

Get a JSON-LD context for the NMDC Schema, to serialize documents to RDF.

In [5]:
import json
from pprint import pprint

from linkml.generators.jsonldcontextgen import ContextGenerator
from nmdc_schema.nmdc_data import get_nmdc_schema_definition

context = ContextGenerator(get_nmdc_schema_definition())
context = json.loads(context.serialize())["@context"]

for k, v in list(context.items()):
    if isinstance(v, dict): #and v.get("@type") == "@id":
        v.pop("@id", None) # use nmdc uri, not e.g. MIXS uri

Ensure `nmdc:type` has a `URIRef` range, i.e. `nmdc:type a owl:ObjectProperty`.

In [6]:
context['type'] = {'@type': '@id'}

Initialize an in-memory graph to store triples, prior to serializing to disk.

In [7]:
from rdflib import Graph

g = Graph()

Define a helper function to speed up triplification process.

In [8]:
def split_chunk(seq, n: int):
    """
    Split sequence into chunks of length n. Do not pad last chunk.
    
    >>> list(split_chunk(list(range(10)), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    for i in range(0, len(seq), n):
        yield seq[i : i + n]

Use `rdflib` JSON-LD parsing to ingest mongo docs to in-memory graph.

In [9]:
from nmdc_runtime.util import collection_name_to_class_names

def ensure_type(doc, collection_name):
    if "type" in doc:
        return doc

    class_names = collection_name_to_class_names[collection_name]
    if len(class_names) > 1:
        raise Exception("cannot unambiguously infer class of document")
    return assoc(doc, "type", class_names[0])

In [10]:
from toolz import assoc, dissoc
from tqdm.notebook import tqdm

chunk_size = 2_000
total = sum((1 + mdb[name].estimated_document_count() // 2_000) for name in populated_collections)

pbar = tqdm(total=total)

for collection_name in populated_collections:
    print(collection_name)
    docs = [dissoc(doc, "_id") for doc in mdb[collection_name].find()]
    chunks = list(split_chunk(docs, chunk_size))
    for chunk in chunks:
        typed_chunk = [ensure_type(doc, collection_name) for doc in chunk]
        doc_jsonld = {"@context": context, "@graph": chunk}
        g.parse(data=json.dumps(doc_jsonld), format='json-ld')
        pbar.update(1)
print(f"{len(g):,} triples loaded")

  0%|          | 0/114 [00:00<?, ?it/s]

biosample_set
data_object_set
extraction_set
field_research_site_set
library_preparation_set
mags_activity_set
metabolomics_analysis_activity_set
metagenome_annotation_activity_set
metagenome_assembly_set
metagenome_sequencing_activity_set
metaproteomics_analysis_activity_set
metatranscriptome_activity_set
nom_analysis_activity_set
omics_processing_set
pooling_set
processed_sample_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
study_set
6,555,800 triples loaded


Correct crazy URIs that end with newlines, which messes up graph serialization.

In [16]:
from rdflib import Namespace, RDF, Literal, URIRef

NMDC = Namespace("https://w3id.org/nmdc/")

for s, p, o in tqdm(g, total=len(g)):
    s_str = str(s)
    if s_str.endswith("\n"):
        s_str_fixed = str(s_str)[:-2]
        g.remove((s,p,o))
        g.add((URIRef(s_str_fixed), p,o))
    if isinstance(o, URIRef):
        o_str = str(o)
        if o_str.endswith("\n"):
            o_str_fixed = str(o_str)[:-2]
            g.remove((s,p,o))
            g.add((s, p, URIRef(o_str_fixed)))

  0%|          | 0/6795354 [00:00<?, ?it/s]

Given a schema-collection entity (i.e. one with an `id` and its own mongo document), we want to easily find all other schema-collection entities to which it connects, via any slot.

To do this, we first gather all schema classes that are the type of a schema-collection entity, as well as these class' ancestors.

In [17]:
from linkml_runtime.utils.schemaview import SchemaView

from nmdc_runtime.util import nmdc_schema_view, nmdc_database_collection_instance_class_names

schema_view = nmdc_schema_view()
toplevel_classes = set()
for name in nmdc_database_collection_instance_class_names():
    toplevel_classes |= set(schema_view.class_ancestors(name))

Next, we determine which slots have such a "top-level" class as its range.

In [18]:
slots = schema_view.all_slots()

toplevel_entity_connectors = set()
for k, v in context.items():
    if isinstance(v, dict) and "@type" in v and v["@type"] == "@id":
        if slots[k].range in toplevel_classes and slots[k].domain != "Database":
            toplevel_entity_connectors.add(k)
print(toplevel_entity_connectors)

{'was_informed_by', 'was_generated_by', 'metagenome_annotation_id', 'has_input', 'has_output', 'collected_from', 'part_of'}


Let's construct an entity-relationship diagram to visualize relationships.

In [19]:
# print("classDiagram\n")
# for slot_name in toplevel_entity_connectors:
#     slot = slots[slot_name]
#     domain = slot.domain or "NamedThing"
#     range = slot.range
#     print(f"{domain} --> {range} : {slot_name}")

# print()

# inheritance_links = set()
# for cls in toplevel_classes:
#     ancestors = schema_view.class_ancestors(cls)
#     for a in ancestors:
#         if a != cls:
#             inheritance_links.add(f"{a} <|-- {cls}")

# for link in inheritance_links:
#     print(link)

Now, let's assert a common `depends_on` relation for all entities connected by these slots so that we can traverse the graph of top-level entities without needing to specify any specific slot names.

In [20]:
from rdflib import PROV

for s, p, o in tqdm(g, total=len(g)):
    if (connector := p.removeprefix(str(NMDC))) in toplevel_entity_connectors:
        if connector == "has_output":
            g.add((o, NMDC.depends_on, s))
        else:
            g.add((s, NMDC.depends_on, o))

print(f"{len(g):,} triples in total")

  0%|          | 0/6795354 [00:00<?, ?it/s]

6,795,354 triples in total


Materialize superclass relations.

In [22]:
schema_view = nmdc_schema_view()
toplevel_classes = set()
for name in nmdc_database_collection_instance_class_names():
    toplevel_classes |= set(getattr(NMDC, a) for a in schema_view.class_ancestors(name))

for s, p, o in tqdm(g, total=len(g)):
    p_localname = p.removeprefix(str(NMDC))
    if p_localname != "type":
        continue
    if o not in toplevel_classes:
        continue
    for a in schema_view.class_ancestors(o.removeprefix(str(NMDC))):
        g.add((s, NMDC.type, getattr(NMDC,a)))

  0%|          | 0/6991623 [00:00<?, ?it/s]

In [31]:
len([t for t in g.subjects(NMDC.type, NMDC.Activity)])

13183

Serialize and store as gzipped N-Triples file.

In [37]:
import gzip

with gzip.open('data/nmdc-db.nt.gz', 'wb') as f:
    f.write(g.serialize(format='nt').encode())

Wipe any existing persisted data.

In [38]:
!docker compose up fuseki -d
!docker exec fuseki rm -rf /fuseki-base/nmdc-db.tdb

[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Container fuseki  [32mRunning[0m                                               [34m0.0s [0m
[?25h

Ensure data is present to load.

In [39]:
!docker cp data/nmdc-db.nt.gz fuseki:/fuseki-base/

[sPreparing to copy...[?25l[u[2KCopying to container - 0B[24G[0K5.54MB[24G[0K19.5MB[24G[0K33.8MB[24G[0K47.7MB[24G[0K62.4MB[24G[0K77.3MB[24G[0K92.7MB[24G[0K108MB[24G[0K124MB[24G[0K140MB[24G[0K156MB[24G[0K172MB[24G[0K190MB[24G[0K205MB[?25h[u[2KSuccessfully copied 212MB to fuseki:/fuseki-base/


Take server down in order to bulk-load data.

In [40]:
!docker compose down fuseki

[1A[1B[0G[?25l[+] Running 0/0
 ⠋ Container fuseki  Stopping                                              [34m0.1s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠙ Container fuseki  Stopping                                              [34m0.2s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠹ Container fuseki  Stopping                                              [34m0.3s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠸ Container fuseki  Stopping                                              [34m0.4s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠼ Container fuseki  Stopping                                              [34m0.5s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠴ Container fuseki  Stopping                                              [34m0.6s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 ⠦ Container fuseki  Stopping                                              [34m0.7s [0m
[?25h[1A[1A[0G[?25l[+] Running 2/1
 [32m✔[0m Container fuseki              [32mRemoved[

Bulk-load data.

In [41]:
!docker compose run fuseki ./apache-jena-4.9.0/bin/tdbloader --loc /fuseki-base/nmdc-db.tdb /fuseki-base/nmdc-db.nt.gz

14:00:20 INFO  loader          :: -- Start triples data phase
14:00:20 INFO  loader          :: ** Load empty triples table
14:00:20 INFO  loader          :: -- Start quads data phase
14:00:20 INFO  loader          :: ** Load empty quads table
14:00:20 INFO  loader          :: Load: /fuseki-base/nmdc-db.nt.gz -- 2024/03/14 14:00:20 UTC
14:00:21 WARN  riot            :: [line: 29434, col: 92] Bad IRI: Not a valid UUID string: uuid:DELA-CB-T-13ba6115-12fc-47cc-8cb0-ebf65e1d23d1
14:00:22 WARN  riot            :: [line: 86977, col: 92] Bad IRI: Not a valid UUID string: uuid:TEAK-CB-T-0d2245d4-c6da-4723-95be-ca5aefe607de
14:00:22 INFO  loader          :: Add: 100,000 triples (Batch: 76,219 / Avg: 76,219)
14:00:22 WARN  riot            :: [line: 130251, col: 92] Bad IRI: Not a valid UUID string: uuid:CSF2-CB-T-6d29d97e-b8d7-4844-a8c3-cc181f4c9909
14:00:22 INFO  loader          :: Add: 200,000 triples (Batch: 130,718 / Avg: 96,292)
14:00:23 INFO  loader          :: Add: 300,000 triples (Batch

Start up server.

In [42]:
!docker compose up fuseki -d

[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Container fuseki  [32mCreated[0m                                               [34m0.0s [0m
[?25h[1A[1A[0G[?25l[34m[+] Running 1/1[0m
 [32m✔[0m Container fuseki  [32mStarted[0m                                               [34m0.0s [0m
[?25h

Now go to <http://localhost:3030/#/dataset/nmdc/query> and SPARQL it up.

In [43]:
# 2024-03-14T09:40 : took <4min to run all the above.