# Referential integrity checker (prototype)

## Prerequisites

Before running this notebook, make sure you have done the following:

1. Run `$ make up-dev`
2. Map `localhost:27018` to the Mongo server you want to use
3. Load a recent dump of the production Mongo database into that Mongo server (see `$ make mongorestore-nmdc-db` for an example)
4. In the `.env` file, set `MONGO_HOST` to `mongodb://localhost:27018`
5. Run `$ export $(grep -v '^#' .env | xargs)` to load the environment variables defined in `.env` into your shell environment
6. Run `make init` to ensure a consistent python kernel for this notebook.

Once you've done all of those things, you can run this notebook (e.g. via `$ jupyter notebook`) 


In [1]:
!echo $MONGO_HOST

localhost:27018


## Enable automatic reloading of modules

Reference: https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html#autoreload

In [2]:
# Ensure code changes in this notebook will be import-able  
# without needing to restart the kernel and lose state
%load_ext autoreload
%autoreload 2

## Import Python modules

Be sure you're using the version of `nmdc-schema` you think you are!

In [3]:
from importlib.metadata import version

version("nmdc-schema")

'11.1.0'

In [4]:
from collections import defaultdict
import concurrent.futures
from itertools import chain
import os
import re

from linkml_runtime.utils.schemaview import SchemaView
from pymongo import InsertOne
from toolz import dissoc, assoc
from tqdm.notebook import tqdm

from nmdc_runtime.api.core.util import pick
from nmdc_runtime.api.db.mongo import get_mongo_db, get_nonempty_nmdc_schema_collection_names, get_collection_names_from_schema
from nmdc_runtime.util import collection_name_to_class_names, populated_schema_collection_names_with_id_field, nmdc_schema_view, nmdc_database_collection_instance_class_names, get_nmdc_jsonschema_dict
from nmdc_schema.nmdc import Database as NMDCDatabase 
from nmdc_schema.get_nmdc_view import ViewGetter

mdb = get_mongo_db()
schema_view = nmdc_schema_view()

## Create slot mappings

In [5]:
collection_names = populated_schema_collection_names_with_id_field(mdb) # `get_nonempty_nmdc_schema_collection_names` to include "functional_annotation_agg"

Collect all possible classes of documents across all schema collections. `collection_name_to_class_names` is a mapping from collection name to a list of class names allowable for that collection's documents.

In [6]:
document_class_names = set(chain.from_iterable(collection_name_to_class_names.values()))

Map each document-class name to a map of slot name to slot definition. Class slots here are (to quote the LinkML SchemaView documentation) "all slots that are asserted or inferred for [the] class, with their inferred semantics."

In [7]:
cls_slot_map = {
    cls_name : {slot.name: slot
                for slot in schema_view.class_induced_slots(cls_name)
               }
    for cls_name in document_class_names
}

## Materialize single-collection view of database

The `alldocs` collection associates each database document's `id` with not only its class (via that document's `type` field) but also with all ancestors of the docuement's class.

The set-of-classes association is done by setting the `type` field in an `alldocs` document to be a list, which facilitates filtering by type using the same strutured query forms as for upstream schema collections. The first element of the `type` list *must* correspond to the source document's asserted class; this is so that validation code can determine the expected range of document slots, as slot ranges may be specialized by a class (via linkml "slot_usage").

To keep the `alldocs` collection focused on supporting referential-integrity checking, only document-reference-ranged slots from source documents are copied to an entity's corresponding `alldocs` materialization. 

In [8]:
# From https://github.com/microbiomedata/refscan/blob/af092b0e068b671849fe0f323fac2ed54b81d574/refscan/lib/helpers.py#L141-L176

from typing import List
from linkml_runtime import linkml_model

def get_names_of_classes_in_effective_range_of_slot(
    schema_view: SchemaView, slot_definition: linkml_model.SlotDefinition
) -> List[str]:
    r"""
    Determine the slot's "effective" range, by taking into account its `any_of` constraints (if defined).

    Note: The `any_of` constraints constrain the slot's "effective" range beyond that described by the
          induced slot definition's `range` attribute. `SchemaView` does not seem to provide the result
          of applying those additional constraints, so we do it manually here (if any are defined).
          Reference: https://github.com/orgs/linkml/discussions/2101#discussion-6625646

    Reference: https://linkml.io/linkml-model/latest/docs/any_of/
    """

    # Initialize the list to be empty.
    names_of_eligible_target_classes = []

    # If the `any_of` constraint is defined on this slot, use that instead of the `range`.
    if "any_of" in slot_definition and len(slot_definition.any_of) > 0:
        for slot_expression in slot_definition.any_of:
            # Use the slot expression's `range` to get the specified eligible class name
            # and the names of all classes that inherit from that eligible class.
            if slot_expression.range in schema_view.all_classes():
                own_and_descendant_class_names = schema_view.class_descendants(slot_expression.range)
                names_of_eligible_target_classes.extend(own_and_descendant_class_names)
    else:
        # Use the slot's `range` to get the specified eligible class name
        # and the names of all classes that inherit from that eligible class.
        if slot_definition.range in schema_view.all_classes():
            own_and_descendant_class_names = schema_view.class_descendants(slot_definition.range)
            names_of_eligible_target_classes.extend(own_and_descendant_class_names)

    # Remove duplicate class names.
    names_of_eligible_target_classes = list(set(names_of_eligible_target_classes))

    return names_of_eligible_target_classes

In [9]:
# Any ancestor of a document class is a document-referenceable range, i.e., a valid range of a document-reference-ranged slot.
document_referenceable_ranges = set(chain.from_iterable(schema_view.class_ancestors(cls_name) for cls_name in document_class_names))

document_reference_ranged_slots = defaultdict(list)
for cls_name, slot_map in cls_slot_map.items():
    for slot_name, slot in slot_map.items():
        if set(get_names_of_classes_in_effective_range_of_slot(schema_view, slot)) & document_referenceable_ranges:
            document_reference_ranged_slots[cls_name].append(slot_name)

In [10]:
# Drop any existing `alldocs` collection (e.g. from previous use of this notebook).
mdb.alldocs.drop()

BULK_WRITE_BATCH_SIZE = 2_000 # ensure bulk-write batches aren't too huge

# Set up progress bar
n_docs_total = sum(mdb[name].estimated_document_count() for name in collection_names)
pbar = tqdm(total=n_docs_total)

for coll_name in collection_names:
    pbar.set_description(f"processing {coll_name}...")
    requests = []
    for doc in mdb[coll_name].find():
        doc_type = doc['type'][5:] # lop off "nmdc:" prefix
        slots_to_include = ["id", "type"] + document_reference_ranged_slots[doc_type]
        new_doc = pick(slots_to_include, doc)
        new_doc["_type_and_ancestors"] = schema_view.class_ancestors(doc_type)
        requests.append(InsertOne(new_doc))
        if len(requests) == BULK_WRITE_BATCH_SIZE: 
            result = mdb.alldocs.bulk_write(requests, ordered=False)
            pbar.update(result.inserted_count)
            requests.clear()
    if len(requests) > 0:
        result = mdb.alldocs.bulk_write(requests, ordered=False)
        pbar.update(result.inserted_count)
pbar.close()

# Prior to re-ID-ing, some IDs are not unique across Mongo collections (eg nmdc:0078a0f981ad3f92693c2bc3b6470791)

# Ensure unique id index for `alldocs` collection.
# The index is sparse because e.g. nmdc:FunctionalAnnotationAggMember documents don't have an "id".
mdb.alldocs.create_index("id", unique=True, sparse=True)

print("refreshed `alldocs` collection")

  0%|          | 0/163175 [00:00<?, ?it/s]

refreshed `alldocs` collection


The resulting `alldocs` collection contains a copy of every document from every Mongo collection identified earlier. The copy has a subset of the key-value pairs as the original document, except that its `type` field contains a list of the names of its own class and all of its ancestor classes (whereas the original document's `type` field either is unset or contains its own class only).

## Validate

### Check referential integrity

In this cell, we populate two lists:

- `errors.not_found`: a list of "naive" errors
- `errors.invalid_type`: a list of (hierarchy-aware) type errors (document was found, but is of an invalid type)

Reference: https://linkml.io/linkml/developers/schemaview.html#linkml_runtime.utils.schemaview.SchemaView.class_induced_slots

In [11]:
def doc_assertions(limit=0, batch_size=2_000):
    """Yields batches of assertions to greatly speed up processing."""
    # Initialize progress bar.
    pbar = tqdm(total=(mdb.alldocs.estimated_document_count() if limit == 0 else limit))
    rv = []
    for doc in mdb.alldocs.find(limit=limit):
        # Iterate over each key/value pair in the dictionary (document).
        for field, value in doc.items():
            if field.startswith("_") or field in ("id", "type"):
                continue
            acceptable_slot_classes = get_names_of_classes_in_effective_range_of_slot(
                schema_view,
                cls_slot_map[doc["type"][5:]][field],
            )
            if not isinstance(value, list):
                value = [value]
            for v in value:
                rv.append({
                    "id": doc.get("id", doc["_id"]),
                    "id_is_nmdc_id": "id" in doc,
                    "field": field,
                    "value": v,
                    "acceptable_slot_classes": acceptable_slot_classes,
                })
                if len(rv) == batch_size:
                    yield rv
                    rv.clear()
        pbar.update(1)
    yield rv
    pbar.close()

In [12]:
from pprint import pprint

alldocs_ids = set(mdb.alldocs.distinct("id"))

def doc_field_value_errors(assertions):
    errors = {"not_found": [], "invalid_type": []}
    # group assertions by referenced "id" value.
    assertions_by_referenced_id_value = defaultdict(list)
    for a in assertions:
        assertions_by_referenced_id_value[a["value"]].append(a)
    # associate each referenced document id with its type.
    doc_id_types = {}
    for d in list(mdb.alldocs.find({"id": {"$in": list(assertions_by_referenced_id_value.keys())}}, {"_id": 0, "id": 1, "type": 1})):
        doc_id_types[d["id"]] = d["type"]

    for id_value, id_value_assertions in assertions_by_referenced_id_value.items():
        if id_value not in alldocs_ids:
            errors["not_found"].extend(id_value_assertions)
        else:
            for a in id_value_assertions:
                # check that the document-reported type for this id reference is kosher as per the referring slot's schema definition.
                if doc_id_types[a["value"]][5:] not in a["acceptable_slot_classes"]:
                    errors["invalid_type"].append(a)

    return errors


# Initialize "global" error lists.
errors = {"not_found": [], "invalid_type": []}

for das in doc_assertions(batch_size=2_000):
    rv = doc_field_value_errors(das)
    errors["not_found"].extend(rv["not_found"])
    errors["invalid_type"].extend(rv["invalid_type"])

  0%|          | 0/163175 [00:00<?, ?it/s]

## Results

Display the number errors in each list.

In [13]:
len(errors["not_found"]), len(errors["invalid_type"])
# results with v11.1.0 on `/global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-11-25_20-12-02/nmdc`: (33, 0)

(33, 0)