# Referential integrity checker (prototype)

## Prerequisites

Before running this notebook, make sure you have done the following:

1. Run `$ make up-dev`
2. Map `localhost:27018` to the Mongo server you want to use
3. Load a recent dump of the production Mongo database into that Mongo server (see `$ make mongorestore-nmdc-dev` for an example)
4. In the `.env` file, set `MONGO_HOST` to `mongodb://localhost:27018`
5. Run `$ export $(grep -v '^#' .env | xargs)` to load the environment variables defined in `.env` into your shell environment

Once you've done all of those things, you can run this notebook (e.g. via `$ jupyter notebook`) 


## Enable automatic reloading of modules

Reference: https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html#autoreload

In [1]:
# Ensure code changes in this notebook will be import-able  
# without needing to restart the kernel and lose state
%load_ext autoreload
%autoreload 2

## Import Python modules

In [2]:
import os

from linkml_runtime.utils.schemaview import SchemaView
from toolz import dissoc, assoc
from tqdm.notebook import tqdm

from nmdc_runtime.api.db.mongo import get_mongo_db, nmdc_schema_collection_names
from nmdc_runtime.util import collection_name_to_class_names, nmdc_schema_view, nmdc_database_collection_instance_class_names
from nmdc_schema.nmdc_schema_accepting_legacy_ids import Database as NMDCDatabase
from nmdc_schema.get_nmdc_view import ViewGetter

mdb = get_mongo_db()

## "Pre-clean" the data

Determine the name of each Mongo collection in which at least one document has a field named `id`.

> **TODO:** Documents in the [`functional_annotation_agg` collection](https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/) do not have a field named `id`, and so will not be included here. Document the author's rationale for omitting it.

> **TODO:** The `nmdc_schema_collection_names` function combines the collection names in Mongo with the Database slots in the schema, and then omits some collection names. Document why the author took that approach.

In [3]:
collection_names = sorted(nmdc_schema_collection_names(mdb))
collection_names = [n for n in collection_names if mdb[n].find_one({"id": {"$exists": True}})]

### Remove fields that contain null

Remove specific fields from specific documents in the above collections, if the field's name appears in our hard-coded list (see the cell below for the list) and — in that document — the field consists of a null value.

> **TODO:** Document how the author obtained this list and whether the list would require maintenance over time.

In [1]:
# check these slots for null values for all docs in collection_names
props = ["used", "git_url", "was_associated_with", "was_generated_by", "compression_type", 
         "metagenome_annotation_id", "metaproteomic_analysis_id"] 

pbar = tqdm(total=len(collection_names))
for p in props:
    for coll_name in collection_names:
        pbar.set_description(f"checking {coll_name}...")
        # The {$type: 10} query matches for BSON Type Null, not just value `null`
        docs_broken = list(mdb[coll_name].find({p: {"$type": 10}}, ["id"]))
        if docs_broken:
            print(f"removing {len(docs_broken)} null-valued {p} values for {coll_name}...")
            mdb[coll_name].update_many(
                {"id": {"$in": [d["id"] for d in docs_broken]}},
                {"$unset": {p: None}}
            )
        pbar.update(1)

NameError: name 'tqdm' is not defined

## Materialize single-collection view of database

Check assumption that every populated collection currently has documents of one type only.

> **TODO:** The "class_names" part of the `collection_name_to_class_names` dictionary does not list _descendant_ classes, even though the schema will allow instances of descendant classes to reside in those collections. Document why disregarding descendant classes here is OK.

In [5]:
for name in collection_names:
    assert len(collection_name_to_class_names[name]) == 1

Define a helper function that takes a class instance and returns a list of the names of its own class and its ancestor classes.

In [6]:
def class_hierarchy_as_list(obj) -> list[str]:
    r"""
    Returns a list consisting of the name of the class of the instance pass in,
    and the names of all of its ancestor classes.

    TODO: Consider renaming function to be a verb; e.g. `get_class_hierarchy_as_list`.

    TODO: Document the purpose of the `rv` list (does not seem to be used anywhere).
    """

    rv = []
    current_class = obj.__class__
    
    def recurse_through_bases(cls):
        name = cls.__name__
        if name == "YAMLRoot":  # base case
            return rv
        rv.append(name)
        for base in cls.__bases__:
            recurse_through_bases(base)  # recursive invocation
        return rv
    
    return recurse_through_bases(current_class)  # initial invocation

Materialize `alldocs` collection, associating all inherited classes with document via `type` field.

> **TODO:** Clarify the above sentence.

In [None]:
# Drop any existing `alldocs` collection (e.g. from previous use of this notebook).
mdb.alldocs.drop()

# Set up progress bar
n_docs_total = sum(mdb[name].estimated_document_count() for name in collection_names)
pbar = tqdm(total=n_docs_total)

# for each collection name
for coll_name in collection_names:
    pbar.set_description(f"processing {coll_name}...")
    # for each doc in collection, remove the mongo-generated '_id' field
    try:
        nmdcdb = NMDCDatabase(**{coll_name: [dissoc(mdb[coll_name].find_one(), '_id')]})
    except ValueError as e:
        print(f"no {coll_name}!")
        raise e

    # Calculate class_hierarchy_as_list once per collection.
    #
    # Note: This seems to assume that the class hierarchy is identical for each document
    #       in a given collection, which may not be the case since a collection whose
    #       range is a "parent" class can store instances of descendant classes (and the
    #       class hierarchy of the latter would differ from that of the former).
    #
    exemplar = getattr(nmdcdb, coll_name)[0]  # get first instance (i.e. document) in list
    newdoc_type: list[str] = class_hierarchy_as_list(exemplar)
    
    # For each document in this collection, replace the value of the `type` field with
    # a _list_ of the document's own class and ancestor classes, remove the `_id` field,
    # and insert the resulting document into the `alldocs` collection. Note that we are not
    # relying on the original value of the `type` field, since it's unreliable (see below).
    
    # NOTE: `type` is currently a string, does not exist for all classes, and can have typos. 
    # Both of these are fixed in berkeley schema but is risky to use at this time

    # TODO: Consider omitting fields that neither (a) are the `id` field, nor (b) have the potential
    #       to reference a document. Those fields aren't related to referential integrity.
    
    mdb.alldocs.insert_many([assoc(dissoc(doc, 'type', '_id'), 'type', newdoc_type) for doc in mdb[coll_name].find()])
    pbar.update(mdb[coll_name].estimated_document_count())

pbar.close()

# Prior to re-ID-ing, some IDs are not unique across Mongo collections (eg nmdc:0078a0f981ad3f92693c2bc3b6470791)
# Re-idx for `alldocs` collection
mdb.alldocs.create_index("id")
print("refreshed `alldocs` collection")

  0%|          | 0/224995 [00:00<?, ?it/s]

refreshed `alldocs` collection


The resulting `alldocs` collection contains a copy of every document from every Mongo collection identified earlier. The copy is the same as the original document, except that its `type` field contains a list of the names of its own class and all of its ancestor classes (whereas, the original document's `type` field contains an unreliable string).

## Validate

Collect "top level" (`nmdc:Database` slot range) classes.

Reference: https://linkml.io/linkml/developers/schemaview.html#linkml_runtime.utils.schemaview.SchemaView.class_ancestors

In [14]:
nmdc_view = nmdc_schema_view()
toplevel_classes = set()
for name in nmdc_database_collection_instance_class_names():
    # TODO: Document why class _ancestors_ are being included here.
    #       A (hypothetical) collection whose range is "Chihuahua" wouldn't
    #       be allowed to store non-"Chihuahua" instances of "Dog" or "Animal".
    #
    # Note: `a |= b` is same as `a = a | b` (union two sets and store the result).
    #
    toplevel_classes |= set(nmdc_view.class_ancestors(name))

toplevel_classes

{'Activity',
 'Biosample',
 'BiosampleProcessing',
 'CollectingBiosamplesFromSite',
 'DataObject',
 'Extraction',
 'FieldResearchSite',
 'FunctionalAnnotation',
 'FunctionalAnnotationAggMember',
 'GenomeFeature',
 'LibraryPreparation',
 'MagsAnalysisActivity',
 'MaterialEntity',
 'MetabolomicsAnalysisActivity',
 'MetagenomeAnnotationActivity',
 'MetagenomeAssembly',
 'MetagenomeSequencingActivity',
 'MetaproteomicsAnalysisActivity',
 'MetatranscriptomeActivity',
 'NamedThing',
 'NomAnalysisActivity',
 'OmicsProcessing',
 'PlannedProcess',
 'Pooling',
 'ProcessedSample',
 'ReadBasedTaxonomyAnalysisActivity',
 'ReadQcAnalysisActivity',
 'Site',
 'Study',
 'WorkflowExecutionActivity'}

### Check referential integrity

In this cell, we populate two lists:

- `errors.not_found`: a list of "naive" errors
- `errors.invalid_type`: a list of (hierarchy-aware) type errors (document was found, but is of an invalid type)

Reference: https://linkml.io/linkml/developers/schemaview.html#linkml_runtime.utils.schemaview.SchemaView.class_induced_slots

In [None]:
# Initialize error lists.
errors = {"not_found": [], "invalid_type": []}

# Initialize progress bar.
#
# TODO: Explain why the author has opted to count the documents in the original collections,
#       even though the `alldocs` collection exists now.
#
n_docs_total = sum(mdb[name].estimated_document_count() for name in collection_names)
pbar = tqdm(total=n_docs_total)

# Iterate over each collection.
for name in sorted(collection_names):
    # Note: We already confirmed (in a different cell of this notebook)
    #       that each `class_names` list has exactly one item.
    cls_name = collection_name_to_class_names[name][0]
    # Make a dictionary of slot names to slot definitions. The set of slots here is (to quote the
    # LinkML SchemaView documentation) "all slots that are asserted or inferred for [the] class,
    # with their inferred semantics."
    slot_map = {
        slot.name: slot
        for slot in nmdc_view.class_induced_slots(cls_name)
    }
    pbar.set_description(f"processing {name}...")
    for doc in mdb[name].find():
        doc = dissoc(doc, "_id")
        for field, value in doc.items():
            assert field in slot_map, f"{name} doc {doc['id']}: field {field} not a valid slot"
            slot_range = str(slot_map[field].range)
            assert slot_range, type(slot_range)
            if not slot_range in toplevel_classes:
                continue
            if not isinstance(value, list):
                value = [value]
            for v in value:
                if mdb.alldocs.find_one({"id": v}, ["_id"]) is None:
                    errors["not_found"].append(f"{name} doc {doc['id']}: field {field} referenced doc {v} not found")
                elif mdb.alldocs.find_one({"id": v, "type": slot_range}, ["_id"]) is None:
                    errors["invalid_type"].append(f"{name} doc {doc['id']}: field {field} referenced doc {v} not of type {slot_range}")
        pbar.update(1)
pbar.close()           

  0%|          | 0/224995 [00:00<?, ?it/s]

## Results

In [15]:
len(errors["not_found"]), len(errors["invalid_type"])

(4857, 23503)

In [16]:
errors["not_found"][:5]

['mags_activity_set doc nmdc:fdefb3fa15098906cf788f5cadf17bb3: field part_of referenced doc nmdc:mga0vx38 not found',
 'mags_activity_set doc nmdc:78f8bf24916f01d053378b1bd464cd8a: field has_input referenced doc nmdc:9003278a200d1e7921e978d4c59233c3 not found',
 'mags_activity_set doc nmdc:a57ecfc4dee4e6938a5517ad0961dcd8: field part_of referenced doc nmdc:mga08x19 not found',
 'mags_activity_set doc nmdc:3e0d8aae3b16d5bba2b3faec04391929: field part_of referenced doc nmdc:mga06z11 not found',
 'mags_activity_set doc nmdc:4417090e8ce0e96ff2867b85823d4b26: field part_of referenced doc nmdc:mga07m45 not found']

In [17]:
mdb.alldocs.find_one({"id": "nmdc:mga0vx38"}) is None

True

In [18]:
errors["invalid_type"][:5]

['data_object_set doc emsl:output_570856: field was_generated_by referenced doc emsl:570856 not of type Activity',
 'data_object_set doc emsl:output_570991: field was_generated_by referenced doc emsl:570991 not of type Activity',
 'data_object_set doc emsl:output_570998: field was_generated_by referenced doc emsl:570998 not of type Activity',
 'data_object_set doc emsl:output_570855: field was_generated_by referenced doc emsl:570855 not of type Activity',
 'data_object_set doc emsl:output_570823: field was_generated_by referenced doc emsl:570823 not of type Activity']

In [19]:
# OmicsProcessing is not subclass of Activity (!)
mdb.alldocs.find_one({"id": "emsl:570856"})

{'_id': ObjectId('663fbef9ba64633177320f59'),
 'id': 'emsl:570856',
 'name': 'Rachael_21T_04-15A_M_14Mar17_leopard_Infuse',
 'instrument_name': '21T Agilent',
 'has_input': ['emsl:2f71038a-5dd1-11ec-bf63-0242ac130002'],
 'has_output': ['emsl:output_570856'],
 'omics_type': {'has_raw_value': 'Organic Matter Characterization'},
 'part_of': ['gold:Gs0110138'],
 'description': 'High resolution MS spectra only',
 'processing_institution': 'EMSL',
 'gold_sequencing_project_identifiers': [],
 'type': ['OmicsProcessing', 'PlannedProcess', 'NamedThing']}