# Migrate MongoDB database from `nmdc-schema` `v11.3.0` to `v11.4.0`

## Introduction

This notebook will be used to migrate the database from `nmdc-schema` `v11.3.0` ([released](https://github.com/microbiomedata/nmdc-schema/releases/tag/v11.3.0) January 17, 2025) to `v11.4.0` ([released](https://github.com/microbiomedata/nmdc-schema/releases/tag/v11.4.0) February 12, 2025).

### Notice

In each migration notebook between schema `v10.9.1` and `v11.3.0`, we dumped **all collections** from the Mongo database. We started doing that once migrations involved collection-level operations (i.e., creating, renaming, and deleting them), as opposed to only document-level operations.

In _this_ migration notebook (from schema `v11.3.0` to `v11.4.0`), we dump only **one collection** from the Mongo database. We opted to do this after understanding the scope of the `Migrator` class ([here](https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migrators/migrator_from_11_3_0_to_11_4_0.py)) imported by this notebook. This eliminates some overhead from the migration process.

## Prerequisites

### 1. Coordinate with stakeholders.

We will be enacting full Runtime and Database downtime for this migration. Ensure stakeholders are aware of that.

### 2. Set up notebook environment.

Here, you'll prepare an environment for running this notebook.

1. Start a **MongoDB server** on your local machine (and ensure it does **not** already contain a database having the name specified in the notebook configuration file).
    1. You can start a [Docker](https://hub.docker.com/_/mongo)-based MongoDB server at `localhost:27055` by running the following command. A MongoDB server started this way will be accessible without a username or password.


In [None]:
!docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo:8.0.4

2. Create and populate a **notebook configuration file** named `.notebook.env`.
   > You can use `.notebook.env.example` as a template.

## Procedure

### Install Python packages

In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends.

> Note: If the output of this cell says "Note: you may need to restart the kernel to use updated packages", restart the kernel (not the notebook cells), then proceed to the next cell.

##### References

| Description                                                                     | Link                                                   |
|---------------------------------------------------------------------------------|--------------------------------------------------------|
| NMDC Schema PyPI package | https://pypi.org/project/nmdc-schema                   |
| How to `pip install` from a Git branch<br>instead of PyPI                       | https://stackoverflow.com/a/20101940                   |

In [None]:
%pip install --upgrade pip
%pip install -r requirements.txt
%pip install nmdc-schema==11.4.0

### Import Python dependencies

Import the Python objects upon which this notebook depends.

#### References

| Description                            | Link                                                                                                  |
|----------------------------------------|-------------------------------------------------------------------------------------------------------|
| Dynamically importing a Python module  | [`importlib.import_module`](https://docs.python.org/3/library/importlib.html#importlib.import_module) |
| Confirming something is a Python class | [`inspect.isclass`](https://docs.python.org/3/library/inspect.html#inspect.isclass)                   |

In [None]:
MIGRATOR_MODULE_NAME = "migrator_from_11_3_0_to_11_4_0"

In [None]:
# Standard library packages:
import subprocess
from typing import List
import importlib
from inspect import isclass

# Third-party packages:
import pymongo
from linkml.validator import Validator, ValidationReport
from linkml.validator.plugins import JsonschemaValidationPlugin
from nmdc_schema.nmdc_data import get_nmdc_schema_definition
from nmdc_schema.migrators.adapters.mongo_adapter import MongoAdapter
from linkml_runtime import SchemaView

# First-party packages:
from helpers import Config, setup_logger, get_collection_names_from_schema, derive_schema_class_name_from_document
from bookkeeper import Bookkeeper, MigrationEvent

# Dynamic imports:
migrator_module = importlib.import_module(f".{MIGRATOR_MODULE_NAME}", package="nmdc_schema.migrators")
Migrator = getattr(migrator_module, "Migrator")  # gets the class
assert isclass(Migrator), "Failed to import Migrator class."

### Parse configuration files

Parse the notebook and Mongo configuration files.

In [None]:
cfg = Config()

# Define some aliases we can use to make the shell commands in this notebook easier to read.
mongodump = cfg.mongodump_path
mongorestore = cfg.mongorestore_path
mongosh = cfg.mongosh_path

# Make the base CLI options for Mongo shell commands.
origin_mongo_cli_base_options = Config.make_mongo_cli_base_options(
    mongo_host=cfg.origin_mongo_host,
    mongo_port=cfg.origin_mongo_port,
    mongo_username=cfg.origin_mongo_username,
    mongo_password=cfg.origin_mongo_password,
)
transformer_mongo_cli_base_options = Config.make_mongo_cli_base_options(
    mongo_host=cfg.transformer_mongo_host,
    mongo_port=cfg.transformer_mongo_port,
    mongo_username=cfg.transformer_mongo_username,
    mongo_password=cfg.transformer_mongo_password,
)

# Perform a sanity test of the application paths.
!{mongodump} --version
!{mongorestore} --version
!{mongosh} --version

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" and "transformer" MongoDB servers.

In [None]:
# Mongo client for "origin" MongoDB server.
origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_host,
                                          port=int(cfg.origin_mongo_port),
                                          username=cfg.origin_mongo_username,
                                          password=cfg.origin_mongo_password,
                                          directConnection=True)

# Mongo client for "transformer" MongoDB server.
transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_host,
                                               port=int(cfg.transformer_mongo_port),
                                               username=cfg.transformer_mongo_username,
                                               password=cfg.transformer_mongo_password,
                                               directConnection=True)

# Perform sanity tests of those MongoDB clients' abilities to access their respective MongoDB servers.
with pymongo.timeout(3):
    # Display the MongoDB server version (running on the "origin" Mongo server).
    print("Origin Mongo server version:      " + origin_mongo_client.server_info()["version"])

    # Sanity test: Ensure the origin database exists.
    assert cfg.origin_mongo_database_name in origin_mongo_client.list_database_names(), "Origin database does not exist."

    # Display the MongoDB server version (running on the "transformer" Mongo server).
    print("Transformer Mongo server version: " + transformer_mongo_client.server_info()["version"])

    # Sanity test: Ensure the transformation database does not exist.
    assert cfg.transformer_mongo_database_name not in transformer_mongo_client.list_database_names(), "Transformation database already exists."

Delete the transformer database from the transformer MongoDB server if that database already exists there (e.g. if it was left over from an experiment).

#### References

| Description                  | Link                                                          |
|------------------------------|---------------------------------------------------------------|
| Python's `subprocess` module | https://docs.python.org/3/library/subprocess.html             |
| `mongosh` CLI options        | https://www.mongodb.com/docs/mongodb-shell/reference/options/ |

In [None]:
# Note: I run this command via Python's `subprocess` module instead of via an IPython magic `!` command
#       because I expect to eventually use regular Python scripts—not Python notebooks—for migrations.
shell_command = f"""
  {cfg.mongosh_path} {transformer_mongo_cli_base_options} \
      --eval 'use {cfg.transformer_mongo_database_name}' \
      --eval 'db.dropDatabase()' \
      --quiet
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Create validator

In this step, you'll create a validator that can be used to check whether data conforms to the NMDC Schema. You'll use it later, to do that.

#### References

| Description                  | Link                                                                         |
|------------------------------|------------------------------------------------------------------------------|
| LinkML's `Validator` class   | https://linkml.io/linkml/code/validator.html#linkml.validator.Validator      |
| Validating data using LinkML | https://linkml.io/linkml/data/validating-data.html#validation-in-python-code |

In [None]:
schema_definition = get_nmdc_schema_definition()
validator = Validator(
    schema=schema_definition,
    validation_plugins=[JsonschemaValidationPlugin(closed=True)],
)

# Perform a sanity test of the validator.
assert callable(validator.validate), "Failed to instantiate a validator"

### Create SchemaView

In this step, you'll instantiate a `SchemaView` that is bound to the destination schema.

#### References

| Description                 | Link                                                |
|-----------------------------|-----------------------------------------------------|
| LinkML's `SchemaView` class | https://linkml.io/linkml/developers/schemaview.html |

In [None]:
schema_view = SchemaView(get_nmdc_schema_definition())

# As a sanity test, confirm we can use the `SchemaView` instance to access a schema class.
schema_view.get_class(class_name="Database")["name"]

### Revoke access from the "origin" MongoDB server

We revoke both "write" and "read" access to the server.

#### Rationale

We revoke "write" access so people don't make changes to the original data while the migration is happening, given that the migration ends with an overwriting of the original data (which would wipe out any changes made in the meantime).

We also revoke "read" access. The revocation of "read" access is technically optional, but (a) the JavaScript mongosh script will be easier for me to maintain if it revokes everything and (b) this prevents people from reading data during the restore step, during which the database may not be self-consistent.

#### References

| Description                    | Link                                                      |
|--------------------------------|-----------------------------------------------------------|
| Running a script via `mongosh` | https://www.mongodb.com/docs/mongodb-shell/write-scripts/ |

In [None]:
shell_command = f"""
  {cfg.mongosh_path} {origin_mongo_cli_base_options} \
      --file='mongosh-scripts/revoke-privileges.mongo.js' \
      --quiet
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Delete obsolete dumps from previous migrations

Delete any existing dumps before we create new ones in this notebook. This is so the dumps you generate with this notebook do not get merged with any unrelated ones.

In [None]:
!rm -rf {cfg.origin_dump_folder_path}
!rm -rf {cfg.transformer_dump_folder_path}

### Dump collection(s) from the "origin" MongoDB server

Use `mongodump` to dump specific collection(s) **from** the "origin" MongoDB server **into** a local directory.


In [None]:
# Dump the specified collection from the "origin" database.
shell_command = f"""
  {mongodump} {origin_mongo_cli_base_options} \
      --db='{cfg.origin_mongo_database_name}' \
      --out='{cfg.origin_dump_folder_path}' \
      --gzip \
      --collection='workflow_execution_set'
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Load the dumped collection(s) into the "transformer" MongoDB server

Use `mongorestore` to load the dumped collection(s) **from** the local directory **into** the "transformer" MongoDB server.

References:
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore
- https://www.mongodb.com/docs/database-tools/mongorestore/mongorestore-examples/#copy-clone-a-database

In [None]:
# Restore the dumped collections to the "transformer" MongoDB server.
shell_command = f"""
  {mongorestore} {transformer_mongo_cli_base_options} \
      --nsFrom='{cfg.origin_mongo_database_name}.*' \
      --nsTo='{cfg.transformer_mongo_database_name}.*' \
      --dir='{cfg.origin_dump_folder_path}' \
      --stopOnError \
      --drop \
      --gzip
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Transform the collections within the "transformer" MongoDB server

Use the migrator to transform the collections in the "transformer" database.

> Reminder: The database transformation functions are defined in the `nmdc-schema` Python package installed earlier.

> Reminder: The "origin" database is **not** affected by this step.

In [None]:
# Instantiate a MongoAdapter bound to the "transformer" database.
adapter = MongoAdapter(
    database=transformer_mongo_client[cfg.transformer_mongo_database_name],
    on_collection_created=lambda name: print(f'Created collection "{name}"'),
    on_collection_renamed=lambda old_name, name: print(f'Renamed collection "{old_name}" to "{name}"'),
    on_collection_deleted=lambda name: print(f'Deleted collection "{name}"'),
)

# Instantiate a Migrator bound to that adapter.
logger = setup_logger()
migrator = Migrator(adapter=adapter, logger=logger)

# Execute the Migrator's `upgrade` method to perform the migration.
migrator.upgrade()

### Validate the transformed documents

Now that we have transformed the database, validate each document in each collection in the "transformer" MongoDB server.

In [None]:
# Get the names of all collections.
collection_names: List[str] = get_collection_names_from_schema(schema_view)

# Ensure that, if the (large) "functional_annotation_agg" collection is present in `collection_names`,
# it goes at the end of the list we process. That way, we can find out about validation errors in
# other collections without having to wait for that (large) collection to be validated.
ordered_collection_names = sorted(collection_names.copy())
large_collection_name = "functional_annotation_agg"
if large_collection_name in ordered_collection_names:
    ordered_collection_names = list(filter(lambda n: n != large_collection_name, ordered_collection_names))
    ordered_collection_names.append(large_collection_name)  # puts it last

# Process each collection.
for collection_name in ordered_collection_names:
    collection = transformer_mongo_client[cfg.transformer_mongo_database_name][collection_name]
    num_documents_in_collection = collection.count_documents({})
    print(f"Validating collection {collection_name} ({num_documents_in_collection} documents) [", end="")  # no newline

    # Calculate how often we'll display a tick mark (i.e. a sign of life).
    num_documents_per_tick = num_documents_in_collection * 0.10  # one tenth of the total
    num_documents_since_last_tick = 0

    for document in collection.find():
        # Validate the transformed document.
        #
        # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md
        #
        # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,
        #       the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries
        #       containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a
        #       copy (i.e. a shallow copy) of the document that lacks that specific key.
        #
        # Note: The reason we don't use a progress bar library such as `rich[jupyter]`, `tqdm`, or `ipywidgets`
        #       is that _PyCharm's_ Jupyter Notebook integration doesn't fully work with any of them. :(
        #
        schema_class_name = derive_schema_class_name_from_document(schema_view=schema_view, document=document)
        document_without_underscore_id_key = {key: value for key, value in document.items() if key != "_id"}
        validation_report: ValidationReport = validator.validate(document_without_underscore_id_key, schema_class_name)
        if len(validation_report.results) > 0:
            result_messages = [result.message for result in validation_report.results]
            raise TypeError(f"Document is invalid.\n{result_messages=}\n{document_without_underscore_id_key=}")

        # Display a tick mark if we have validated enough documents since we last displayed one.
        num_documents_since_last_tick += 1
        if num_documents_since_last_tick >= num_documents_per_tick:
            num_documents_since_last_tick = 0
            print(".", end="")  # no newline

    print("]")

### Dump the collections from the "transformer" MongoDB server

Now that the collections have been transformed and validated, dump them **from** the "transformer" MongoDB server **into** a local directory.

In [None]:
# Dump the database from the "transformer" MongoDB server.
shell_command = f"""
  {mongodump} {transformer_mongo_cli_base_options} \
      --db='{cfg.transformer_mongo_database_name}' \
      --out='{cfg.transformer_dump_folder_path}' \
      --gzip
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Create a bookkeeper

Create a `Bookkeeper` that can be used to document migration events in the "origin" server.

In [None]:
bookkeeper = Bookkeeper(mongo_client=origin_mongo_client)

### Indicate — on the "origin" server — that the migration is underway

Add an entry to the migration log collection to indicate that this migration has started.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_STARTED)

### Skipped: Drop the original collections from the "origin" MongoDB server

Note: This step is necessary for migrations where collections are being renamed or deleted. (The `--drop` option of `mongorestore` would only drop collections that exist in the dump being restored, which would not include renamed or deleted collections.)

In the case of _this_ migration, no collections are being renamed or deleted. So, we can skip this step. The `workflow_execution_set` collection that the migrator _did_ transform, will still be dropped when we run `mongorestore` with the `--drop` option later in this notebook.


In [None]:
print("skipped")

# shell_command = f"""
#   {cfg.mongosh_path} {origin_mongo_cli_base_options} \
#       --eval 'use {cfg.origin_mongo_database_name}' \
#       --eval 'db.dropDatabase()'
# """
# completed_process = subprocess.run(shell_command, shell=True)
# print(f"\nReturn code: {completed_process.returncode}")

### Load the collections into the "origin" MongoDB server

Load the transformed collections into the "origin" MongoDB server.

In [None]:
# Load the transformed collections into the origin server, replacing any same-named ones that are there.
shell_command = f"""
  {mongorestore} {origin_mongo_cli_base_options} \
      --nsFrom='{cfg.transformer_mongo_database_name}.*' \
      --nsTo='{cfg.origin_mongo_database_name}.*' \
      --dir='{cfg.transformer_dump_folder_path}' \
      --stopOnError \
      --verbose \
      --drop \
      --gzip
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Indicate that the migration is complete

Add an entry to the migration log collection to indicate that this migration is complete.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_COMPLETED)

### Restore access to the "origin" MongoDB server

This effectively un-does the access revocation that we did earlier.

In [None]:
shell_command = f"""
  {cfg.mongosh_path} {origin_mongo_cli_base_options} \
      --file='mongosh-scripts/restore-privileges.mongo.js' \
      --quiet
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")