# Migrate MongoDB database from `nmdc-schema` `v9.3.2` to `v10.0.0`

## Prerequisites

### 1. Determine MongoDB collections involved.

In this step, you will determine which MongoDB collections will be involved with this migration.

1. In the [`nmdc-schema` repo](https://github.com/microbiomedata/nmdc-schema/tree/main/nmdc_schema/migrators), go to the `nmdc_schema/migrators` directory and open the Python module whose name reflects the initial and final schema versions.
2. In the `Migrator` class, note the collection names that are mentioned within the `upgrade` method.
3. Add their names to the `COLLECTION_NAMES` Python list below.
    - > TODO: Distinguish between colletions being read vs. collections being transformed; and consider what happens when a collection is created, deleted, or renamed.

In [None]:
COLLECTION_NAMES: list[str] = [
    "extraction_set",
]

### 2. Coordinate with stakeholders.

In this step, you'll identify and reach out to the people that read/write to those collections; to agree on a migration schedule that works for you and them.

Here's a table of MongoDB collections and the NMDC system components that write to them (according to [a conversation that occurred on September 11, 2023](https://nmdc-group.slack.com/archives/C01SVTKM8GK/p1694465755802979?thread_ts=1694216327.234519&cid=C01SVTKM8GK)).

| Mongo collection                            | NMDC system components that write to it                  |
|---------------------------------------------|----------------------------------------------------------|
| `biosample_set`                             | Workflows (via manual entry via `nmdc-runtime` HTTP API) |
| `data_object_set`                           | Workflows (via `nmdc-runtime` HTTP API)                  |
| `mags_activity_set`                         | Workflows (via `nmdc-runtime` HTTP API)                  |
| `metagenome_annotation_activity_set`        | Workflows (via `nmdc-runtime` HTTP API)                  |
| `metagenome_assembly_set`                   | Workflows (via `nmdc-runtime` HTTP API)                  |
| `read_based_taxonomy_analysis_activity_set` | Workflows (via `nmdc-runtime` HTTP API)                  |
| `read_qc_analysis_activity_set`             | Workflows (via `nmdc-runtime` HTTP API)                  |
| `jobs`                                      | Scheduler (via MongoDB directly; e.g. `pymongo`)         |
| `*`                                         | `nmdc-runtime` (via MongoDB directly; e.g. `pymongo`)    |

You can use that table to help determine which people read/write to those collections. You can then coordinate a migration time slot with them via Slack, email, etc.

### 3. Set up environment.

In this step, you'll set up an environment in which you can run this notebook.

1. Start a **MongoDB server** on your local machine (and ensure it does **not** already contain a database named `nmdc`).
    1. You can start a [Docker](https://hub.docker.com/_/mongo)-based MongoDB server at `localhost:27055` by running this command:
       ```shell
       # Run in any directory:
       docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo:6.0.4
       ```
       > Note: A MongoDB server started via that command will have no access control (i.e. you will be able to access it without a username or password).
2. Create and populate a **notebook configuration file** named `.notebook.env`.
    1. You can use the `.notebook.env.example` file as a template:
       ```shell
       # Run in the same directory as this notebook:
       $ cp .notebook.env.example .notebook.env
       ```
3. Create and populate **MongoDB configuration files** for connecting to the origin (typically, remote) and transformer (typically, local) MongoDB servers.
    1. You can use the `.mongo.yaml.example` file as a template:
       ```shell
       # Run in the same directory as this notebook:
       $ cp .mongo.yaml.example .mongo.origin.yaml
       $ cp .mongo.yaml.example .mongo.transformer.yaml
       ```
       > When populating the file for the origin MongoDB server, use credentials that have write access to the `nmdc` database.

## Procedure

### Install Python dependencies

In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends. You can do that by running this cell.

> Note: If the output of this cell says "Note: you may need to restart the kernel to use updated packages", restart the kernel (not the notebook) now.

In [None]:
%pip install -r requirements.txt
%pip install nmdc-schema==10.0.0

### Import Python dependencies

Import the Python objects upon which this notebook depends.

> Note: One of the Python objects is a Python class that is specific to this migration.

In [None]:
# Third-party packages:
import pymongo
from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict
from nmdc_schema.migrators.adapters.mongo_adapter import MongoAdapter
from nmdc_schema.migrators.migrator_from_9_3_to_10_0 import Migrator
from jsonschema import Draft7Validator

# First-party packages:
from helpers import Config

### Parse configuration files

Parse the notebook and Mongo configuration files.

In [None]:
cfg = Config()

# Define some aliases we can use to make the shell commands in this notebook easier to read.
mongodump = cfg.mongodump_path
mongorestore = cfg.mongorestore_path

# Perform a sanity test of the application paths.
!{mongodump} --version
!{mongorestore} --version

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" MongoDB server (i.e. the one containing the database you want to migrate) and the "transformer" MongoDB server (i.e. the one you want to use to perform the data transformations).

In [None]:
# Mongo client for origin MongoDB server.
origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_server_uri, directConnection=True)

# Mongo client for transformer MongoDB server.
transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_server_uri)

# Perform sanity tests of those MongoDB clients' abilities to access their respective MongoDB servers.
with pymongo.timeout(3):
    # Display the MongoDB server version (running on the "origin" Mongo server).
    print("Origin Mongo server version:      " + origin_mongo_client.server_info()["version"])

    # Sanity test: Ensure the origin database exists.
    assert "nmdc" in origin_mongo_client.list_database_names(), "Origin database does not exist."

    # Display the MongoDB server version (running on the "transformer" Mongo server).
    print("Transformer Mongo server version: " + transformer_mongo_client.server_info()["version"])

    # Sanity test: Ensure the transformation database does not exist.
    assert "nmdc" not in transformer_mongo_client.list_database_names(), "Transformation database already exists."

### Create JSON Schema validator

In this step, you'll create a JSON Schema validator for the NMDC Schema.

In [None]:
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
nmdc_jsonschema_validator = Draft7Validator(nmdc_jsonschema)

# Perform sanity tests of the NMDC Schema dictionary and the JSON Schema validator.
# Reference: https://python-jsonschema.readthedocs.io/en/latest/api/jsonschema/protocols/#jsonschema.protocols.Validator.check_schema
print("NMDC Schema title:   " + nmdc_jsonschema["title"])
print("NMDC Schema version: " + nmdc_jsonschema["version"])

nmdc_jsonschema_validator.check_schema(nmdc_jsonschema)  # raises exception if schema is invalid

### Dump collections from the "origin" MongoDB server

In this step, you'll use `mongodump` to dump the collections that will be impacted by this migration; from the "origin" MongoDB server.

Since `mongodump` doesn't provide a CLI option that you can use to specify the collections you _want_ it to dump (unless you want it to dump only one collection), you can use a different CLI option to tell it all the collection you do _not_ want it to dump. 

The end result will be the same—there's just an extra step involved. That extra step is to generate an `--excludeCollection="{name}"` CLI option for each collection that you do _not_ want it to dump; and then pass all those CLI options to the `mongodump` command.

In [None]:
# Build a string containing zero or more `--excludeCollection="..."` options, 
# which can be included in a `mongodump` command.
all_collection_names: list[str] = origin_mongo_client["nmdc"].list_collection_names()
non_agenda_collection_names = [name for name in all_collection_names if name not in COLLECTION_NAMES]
exclusion_options = [f"--excludeCollection='{name}'" for name in non_agenda_collection_names]
exclusion_options_str = " ".join(exclusion_options)  # separates each option with a space
print(exclusion_options_str)

# Dump the not-excluded collections from the origin database.
!{mongodump} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.origin_dump_folder_path}" \
  {exclusion_options_str}

### Load the collections into the "transformer" MongoDB server

In this step, you'll load the collections dumped from the "origin" MongoDB server, into the "transformer" MongoDB server.

Since it's possible that the dump includes more collections than are on the agenda (due to someone creating a collection between the time you generated the exclusion list and the time you ran `mongodump`), you will use one or more of `mongorestore`'s `--nsInclude` CLI options to indicate which collections you want to load.

So, we'll first generate the `--nsInclude="nmdc.{name}"` CLI options; then include them in the `mongorestore` command that follows.

In [None]:
# Build a string containing zero or more `--nsInclude="..."` options, 
# which can be included in a `mongorestore` command.
inclusion_options = [f"--nsInclude='nmdc.{name}'" for name in COLLECTION_NAMES]
inclusion_options_str = " ".join(inclusion_options)  # separates each option with a space
print(inclusion_options_str)

# Restore the dumped collections to the transformer MongoDB server.
!{mongorestore} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --gzip \
  --drop \
  --preserveUUID \
  --dir="{cfg.origin_dump_folder_path}" \
  {inclusion_options_str}

### Transform the collections within the "transformer" MongoDB server

Now that the transformer database contains a copy of each collection on the agenda, you can freely transform those copies. **The "origin" database is not involved with this step.**

The database transformation functions are defined in the `nmdc-schema` Python package installed earlier.

> Note: This step also includes validation. Reference: https://github.com/microbiomedata/nmdc-runtime/blob/main/metadata-translation/src/bin/validate_json.py

In [None]:
# Instantiate a MongoAdapter bound to the transformer database.
adapter = MongoAdapter(database=transformer_mongo_client["nmdc"])

# Instantiate a Migrator bound to that adapter.
migrator = Migrator(adapter=adapter)

# Execute the Migrator's `upgrade` method to perform the migration.
migrator.upgrade()

## Validate all documents in all collections involved

> TODO: We could delegate this responsibility to the `Migrator` class; or have some `Migrator` methods accept a callback function to run on each document before and after transformation.

In [None]:
for collection_name in COLLECTION_NAMES:
    collection = transformer_mongo_client["nmdc"][collection_name]
    for document in collection.find():
        # Validate the transformed document.
        #
        # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md
        #
        # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,
        #       the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries
        #       containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a
        #       copy (i.e. a shallow copy) of the document that lacks that specific key.
        #
        # Note: `root_to_validate` is a dictionary having the shape: { "some_collection_name": [ some_document ] }
        #       Reference: https://docs.python.org/3/library/stdtypes.html#dict (see the "type constructor" section)
        #
        document_without_underscore_id_key = {key: value for key, value in document.items() if key != "_id"}
        root_to_validate = dict([(collection_name, [document_without_underscore_id_key])])
        nmdc_jsonschema_validator.validate(root_to_validate)  # raises exception if invalid

### Dump the collections from the "transformer" MongoDB server

In [None]:
# Dump the database from the transformer MongoDB server.
!{mongodump} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.transformer_dump_folder_path}" \
  {exclusion_options_str}

### Load the collections into the "origin" MongoDB server

In this step, you'll put the referenced collection(s) into the origin MongoDB server, replacing the original collection(s) that have the same name(s).

In [None]:
# Replace the same-named collection(s) on the origin server, with the transformed one(s).
!{mongorestore} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --gzip \
  --verbose \
  --dir="{cfg.transformer_dump_folder_path}" \
  --drop \
  --preserveUUID \
  {inclusion_options_str}