# Migrate MongoDB database from `nmdc-schema` `v10.5.6` to `v11.0.0`

## Introduction

This notebook will be used to migrate the database from `nmdc-schema` `v10.5.6` ([released](https://github.com/microbiomedata/nmdc-schema/releases/tag/v10.5.6) June 25, 2024) to `v11.0.0` (i.e. the initial version of the Berkeley schema).

Unlike previous migrators, this one does not pick and choose which collections it will dump. There are two reasons for this: (1) migrators no longer have a dedicated `self.agenda` dictionary that indicates all the collections involved in the migration; and (2) this migration is the first one that involves creating, renaming, and dropping any collections; none of which were things that the old `self.agenda`-based system were designed to handle. So, instead of picking and choosing collections, this migrator **dumps them all.**

## Prerequisites

### 1. Coordinate with stakeholders.

We will be enacting full Runtime and Database downtime for this migration. Ensure stakeholders are aware of that.

### 2. Set up environment.

Here, you'll prepare an environment for running this notebook.

1. Start a **MongoDB server** on your local machine (and ensure it does **not** already contain a database named `nmdc`).
    1. You can start a [Docker](https://hub.docker.com/_/mongo)-based MongoDB server at `localhost:27055` by running this command (this MongoDB server will be accessible without a username or password).


In [None]:
!docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo:6.0.4

2. Delete **obsolete dumps** from previous notebooks runs.
    1. This is so the dumps you generate below will not be mixed in with any unrelated ones.

In [None]:
!rm -rf {cfg.origin_dump_folder_path}
!rm -rf {cfg.transformer_dump_folder_path}

3. Create and populate a **notebook configuration file** named `.notebook.env`.
    1. You can use `.notebook.env.example` as a template.
4. Create and populate the two **MongoDB configuration files**—`.mongo.origin.yaml` and `.mongo.transformer.yaml`—that this notebook will use to connect to the "origin" and "transformer" MongoDB servers, respectively. The "origin" MongoDB server is the one that contains the database you want to migrate; and the "transformer" MongoDB server is the one you want to use to perform the data transformations. In practice, the "origin" MongoDB server is typically a remote server, and the "transformer" MongoDB server is typically a local server.
    1. You can use `.mongo.yaml.example` as a template.

- TODO: Consolidate config files!

## Procedure

### Install Python dependencies

In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends.

> Note: If the output of this cell says "Note: you may need to restart the kernel to use updated packages", restart the kernel (not the notebook cells) now.

References: 
- Berkeley Schema PyPI package (it's version 11+ of the `nmdc-schema` package): https://pypi.org/project/nmdc-schema/
- Berkeley Schema GitHub repo: https://github.com/microbiomedata/berkeley-schema-fy24
- How to `pip install` a Git branch: https://stackoverflow.com/a/20101940

In [None]:
%pip install --upgrade pip
%pip install -r requirements.txt
%pip install nmdc-schema==11.0.0rc16

### Import Python dependencies

Import the Python objects upon which this notebook depends.

In [None]:
# Standard library packages:
import subprocess
import logging
from typing import List

# Third-party packages:
import pymongo
from jsonschema import Draft7Validator
from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict, SchemaVariantIdentifier
from nmdc_schema.migrators.adapters.mongo_adapter import MongoAdapter
from nmdc_schema.migrators.migrator_from_10_2_0_to_11_0_0 import Migrator

# First-party packages:
from helpers import Config
from bookkeeper import Bookkeeper, MigrationEvent

### Parse configuration files

Parse the notebook and Mongo configuration files.

In [None]:
cfg = Config()

# Define some aliases we can use to make the shell commands in this notebook easier to read.
mongodump    = cfg.mongodump_path
mongorestore = cfg.mongorestore_path
mongosh      = cfg.mongosh_path

# Perform a sanity test of the application paths.
!{mongodump}    --version
!{mongorestore} --version
!{mongosh}      --version

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" and "transformer" MongoDB servers.

In [None]:
# Mongo client for "origin" MongoDB server.
origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_server_uri, directConnection=True)

# Mongo client for "transformer" MongoDB server.
transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_server_uri)

# Perform sanity tests of those MongoDB clients' abilities to access their respective MongoDB servers.
with pymongo.timeout(3):
    # Display the MongoDB server version (running on the "origin" Mongo server).
    print("Origin Mongo server version:      " + origin_mongo_client.server_info()["version"])

    # Sanity test: Ensure the origin database exists.
    assert "nmdc" in origin_mongo_client.list_database_names(), "Origin database does not exist."

    # Display the MongoDB server version (running on the "transformer" Mongo server).
    print("Transformer Mongo server version: " + transformer_mongo_client.server_info()["version"])

    # Sanity test: Ensure the transformation database does not exist.
    assert "nmdc" not in transformer_mongo_client.list_database_names(), "Transformation database already exists."

### Create JSON Schema validator

In this step, you'll create a JSON Schema validator for the NMDC Schema.

- TODO: Consider whether the JSON Schema validator version is consistent with the JSON Schema version (e.g. draft 7 versus draft 2019).

In [None]:
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict(variant=SchemaVariantIdentifier.nmdc_materialized_patterns)
nmdc_jsonschema_validator = Draft7Validator(nmdc_jsonschema)

# Perform sanity tests of the NMDC Schema dictionary and the JSON Schema validator.
# Reference: https://python-jsonschema.readthedocs.io/en/latest/api/jsonschema/protocols/#jsonschema.protocols.Validator.check_schema
print("NMDC Schema title:   " + nmdc_jsonschema["title"])
print("NMDC Schema version: " + nmdc_jsonschema["version"])

nmdc_jsonschema_validator.check_schema(nmdc_jsonschema)  # raises exception if schema is invalid

### Revoke access from the "origin" MongoDB server

We revoke "write" access so people don't make changes to the original data while the migration is happening, given that the migration ends with an overwriting of the original data.

We also revoke "read" access. The revocation of "read" access is technically optional, but (a) the JavaScript mongosh script will be easier for me to maintain if it revokes everything and (b) this prevents people from reading data during the restore step, during which the database may not be self-consistent.

References:

- https://docs.python.org/3/library/subprocess.html
- https://www.mongodb.com/docs/mongodb-shell/reference/options/
- https://www.mongodb.com/docs/mongodb-shell/write-scripts/

In [None]:
# Note: I run this command via Python's `subprocess` module instead of via an IPython magic `!` command
#       because one of the CLI options contains the Mongo password (since `mongosh` does not support the
#       use of config files located anywhere except in the user's home directory) and my gut tells me
#       this approach makes it less likely that the password appear in some shell history compared to
#       if the command were run via a `!` command (since, to me, the latter more closely resembles
#       regular shell usage).
#
#       TODO: Revisit this; and consider switching all the other `!` commands to use `subprocess`
#             so that this notebook is closer to becoming a regular Python script.
#
shell_command = f"""
  {cfg.mongosh_path} \
      --host='{cfg.origin_mongo_host}' \
      --port='{cfg.origin_mongo_port}' \
      --username='{cfg.origin_mongo_username}' \
      --password='{cfg.origin_mongo_password}' \
      --quiet \
      --file='mongosh-scripts/revoke-privileges.mongo.js'
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")

### Dump collections from the "origin" MongoDB server

Use `mongodump` to dump all the collections **from** the "origin" MongoDB server **into** a local directory.

In [None]:
# Dump all collections from the "origin" database.
!{mongodump} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.origin_dump_folder_path}"

### Load the dumped collections into the "transformer" MongoDB server

Use `mongorestore` to load the dumped collections **from** the local directory **into** the "transformer" MongoDB server.

In [None]:
# Restore the dumped collections to the "transformer" MongoDB server.
!{mongorestore} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --gzip \
  --drop \
  --preserveUUID \
  --stopOnError \
  --dir="{cfg.origin_dump_folder_path}"

### Transform the collections within the "transformer" MongoDB server

Use the migrator to transform the collections in the "transformer" database.

> Reminder: The database transformation functions are defined in the `nmdc-schema` Python package installed earlier.

> Reminder: The "origin" database is **not** affected by this step.

- TODO: Consider deleting the existing log or appending a timestamp to the log filename.

In [None]:
# Setup a logger that writes to a file.
# TODO: Move this logger stuff to `helpers.py`.`
LOG_FILE_PATH = "./tmp.log"
logger = logging.getLogger(name="migrator_logger")
logger.setLevel(logging.DEBUG)
file_handler = logging.FileHandler(LOG_FILE_PATH)
formatter = logging.Formatter(fmt="%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s",
                              datefmt="%Y-%m-%d %H:%M:%S")
file_handler.setFormatter(formatter)
if logger.hasHandlers():
    logger.handlers.clear()  # avoid duplicate log entries
logger.addHandler(file_handler)

In [None]:
# Instantiate a MongoAdapter bound to the "transformer" database.
adapter = MongoAdapter(
    database=transformer_mongo_client["nmdc"],
    on_collection_created=lambda name: print(f'Created collection "{name}"'),
    on_collection_renamed=lambda old_name, name: print(f'Renamed collection "{old_name}" to "{name}"'),
    on_collection_deleted=lambda name: print(f'Deleted collection "{name}"'),
)

# Instantiate a Migrator bound to that adapter.
migrator = Migrator(adapter=adapter, logger=logger)

# Execute the Migrator's `upgrade` method to perform the migration.
migrator.upgrade()

### Validate the transformed documents

Now that we have transformed the database, validate each document in each collection in the "transformer" MongoDB server.

In [None]:
# Make a list of all slots of the `Database` class in the schema.
#
# TODO: Use a SchemaView for this instead of directly accessing the JSON Schema dictionary.
#
database_slot_names: List[str] = nmdc_jsonschema["$defs"]["Database"]["properties"]

# Ensure that, if the (large) "functional_annotation_agg" collection is present in `database_slot_names`,
# it goes at the end of the list we process. That way, we can find out about validation errors in
# other collections without having to wait for that (large) collection to be validated before them.
ordered_collection_names = sorted(database_slot_names.copy())
large_collection_name = "functional_annotation_agg"
if large_collection_name in ordered_collection_names:
    ordered_collection_names = list(filter(lambda n: n != large_collection_name, ordered_collection_names))
    ordered_collection_names.append(large_collection_name)

for collection_name in ordered_collection_names:
    collection = transformer_mongo_client["nmdc"][collection_name]
    num_documents_in_collection = collection.count_documents({})
    print(f"Validating collection {collection_name} ({num_documents_in_collection} documents)")

    for document in collection.find():
        # Validate the transformed document.
        #
        # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md
        #
        # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,
        #       the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries
        #       containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a
        #       copy (i.e. a shallow copy) of the document that lacks that specific key.
        #
        # Note: `root_to_validate` is a dictionary having the shape: { "some_collection_name": [ some_document ] }
        #       Reference: https://docs.python.org/3/library/stdtypes.html#dict (see the "type constructor" section)
        #
        document_without_underscore_id_key = {key: value for key, value in document.items() if key != "_id"}
        root_to_validate = dict([(collection_name, [document_without_underscore_id_key])])
        nmdc_jsonschema_validator.validate(root_to_validate)  # raises exception if invalid

### Dump the collections from the "transformer" MongoDB server

Now that the collections have been transformed and validated, dump them **from** the "transformer" MongoDB server **into** a local directory.

In [None]:
# Dump the database from the "transformer" MongoDB server.
!{mongodump} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.transformer_dump_folder_path}"

### Create a bookkeeper

Create a `Bookkeeper` that can be used to document migration events in the "origin" server.

In [None]:
bookkeeper = Bookkeeper(mongo_client=origin_mongo_client)

### Indicate — on the "origin" server — that the migration is underway

Add an entry to the migration log collection to indicate that this migration has started.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_STARTED)

### TODO: Drop the original collections from the "origin" MongoDB server

This is necessary for situations where collections were renamed or deleted. The `--drop` option of `mongorestore` only drops collections that exist in the dump. We may need `mongosh` for this.

- TODO: Now that the notebook does depend upon `mongosh`, revisit filling in this step.

### Load the collections into the "origin" MongoDB server

Load the transformed collections into the "origin" MongoDB server, **replacing** the collections there that have the same names.

- TODO: If the migration involved renaming or deleting a collection, the collection having the original name will continue to exist in the "origin" database until someone deletes it manually.

In [None]:
# Load the transformed collections into the origin server, replacing any same-named ones that are there.
!{mongorestore} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --gzip \
  --verbose \
  --dir="{cfg.transformer_dump_folder_path}" \
  --drop --preserveUUID \
  --stopOnError

### Indicate that the migration is complete

Add an entry to the migration log collection to indicate that this migration is complete.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_COMPLETED)

### Restore access to the "origin" MongoDB server

This effectively un-does the access revocation that we did earlier.

In [None]:
shell_command = f"""
  {cfg.mongosh_path} \
      --host='{cfg.origin_mongo_host}' \
      --port='{cfg.origin_mongo_port}' \
      --username='{cfg.origin_mongo_username}' \
      --password='{cfg.origin_mongo_password}' \
      --quiet \
      --file='mongosh-scripts/restore-privileges.mongo.js'
"""
completed_process = subprocess.run(shell_command, shell=True)
print(f"\nReturn code: {completed_process.returncode}")