# Migrate MongoDB database from `nmdc-schema` `v10.0.0` to `v10.1.4`

- TODO: Disable read/write access to the origin database during the migration process.

## Prerequisites

### 1. Determine MongoDB collections involved.

Here, you'll determine which MongoDB collections will be used as part of this migration.

1. In the [`nmdc-schema` repo](https://github.com/microbiomedata/nmdc-schema/tree/main/nmdc_schema/migrators), go to the `nmdc_schema/migrators` directory and open the Python module whose name contains the two schema versions involved with this migration. For example, if migrating from schema version `A.B.C` to `X.Y.Z`, open the module named `migrator_from_A_B_C_to_X_Y_Z.py`.
2. Determine the collections that are accessed—whether for reading or for writing—by that module. **This is currently a manual process.**
3. Add their names to the `COLLECTION_NAMES` Python list below.

In [None]:
COLLECTION_NAMES: list[str] = [
    "data_object_set"
]

### 2. Coordinate with stakeholders.

Identify the people that read/write to those collections, or that maintain software that reads/writes to those collection. You can view a list of stakeholders in `./stakeholders.md`. 

Once you have identified those people; coordinate with them to agree on a time window for the migration. You can contact them via Slack, for example.

### 3. Set up environment.

Here, you'll prepare an environment for running this notebook.

1. Start a **MongoDB server** on your local machine (and ensure it does **not** already contain a database named `nmdc`).
    1. You can start a [Docker](https://hub.docker.com/_/mongo)-based MongoDB server at `localhost:27055` by running this command (this MongoDB server will be accessible without a username or password).
       ```shell
       docker run --rm --detach --name mongo-migration-transformer -p 27055:27017 mongo:6.0.4
       ```
2. Create and populate a **notebook configuration file** named `.notebook.env`.
    1. You can use `.notebook.env.example` as a template:
       ```shell
       $ cp .notebook.env.example .notebook.env
       ```
3. Create and populate the two **MongoDB configuration files** that this notebook will use to connect to the "origin" and "transformer" MongoDB servers. The "origin" MongoDB server is the one that contains the database you want to migrate; and the "transformer" MongoDB server is the one you want to use to perform the data transformations. In practice, the "origin" MongoDB server is typically a remote server, and the "transformer" MongoDB server is typically a local server.
    1. You can use `.mongo.yaml.example` as a template:
       ```shell
       $ cp .mongo.yaml.example .mongo.origin.yaml
       $ cp .mongo.yaml.example .mongo.transformer.yaml
       ```
       > When populating the file for the origin MongoDB server, use credentials that have **both read and write access** to the `nmdc` database.

## Procedure

### Install Python dependencies

In this step, you'll [install](https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/) the Python packages upon which this notebook depends.

> Note: If the output of this cell says "Note: you may need to restart the kernel to use updated packages", restart the kernel (not the notebook) now.

In [None]:
%pip install -r requirements.txt
%pip install nmdc-schema==10.1.4

### Import Python dependencies

Import the Python objects upon which this notebook depends.

> Note: One of the `import` statements is specific to this migration.

In [None]:
# Stdlib packages:
from copy import deepcopy

# Third-party packages:
import pymongo
from jsonschema import Draft7Validator, ValidationError
from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict
from nmdc_schema.migrators.adapters.mongo_adapter import MongoAdapter

from nmdc_schema.migrators.migrator_from_10_0_0_to_10_1_2 import Migrator  # note: the migrator to 10.1.2 was introduced in schema version 10.1.4

# First-party packages:
from helpers import Config
from bookkeeper import Bookkeeper, MigrationEvent

### Parse configuration files

Parse the notebook and Mongo configuration files.

In [None]:
cfg = Config()

# Define some aliases we can use to make the shell commands in this notebook easier to read.
mongodump = cfg.mongodump_path
mongorestore = cfg.mongorestore_path

# Perform a sanity test of the application paths.
!{mongodump} --version
!{mongorestore} --version

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" and "transformer" MongoDB servers.

In [None]:
# Mongo client for "origin" MongoDB server.
origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_server_uri, directConnection=True)

# Mongo client for "transformer" MongoDB server.
transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_server_uri)

# Perform sanity tests of those MongoDB clients' abilities to access their respective MongoDB servers.
with pymongo.timeout(3):
    # Display the MongoDB server version (running on the "origin" Mongo server).
    print("Origin Mongo server version:      " + origin_mongo_client.server_info()["version"])

    # Sanity test: Ensure the origin database exists.
    assert "nmdc" in origin_mongo_client.list_database_names(), "Origin database does not exist."

    # Display the MongoDB server version (running on the "transformer" Mongo server).
    print("Transformer Mongo server version: " + transformer_mongo_client.server_info()["version"])

    # Sanity test: Ensure the transformation database does not exist.
    assert "nmdc" not in transformer_mongo_client.list_database_names(), "Transformation database already exists."

### Create a bookkeeper

Create a `Bookkeeper` that can be used to document migration events in the "origin" server.

In [None]:
bookkeeper = Bookkeeper(mongo_client=origin_mongo_client)

### Create JSON Schema validator

In this step, you'll create a JSON Schema validator for the NMDC Schema.

In [None]:
def remove_id_pattern_constraints(nmdc_schema: dict) -> dict:
    r"""
    Returns a variant of the schema having no `$defs[*].properties.id.pattern` properties.

    Note: This algorithm was copied from the `without_id_patterns` function in `nmdc_runtime/util.py`.
    """
    custom_schema = deepcopy(nmdc_schema)
    for spec in custom_schema["$defs"].values():
        if "properties" in spec and "id" in spec["properties"] and "pattern" in spec["properties"]["id"]:
            del spec["properties"]["id"]["pattern"]
    return custom_schema


# Make a version of the NMDC Schema that accepts so-called "legacy IDs".
nmdc_jsonschema: dict = remove_id_pattern_constraints(get_nmdc_jsonschema_dict())
nmdc_jsonschema_validator = Draft7Validator(nmdc_jsonschema)

# Perform sanity tests of the NMDC Schema dictionary and the JSON Schema validator.
# Reference: https://python-jsonschema.readthedocs.io/en/latest/api/jsonschema/protocols/#jsonschema.protocols.Validator.check_schema
print("NMDC Schema title:   " + nmdc_jsonschema["title"])
print("NMDC Schema version: " + nmdc_jsonschema["version"])

nmdc_jsonschema_validator.check_schema(nmdc_jsonschema)  # raises exception if schema is invalid

### Dump collections from the "origin" MongoDB server

Use `mongodump` to dump the collections involved in this migration **from** the "origin" MongoDB server **into** a local directory.

> Since `mongodump` doesn't provide a CLI option we can use to specify the collections we _want_ the dump to include, we use multiple occurrences of the `--excludeCollection` CLI option to exclude each collection we do _not_ want the dump to include. The end result is the same—there's just that extra step involved.

In [None]:
# Build a string containing zero or more `--excludeCollection="..."` options, which can be included in a `mongodump` command.
all_collection_names: list[str] = origin_mongo_client["nmdc"].list_collection_names()
non_agenda_collection_names = [name for name in all_collection_names if name not in COLLECTION_NAMES]
exclusion_options = [f"--excludeCollection='{name}'" for name in non_agenda_collection_names]
exclusion_options_str = " ".join(exclusion_options)  # separates each option with a space
print(exclusion_options_str)

# Dump the not-excluded collections from the "origin" database.
!{mongodump} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.origin_dump_folder_path}" \
  {exclusion_options_str}

### Load the dumped collections into the "transformer" MongoDB server

Use `mongorestore` to load the dumped collections **from** the local directory **into** the "transformer" MongoDB server.

> Since it's possible that the dump included extra collections (due to someone having created a collection between the time you generated the `--excludeCollection` CLI options and the time you ran `mongodump` above), we will use the `--nsInclude` CLI option to indicate which specific collections—from the dump—we want to load into the "transformer" database.

In [None]:
# Build a string containing zero or more `--nsInclude="..."` options, which can be included in a `mongorestore` command.
inclusion_options = [f"--nsInclude='nmdc.{name}'" for name in COLLECTION_NAMES]
inclusion_options_str = " ".join(inclusion_options)  # separates each option with a space
print(inclusion_options_str)

# Restore the dumped collections to the "transformer" MongoDB server.
!{mongorestore} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --gzip \
  --drop \
  --preserveUUID \
  --dir="{cfg.origin_dump_folder_path}" \
  {inclusion_options_str}

### Transform the collections within the "transformer" MongoDB server

Use the migrator to transform the collections in the "transformer" database.

> Reminder: The database transformation functions are defined in the `nmdc-schema` Python package installed earlier.

> Reminder: The "origin" database is **not** affected by this step.

In [None]:
# Instantiate a MongoAdapter bound to the "transformer" database.
adapter = MongoAdapter(
    database=transformer_mongo_client["nmdc"],
    # Note: These callbacks aren't support yet, as of nmdc-schema 10.1.4.
    # on_collection_created=lambda name: print(f'Created collection "{name}"'),
    # on_collection_renamed=lambda old_name, name: print(f'Renamed collection "{old_name}" to "{name}"'),
    # on_collection_deleted=lambda name: print(f'Deleted collection "{name}"'),
)

# Instantiate a Migrator bound to that adapter.
migrator = Migrator(adapter=adapter)

# Execute the Migrator's `upgrade` method to perform the migration.
migrator.upgrade()

### Validate the transformed documents

Now that we have transformed the database, validate each document in each collection in the "transformer" MongoDB server.

> Reference: https://github.com/microbiomedata/nmdc-runtime/blob/main/metadata-translation/src/bin/validate_json.py

In [None]:
for collection_name in COLLECTION_NAMES:
    collection = transformer_mongo_client["nmdc"][collection_name]
    for document in collection.find():
        # Validate the transformed document.
        #
        # Reference: https://github.com/microbiomedata/nmdc-schema/blob/main/src/docs/schema-validation.md
        #
        # Note: Dictionaries originating as Mongo documents include a Mongo-generated key named `_id`. However,
        #       the NMDC Schema does not describe that key and, indeed, data validators consider dictionaries
        #       containing that key to be invalid with respect to the NMDC Schema. So, here, we validate a
        #       copy (i.e. a shallow copy) of the document that lacks that specific key.
        #
        # Note: `root_to_validate` is a dictionary having the shape: { "some_collection_name": [ some_document ] }
        #       Reference: https://docs.python.org/3/library/stdtypes.html#dict (see the "type constructor" section)
        #
        document_without_underscore_id_key = {key: value for key, value in document.items() if key != "_id"}
        root_to_validate = dict([(collection_name, [document_without_underscore_id_key])])
        try:
            nmdc_jsonschema_validator.validate(root_to_validate)  # raises exception if invalid
        except ValidationError as err:
            # Print the offending document (to facilitate debug) before propagating the exception.
            print(document)
            raise err

### Indicate that the migration is underway

Add an entry to the migration log collection to indicate that this migration has started.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_STARTED)

### Dump the collections from the "transformer" MongoDB server

Now that the collections have been transformed and validated, dump them **from** the "transformer" MongoDB server **into** a local directory.

In [None]:
# Dump the database from the "transformer" MongoDB server.
!{mongodump} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.transformer_dump_folder_path}" \
  {exclusion_options_str}

### Load the collections into the "origin" MongoDB server

Load the transformed collections into the "origin" MongoDB server, **replacing** the collections there that have the same names.

> Note: If the migration involved renaming or deleting a collection, the collection having the original name will continue to exist in the "origin" database until someone deletes it manually.

In [None]:
# Replace the same-named collection(s) on the origin server, with the transformed one(s).
!{mongorestore} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --gzip \
  --verbose \
  --dir="{cfg.transformer_dump_folder_path}" \
  --drop \
  --preserveUUID \
  {inclusion_options_str}

### Indicate that the migration is complete

Add an entry to the migration log collection to indicate that this migration is complete.

In [None]:
bookkeeper.record_migration_event(migrator=migrator, event=MigrationEvent.MIGRATION_COMPLETED)