# Migrate from `v7.8.0` to `v8.0.0`

## Prerequisites

### 1. Determine impacted Mongo collections

Determine which Mongo collections will be transformed during this migration.

That involves reading each transformation function written specifically for this migration—[in the `nmdc-schema` repository](https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migration_recursion.py); checking the object it returns; and mapping that to a Mongo collection (e.g. a dictionary named `transformed_study` → the Mongo collection named `study_set`).

In this case, those Mongo collections are:

- `extraction_set`
- `omics_processing_set`
- `biosample_set`
- `study_set`

> Note: The list above is specific to this migration.

### 2. Determine impacted system components

Determine which components of the NMDC system will be impacted by the transformations.

That involves mapping each Mongo collection listed in the previous step, to the component of the NMDC system that write to it.

For reference, here's a table of Mongo collections and the components of the NMDC system that write to them (according to [a conversation that occurred on September 11, 2023](https://nmdc-group.slack.com/archives/C01SVTKM8GK/p1694465755802979?thread_ts=1694216327.234519&cid=C01SVTKM8GK)):

| Mongo collection                            | Components that writes to it                             |
|---------------------------------------------|----------------------------------------------------------|
| `biosample_set`                             | Workflows (via manual entry via `nmdc-runtime` HTTP API) |
| `data_object_set`                           | Workflows (via `nmdc-runtime` HTTP API)                  |
| `mags_activity_set`                         | Workflows (via `nmdc-runtime` HTTP API)                  |
| `metagenome_annotation_activity_set`        | Workflows (via `nmdc-runtime` HTTP API)                  |
| `metagenome_assembly_set`                   | Workflows (via `nmdc-runtime` HTTP API)                  |
| `read_based_taxonomy_analysis_activity_set` | Workflows (via `nmdc-runtime` HTTP API)                  |
| `read_qc_analysis_activity_set`             | Workflows (via `nmdc-runtime` HTTP API)                  |
| `jobs`                                      | Scheduler (via Mongo directly)                           |
| `*`                                         | `nmdc-runtime` (via Mongo directly)                      |

> Note: The table above is not specific to any given migration. It may still change over time, though.

In this case, those parts of the NMDC system are:

- Workflows (due to `biosample_set`)
- `nmdc-runtime` (due to `*`)

### 3. Coordinate with component owners

Coordinate with the owners of the impacted components of the NMDC system.

TODO: Elaborate on this.

### 4. Setup the environment

1. Start a MongoDB server on your local machine (or in a Docker container) and ensure it does **not** contain a database named `nmdc`.
2. Create and populate a **notebook configuration file** named `.notebook.env`.
    1. You can use the `.notebook.env.example` file as a template:
       ```shell
       $ cp .notebook.env.example .notebook.env
       ```
3. Create and populate **MongoDB configuration files** for connecting to the origin and transformer MongoDB servers.
    1. You can use the `.mongo.yaml.example` file as a template:
       ```shell
       $ cp .mongo.yaml.example .mongo.origin.yaml
       $ cp .mongo.yaml.example .mongo.transformer.yaml
       ```
       > When populating the file for the origin MongoDB server, use root credentials since this notebook will be manipulating user roles on that server. You can get those root credentials from Rancher.

## Procedure

### Install dependencies

Install the third-party Python packages upon which this notebook depends.

Reference: https://saturncloud.io/blog/what-is-the-difference-between-and-in-jupyter-notebooks/

> Note: You may need to restart the notebook kernel to use updated packages.

In [None]:
%pip install -r requirements.txt
%pip install nmdc-schema==8.0.0

Import the Python objects upon which this notebook depends.

> Note: You may need to restart the notebook kernel to use updated packages.

In [None]:
# Standard library packages:
from pathlib import Path
from pprint import pformat
from shutil import rmtree
from tempfile import NamedTemporaryFile

# Third-party packages:
import pymongo
from nmdc_schema.migration_recursion import Migrator

# First-party packages:
from helpers import Config

### Parse configuration files

Parse the notebook and MongoDB configuration files.

In [None]:
cfg = Config()

# Define some aliases we can use to make the shell commands in this notebook easier to read.
mongodump = cfg.mongodump_path
mongorestore = cfg.mongorestore_path

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" MongoDB server (i.e. the one containing the database you want to migrate) and the "transformer" MongoDB server (i.e. the one you want to use to perform the data transformations).

In [None]:
# MongoDB client for origin MongoDB server.
origin_mongo_client = pymongo.MongoClient(host=cfg.origin_mongo_server_uri, directConnection=True)

# MongoDB client for transformer MongoDB server.
transformer_mongo_client = pymongo.MongoClient(host=cfg.transformer_mongo_server_uri)

### Back up user data from origin MongoDB server

Before temporarily disabling write access to the `nmdc` database on the origin MongoDB server, we will back up the current user data.

> That way, in case something goes wrong later, we can refer to this backup (e.g. to manually restore the original access levels).

In [None]:
result: dict = origin_mongo_client["admin"].command("usersInfo")
users_initial = result["users"]

# Create temporary file in the notebook's folder, containing the initial users.
users_file = NamedTemporaryFile(delete=False, dir=str(Path.cwd()), prefix="tmp.origin_users_initial.")
users_file.write(bytes(pformat(users_initial), "utf-8"))
users_file.close()

### Disable non-admin writing to the origin database

To disable non-admin writing to the `nmdc` database on the origin MongoDB server, we will set all users' roles (except the root user) to `read` (i.e. read-only) with respect to that database.

In [None]:
for user in users_initial:

    break  # Abort! TODO: Remove me when I'm ready to run this notebook for real.

    if any((role["db"] == "nmdc") for role in user["roles"]):
        origin_mongo_client["admin"].command("grantRolesToUser", user["user"], roles=[{ "role": "read", "db": "nmdc" }])
        origin_mongo_client["admin"].command("revokeRolesFromUser", user["user"], roles=[{ "role": "readWrite", "db": "nmdc" }])

### Dump those collections from the database on the origin MongoDB server

In a previous step, you determined which collections would be transformed during this migration.

Here, you'll dump those collections from the database on the origin MongoDB server.

Since `mongodump` doesn't offer an option to specify more than one collection to dump (it's either one—via the `--collection` option—or all), we use its `--excludeCollection` option multiple times to specify all the collections we _don't_ want to dump.

> If we accidentally dump more collections than necessary; the dump process will take longer than necessary and the dump files will, together, take up more space than necessary. You may be OK with that (I am).

You can get the full list of collections by running the following commands on the origin Mongo server: 
```shell
use nmdc;
db.getCollectionNames();
```

In [None]:
!date

# Dump collections from the origin database.
#
# I removed these options from the command because I want to dump these collections.
#   --excludeCollection='biosample_set' \
#   --excludeCollection='study_set' \
#   --excludeCollection='omics_processing_set' \
#   --excludeCollection='extraction_set' \
#
!{mongodump} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --excludeCollection='notes' \
  --excludeCollection='ids_nmdc_fk0' \
  --excludeCollection='fs.files' \
  --excludeCollection='schema_classes' \
  --excludeCollection='capabilities' \
  --excludeCollection='_runtime.api.allow' \
  --excludeCollection='minter.id_records' \
  --excludeCollection='run_events' \
  --excludeCollection='query_runs' \
  --excludeCollection='metagenome_assembly_set' \
  --excludeCollection='object_types' \
  --excludeCollection='metabolomics_analysis_activity_set' \
  --excludeCollection='ids_nmdc_mga0' \
  --excludeCollection='read_qc_analysis_activity_set' \
  --excludeCollection='date_created' \
  --excludeCollection='pooling_set' \
  --excludeCollection='field_research_site_set' \
  --excludeCollection='typecodes' \
  --excludeCollection='read_QC_analysis_activity_set' \
  --excludeCollection='ids_nmdc_sys0' \
  --excludeCollection='material_sample_set' \
  --excludeCollection='triggers' \
  --excludeCollection='nom_analysis_activity_set' \
  --excludeCollection='system.views' \
  --excludeCollection='sites' \
  --excludeCollection='fs.chunks' \
  --excludeCollection='jobs' \
  --excludeCollection='functional_annotation_agg' \
  --excludeCollection='ids_nmdc_gfs0' \
  --excludeCollection='minter.requesters' \
  --excludeCollection='ids' \
  --excludeCollection='workflows' \
  --excludeCollection='requesters' \
  --excludeCollection='operations' \
  --excludeCollection='etl_software_version' \
  --excludeCollection='read_based_analysis_activity_set' \
  --excludeCollection='processed_sample_set' \
  --excludeCollection='minter.schema_classes' \
  --excludeCollection='metap_gene_function_aggregation' \
  --excludeCollection='read_based_taxonomy_analysis_activity_set' \
  --excludeCollection='_tmp__get_file_size_bytes' \
  --excludeCollection='users' \
  --excludeCollection='ids_nmdc_fk4' \
  --excludeCollection='queries' \
  --excludeCollection='metaproteomics_analysis_activity_set' \
  --excludeCollection='txn_log' \
  --excludeCollection='shoulders' \
  --excludeCollection='activity_set' \
  --excludeCollection='library_preparation_set' \
  --excludeCollection='page_tokens' \
  --excludeCollection='data_object_set' \
  --excludeCollection='mags_activity_set' \
  --excludeCollection='metagenome_annotation_activity_set' \
  --excludeCollection='file_type_enum' \
  --excludeCollection='id_records' \
  --excludeCollection='metatranscriptome_activity_set' \
  --excludeCollection='metagenome_sequencing_activity_set' \
  --excludeCollection='collecting_biosamples_from_site_set' \
  --excludeCollection='services' \
  --excludeCollection='nmdc_schema_version' \
  --excludeCollection='objects' \
  --excludeCollection='ids_nmdc_mta0' \
  --excludeCollection='minter.services' \
  --excludeCollection='_runtime.healthcheck' \
  --excludeCollection='minter.typecodes' \
  --excludeCollection='minter.shoulders' \
  --excludeCollection='EMP_soil_project_run_counts' \
  --out="{cfg.origin_dump_folder_path}"

!date

### Restore the dump into the transformer MongoDB server

Load the collections contained in the dump (dumped from the origin MongoDB server) into the transformer MongoDB server, so we can start transforming them.

In [None]:
# Restore the dumped collections to the transformer MongoDB server.
!{mongorestore} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --gzip \
  --nsInclude="nmdc.extraction_set" \
  --nsInclude="nmdc.omics_processing_set" \
  --nsInclude="nmdc.biosample_set" \
  --nsInclude="nmdc.study_set" \
  --drop --preserveUUID \
  --dir="{cfg.origin_dump_folder_path}"

### Transform the database

Now that the transformer database contains a copy of each relevant collection from the origin database, we can transform those copies.

The transformation functions are provided by the `nmdc-schema` Python package.
> You can examine the transformation functions at: https://github.com/microbiomedata/nmdc-schema/blob/main/nmdc_schema/migration_recursion.py

In this step, we will retrieve each documents from each collection, pass it to a transformation function(s), then store the transformed document in place of the original one—all within the transformation database only. **The origin database is not involved with this step.**

In [None]:
migrator = Migrator()

# Define a mapping from collection name to transformation function(s).
# TODO: Consider defining this mapping in the `nmdc-schema` repository/package instead.
transformation_pipelines = dict(
    extraction_set=[migrator.migrate_extractions_7_8_0_to_8_0_0],
    omics_processing_set=[migrator.migrate_uc_gold_sequencing_project_identifiers_7_8_0_to_8_0_0],
    biosample_set=[migrator.migrate_uc_gold_biosample_identifiers_7_8_0_to_8_0_0],
    study_set=[migrator.migrate_uc_gold_study_identifiers_7_8_0_to_8_0_0],
)

# Apply the transformations.
for collection_name, transformation_pipeline in transformation_pipelines.items():
    print(f"Transforming documents in collection: {collection_name}")
    transformed_documents = []

    # Get each document from this collection.
    collection = transformer_mongo_client["nmdc"][collection_name]
    for original_document in collection.find():
        
        # Put the document through the transformation pipeline associated with this collection.
        print(original_document)
        transformed_document = original_document  # initializes the variable
        for transformation_function in transformation_pipeline:
            transformed_document = transformation_function(transformed_document)

        # Store the transformed document.
        print(transformed_document)
        print("")
        transformed_documents.append(transformed_document)

    # Replace the original documents with the transformed versions of themselves (in the transformer database).
    for transformed_document in transformed_documents:
        collection.replace_one({"id": {"$eq": transformed_document["id"]}}, transformed_document)


### Validate the transformed database

In [None]:
# TODO

### Dump the transformed database

In [None]:
# Dump the database from the transformer MongoDB server.
!{mongodump} \
  --config="{cfg.transformer_mongo_config_file_path}" \
  --db="nmdc" \
  --gzip \
  --out="{cfg.transformer_dump_folder_path}"

### Put the transformed data into the origin MongoDB server

Ensure the command below includes an [`--nsInclude` option](https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--nsInclude) for each transformed collection.

In this step, you'll put the transformed collection(s) into the origin MongoDB server, replacing the original collection(s) that has/have the same name(s).

In [None]:
!date

# Replace the same-named collection(s) on the origin server, with the transformed one(s).
!{mongorestore} \
  --config="{cfg.origin_mongo_config_file_path}" \
  --gzip \
  --verbose \
  --dir="{cfg.transformer_dump_folder_path}" \
  --nsInclude="nmdc.extraction_set" \
  --nsInclude="nmdc.omics_processing_set" \
  --nsInclude="nmdc.biosample_set" \
  --nsInclude="nmdc.study_set" \
  --drop --preserveUUID

!date

Now that we've restored the database, we'll restore the original user roles (with respect to the `nmdc` database).

In [None]:
for user in users_initial:

    break  # Abort! TODO: Remove me when I'm ready to run this notebook for real.

    if any((role["db"] == "nmdc" and role["role"] == "readWrite") for role in user["roles"]):
        origin_mongo_client["admin"].command("grantRolesToUser", user["user"], roles=[{ "role": "readWrite", "db": "nmdc" }])
        origin_mongo_client["admin"].command("revokeRolesFromUser", user["user"], roles=[{ "role": "read", "db": "nmdc" }])

### (Optional) Clean up

Delete the temporary files and MongoDB dumps created by this notebook.

> Note: You can skip this step, in case you want to delete them manually later (e.g. to examine them before deleting them).

In [None]:
paths_to_files_to_delete = [
    users_file.name,
]

paths_to_folders_to_delete = [
    cfg.origin_dump_folder_path,
    cfg.transformer_dump_folder_path,
]

# Delete files.
for path in [Path(string) for string in paths_to_files_to_delete]:
    try:
        path.unlink()
        print(f"Deleted: {path}")
    except:
        print(f"Failed to delete: {path}")

# Delete folders.
for path in [Path(string) for string in paths_to_folders_to_delete]:
    try:
        rmtree(path)
        print(f"Deleted: {path}")
    except:
        print(f"Failed to delete: {path}")