# Migrate from v7.7.2 to v7.8.0

## Prerequisites

1. Start a MongoDB server on your local machine (or in a Docker container) and ensure it does **not** contain a database named `nmdc`.
1. Create a file named `.notebook.env` in the same folder as this notebook. 
    - You can copy the `.notebook.env.example` file as a starting point.
2. Customize the values in the `.notebook.env` file to reflect your situation.
    - For the origin MongoDB server, use root credentials since this notebook will be manipulating user roles.
3. Run the cells in this notebook in order.

## Procedure

Install the third-party Python packages upon which this notebook depends.

In [None]:
!python -m pip install pymongo python-dotenv

Import the standard and third-party Python packages upon which this notebook depends.

In [None]:
from pprint import pformat
from pathlib import Path
from tempfile import NamedTemporaryFile
import re

from dotenv import dotenv_values
import pymongo

Load the notebook configuration parameters from the `.notebook.env` file.

In [None]:
cfg_file_path = "./.notebook.env"
if not Path(cfg_file_path).is_file():
    raise FileNotFoundError("Config file not found.")

cfg = dotenv_values(cfg_file_path)

origin_mongo_username: str = cfg["ORIGIN_MONGO_USER"]
origin_mongo_password: str = cfg["ORIGIN_MONGO_PASS"]
origin_mongo_host: str = cfg["ORIGIN_MONGO_HOST"]
origin_mongo_port: int = int(cfg["ORIGIN_MONGO_PORT"])

transformer_mongo_username: str = cfg["TRANSFORMER_MONGO_USER"]
transformer_mongo_password: str = cfg["TRANSFORMER_MONGO_PASS"]
transformer_mongo_host: str = cfg["TRANSFORMER_MONGO_HOST"]
transformer_mongo_port: int = int(cfg["TRANSFORMER_MONGO_PORT"])

mongodump: str = cfg["PATH_TO_MONGODUMP_BINARY"]
mongorestore: str = cfg["PATH_TO_MONGORESTORE_BINARY"]

Generate MongoDB configuration files.

You'll use these files file with `mongodump` and `mongorestore` to prevent the associated CLI commands from containing the passwords in plain text.

In [None]:
# Create temporary file in the notebook's folder, containing the origin MongoDB password.
origin_mongo_config_file = NamedTemporaryFile(delete=False, dir=str(Path.cwd()), prefix="tmp.origin_mongo_config.")
origin_mongo_config_file.write(bytes(f"password: {origin_mongo_password}", "utf-8"))
origin_mongo_config_file.close()
origin_mongo_config_file_path: str = origin_mongo_config_file.name

# Create temporary file in the notebook's folder, containing the transformer MongoDB password.
transformer_mongo_config_file = NamedTemporaryFile(delete=False, dir=str(Path.cwd()), prefix="tmp.transformer_mongo_config.")
transformer_mongo_config_file.write(bytes(f"password: {transformer_mongo_password}", "utf-8"))
transformer_mongo_config_file.close()
transformer_mongo_config_file_path: str = transformer_mongo_config_file.name

### Create MongoDB clients

Create MongoDB clients you can use to access the "origin" MongoDB server (i.e. the one containing the database you want to migrate) and the "transformer" MongoDB server (i.e. the one you want to use to perform the data transformations).

In [None]:
# MongoDB client for origin MongoDB server.
origin_mongo_client = pymongo.MongoClient(
    username=origin_mongo_username,
    password=origin_mongo_password,
    host=origin_mongo_host,
    port=origin_mongo_port,
    directConnection=True,
)

# MongoDB client for transformer MongoDB server.
transformer_mongo_client = pymongo.MongoClient(
    username=transformer_mongo_username,
    password=transformer_mongo_password,
    host=transformer_mongo_host,
    port=transformer_mongo_port,
)

### Disable writing to the origin MongoDB database

To disable writing to the database, I will eventually set all users' roles (except the admin user) to `read` (i.e. read-only) with respect to the database. Before I carry out that plan, though, I will store the original users for future reference (so I can restore their original roles later).

Note: `pymongo` does not offer [`db.getUsers()`](https://www.mongodb.com/docs/manual/reference/method/db.getUsers/).

In [None]:
result: dict = origin_mongo_client["admin"].command("usersInfo")
users_initial = result["users"]

# Create temporary file in the notebook's folder, containing the initial users.
users_file = NamedTemporaryFile(delete=False, dir=str(Path.cwd()), prefix="tmp.origin_users_initial.")
users_file.write(bytes(pformat(users_initial), "utf-8"))
users_file.close()

Now that I've stored their original roles, I'll convert every `readWrite` role (with respect to the `nmdc` database) into just plain `read`.

In [None]:
for user in users_initial:

    break  # Abort! TODO: Remove me when I'm ready to run this notebook for real.

    if any((role["db"] == "nmdc") for role in user["roles"]):
        origin_mongo_client["admin"].command("grantRolesToUser", user["user"], roles=[{ "role": "read", "db": "nmdc" }])
        origin_mongo_client["admin"].command("revokeRolesFromUser", user["user"], roles=[{ "role": "readWrite", "db": "nmdc" }])

### Dump the necessary collections from the origin database

In this case, I'll dump the `study_set` collection only.

References:
- https://www.mongodb.com/docs/database-tools/mongodump/
- https://www.mongodb.com/docs/database-tools/mongodump/#std-option-mongodump.--config (`--config` option)

In [None]:
origin_dump_folder_path = "./mongodump.origin.out"

# Dump the database from the origin MongoDB server.
!{mongodump} \
  --config="{origin_mongo_config_file_path}" \
  --host="{origin_mongo_host}" \
  --port="{origin_mongo_port}" \
  --authenticationDatabase="admin" \
  --username="{origin_mongo_username}" \
  --db="nmdc" \
  --gzip \
  --collection="study_set" \
  --out="{origin_dump_folder_path}"

### Restore the database into the transformer MongoDB server

References:
- https://www.mongodb.com/docs/database-tools/mongorestore/
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--config (`--config` option)
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--drop (`--drop` to drop the existing collection)
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--preserveUUID (`--preserveUUID` to use the existing UUIDs from the dump)

In [None]:
# Restore the database to the transformer MongoDB server.
!{mongorestore} \
  --config="{transformer_mongo_config_file_path}" \
  --host="{transformer_mongo_host}" \
  --port="{transformer_mongo_port}" \
  --username="{transformer_mongo_username}" \
  --gzip \
  --drop --preserveUUID \
  --dir="{origin_dump_folder_path}"

### Transform the database

Now that the transformer database contains a copy of the subject database, we can transform it there.

Source: https://github.com/microbiomedata/nmdc-schema/blob/13acf18c9e3b92b39bf67db9d17c66f190575c9d/nmdc_schema/migration_recursion.py#L21C1-L36C27
- Replaced `logger` calls with `print` calls
- Removed unused CURIE regex pattern
- Removed commented-out line
- Added import for `re`

References:
- https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.replace_one

In [None]:

# <copy_pasted_snippet from="https://github.com/microbiomedata/nmdc-schema/blob/13acf18c9e3b92b39bf67db9d17c66f190575c9d/nmdc_schema/migration_recursion.py#L21C1-L36C27">
doi_url_pattern = r'^https?:\/\/[a-zA-Z\.]+\/10\.'

def migrate_studies_7_7_2_to_7_8(retrieved_study):
    print(f"Starting migration of {retrieved_study['id']}")
    if "doi" in retrieved_study:
        match = re.search(doi_url_pattern, retrieved_study["doi"]['has_raw_value'])
        if match:
            start_index = match.end()
            as_curie = f"doi:10.{retrieved_study['doi']['has_raw_value'][start_index:]}"
            retrieved_study["award_dois"] = [as_curie]
        del retrieved_study["doi"]
    return retrieved_study
# </copy_pasted_snippet>



In [None]:
# Make a transformed version of each study in the transformer database.
transformed_studies = []
for study in transformer_mongo_client["nmdc"]["study_set"].find():
    transformed_study = migrate_studies_7_7_2_to_7_8(study)
    transformed_studies.append(transformed_study)
    print(study)
    print(transformed_study)

# Replace the original versions with the transformed versions of themselves (in the transformer database).
for transformed_study in transformed_studies:
    transformer_mongo_client["nmdc"]["study_set"].replace_one({"id": {"$eq": transformed_study["id"]}}, transformed_study)


### Validate the transformed database

In [None]:
# TODO

### Dump the transformed database

In [None]:
transformer_dump_folder_path = "./mongodump.transformer.out"

# Dump the database from the transformer MongoDB server.
!{mongodump} \
  --config="{transformer_mongo_config_file_path}" \
  --host="{transformer_mongo_host}" \
  --port="{transformer_mongo_port}" \
  --authenticationDatabase="admin" \
  --username="{transformer_mongo_username}" \
  --db="nmdc" \
  --gzip \
  --out="{transformer_dump_folder_path}"

### Put the transformed data into the origin MongoDB server

In the case of this migration, given how focused the transformation was (i.e. only the `study_set` collection was affected), I will restore **only** the `study_set` collection to the origin server.

References:
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--nsInclude (`--nsInclude` to specify which collections to affect)
- https://www.mongodb.com/docs/database-tools/mongorestore/#std-option-mongorestore.--dryRun (`--dryRun` can be used to preview the outcome)

In [None]:
# Drop the original `study_set` collection from the origin server,
# and restore the transformed `study_set` collection into its place.
!{mongorestore} \
  --config="{origin_mongo_config_file_path}" \
  --host="{origin_mongo_host}" \
  --port="{origin_mongo_port}" \
  --username="{origin_mongo_username}" \
  --gzip \
  --verbose \
  --dir="{transformer_dump_folder_path}" \
  --nsInclude="nmdc.study_set" \
  --drop --preserveUUID

Now that I've restored the database, I'll restore the original user roles (with respect to the `nmdc` database).

In [None]:
for user in users_initial:

    break  # Abort! TODO: Remove me when I'm ready to run this notebook for real.

    if any((role["db"] == "nmdc" and role["role"] == "readWrite") for role in user["roles"]):
        origin_mongo_client["admin"].command("grantRolesToUser", user["user"], roles=[{ "role": "readWrite", "db": "nmdc" }])
        origin_mongo_client["admin"].command("revokeRolesFromUser", user["user"], roles=[{ "role": "read", "db": "nmdc" }])

### About db.fsyncLock() and db.fsyncUnlock()

I chose not to use `db.fsyncLock()`/`db.fsyncUnlock()` as the method of disabling/re-enabling write access, because I want to be able to `mongorestore` a database while write access is still disabled. `db.fsyncLock()` would have disabled write access at the `mongod` level, preventing database-level write operations (but still allowing a system administrator to "backup" database **files** via `cp`, `scp`, `tar`, etc.

Reference: https://www.mongodb.com/docs/manual/reference/method/db.fsyncLock/#mongodb-method-db.fsyncLock

## Clean up

You may want to manually delete the `.tmp.*` files that this notebook created in its folder. Some of them contain MongoDB passwords.