# TOC <a id="TOC" />
* [setup notebook](#setup)
* [update values](#update-values)
* [insert items](#insert-items)
* [remove items](#remove-items)
* [remove](#remove-property)

# Setup notebook <a id="setup" />
[TOC](#TOC)

In [8]:
# autoreload modules (useful for testing)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
# load dependencies
import json
import pandas as pd
from dotenv import dotenv_values
from pymongo import MongoClient
from pymongo.database import Database as MongoDatabase
from nmdc_runtime.api.core.metadata import load_changesheet, update_mongo_db, mongo_update_command_for, copy_docs_in_update_cmd

In [35]:
# load mongodb via env info
config = dotenv_values("../../.env.localhost")
config["MONGO_HOST"]

# create mongo client
client = MongoClient(host=config["MONGO_HOST"], username=config["MONGO_USERNAME"], password=config["MONGO_PASSWORD"])
mongodb = client["nmdc"]

In [36]:
# create temp database and temp_set collection for testing¶
if "temp_db" in client.list_database_names():
    client.drop_database("temp_db")
temp_db = client["temp_db"]

In [38]:
# helper functions

# wraps the mongo_update_command_for and update_mongo_db into
# a single function to process the change sheet
def process_changesheet(changeDf, mdb: MongoDatabase, temp_db: MongoDatabase, print_update_cmd=False):
    update_cmd = mongo_update_command_for(changeDf)
    
    # used for debugging 
    if print_update_cmd:
        for id_, cmd in update_cmd.items():
            print('id:', id_)
            print(cmd)
            print('\n')
            
    copy_docs_in_update_cmd(update_cmd, mdb, temp_db)
    return update_mongo_db(temp_db, update_cmd)

# puts the change sheet results in dataframe
def print_results(results, print_before=True, print_after=True, print_errors=True):
    for i, result in enumerate(results):
        print(f"\n============== {result['id']} ==============")
        if print_before:
            print("------------------ BEFORE ------------------")
            print(json.dumps(result["doc_before"], indent=2))
        if print_after:
            print("------------------ AFTER ------------------")
            print(json.dumps(result["doc_after"], indent=2))
        if print_errors:
            print("------------------ ERRORS ------------------")
            print("\n".join(result["validation_errors"]))

In [39]:
# set dataframe display options
pd.set_option("display.max_columns", None)
pd.set_option('display.width', 1000)

# Update values <a id="update-values" />
This replaces the old value with new specified values.  
A list of values is specified using the pipe ("|") as a delimiter between the values.  
If the attribute does not already exist, it will be created. However, the attribute must exist in the schema.  
The following actions are synonyms:
* update
* replace
* replace items
* replace item
* set  


[TOC](#TOC)

In [25]:
pd.read_csv("data/changesheet-replace.tsv", sep="\t", dtype="string").fillna('')

Unnamed: 0,id,action,attribute,value
0,gold:Gs0135149,update,description,NEW DESCRIPTION
1,,update,ess_dive_datasets,doi:ESS_DIVE_1|doi:ESS_DIVE_2
2,,update,principal_investigator.orcid,orcid:0000-0000-0000-0000
3,,update,principal_investigator.websites,https://www.ornl.gov/staff-profile/WEBSITE
4,gold:Gs0103573,update,principal_investigator.has_raw_value,NEW PI NAME
5,,update,principal_investigator.name,NEW PI NAME
6,,update,principal_investigator.profile_image_url,https://portal.nersc.gov/NEW-PI-NAME.jpg
7,,update,principal_investigator.websites,https://WEBSITE_1|WEBSITE_2
8,gold:Gs0114675,update,has_credit_associations,ca1
9,ca1,update,applied_role,Conceptualization


In [28]:
sheetDf = load_changesheet("data/changesheet-replace.tsv", mongodb)
# sheetDf

In [31]:
print_results(process_changesheet(sheetDf, mongodb, temp_db), print_before=True)


------------------ BEFORE ------------------
{
  "id": "gold:Gs0103573",
  "name": "Populus root and rhizosphere microbial communities from Tennessee, USA",
  "description": "This study is part of the Plant-Microbe Interfaces Science Focus Area, which aims to gain a deeper understanding of the diversity and functioning of mutually beneficial interactions between plants and microbes in the rhizosphere. Ongoing efforts focus on characterizing and interpreting such interfaces using systems comprising plants and microbes, in particular the poplar tree (Populus) and its microbial community in the context of favorable plant microbe interactions.",
  "ecosystem": "Host-associated",
  "ecosystem_category": "Plants",
  "ecosystem_type": "Unclassified",
  "ecosystem_subtype": "Unclassified",
  "specific_ecosystem": "Unclassified",
  "principal_investigator": {
    "has_raw_value": "Mitchel J. Doktycz",
    "name": "Mitchel J. Doktycz",
    "profile_image_url": "https://portal.nersc.gov/project/

# Insert items <a id="insert-items" />
Inserts a new items into a list.  
The pipe ("|") is used for add list of multiple values.   
If the property does not exist, it will be created.  
The following actions are synonyms:
* insert
* insert item
* insert items

[TOC](#TOC)

In [32]:
pd.read_csv("data/changesheet-insert.tsv", sep="\t", dtype="string").fillna('')

Unnamed: 0,id,action,attribute,value
0,gold:Gs0135149,insert items,ess_dive_datasets,doi:ESS_DIVE_1|doi:ESS_DIVE_2
1,,insert items,websites,http://WEBSITE_1|http://WEBSITE_2
2,,insert items,publications,PUBLICATION 1
3,gold:Gs0114675,insert items,has_credit_associations,ca1
4,ca1,,applied_role,Conceptualization
5,ca1,,applies_to_person.name,CREDIT NAME 1
6,ca1,,applies_to_person.email,CREDIT_NAME_1@foo.edu
7,ca1,,applies_to_person.orcid,orcid:0000-0000-0000-0001


In [33]:
sheetDf = load_changesheet("data/changesheet-insert.tsv", mongodb)
# sheetDf

In [34]:
print_results(process_changesheet(sheetDf, mongodb, temp_db), print_before=True, print_after=True)


------------------ BEFORE ------------------
{
  "id": "gold:Gs0114675",
  "name": "Deep subsurface shale carbon reservoir microbial communities from Ohio and West Virginia, USA",
  "description": "This project aims to improve the understanding of microbial diversity and metabolism in deep shale, with implications for novel enzyme discovery and energy development. This project was conducted along two Appalachian basin shales, the Marcellus and Utica/Point Pleasant formations in Pennsylvania and Ohio, respectively. Samples were collected from input and produced fluids up to a year after hydraulic fracturing at varying depths and locations (4 wells, 2 basin shales).\n",
  "ecosystem": "Environmental",
  "ecosystem_category": "Terrestrial",
  "ecosystem_type": "Deep subsurface",
  "ecosystem_subtype": "Unclassified",
  "specific_ecosystem": "Unclassified",
  "principal_investigator": {
    "has_raw_value": "Kelly Wrighton",
    "profile_image_url": "https://portal.nersc.gov/project/m3408

# Remove items <a id="remove-items" />
Removes one more items from a list.  
Use the pipe ("|") delimiter to remove multiple items.  
The following actions are synonyms:
* replace item
* replace items

**Note:** Removing all the items will result in an empty list (i.e., `[]`) being stored.

[TOC](#TOC)

pd.read_csv("data/changesheet-remove-item.tsv", sep="\t", dtype="string").fillna('')

In [44]:
sheetDf = load_changesheet("data/changesheet-remove-item.tsv", mongodb)
# sheetDf

In [45]:
print_results(process_changesheet(sheetDf, mongodb, temp_db), print_before=True)


------------------ BEFORE ------------------
{
  "id": "gold:Gs0135149",
  "name": "Bulk soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",
  "description": "This research project aimed to understand how snow accumulation and snowmelt influences the mobilization of nitrogen through the soil microbiome in a mountainous catchment at the East River Watershed in Colorado. This project sought to identify bacteria, archaea, and fungi that were associated with the microbial biomass bloom that occurs during winter and the biomass crash following snowmelt. This project also sought to understand whether the traits that govern microbial community assembly during and after snowmelt were phylogenetically conserved. Samples were collected during winter, the snowmelt period, and after snowmelt in spring, from an area that transitioned from an upland hillslope to a riparian floodplain.\n\nThis project is part of the Watershed Function Science Focus 

# Remove <a id="remove-property" />
Completely removes the attribute/property from the document.  
Anything can be specified as a value. For example, all the credit associations are not specified. 
However, for provenance, it is helpful specify the value (when appropriate).  
The following actions are synonyms:
* remove
* delete

[TOC](#TOC)

In [49]:
pd.read_csv("data/changesheet-remove-property.tsv", sep="\t", dtype="string").fillna('')

Unnamed: 0,id,action,attribute,value
0,gold:Gs0135149,remove,ess_dive_datasets,doi:10.21952/WTR/1573029
1,,remove,has_credit_associations,removing credit associations


In [50]:
sheetDf = load_changesheet("data/changesheet-remove-property.tsv", mongodb)
# sheetDf

In [51]:
print_results(process_changesheet(sheetDf, mongodb, temp_db), print_before=True)


------------------ BEFORE ------------------
{
  "id": "gold:Gs0135149",
  "name": "Bulk soil microbial communities from the East River watershed near Crested Butte, Colorado, United States",
  "description": "This research project aimed to understand how snow accumulation and snowmelt influences the mobilization of nitrogen through the soil microbiome in a mountainous catchment at the East River Watershed in Colorado. This project sought to identify bacteria, archaea, and fungi that were associated with the microbial biomass bloom that occurs during winter and the biomass crash following snowmelt. This project also sought to understand whether the traits that govern microbial community assembly during and after snowmelt were phylogenetically conserved. Samples were collected during winter, the snowmelt period, and after snowmelt in spring, from an area that transitioned from an upland hillslope to a riparian floodplain.\n\nThis project is part of the Watershed Function Science Focus 