## Translate EMSL data.
The notebooks demostrates how to translate the EMSL spreadsheets [EMSL_FICUS_project_process_data_export.xlsx](https://drive.google.com/drive/u/1/folders/1frzGlz8EB8inpVokNTSwD6Ia94eVUlsZ) and [FICUS - JGI-EMSL Proposal - Gold Study - ID mapping and PI](https://docs.google.com/spreadsheets/d/1BX35JZsRkA5cZ-3Y6x217T3Aif30Ptxe_SjIC7JqPx4/edit#gid=0) into json that conforms with the [NMDC schema](https://github.com/microbiomedata/nmdc-metadata/blob/schema-draft/README.md).  
Before doing the translation it is important that you have an up to date `nmdc.py` file in the `lib` directory.  

The python modules for running the notebook are in the `requirements.txt` file.  

In [3]:
import os, sys
sys.path.append(os.path.abspath('../src/bin/lib/')) # add path to lib

In [4]:
import json
import pandas as pds
import jsonasobj
import nmdc
import data_operations as dop
from pandasql import sqldf

def pysqldf(q):
    return sqldf(q, globals())

## Load GOLD study table from nmdc zip file
The NMDC data is currently stored in a zip file. Instead of unzipping the file, simply use the `zipfile` library to load the `study` table (stored as tab-delimited files). 

The code for unzipping and creating the dataframe is found in the `make_dataframe` function. As part of the dataframe creation process, the column names are lower cased and spaces are replaced with underscored. I find it helpful to have some standarization on column names when doing data wrangling. This behavior can be overridden if you wish.

In [7]:
study = dop.make_dataframe("export.sql/STUDY_DATA_TABLE.dsv", file_archive_name="../src/data/nmdc-version2.zip")

## Subset GOLD tables to active records that are joined to valid study IDs

In [8]:
q = """
select 
    *
from
    study
where
    active = 'Yes'
"""
study = sqldf(q)

## Load EMSL spreadsheets into spreadsheets

In [9]:
## load emsl instrument run data
## the spreadsheet contains multiple tab, so I have to load using pandas and the clean the columnn names
emsl = pds.concat(pds.read_excel("../src/data/EMSL_FICUS_project_process_data_export.xlsx", 
                                     sheet_name=None), ignore_index=True)
emsl = dop.clean_dataframe_column_names(emsl)

## load mapping spreadsheet
jgi_emsl = dop.make_dataframe("../src/data/FICUS - JGI-EMSL Proposal - Gold Study - ID mapping and PI.xlsx", file_type="excel")


## Subset EMSL data to only those that have a valid FICUS study ID

In [10]:
## subset the mapping spreadsheet
q = """
select 
    *
from
    jgi_emsl
inner join
    study
on
    jgi_emsl.gold_study_id = study.gold_id
"""
jgi_emsl = sqldf(q)

In [11]:
# jgi_emsl.head() # peek at data

In [12]:
## subset instrument run data
q = """
select 
    emsl.*, jgi_emsl.gold_study_id
from
    emsl
inner join
    jgi_emsl
on
    emsl.emsl_proposal_id = jgi_emsl.emsl_proposal_id
"""
emsl = sqldf(q)

## Update/prep instrument run data
* Change column experimental_data_type to omics_type
* Change column dataset_file_size_bytes to file_size
* Add processing_institution = "Environmental Molecular Sciences Lab" column
* Add column data_object_id to identify data objects. Currently, this is just "output" + value of dataset_id
Add column data_object_name associated with data object ids. Currently, this is just "output from: " + value of dataset_name

In [13]:
emsl.rename(columns={"experimental_data_type":"omics_type"}, inplace=True) # rename column

In [14]:
emsl.rename(columns={"dataset_file_size_bytes":"file_size"}, inplace=True) # rename column

In [15]:
emsl["processing_institution"] = "Environmental Molecular Sciences Lab" # add processing institution

In [16]:
emsl["data_object_id"] = "output_"
emsl["data_object_id"] = emsl["data_object_id"] + emsl["dataset_id"].map(str) # build data object id

In [17]:
emsl["data_object_name"] = "output: "
emsl["data_object_name"] = emsl["data_object_name"] + emsl["dataset_name"].map(str) # build data object name

In [18]:
# emsl[["data_object_id", "dataset_id", "omics_type", "processing_institution", "gold_study_id"]].head() # peek at data

## Build omics prossessing json

In [19]:
emsl_dictdf = emsl.to_dict(orient="records")

In [20]:
## specify characteristics
characteristics = \
    ['omics_type', 'instrument_name', 'processing_institution']

## create list of json string objects
omics_processing_dict_list = dop.make_nmdc_dict_list\
    (emsl_dictdf, nmdc.OmicsProcessing, id_key='dataset_id', name_key='dataset_name', description_key="dataset_type_description",
     part_of_key="gold_study_id", has_output_key="data_object_id", characteristic_fields=characteristics)

TypeError: make_nmdc_dict_list() got an unexpected keyword argument 'id_key'

In [None]:
omics_processing_dict_list[0] # peek at data

## Build data ojbects json

In [None]:
## specify characteristics
characteristics = ['file_size']

## create list of dictionaries
data_objects_dict_list = dop.make_nmdc_dict_list\
    (emsl_dictdf, nmdc.DataObject, id_key='data_object_id', 
     name_key='data_object_name', characteristic_fields=characteristics)

In [None]:
# data_objects_dict_list[-1] # peek at data

## Update the omics_processing.json file

In [5]:
## load omics processing json into dict list
omics_processing_file_data = dop.load_dict_from_json_file("output/nmdc-json/omics_processing.json")

In [6]:
omics_processing_file_data[0] # peek at data

{'id': 'Gp0095972',
 'name': 'Cyanobacterial communities from the Joint Genome Institute, California, USA from Joint Genome Institute, California, USA - FECB-24 metaG',
 'annotations': [{'has_characteristic': {'name': 'add_date'},
   'has_raw_value': '19-JUN-14 12.00.00.000000000 AM'},
  {'has_characteristic': {'name': 'mod_date'},
   'has_raw_value': '04-DEC-19 01.50.18.267000000 PM'},
  {'has_characteristic': {'name': 'ncbi_project_name'},
   'has_raw_value': 'Cyanobacterial communities from the Joint Genome Institute, California, USA from Joint Genome Institute, California, USA - FECB-24 metaG'},
  {'has_characteristic': {'name': 'omics_type'},
   'has_raw_value': 'Metagenome'},
  {'has_characteristic': {'name': 'principal_investigator_name'},
   'has_raw_value': 'Matthias Hess'},
  {'has_characteristic': {'name': 'processing_institution'},
   'has_raw_value': 'Joint Genome Institute'}],
 'part_of': ['Gs0110132']}

In [None]:
updated_omics_processing = [*omics_processing_file_data, *omics_processing_dict_list]

In [None]:
# updated_omics_processing[-1] ## peek at data

## Save updated omics processing data as json

In [None]:
updated_omics_processing_json_list = dop.convert_dict_list_to_json_list(updated_omics_processing)

In [None]:
dop.save_json_string_list("output/nmdc-json/omics_processing.json", updated_omics_processing_json_list) # save json string list to file

## Update the data_objects.json file

In [None]:
## load data objects json into dict list
data_objects_file_data = dop.load_dict_from_json_file("output/nmdc-json/data_objects.json")

In [None]:
# data_objects_file_data[0] # peek at data

In [None]:
updated_data_objects = [*data_objects_file_data, *data_objects_dict_list]

In [None]:
# updated_data_objects[-1] # peek at data

## Save updated data objects data as json

In [None]:
updated_data_objects_json_list = dop.convert_dict_list_to_json_list(updated_data_objects)

In [None]:
dop.save_json_string_list("output/nmdc-json/data_objects.json", updated_data_objects_json_list) # save json string list to file
