## Translate GOLD FAA (amino acid assembly), FNA (nucleotide assembly), and FASTQ data into data objects.
The notebooks demostrates how to translate GOLD FAA, FNA, and FASTQ data in the JGI Archive and Metadata Organizer (**JAMO**) into json that conforms with the [NMDC schema](https://github.com/microbiomedata/nmdc-metadata/blob/schema-draft/README.md).  
Before doing the translation it is important that you have an up to date `nmdc.py` file in the `lib` directory.  

The python modules for running the notebook are in the `requirements.txt` file.  

In [1]:
import os, sys
sys.path.append(os.path.abspath('../src/bin/lib/')) # add path to lib

In [2]:
import json
import pandas as pds
import jsonasobj
import nmdc
import data_operations as dop
from pandasql import sqldf

def pysqldf(q):
    return sqldf(q, globals())

## Load GOLD study and project tables from nmdc zip file
The NMDC data is currently stored in a zip file. Instead of unzipping the file, simply use the `zipfile` library to load the `study` and `project` tables (stored as tab-delimited files). 

The code for unzipping and creating the dataframe is found in the `make_dataframe` function. As part of the dataframe creation process, the column names are lower cased and spaces are replaced with underscored. I find it helpful to have some standarization on column names when doing data wrangling. This behavior can be overridden if you wish.

In [3]:
study = dop.make_dataframe("export.sql/STUDY_DATA_TABLE.dsv", file_archive_name="../src/data/nmdc-version2.zip")
project = dop.make_dataframe("export.sql/PROJECT_DATA_TABLE.dsv", file_archive_name="../src/data/nmdc-version2.zip")

## Subset GOLD tables to active records that are joined to valid study IDs

In [4]:
q = """
select 
    *
from
    study
where
    active = 'Yes'
"""
study = sqldf(q)

In [5]:
q = """
select 
    project.*
from
    project
inner join 
    study
on 
    study.study_id = project.master_study_id    
where
    project.active = 'Yes'
"""
project = sqldf(q)

## Load GOLD FAA, FNA, and FASTQ files into data frames

In [6]:
faa = dop.make_dataframe("../src/data/ficus_project_faa.tsv")
fna = dop.make_dataframe("../src/data/ficus_project_fna.tsv")
fastq = dop.make_dataframe("../src/data/ficus_project_fastq.tsv")

In [7]:
# faa.head() # peek at data

## Combine FAA, FNA, and FASTQ dataframes into a single dataframe 
* Since all the files have the same headers, I can concatenate data into a single dataframe for processing.

In [8]:
data_objects = pds.concat([faa, fna, fastq], axis=0)

In [9]:
# data_objects.head() # peek at data

## Build json data files
The json data files are build using a general approach:
1. Create a pandas dataframe (often using SQL syntax) to be translated.
2. Transform the dataframe into a dictionary (these variables end with '_dictdf')
3. Define a list of field names whose names and values will be translated into characteristics within an annotation object.
4. Pass the dataframe dictionary and characteristices list to the `make_json_string_list` method. This method returns a list of json ojbects each of which has been converted to a string.
5. Save the json string to file using `save_json_string_list`.

**Note:** Currently, I am using the GOLD IDs as idenifiers. This need to changed to IRIs.

In [10]:
q = """
select 
    data_objects.*
from
    data_objects
inner join
    project
on
    data_objects.gold_project_id = project.gold_id
"""
data_objects = sqldf(q)

In [11]:
# data_objects.head() # peek at data

In [12]:
data_objects_dictdf = data_objects.to_dict(orient="records") # transorm dataframe to dictionary

In [13]:
## print out a single record for viewing
# for record in data_objects_dictdf:
#     print(json.dumps(record, indent=4)); break

In [14]:
## specify characteristics
characteristics = ['file_size']

## create list of json string objects
data_objects_json_list = dop.make_json_string_list\
    (data_objects_dictdf, nmdc.DataObject, id_key='file_id', name_key='file_name', description_key="file_type_description", characteristic_fields=characteristics)

TypeError: make_json_string_list() got an unexpected keyword argument 'id_key'

In [None]:
# print(data_objects_json_list[0]) ## peek at data

In [None]:
dop.save_json_string_list("output/nmdc-json/data_objects.json", data_objects_json_list) # save json string list to file

## Update omics processing json to relate omics processing to data objects
### Schema pattern: omics processing -- has output --> data object
Steps:
* load project json data into a dictionary
* create dataframe linking a project id to list of files ids associated with it
* iterate over dataframe and add "has_oput" key to matching project ids

In [None]:
## load omics processing json into dict list
omics_dict_list = dop.load_dict_from_json_file("output/nmdc-json/omics_processing.json")

In [None]:
omics_dict_list[0] ## peek at data

In [None]:
## build a dataframe with each project id and the file ids associated with it.
q = """
select
    d1.gold_project_id, group_concat(d2.file_id, " ") as file_ids
from
    data_objects d1
inner join
    data_objects d2
on
    d1.file_id = d2.file_id
group by 
    d1.gold_project_id
"""
files_df = sqldf(q)

In [None]:
## iterate over dataframe and create a has_output key for dictionary items with matching project ids
for (ix, gold_project_id, file_ids) in files_df.itertuples():
    for omics_dict in omics_dict_list:
        if gold_project_id == omics_dict["id"]: # compare project id to id of current dict object
            omics_dict["has_output"] = file_ids.split() # create list of file ids associated with project id

In [None]:
# omics_dict_list[0] ## peek at data

## Save updated omics processing data as json

In [None]:
project_json_list = [] # list to hold individual json objects
for omics_dict in omics_dict_list:
    project_json_list.append(json.dumps(omics_dict))

In [None]:
dop.save_json_string_list("output/nmdc-json/omics_processing.json", project_json_list) # save json string list to file