## Translate GOLD study, project, and biosample data into json.
The notebooks demostrates how to translate study, project, and biosample data from the GOLD database into json that conforms with the [NMDC schema](https://github.com/microbiomedata/nmdc-metadata/blob/schema-draft/README.md).  
Before doing the translation it is important that you have an up to date `nmdc.py` file in the `lib` directory.  

The python modules for running the notebook are in the `requirements.txt` file.  

In [1]:
import json
import pandas as pds
import jsonasobj
import lib.nmdc as nmdc
import lib.data_operations as dop
from pandasql import sqldf

def pysqldf(q):
    return sqldf(q, globals())

## Load tables (i.e., tab delimited files) from nmdc zip file
The NMDC data is currently stored in a zip file. Instead of unzipping the file, simply use the `zipfile` library to load the `study`, `project`, `contact`, `project_biosample`, and `biosample` tables (stored as tab-delimited files). The `project_biosample` table is needed as a cross-linking table between `project` and `biosample`. The `contact` table contains information about principal investigators.

The code for unzipping and creating the dataframe is found in the `make_dataframe` function. As part of the dataframe creation process, the column names are lower cased and spaces are replaced with underscored. I find it helpful to have some standarization on column names when doing data wrangling. This behavior can be overridden if you wish.

In [2]:
study = dop.make_dataframe("export.sql/STUDY_DATA_TABLE.dsv", file_archive_name="data/nmdc-version2.zip")
contact = dop.make_dataframe("export.sql/CONTACT_DATA_TABLE.dsv", file_archive_name="data/nmdc-version2.zip")
project = dop.make_dataframe("export.sql/PROJECT_DATA_TABLE.dsv", file_archive_name="data/nmdc-version2.zip")
project_biosample = dop.make_dataframe("export.sql/PROJECT_BIOSAMPLE_DATA_TABLE.dsv", file_archive_name="data/nmdc-version2.zip")
biosample = dop.make_dataframe("export.sql/BIOSAMPLE_DATA_TABLE.dsv", file_archive_name="data/nmdc-version2.zip")
proposals = dop.make_dataframe("data/JGI-EMSL-FICUS-proposals.fnl.tsv")

## Subset tables to records where active = 'Yes'

In [3]:
q = """
select 
    *
from
    study
where
    active = 'Yes'
"""
study = sqldf(q)

In [4]:
q = """
select 
    *
from
    project
where
    active = 'Yes'
"""
project = sqldf(q)

In [5]:
q = """
select 
    *
from
    biosample
where
    active = 'Yes'
"""
biosample = sqldf(q)

In [6]:
# biosample.head() # peek at data

## Build json data files
The json data files are build using a general approach:
1. Create a pandas dataframe (often using SQL syntax) to be translated.
2. Transform the dataframe into a dictionary (these variables end with '_dictdf')
3. Define a list of field names whose names and values will be translated into characteristics within an annotation object.
4. Pass the dataframe dictionary and characteristices list to the `make_json_string_list` method. This method returns a list of json ojbects each of which has been converted to a string.
5. Save the json string to file using `save_json_string_list`.

**Note:** Currently, I am using the GOLD IDs as idenifiers. This need to changed to IRIs.

## Build study json
* Create a subset of the study table using the FICUS gold_ids specified in [JGI-EMSL-FICUS-proposals.fnl.xlxs](https://docs.google.com/spreadsheets/d/1sowTCYooDrOMq0ErD4s3xtgH3PLoxwa7/edit#gid=1363834365).
* Follow approach for building json data files.

In [7]:
q = """
select 
    study.*, contact.name as principal_investigator_name, proposals.doi
from
    study
left join
    contact
on
    study.contact_id = contact.contact_id
left join
    proposals
on
    study.gold_id = proposals.gold_study
where
    study.gold_id in 
      ('Gs0110115', 'Gs0110132', 'Gs0112340', 'Gs0114675', 'Gs0128849', 'Gs0130354', 
       'Gs0114298', 'Gs0114663', 'Gs0120351', 'Gs0134277', 'Gs0133461', 'Gs0135152', 'Gs0135149')
"""
study = sqldf(q)

In [9]:
# study.head() # peek at data

In [10]:
study_dictdf = study.to_dict(orient="records") # transorm dataframe to dictionary

In [11]:
## print out a single record for viewing
# for record in study_dictdf:
#     print(json.dumps(record, indent=4)); break

In [12]:
## specify characteristics
characteristics = \
    ['gold_study_name', 'principal_investigator_name', 'add_date', 'mod_date', 'doi',
      'ecosystem', 'ecosystem_category', 'ecosystem_type', 'ecosystem_subtype', 'specific_ecosystem', 'ecosystem_path_id']

## create list of json string objects
study_json_list = dop.make_json_string_list\
    (study_dictdf, nmdc.Study, id_key='gold_id', name_key='study_name', description_key="description", characteristic_fields=characteristics)

In [13]:
# print(study_json_list[0]) ## peek at data

In [14]:
dop.save_json_string_list("output/nmdc-json/study.json", study_json_list) # save json string list to file

## Buid project json
* Create dataframe for projects that are part of the FICUS studies.
* Follow approach for building json data files.

In [17]:
q = """
select
    project.*, study.gold_id as study_gold_id, contact.name as principal_investigator_name
from 
    project
inner join 
    study
on 
    study.study_id = project.master_study_id
left join
    contact
on
    contact.contact_id = project.pi_id
"""
project = sqldf(q)

In [18]:
project.head() # peek at data

Unnamed: 0,project_id,project_name,ncbi_bioproject_id,add_date,mod_date,mod_by,description,specimen_id,organism_type,nucleic_acid,...,its_sample_name,project_subtype,sequencing_strategy_full,is_test,is_approved,is_locked,gpts_sample_id,gpts_disambiguator,study_gold_id,principal_investigator_name
0,95972,Cyanobacterial communities from the Joint Geno...,410293.0,19-JUN-14 12.00.00.000000000 AM,04-DEC-19 01.50.18.267000000 PM,101072.0,,1,,,...,,,Metagenome,No,Yes,Yes,,,Gs0110132,Matthias Hess
1,95970,Cyanobacterial communities from the Joint Geno...,410291.0,19-JUN-14 12.00.00.000000000 AM,04-DEC-19 01.50.27.161000000 PM,101072.0,,1,,,...,,,Metagenome,No,Yes,Yes,,,Gs0110132,Matthias Hess
2,95966,Cyanobacterial communities from the Joint Geno...,410405.0,19-JUN-14 12.00.00.000000000 AM,19-NOV-19 01.19.44.394000000 AM,101072.0,,1,,,...,,,Metatranscriptome,No,Yes,Yes,,,Gs0110132,Matthias Hess
3,95968,Cyanobacterial communities from the Joint Geno...,410289.0,19-JUN-14 12.00.00.000000000 AM,19-NOV-19 01.19.47.021000000 AM,101072.0,,1,,,...,,,Metatranscriptome,No,Yes,Yes,,,Gs0110132,Matthias Hess
4,95969,Cyanobacterial communities from the Joint Geno...,410406.0,19-JUN-14 12.00.00.000000000 AM,19-NOV-19 01.19.38.495000000 AM,101072.0,,1,,,...,,,Metatranscriptome,No,Yes,Yes,,,Gs0110132,Matthias Hess


In [20]:
project_dictdf = project.to_dict(orient="records") # transorm dataframe to dictionary

In [21]:
## specify characteristics
characteristics = \
    ['add_date', 'mod_date', 'completion_date', 'ncbi_project_name', 'principal_investigator_name']

## create list of json string objects
project_json_list = dop.make_json_string_list\
    (project_dictdf, nmdc.SequencingProject, id_key='gold_id', name_key='project_name', 
     part_of_key="study_gold_id", description_key="description", characteristic_fields=characteristics)

In [23]:
# print(project_json_list[0]) ## peek at data

In [24]:
dop.save_json_string_list("output/nmdc-json/project.json", project_json_list) # save json string list to file

## Build biosample json
* Create dataframe for biosamples that are part of the FICUS studies. Note the use of `group_concat` in the query string. This is needed b/c a biosample may potentially belong to more than one project.
* Follow approach for building json data files.

**Note:** The list of characteristics is quite long. I might need to rething a more elegant way to do this.

In [25]:
q = """
select
    biosample.gold_id,
    biosample.biosample_name,
    biosample.description,
    biosample.add_date,
    biosample.mod_date,
    biosample.ecosystem_path_id,
    biosample.ecosystem,
    biosample.ecosystem_category,
    biosample.ecosystem_type,
    biosample.ecosystem_subtype,
    biosample.specific_ecosystem,
    biosample.habitat,
    biosample.location,
    biosample.community,
    biosample.ncbi_taxonomy_name,
    biosample.geographic_location,
    biosample.latitude,
    biosample.longitude,
    biosample.sample_collection_site,
    biosample.identifier,
    biosample.sample_collection_year,
    biosample.sample_collection_month,
    biosample.sample_collection_day,
    biosample.sample_collection_hour,
    biosample.sample_collection_minute,
    biosample.host_name,
    biosample.depth,
    biosample.subsurface_depth,
    biosample.altitude,
    biosample.temperature_range,
    biosample.proport_woa_temperature,
    biosample.biogas_temperature,
    biosample.growth_temperature,
    biosample.soil_annual_season_temp,
    biosample.water_samp_store_temp,
    biosample.biogas_retention_time,
    biosample.salinity,
    biosample.pressure,
    biosample.ph,
    biosample.chlorophyll_concentration,
    biosample.nitrate_concentration,
    biosample.oxygen_concentration,
    biosample.salinity_concentration,
    group_concat(project.gold_id) as project_gold_ids
from
    biosample
inner join project_biosample
    on biosample.biosample_id = project_biosample.biosample_id
inner join project
    on project.project_id = project_biosample.project_id
group by
    biosample.biosample_id,
    biosample.biosample_name,
    biosample.description,
    biosample.add_date,
    biosample.mod_date,
    biosample.ecosystem_path_id,
    biosample.ecosystem,
    biosample.ecosystem_category,
    biosample.ecosystem_type,
    biosample.ecosystem_subtype,
    biosample.specific_ecosystem,
    biosample.habitat,
    biosample.location,
    biosample.community,
    biosample.ncbi_taxonomy_name,
    biosample.geographic_location,
    biosample.latitude,
    biosample.longitude,
    biosample.sample_collection_site,
    biosample.identifier,
    biosample.sample_collection_year,
    biosample.sample_collection_month,
    biosample.sample_collection_day,
    biosample.sample_collection_hour,
    biosample.sample_collection_minute,
    biosample.host_name,
    biosample.depth,
    biosample.subsurface_depth,
    biosample.altitude,
    biosample.temperature_range,
    biosample.proport_woa_temperature,
    biosample.biogas_temperature,
    biosample.growth_temperature,
    biosample.soil_annual_season_temp,
    biosample.water_samp_store_temp,
    biosample.biogas_retention_time,
    biosample.salinity,
    biosample.pressure,
    biosample.ph,
    biosample.chlorophyll_concentration,
    biosample.nitrate_concentration,
    biosample.oxygen_concentration,
    biosample.salinity_concentration
"""
biosampledf = sqldf(q)

In [26]:
biosample_dictdf = biosampledf.to_dict(orient="records") # transorm dataframe to dictionary

In [27]:
## specify characteristics
characteristics = \
    ['add_date',
     'mod_date',
     'ecosystem_path_id',
     'ecosystem',
     'ecosystem_category',
     'ecosystem_type',
     'ecosystem_subtype',
     'specific_ecosystem',
     'habitat',
     'location',
     'community',
     'ncbi_taxonomy_name',
     'geographic_location',
     'latitude',
     'longitude',
     'sample_collection_site',
     'identifier',
     'sample_collection_year',
     'sample_collection_month',
     'sample_collection_day',
     'sample_collection_hour',
     'sample_collection_minute',
     'host_name',
     'depth',
     'subsurface_depth',
     'altitude',
     'temperature_range',
     'proport_woa_temperature',
     'biogas_temperature',
     'growth_temperature',
     'soil_annual_season_temp',
     'water_samp_store_temp',
     'biogas_retention_time',
     'salinity',
     'pressure',
     'ph',
     'chlorophyll_concentration',
     'nitrate_concentration',
     'oxygen_concentration',
     'salinity_concentration'
    ]

In [28]:
## create list of json string objects
biosample_json_list = dop.make_json_string_list\
    (biosample_dictdf, nmdc.Biosample, id_key='gold_id', name_key='biosample_name', 
     part_of_key="project_gold_ids", description_key="description", characteristic_fields=characteristics)

In [29]:
# print(biosample_json_list[0]) # peek at data

In [30]:
dop.save_json_string_list("output/nmdc-json/biosample.json", biosample_json_list) # save json string list to file