# Identifying relationships between annotated omics data in NMDC

In this notebook, we’ll look at how different types of omics data can be connected using popular annotation vocabularies. We’ll focus on exploring biomolecules and KEGG pathways from a group of samples that have both metagenomic and metatranscriptomic data, all available in the NMDC Data Portal. By working through these examples, you’ll see how to bring these data types together for integrated analysis.

In [6]:
%%capture
## First need to install locally the nmdc_api_utilities
%pip install nmdc_api_utilities
%pip install Bio
%pip install pycirclize
%pip install dotenv
%pip install plotly
%pip install seaborn

## Set up environment variables
**You can disregard this section of code unless you have interest in testing this on the development API.**

Using pythons python-dotenv package, load the environment variables from the system. This chunk of code is used in the Github CI/CD pipelines to test our development API. The variable can be used when creating `nmdc_api_utilities` objects. If you do not have the environment variable `ENV` set in a .env file, this code with default to the variable "prod" - which tells nmdc_api_utilities to use the production API URL.

In [4]:
# set up environment variables
from dotenv import load_dotenv
import os
load_dotenv()
# load the environment variable ENV. If it does not exist, default to "prod"
ENV = os.environ.get("ENV", "prod")

In [7]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import numpy as np
import sys
from io import StringIO
from Bio.KEGG import REST
from pycirclize import Circos
import time
import importlib.util
import nmdc_api_utilities

# Identifying relationships between annotated omics data in NMDC

This notebook is an example of how different omics data types may be linked via commonly used annotation vocabularies and investigated together. In this notebook we explore KEGG pathways and taxonomy lineage identified in a set of samples that have processed metagenomics and metatranscriptomics data available in the NMDC Data Portal.

## 1. Retrieve data from the NMDC database using API endpoints

### Choose data to retrieve

The [NMDC Data Portal](https://data.microbiomedata.org/) is a powerful resource where you can search for samples by all kinds of criteria. For this example, we’ll focus on finding samples that have both metagenomics and metatranscriptomics data.  
To do this, just use the Data Type filters (the “upset plot” below the interactive map) to quickly spot the right samples. In our case, these filters lead us to samples from the study [“Jeff Blanchard’s Harvard forest soil project”](https://data.microbiomedata.org/details/study/nmdc:sty-11-8ws97026).

### Retrieve and filter data for Harvard forest soil study

The study page linked above has the NMDC study identifier in the URL: `nmdc:sty-11-8ws97026`. We will use the [nmdc_api_utilities](https://microbiomedata.github.io/nmdc_api_utilities/) package to access the data_objects/study endpoint from the [NMDC Runtime API](https://api.microbiomedata.org/docs) to retrieve all records that represent data. This includes URLs for downloading raw data files (e.g. FASTQ) as well as processed data results output by the NMDC workflows.

In [33]:
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-8ws97026')
data = pd.DataFrame(data)

#reformat data into dataframe
data_objects=[]
for index, row in data.iterrows():
    bio_id = row['biosample_id']
    row_out = pd.json_normalize(row['data_objects'])
    row_out['biosample_id'] = bio_id
    data_objects.append(row_out)

data_objects = pd.concat(data_objects).reset_index(drop=True)
display(data_objects)

del data, index, row, row_out, bio_id

Unnamed: 0,id,type,name,file_size_bytes,md5_checksum,data_object_type,was_generated_by,url,description,data_category,biosample_id
0,nmdc:dobj-11-0ebr0z48,nmdc:DataObject,11862.7.224581.ACGGAAC-TGTTCCG.fastq.gz,34934525960,952d3a6780ac34a03b112884ec10cdda,Metagenome Raw Reads,nmdc:dgns-11-0g2bvk46,https://data.microbiomedata.org/data/nmdc:dgns...,Metagenome Raw Reads for nmdc:dgns-11-0g2bvk46,instrument_data,nmdc:bsm-11-622k6044
1,nmdc:dobj-11-s5chx118,nmdc:DataObject,nmdc_wfrqc-11-840m6332.1_filtered.fastq.gz,31299212828,440a9bfb671ab0fea36359c715bb5f4e,Filtered Sequencing Reads,nmdc:wfrqc-11-840m6332.1,https://data.microbiomedata.org/data/nmdc:dgns...,Reads QC for nmdc:dgns-11-0g2bvk46,processed_data,nmdc:bsm-11-622k6044
2,nmdc:dobj-11-p7q8bb28,nmdc:DataObject,nmdc_wfrqc-11-840m6332.1_filterStats.txt,801,b82b37fb53861fa34a06ca201068d727,QC Statistics,nmdc:wfrqc-11-840m6332.1,https://data.microbiomedata.org/data/nmdc:dgns...,Reads QC summary for nmdc:dgns-11-0g2bvk46,processed_data,nmdc:bsm-11-622k6044
...,...,...,...,...,...,...,...,...,...,...,...
3346,nmdc:dobj-11-8pbtm711,nmdc:DataObject,Blanch_Nat_Lip_H_4_AB_O_06_NEG_25Jan18_Brandi-...,12668,da7f3a98506322255816e3c2507a11e9,Configuration toml,nmdc:wfmb-11-g1vtvp02.1,https://nmdcdemo.emsl.pnnl.gov/lipidomics/blan...,CoreMS parameters used for Lipidomics workflow.,workflow_parameter_data,nmdc:bsm-11-y52p4f86
3347,nmdc:dobj-11-869nnh02,nmdc:DataObject,Blanch_Nat_Lip_H_4_AB_O_06_NEG_25Jan18_Brandi-...,853274930,946f895d2a641e9dac6cd927715df3b8,LC-MS Lipidomics Processed Data,nmdc:wfmb-11-g1vtvp02.1,https://nmdcdemo.emsl.pnnl.gov/lipidomics/blan...,CoreMS hdf5 file representing a lipidomics dat...,processed_data,nmdc:bsm-11-y52p4f86
3348,nmdc:dobj-11-zrzcp186,nmdc:DataObject,Blanch_Nat_Lip_H_4_AB_O_06_NEG_25Jan18_Brandi-...,31914586,5ec14469476f651fceaf2ec5068ad140,LC-MS Lipidomics Results,nmdc:wfmb-11-g1vtvp02.1,https://nmdcdemo.emsl.pnnl.gov/lipidomics/blan...,Lipid annotations as a result of a lipidomics ...,processed_data,nmdc:bsm-11-y52p4f86


One way of further identifying a NMDC `DataObject` record is by looking at its slot `data_object_type` (https://microbiomedata.github.io/nmdc-schema/data_object_type/), which contains a value from `FileTypeEnum` (https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/). 

We want to look at the processed data results for our three omics types of interest in this notebook. Specifically, we want the files containing KEGG Orthology, Taxonomy Lineage, and  Expression data. These expression data is available in file for metatranscriptomics data (in "Metatranscriptome Expression" data object) and annotation data are found together in the metagenomics and metatranscriptomics ( in 'Annotation KEGG Orthology' and 'Gene Phylogeny tsv') results files.

We filter for results files with the following `data_object_type` values:

| Value | Description |
|:-----:|:-----------:|
|Gene Phylogeny tsv|Tab-delimited file of gene phylogeny|
|Annotation KEGG Orthology|Tab delimited file for KO annotation|
|Metatranscriptome Expression|Read count table output|


In [43]:
# Filter to biosamples that have KO annotations, Lineage, and expression results
data_objects = data_objects.groupby('biosample_id').filter(lambda x: all(data_type in x['data_object_type'].values for data_type in [
        "Annotation KEGG Orthology",
        "Metatranscriptome Expression",
        "Gene Phylogeny tsv"
    ])).reset_index(drop=True)

# Filter to the desired results file types
results_by_biosample = data_objects[data_objects['data_object_type'].isin([
    "Annotation KEGG Orthology",
    "Metatranscriptome Expression", "Gene Phylogeny tsv"
])][['biosample_id', 'id','data_object_type', 'url']].rename(columns={'id':'output_id'}).reset_index(drop=True)

# Is there one output for each data object type for each biosample?
results_by_biosample['data_object_type'].value_counts().to_frame()

Unnamed: 0_level_0,count
data_object_type,Unnamed: 1_level_1
Annotation KEGG Orthology,50
Gene Phylogeny tsv,50
Metatranscriptome Expression,31


### Select data has both metaT and metaG analysis result
Notice that Each biosample is expected to have two lineage TSV files and two KO results. Therefore, the total count should be double the expression data count (31 × 2 = 62).  However, we only observed 50. This indicates that some biosamples contain only one type of data.  

To address this, we need to filter out those incomplete biosamples and retain only those with both metaG and metaT data for the analysis.

In [62]:
results_by_biosample["_has_metaT"] = results_by_biosample["url"].str.contains(r"nmdc:wfmt", case=False, na=False)
results_by_biosample["_has_metaG"] = results_by_biosample["url"].str.contains(r"nmdc:wfmg", case=False, na=False)

mask_annotation = results_by_biosample["data_object_type"] == "Annotation KEGG Orthology"
mask_lineage = results_by_biosample["data_object_type"] == "Gene Phylogeny tsv"

results_by_biosample.loc[results_by_biosample["_has_metaT"] & mask_annotation, "data_object_type"] = "MetaT Annotation KEGG Orthology"
results_by_biosample.loc[results_by_biosample["_has_metaG"] & mask_annotation, "data_object_type"] = "MetaG Annotation KEGG Orthology"
results_by_biosample.loc[results_by_biosample["_has_metaT"] & mask_lineage, "data_object_type"] = "MetaT Gene Phylogeny tsv"
results_by_biosample.loc[results_by_biosample["_has_metaG"] & mask_lineage, "data_object_type"] = "MetaG Gene Phylogeny tsv"

# Per-biosample aggregates
agg = results_by_biosample.groupby(results_by_biosample["biosample_id"]).agg(
    rows_per_biosample=("biosample_id", "size"),
    has_metaT=("_has_metaT", "any"),
    has_metaG=("_has_metaG", "any"),
    metaT_row_count=("_has_metaT", "sum"),
    metaG_row_count=("_has_metaG", "sum"),
).reset_index()
# Classify
def classify(row):
    if row["has_metaT"] and row["has_metaG"]:
        return "both_metaT_and_metaG"
    elif row["has_metaT"] and not row["has_metaG"]:
        return "metaT_only"
    elif row["has_metaG"] and not row["has_metaT"]:
        return "metaG_only"
    else:
        return "neither"
agg["category"] = agg.apply(classify, axis=1)



counts = (
    agg["category"]
    .value_counts()
    .reset_index()
    .rename(columns={"index": "category", "category": "count"})
)

print(counts)


# Filter to keep only biosamples with both
keep_ids = set(agg.loc[agg["category"] == "both_metaT_and_metaG", "biosample_id"])
drop_ids = set(agg.loc[agg["category"] != "both_metaT_and_metaG", "biosample_id"])
filtered_results_by_biosample = results_by_biosample[results_by_biosample["biosample_id"].isin(keep_ids)].drop(columns=["_has_metaT", "_has_metaG"])

                  count  count
0  both_metaT_and_metaG     19
1            metaT_only     12


### Download selected results files
 
Now we can use the `url` slot from the filtered `DataObject` records to read in all of the files containing the annotations of interest.

In [None]:
# Create one URL column per data type
filtered_results_by_biosample = filtered_results_by_biosample[['biosample_id', 'data_object_type', 'url']].pivot(
    index='biosample_id', 
    columns='data_object_type', 
    values='url'
).reset_index()

# Function to read TSV files
def read_tsv_from_url(url:str, column_names:None):
    response = requests.get(url)
    if column_names is None:
        data = pd.read_csv(StringIO(response.text), sep='\t')
    else:
        data = pd.read_csv(StringIO(response.text), sep='\t', header=None, names=column_names)
    return data

# Read KO results
ko_columns = ["gene_id", "img_ko_flag", "ko_term", "percent_identity",
              "query_start", "query_end", "subj_start", "subj_end",
              "evalue", "bit_score", "align_length"]
filtered_results_by_biosample['metag_ko_results'] = filtered_results_by_biosample['MetaG Annotation KEGG Orthology'].apply(
    lambda x: read_tsv_from_url(x, ko_columns) if pd.notnull(x) else None
)
filtered_results_by_biosample['metat_ko_results'] = filtered_results_by_biosample['MetaT Annotation KEGG Orthology'].apply(
    lambda x: read_tsv_from_url(x, ko_columns) if pd.notnull(x) else None
)

# Read lineage results
lineage_columns = ["gene_id", "homolog_gene_oid", "homolog_taxon_oid", "percent_identity","lineage"]
filtered_results_by_biosample['metag_lineage_results'] = filtered_results_by_biosample['MetaG Gene Phylogeny tsv'].apply(
    lambda x: read_tsv_from_url(x, lineage_columns) if pd.notnull(x) else None
)
filtered_results_by_biosample['metat_lineage_results'] = filtered_results_by_biosample['MetaT Gene Phylogeny tsv'].apply(
    lambda x: read_tsv_from_url(x, lineage_columns) if pd.notnull(x) else None
)

# Read expression results
expression_columns= ["img_gene_oid", "img_scaffold_oid", "locus_tag", "scaffold_accession", "strand", "locus_type", "length", "reads_cnt", "mean", "median", "stdev", "reads_cntA", "meanA", "medianA", "stdevA"]
filtered_results_by_biosample['metat_expression_results'] = filtered_results_by_biosample['Metatranscriptome Expression'].apply(
    lambda x: read_tsv_from_url(x, expression_columns) if pd.notnull(x) else None
)


filtered_results_by_biosample

In [56]:
filtered_results_by_biosample

Unnamed: 0,biosample_id,output_id,data_object_type,url
0,nmdc:bsm-11-8rn9bm20,nmdc:dobj-11-mp887905,MetaT Annotation KEGG Orthology,https://data.microbiomedata.org/data/nmdc:dgns...
1,nmdc:bsm-11-8rn9bm20,nmdc:dobj-11-hgv4z074,MetaT Gene Phylogeny tsv,https://data.microbiomedata.org/data/nmdc:dgns...
2,nmdc:bsm-11-8rn9bm20,nmdc:dobj-11-f282ne45,Metatranscriptome Expression,https://data.microbiomedata.org/data/nmdc:dgns...
...,...,...,...,...
125,nmdc:bsm-11-6x067s27,nmdc:dobj-11-r46w3k75,Metatranscriptome Expression,https://data.microbiomedata.org/data/nmdc:dgns...
126,nmdc:bsm-11-6x067s27,nmdc:dobj-11-mxe2b333,MetaG Annotation KEGG Orthology,https://data.microbiomedata.org/data/nmdc:dgns...
127,nmdc:bsm-11-6x067s27,nmdc:dobj-11-myqzb759,MetaG Gene Phylogeny tsv,https://data.microbiomedata.org/data/nmdc:dgns...
