# Identifying relationships between annotated omics data in NMDC

In this notebook, we explore how different types of omics data—specifically metagenomic and metatranscriptomic—can be connected through commonly used annotation vocabularies such as biomolecules, taxonomy, and KEGG pathways. Using the **Harvard Forest dataset** as an example, which includes paired and processed metagenomic and metatranscriptomic data available in the NMDC Data Portal, we demonstrate how to link and analyze these data types for integrated interpretation.

In [1]:
# Setup 
# Add renv project library to R environment variable libPaths()
.libPaths(c(.libPaths(), "../../renv/library/*/R-*/*"))

# Load required packages
suppressPackageStartupMessages({
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(stringr, warn.conflicts = FALSE)
library(readr, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(purrr)
library(tibble)
library(jsonlite)
library(KEGGREST)
})

# Load NMDC API functions from this repo
if(Sys.getenv("COLAB_BACKEND_VERSION") == "") source("../../utility_functions.R")

if(Sys.getenv("COLAB_BACKEND_VERSION") != "") source("http://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/utility_functions.R")

## 1. Retrieve data from the NMDC database using API endpoints

### Choose the data you want to explore

The [NMDC Data Portal](https://data.microbiomedata.org/) is a powerful resource where you can search for samples by all kinds of criteria. For this example, we’ll focus on finding samples that have both metagenomics and metatranscriptomics data.  
To do this, just use the Data Type filters (the “upset plot” below the interactive map) to quickly spot the right samples. In our case, these filters lead us to samples from the study [“Jeff Blanchard’s Harvard forest soil project”](https://data.microbiomedata.org/details/study/nmdc:sty-11-8ws97026).

### Get and filter data for Jeff Blanchard’s Harvard forest soil project

Every study in the portal has a unique ID in its URL; for this one, it’s `nmdc:sty-11-8ws97026`.  
We’ll use the function `get_data_objects_for_study` (found in `utility_functions.R`) to pull up all related data records for this project. This includes download links for raw files (like FASTQs) as well as processed outputs from NMDC’s workflows.

> Tip: You can use these data objects to download files directly or to browse the processed results right in the portal!


In [3]:
# Retrieve all data objects associated with this study
dobj <- get_data_objects_for_study("nmdc:sty-11-8ws97026")

“cannot open URL 'https://api.microbiomedata.org/data_objects/study/nmdc:sty-11-8ws97026': HTTP status was '503 Service Unavailable'”


ERROR: Error in open.connection(con, "rb"): cannot open the connection to 'https://api.microbiomedata.org/data_objects/study/nmdc:sty-11-8ws97026'


One way of further identifying a NMDC `DataObject` record is by looking at its slot `data_object_type` (https://microbiomedata.github.io/nmdc-schema/data_object_type/), which contains a value from `FileTypeEnum` (https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/). 

We want to look at the processed data results for our two omics types of interest in this notebook. Specifically, we want the files containing KEGG Orthology, Taxonomy Lineage, and  Expression data. These expression data is available in file for metatranscriptomics data (in "Metatranscriptome Expression" data object) and annotation data are found together in the metagenomics and metatranscriptomics ( in 'Annotation KEGG Orthology' and 'Gene Phylogeny tsv') results files.

We filter for results files with the following `data_object_type` values:

| Value | Description |
|:-----:|:-----------:|
|Gene Phylogeny tsv|Tab-delimited file of gene phylogeny|
|Annotation KEGG Orthology|Tab delimited file for KO annotation|
|Metatranscriptome Expression|Read count table output|


In [None]:
dobj <- dobj %>%
  # Filter to biosamples with metagenome EC annotations, metagenome KO 
  # annotations, metaproteomics results, and metabolomics results
  group_by(biosample_id) %>%
  filter("Metatranscriptome Expression" %in% data_object_type &
           "Annotation KEGG Orthology" %in% data_object_type &
           "Gene Phylogeny tsv" %in% data_object_type ) %>%
  ungroup() %>%
  
  # Remove uninformative columns for simpler dataframe
  select(-c(alternative_identifiers, in_manifest, was_generated_by)) %>% 

  # Filter to only desired file types
  filter(data_object_type %in% c("Metatranscriptome Expression", "Annotation KEGG Orthology",
                                "Gene Phylogeny tsv"))

head(dobj)