# Identifying relationships between annotated omics data in NMDC

In this notebook, we’ll look at how different types of omics data can be connected using popular annotation vocabularies. We’ll focus on exploring biomolecules and KEGG pathways from a group of samples that have both metagenomic and metatranscriptomic data, all available in the NMDC Data Portal. By working through these examples, you’ll see how to bring these data types together for integrated analysis.

In [1]:
%%capture
## First need to install locally the nmdc_api_utilities
%pip install nmdc_api_utilities
%pip install Bio
%pip install pycirclize

## Set up environment variables
**You can disregard this section of code unless you have interest in testing this on the development API.**

Using pythons python-dotenv package, load the environment variables from the system. This chunk of code is used in the Github CI/CD pipelines to test our development API. The variable can be used when creating `nmdc_api_utilities` objects. If you do not have the environment variable `ENV` set in a .env file, this code with default to the variable "prod" - which tells nmdc_api_utilities to use the production API URL.

In [None]:
# set up environment variables
from dotenv import load_dotenv
import os
load_dotenv()
# load the environment variable ENV. If it does not exist, default to "prod"
ENV = os.environ.get("ENV", "prod")

In [2]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import numpy as np
import sys
from io import StringIO
from Bio.KEGG import REST
from pycirclize import Circos
import time
import importlib.util
import nmdc_api_utilities

# Identifying relationships between annotated omics data in NMDC

This notebook is an example of how different omics data types may be linked via commonly used annotation vocabularies and investigated together. In this notebook we explore biomolecules and KEGG pathways identified in a set of samples that have processed metagenomics, metaproteomics, and metabolomics data available in the NMDC Data Portal.

## 1. Retrieve data from the NMDC database using API endpoints

### Choose data to retrieve

The [NMDC Data Portal](https://data.microbiomedata.org/) is a powerful resource where you can search for samples by all kinds of criteria. For this example, we’ll focus on finding samples that have both metagenomics and metatranscriptomics data.  
To do this, just use the Data Type filters (the “upset plot” below the interactive map) to quickly spot the right samples. In our case, these filters lead us to samples from the study [“Jeff Blanchard’s Harvard forest soil project”](https://data.microbiomedata.org/details/study/nmdc:sty-11-8ws97026).

### Retrieve and filter data for Columbia River sediment study

The study page linked above has the NMDC study identifier in the URL: `nmdc:sty-11-8ws97026`. We will use the [nmdc_api_utilities](https://microbiomedata.github.io/nmdc_api_utilities/) package to access the data_objects/study endpoint from the [NMDC Runtime API](https://api.microbiomedata.org/docs) to retrieve all records that represent data. This includes URLs for downloading raw data files (e.g. FASTQ) as well as processed data results output by the NMDC workflows.

In [None]:
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-8ws97026')
data = pd.DataFrame(data)

#reformat data into dataframe
data_objects=[]
for index, row in data.iterrows():
    bio_id = row['biosample_id']
    row_out = pd.json_normalize(row['data_objects'])
    row_out['biosample_id'] = bio_id
    data_objects.append(row_out)

data_objects = pd.concat(data_objects).reset_index(drop=True)
display(data_objects)

del data, index, row, row_out, bio_id