# Meta'omics for Ocean Science

Ocean Hack Week 2023 Tutorial by Julia M Brown

![scales](./tutorial_images/rainbow_satellite_to_microbe.png)  

With thanks to the following for content and inspiration:  
[Greg Gavelis](https://github.com/ggavelis), [Joe Brown](https://github.com/brwnj), [Maria Pachiadaki](https://github.com/microbiaki), [Ramunas Stepanauskas](https://www.bigelow.org/about/people/rstepanauskas.html), [MerenLab](https://merenlab.org/), [Kaiju Team](https://bioinformatics-centre.github.io/kaiju/), [Cath Mitchell](https://github.com/MarineOpticsLab)



### What is 'omics data?

* Data on biological molecules  
* 'Meta' refers to collecting and processing samples in bulk 
* Data often focused on specific size fractions  

### How is it generated?

* Collection of sample in bulk.  
* For planktonic microbes, samples are collected based on a specific size fraction that targets different microbial groups
    * e.g. bacteria and archaea, protists, phytoplankton, viruses  
* Nucleic acids, proteins or other target molecules extracted and sequenced 
* For nucleotide data (DNA + RNA), samples often sequenced via Illumina sequencing
    * generates short __paired end__ reads 
    * reads can be characterized directly or used to assemble larger __contiguous sequences__

![omics](./tutorial_images/metaXomics_diagram.png)  

### It can tell us the who and what of microbial communities

**Metagenome (DNA)** : Presence and potential  
**Metatranscriptome (RNA)**: Activity  

**Taxonomy:**  

* What microbes are present -- DNA
* Which microbes are active -- RNA

**Function:**

* What is the metabolic potential? -- DNA
* What processes are being carried out? -- RNA

### What does it look like?  

* fastq - raw sequence read data with quality information included
* fasta - sequence data, can be contiguous sequences, open reading frames (i.e. coding sequences) or protein sequences



### How can we use raw reads?

**Read profiling** is one of the most commonly used processes in 'omics analysis. It is applied to access the relative abundance of taxonomic groups within metagenomic datasets (when using DNA metagenomes) or to estimate the expression of different microbial taxa (when RNA metatranscriptomes are used).

In a nutshell short reads are aligned to a genomic reference sequences, which have taxonomic information assigned to them that may be assigned to the reads.

![recruitment](./tutorial_images/01-metagenomic-read-recruitment-simple.gif)  
(Thank you [MerenLab](https://merenlab.org/) for the animation)

# Read classification tools

# [Kaiju](https://kaiju.binf.ku.dk/server)
![Kaiju](https://kaiju.binf.ku.dk/images/kaiju3_header.gif)

**Also: [Kraken2](https://github.com/DerrickWood/kraken2/)**

These workflows are wicked fast!

### How do they work?

**Database**  
Database consists of a collection of translated proteins mapped to microbial genomes.

<img src="./tutorial_images/proteins_to_genomes.png" alt="p2g" width="400"/>

### Short read alignment

Reads are translated into protein sequences and aligned to reference protein sequences. Best matches to proteins are then taxonomically assigned based on protein's membership in microbial genomes.

![kaiju_diagram](./tutorial_images/short_read_align.png)

## Tools are as good as your reference database

Kaiju and other classifiers rely on genome databases that primarily contain genomes from isolated microbes and genomes assembled from metagenomes ('MAGs').

**Available Standard Kaiju Databases**  
<img src="https://upload.wikimedia.org/wikipedia/commons/0/07/US-NLM-NCBI-Logo.svg" alt="ncbi" width="100"/>

**nr**: Non-redundant proteins from bacteria, archaea and viruses  
**RefSeq**: Curated bacterial, archaeal and viral genomes from NCBI

<img src="https://progenomes.embl.de/img/progene_head21.png" alt="progenomes" width="300"/>  

**ProGenomes**: Database of microbial genomes including MAGs from diverse environments

**Note:** Kaiju has other available databases that could be useful for your environment or organisms of interest. See their website for more options.  

**Database Limitations**: Despite the depth of these collections, they leave stones unturned. Microbial diversity is high and genomes assembled from short reads represent only a fraction of the microbial diversity present in the ocean!

## SAGs

Single Cell Genomics is another type of 'omics data, it consists of DNA sequence data generated from the DNA present in single cells.  Each set of data from each cell is referred to as a __Single Amplified Genome (i.e. SAG)__. SAGs represent real biological units recovered from samples, and contain genomic information specific to individuals.

<img src="./tutorial_images/scg_diagram.png" alt="sag" width="600"/>

## GORG-Tropics: A collection of reference genomes from individual cells from the Tropical and Sub-tropical Epipelagic Ocean

GORG-Tropics is more representative of global ocean microbes than MAGs or currently available reference genomes*.

*I am not sure if GORG-Tropics has been integrated into ProGenomes or not

![GORG-Figure2](https://ars.els-cdn.com/content/image/1-s2.0-S0092867419312735-gr2.jpg)  


GORG-Tropics is more accurate and sensitive than default databases used for read classification by Kaiju when analyzing marine epipelagic samples. When GORG-Tropics used as a database for reads from similar environments, many more were able to be correctly classified. 

![GORG-Figure](https://ars.els-cdn.com/content/image/1-s2.0-S0092867419312735-gr6.jpg)  

The other advantage of using a tailored database is that it takes up less storage space :)

## TODAY: We will be running Kaiju on a collection of epipelagic metagenomes from the Bermuda Atlantic Time Series using GORG-Tropics_v1 database



In [4]:
import pandas as pd

df = pd.read_csv("./data/PRJNA385855_sra_metadata.csv", sep = ",")
mgoi = df[df['cruise_id'].str.contains('BATS') & df['Depth'].isin(['10m','1m'])][['Run','Collection_date','cruise_id','BioSample','Depth']].sort_values(by = 'Collection_date')

# going to save this table to file
mgoi.to_csv("./data/bats_metagenomes_of_interest.csv", index=False)

with open('data/metagenomes_to_download.txt', 'w') as oh:
    for run in mgoi['Run']:
        print(run, file = oh)

In [11]:
mgoi.head()

Unnamed: 0,Run,Collection_date,cruise_id,BioSample,Depth
74,SRR5720233,2003-02-21,BATS173,SAMN07137079,1m
14,SRR5720238,2003-03-22,BATS174,SAMN07137082,1m
119,SRR5720327,2003-04-22,BATS175,SAMN07137064,10m
99,SRR5720283,2003-05-20,BATS176,SAMN07137103,1m
75,SRR5720235,2003-07-15,BATS178,SAMN07137085,10m


2023-08-02T16:46:34 fasterq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2023-08-02T16:46:34 fasterq-dump.2.9.6 sys: mbedtls_ssl_get_verify_result returned 0x4008 (  !! The certificate is not correctly signed by the trusted CA  !! The certificate is signed with an unacceptable hash.  )
2023-08-02T16:46:34 fasterq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - ktls_handshake failed while accessing '130.14.29.110' from '172.18.0.2'
2023-08-02T16:46:34 fasterq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - Failed to create TLS stream for 'www.ncbi.nlm.nih.gov' (130.14.29.110) from '172.18.0.2'
2023-08-02T16:46:34 fasterq-dump.2.9.6 err: invalid accession 'SRR5720233'


In [16]:
%%bash

cd subsampled_metagenomes

while read p;do
    fastq-dl -a $p --provider sra
done < metagenomes_to_download.txt

2023-08-02T15:58:46 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2023-08-02T15:58:46 fastq-dump.2.9.6 sys: mbedtls_ssl_get_verify_result returned 0x4008 (  !! The certificate is not correctly signed by the trusted CA  !! The certificate is signed with an unacceptable hash.  )
2023-08-02T15:58:46 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - ktls_handshake failed while accessing '130.14.29.110' from '172.18.0.2'
2023-08-02T15:58:46 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - Failed to create TLS stream for 'www.ncbi.nlm.nih.gov' (130.14.29.110) from '172.18.0.2'
2023-08-02T15:58:46 fastq-dump.2.9.6 err: item not found while constructing within virtual database module - the path 'SRR5720233' cannot be opened as database or table


In [10]:
%%bash
mkdir data/subsampled_metagenomes/

cd data/subsampled_metagenomes

while read p; do
  fastq-dump --split-files --skip-technical -N 0 -X 1000000 --gzip --readids "$p"
done < ../metagenomes_to_download.txt

2023-08-02T15:50:35 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -9984 ( X509 - Certificate verification failed, e.g. CRL, CA or signature check failed )
2023-08-02T15:50:35 fastq-dump.2.9.6 sys: mbedtls_ssl_get_verify_result returned 0x4008 (  !! The certificate is not correctly signed by the trusted CA  !! The certificate is signed with an unacceptable hash.  )
2023-08-02T15:50:35 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - ktls_handshake failed while accessing '130.14.29.110' from '172.18.0.2'
2023-08-02T15:50:35 fastq-dump.2.9.6 sys: connection failed while opening file within cryptographic module - Failed to create TLS stream for 'www.ncbi.nlm.nih.gov' (130.14.29.110) from '172.18.0.2'
2023-08-02T15:50:35 fastq-dump.2.9.6 err: item not found while constructing within virtual database module - the path 'SRR5720233' cannot be opened as database or table
2023-08-02T

CalledProcessError: Command 'b'mkdir data/subsampled_metagenomes/\n\ncd data/subsampled_metagenomes\n\nwhile read p; do\n  fastq-dump --split-files --skip-technical -N 0 -X 1000000 --gzip --readids "$p"\ndone < ../metagenomes_to_download.txt\n'' returned non-zero exit status 3.

/home/jovyan/metagenomics_tutorial


In [1]:
!kaiju

Error: Please specify the location of the nodes.dmp file, using the -t option.

Kaiju 1.9.2
Copyright 2015-2022 Peter Menzel, Anders Krogh
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

Usage:
   kaiju -t nodes.dmp -f kaiju_db.fmi -i reads.fastq [-j reads2.fastq]

Mandatory arguments:
   -t FILENAME   Name of nodes.dmp file
   -f FILENAME   Name of database (.fmi) file
   -i FILENAME   Name of input file containing reads in FASTA or FASTQ format

Optional arguments:
   -j FILENAME   Name of second input file for paired-end reads
   -o FILENAME   Name of output file. If not specified, output will be printed to STDOUT
   -z INT        Number of parallel threads for classification (default: 1)
   -a STRING     Run mode, either "mem"  or "greedy" (default: greedy)
   -e INT        Number of mismatches allowed in Greedy mode (default: 3)
   -m INT        Minimum match length (default: 11)
   -s INT        Minimum match score in Greedy mode (default: 65)
   -E