# Tutorial 03 – Metagenomic data processing: from reads to protein functions

## Overview

In this tutorial, we will start with the process of downloading a small metagenome from the NCBI database, on which we will perform different analyses.

The structure of the directories in which the work will be distributed is as follows, separating the different steps of the data processing into different folders:

```
- /SRR2239652/
  - /01_raw_reads/
  - /02_clean_reads/
  - /03_assembly_msp/
  - /04_binning_mW/
  - /05_binref_mW/
  - /06_MAGs_refinement1
  - /07_metaprodigal
  - /08_quantify
  - /09_metabolic
```

#Part 0. Downloading and Installing the required software

Before we start, you must first **remember to start the hosted runtime in Google Colab**.

Then, we must install several pieces of software to perform this tutorial. Namely:
- **SRA Toolkit** for manipulating SRA accession IDs from the SRA database.
- **mambaforge**, a free minimal installer and re-implementation of **conda** for software package and environment management.
- **fastqc**, a tool to assess the quality of the sequencing data.
- **cutadapt**, a tool for trimming the adapter sequences and other types of unwanted sequences from the data.
- **SPAdes**, a genome and metagenome assembly toolkit containing various assembly pipelines.
- **prodigal**, a protein-coding gene prediction tool for prokaryotic genomes.
- **CoverM**, a DNA read coverage and relative abundance calculator focused on metagenomics applications.
- **MetaWRAP**, an easy-to-use metagenomic wrapper suite that accomplishes the core tasks of metagenomic analysis from start to finish.
- **metabolic**, a tool for high-throughput profiling of genomes from isolates, metagenome-assembled genomes, or single-cell genomes



### ⚠️ **WARNING**: This installation process is particularly long, and can require up to 30 min. Please proceed with the installation with sufficient anticipation before the actual tutorial starts.

After several tests, the following installation instructions are the best way of setting up **Google Colab** for this laboratory session.

1. Install **SRA toolkit** from pre-compiled Ubuntu libraries and test that it is correctly installed.

In [None]:
%%bash
#Downloading the latest pre-compiled version of SRA Toolkit (now, v3.0.0)
wget --quiet --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -xzf sratoolkit.tar.gz

In [None]:
%%bash
#Setting up SRA Toolkit and testing that it is correctly installed
#On Google Colab, this PATH export will be required in every code cell to execute SRA Toolkit
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin
#Setting up
echo "Aexyo" | vdb-config -i
#Test, which should print sequence data
fastq-dump --stdout -X 2 SRR390728

2. Install **mamba** using **condacolab**, which will enable the installation of **fastqc, cutadapt, SPAdes, prodigal, and CoverM**

In [None]:
#Installing Conda using Condacolab
!pip install -q condacolab
import condacolab
condacolab.install_mambaforge()

In [None]:
%%bash
#Installing fastqc, cutadapt, prodigal, and coverm using conda
mamba install -y -c bioconda fastqc cutadapt prodigal coverm spades

3. Install **MetaWRAP** on a new conda environment using **mamba**

In [None]:
%%bash
#Installing MetaWRAP on a different conda environment that we will need to
#summon everytime we want to use it
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels ursky
mamba create -y --name mwrap-env --channel ursky metawrap-mg=1.3.2 spades

4. Configure the **CheckM** database for properly running metaWRAP

In [None]:
#Downloading the CheckM database on a user-specified folder
%cd /content/
!mkdir checkm_folder
# Now download using wget
%cd checkm_folder
!wget --quiet https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
!tar -xzf *.tar.gz
!rm *.gz
%cd ../

In [None]:
%%bash
#Now we will activate our conda environment for metaWRAP
source activate mwrap-env
#Configuring the CheckM database
echo "/content/checkm_folder/" > option
echo " " >> option
# On newer versions of CheckM, you would run:
checkm data setRoot < option
checkm data setRoot < option

4. Install **metabolic** and its metabolic profiles on a new conda environment using **mamba**

In [None]:
%%bash
#Installing metabolic on a on a different conda environment that we will need to
#summon everytime we want to use it
mamba create -y --name metabolic-env -c hcc metabolic

In [None]:
%%time
%%bash
#Now we will activate our conda environment for metabolic
source activate metabolic-env
#Downloading the metabolic profiles
download-metabolic-profiles.sh

# Part I - Quality control and trimming of the reads in a SRA file

1. The metagenome will be directly downloaded from the NCBI SRA database with the **fastq-dump** software from **SRA toolkit** and splitting the files into forward (\*_1.fastq) and reverse (\*_2.fastq) files.

In [None]:
#We will work on metagenomic sequencing data from SRA accession ID
#We set up this accession ID as a python variable here
SRA='SRR2239652'
#This is how we create a folder with mkdir: mkdir NAME
!mkdir {SRA}
#We now enter the new folder permanently (with magic command %) and download our SRA file using wget
%cd {SRA}
!mkdir 01_raw_reads
%cd 01_raw_reads
!wget --quiet https://sra-pub-run-odp.s3.amazonaws.com/sra/{SRA}/{SRA}

In [None]:
%%time
%%bash
#Now we will process our downloaded SRA file with fastq-dump
#On Google Colab, this PATH export will be required in every code cell to execute SRA Toolkit
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin
#We set up our SRA accession ID as a variable
SRA=SRR2239652
#Now we will process our downloaded SRA file with fastq-dump
fastq-dump --split-files $SRA

2. Quality along the reads is provided with the **fastqc** software before and after the trimming process. 

In [None]:
%%time
%%bash
#Now we will process our FW and RV fastq files with fastqc
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin
fastqc SRR2239652_1.fastq SRR2239652_2.fastq
#Now we will check the quality control (QC) curves
#Download the fastqc-generated html files and manually inspect them

In [None]:
#@markdown ### 1. Load HTML file from fastqc
html_filename = "/content/SRR2239652/01_raw_reads/SRR2239652_1_fastqc.html" #@param {type:"string"}
#@markdown - This script assumes the HTML file is located in the current folder

print('\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')
import IPython
IPython.display.HTML(filename=html_filename)

3. With the fastqc curves, it can be determined which sequence segments parts of the reads should be trimmed. The **cutadapt** software will remove the reads with the following parameters
  - average phred quality value below 28 (`-q 28`)
  - the first and last 15 bp of the forward reads (`-U 15 -U -15`) 
  - the first 20 and last 15 bp from the reverse reads (`-u 20 -u -15`).
  - remove the reads shorter than 50 bp (`-m 50`) and with any undetermined base (N) (`--trim-n --max-n 1`).

In [None]:
#We will create a new directory in which we will run the next step of the tutorial
%cd ..
%mkdir 02_clean_reads
%cd 02_clean_reads

In [None]:
%%time
%%bash
#Now, we will run cutadapt to trim the adapters based on parameters derived
#from the QC curves and then run again our QC using fastqc
export PATH=$PATH:/content/sratoolkit.3.0.0-ubuntu64/bin
#We set up this accession ID as a variable
SRA=SRR2239652
#Here is an example of our cutadapt run and parameters
cutadapt -q 28 -U 15 -U -15 -u 20 -u -15 -m 50 --trim-n --max-n 1 -o t$SRA\_1.fastq -p t$SRA\_2.fastq ../01_raw_reads/$SRA\_1.fastq  ../01_raw_reads/$SRA\_2.fastq
#Running fastqc on our trimmed fastq files
fastqc t$SRA\_1.fastq t$SRA\_2.fastq

# Part II - Assembling our metagenomes using metaWRAP

1. After checking the clean reads, the conda environment of the metaWRAP software package and its several tools will be used.

  A metagenome assembly will be done with the software **SPAdes** in the metagenomics (`--meta`) mode. Other options can be, for example, `--carefull` for genomes from isolates instead.
  It is recommended to use the highest number of threads (here 2 CPUs, `-t 2`) as well as RAM memory (here 8 GB, `-m 8`) for this process. If the process needs more memory, it will **stop**. 

### ⚠️ **CAUTION**: The following script takes 1h and 30 min on Google Colab due to its low number of CPUs (only 2). For continuation of the tutorial, we will downloaded a pre-assembled file instead

In [1]:
#We will now exit our previous folder and create a new one for our next steps
#%cd ..
%mkdir 03_assembly_msp
%cd 03_assembly_msp

/content/03_assembly_msp


In [None]:
#Downloading the pre-assembled files from a previous SPAdes run
!wget https://raw.githubusercontent.com/pb3lab/workshops/main/backups/saocarlos2022/contigs.fasta
!wget https://raw.githubusercontent.com/pb3lab/workshops/main/backups/saocarlos2022/spades.log



```
#COPY THIS SCRIPT INTO A CODE CELL ONLY IF YOU HAVE 2 H OF FREE TIME TO PATIENTLY WORK ON THIS
%%time
%%bash
#Now we will activate our conda environment for metaWRAP
source activate mwrap-env
#We set up this accession ID as a variable
SRA=SRR2239652
#We will assembly the trimmed reads using SPAdes with 2 threads and 8 GB RAM
#This is sub-optimal, due to Google Colab limitations
spades.py --meta -t 2 -m 8 -1 ../02_clean_reads/t$SRA\_1.fastq -2 ../02_clean_reads/t$SRA\_2.fastq -o .
```



2. The stats of the assembly can be seen with the **quast** software, which will give the main stats of the obtained contigs like numbers and sizes.

In [None]:
%%time
%%bash
#Now we will activate our conda environment for metaWRAP
source activate mwrap-env
#We set up this accession ID as a variable
SRA=SRR2239652
#Analyze using quast
quast contigs.fasta

In [None]:
#Enter quast folder
%cd quast_results
!cp results*/report.html .

3. Now we will look at the report file from our quast run

In [None]:
#@markdown ### Load HTML file from quast
html_filename = "report.html" #@param {type:"string"}
#@markdown - This script assumes the HTML file is located in the current folder

print('\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')
import IPython
IPython.display.HTML(filename=html_filename)

# Part III - Metagenome binning using metaWRAP

1. The binning process is done in two steps. First, the binning is performed over the reads independently with three different software (`--metabat2 --maxbin2 –concoct`) and using the assembly (`-a`). The second step is the joining process of the results from the three binning softwares, and retaining only the mags with > 50 % completeness and < 10 % contamination (`-c 50 -x 10`)

In [None]:
#Go back to original folder of the SRA
%cd /content/{SRA}

In [None]:
%%time
%%bash
#Now we will activate our conda environment for metaWRAP
source activate mwrap-env
#We set up this accession ID as a variable
SRA=SRR2239652
#We will perform binning using metaWRAP with 2 threads
#This is sub-optimal, due to Google Colab limitations
metawrap binning -o 04_binning_mw -t 2 -a ./03_assembly_msp/contigs.fasta --metabat2 --maxbin2 --concoct ./02_clean_reads/t$SRA\_1.fastq ./02_clean_reads/t$SRA\_2.fastq

2. Then, the final MAGs are copied to a definite folder in which the names of the files are also automatically changed with the respective scripts. We recommend to further refine MAGs with other software like **refine** or **MAGpurify** to remove inconsistent contigs regarding sequence or taxonomical properties. 

In [None]:
#Go back to original folder of the SRA
%cd /content/{SRA}
!mkdir 06_MAGs_refinement1

In [None]:
!cp 04_binning_mw/metabat2_bins/* 06_MAGs_refinement1/
%cd 06_MAGs_refinement1/
!for f in bin*; do mv "$f" "${f/bin/MAG_{SRA}_}";done
!for f in *.fa; do mv "$f" "${f/.fa/.fasta}";done

# Part IV - Predicting and quantifying proteins from metagenomes

1. From the contigs (or also the MAGs), the protein sequences are predicted with the **Prodigal** software in which the outputs will be a file in genbank format (\*.gbk), the amino acid sequences with `-a` option (\*.faa) and the nucleotide sequences of the coding regions with `-d` option (\*.ffn).

In [None]:
#Go back to original folder of the SRA and create new folder
%cd /content/{SRA}
!mkdir 07_metaprodigal
%cd 07_metaprodigal

In [None]:
%%time
!prodigal -i ../03_assembly_msp/contigs.fasta -p meta -o {SRA}.gbk -a {SRA}.faa -d {SRA}.ffn

2. To quantify the number of reads that map to the respective MAGs the software **coverM** will be used. In genome mode, the MAGs can be quantified indicating the folder (`-d`), using the **bwa-mem** mapping software (`-p bwa-mem`) and using the parameter to filter the mapping reads as minimum percent of identity of 95 % and minimum of read alignment of 80 % (`--min-read-percent-identity 95 --min-read-aligned-percent 80`). The software can calculate several metrics, like **relative abundance** for MAGs (`-m relative_abundance`) or **transcripts per million** (TPM; `-m tpm`) for coding sequences. 

In [None]:
#Go back to original folder of the SRA and create new folder
%cd /content/{SRA}
!mkdir 08_quantify
%cd 08_quantify

In [None]:
%%time
!coverm genome -1 ../02_clean_reads/t{SRA}_1.fastq -2 ../02_clean_reads/t{SRA}_2.fastq -d ../06_MAGs_refinement1/ -p bwa-mem  --min-read-percent-identity 95 --min-read-aligned-percent 80 -m relative_abundance  --output-file {SRA}_mags_rel_abund --bam-file-cache-directory bam_95_80 -t 2
!coverm contig -1 ../02_clean_reads/t{SRA}_1.fastq -2 ../02_clean_reads/t{SRA}_2.fastq -r  ../07_metaprodigal/{SRA}.ffn -p bwa-mem -m tpm --min-read-percent-identity 95 --min-read-aligned-percent 50 -o {SRA}_all_prot_tpm

3. The METABOLIC-G software can identify 160 KOFAM profiles related with the main metabolic pathways and biogeochemical cycles, as well the CAZymes and other functions. This can be done over the predicted proteins (`-in`) or over directly on the assembly or the MAGs (`-in-gn`).

  The METABOLIC-C version can determine the taxonomy, abundance and cycles of the metagenomes when the reads are also provided and the GTDB-tk software is used.

### ⚠️ **CAUTION**: Beware that running metabolic might take a loooong time.

In [None]:
#Go back to original folder of the SRA and create new folder
%cd /content/{SRA}
!mkdir 09_metabolic
%cd 09_metabolic

In [None]:
%%time
%%bash
#Now we will activate our conda environment for metabolic
source activate metabolic-env 
#Running metabolic
METABOLIC-G.pl -in ../07_metaprodigal/ -o metabolic_all_proteins
METABOLIC-G.pl -in-gn ../06_MAGs_refinement1/ -o metabolic_mags

4. With all of this, from a metagenome can be characterized the MAGs, the proteins, and their functions as well their abundances. This information will be useful to determine possible roles of microorganisms and their proteins in the environment. Moreover, these insights are the base to study these proteins in silico and in vitro in the laboratory to determine their biotechnological potential and optimization. 

# Part V - Backing up your files

1. This tutorial generated very heavy documents, so you must transfer the files directly to your Google Drive as shown below

In [None]:
#Compressing all files into a .tar.gz file
%cd /content/
!tar -czf D1-tutorial-02.tar.gz *

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os
import shutil
from pathlib import Path 
backup = Path("/content/drive/MyDrive/saocarlos2022/")
if os.path.exists(backup):
  print("Sao Carlos Workshop 2022 - Backup folder already exists")
if not os.path.exists(backup):
  os.mkdir(backup)
  print("Sao Carlos Workshop 2022 - Backup folder did not exists and was succesfully created")

#Backing up
shutil.copy(str('/content/D1-tutorial-02.tar.gz'), str(backup/'D1-tutorial-02.tar.gz'))
print("Day 1 - Tutorial 2 files successfully backed up!")