# Quality Check: Assemblies

Now that we have assembled and binned all of the samples, using both megahit and metaspades, we are ready to check our our work and compare the assemblies. This notebook will work through two different quality report. We'll be using Quast and Checkm2 to compare and contrast our assemblies.

Step 0: Checking to make sure you have your assemblies

Step 1: Running Quast on the megahit and metaspades assemblies

Step 2: Running CheckM on the megahit and metaspades assemblies

## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "YOUR_NETID"
xfile = "YOUR_XFILE"

In [None]:
# set directories
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/10_assembly_qc"
megahit_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_metag_binning/out_megahit"
metaspades_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_metag_binning/out_spades"

In [None]:
# Go into the working directory
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export MEGAHIT_DIR=$megahit_dir" >> config.sh
!echo "export METASPADES_DIR=$metaspades_dir" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 0:  Checking the Metagenome Assembled Genomes (MAGs)

All of your metagenomes should have a combined contigs file from the previous step (09_metag_binning), where all of the contigs in the file are named based on the bins that they were put into. If you see that these files are missing, this is a clue that you need to go back and check the last step. Let's see if we have the right files for megahit and metaspades.

In [None]:
# Check that we have megahit contigs after the binning step
import os
xlist = xfile_dir + '/' + xfile
lines = open(xlist).read().splitlines()
for file in lines:
    command = 'ls ' + megahit_dir + '/' + file + '.all_contigs.fna'
    os.system(command)

In [None]:
# Check that we have metaspades contigs after the binning step
import os
xlist = xfile_dir + '/' + xfile
lines = open(xlist).read().splitlines()
for file in lines:
    command = 'ls ' + metaspades_dir + '/' + file + '.all_contigs.fna'
    os.system(command)

Great! Looks like we have all of our contig files for megahit and metaspades (or if not, then check the slurm logs to fix). 

## Step 1: Quast

How good are our assemblies? We can check the quality by running tools that look at the contigs produced by our assembly algorithms. 

Let's see what the quality of our assemblies for both megahit and metaspades, using a bioinformatics tool called quast. We can run this tool on multiple assemblies at once.

In [None]:
# Create a script to run Quast on each of our contig files
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. Quast runs on the contigs files in the MEGAHIT_DIR and METASPADES_DIR
# 3. The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=12:00:00   
#SBATCH --partition=standard
#SBATCH --account=bhurwitz
#SBATCH --array=0-4                         
#SBATCH --output=Job-quast-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directories for the reports
### note that we are going to compare both assemblies at once
OUTDIR=${WORK_DIR}/out_quast

### create the outdir if it does not exist
if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### Contigs to use post-binning
MEGAHIT_CONTIGS=${MEGAHIT_DIR}/${SAMPLE_ID}.all_contigs.fna
METASPADES_CONTIGS=${METASPADES_DIR}/${SAMPLE_ID}.all_contigs.fna

### Run Quast
apptainer run /contrib/singularity/shared/bhurwitz/quast:5.2.0--py39pl5321h4e691d4_3.sif quast -t 24 \
        -o $OUTDIR/${SAMPLE_ID} \
        -m 500 \
        $MEGAHIT_CONTIGS $METASPADES_CONTIGS
'''

with open('quast_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's run the sbatch script, this should take 10-15 minutes to run once the job starts
# You can go on to Step 2 in the meantime.
# The quality reports for quast and checkm can run at the same time
!sbatch ./quast_parallel.sh

In [None]:
# Welcome back, let's see if the job is still running
!squeue --user=$netid

#### Let's check out the assembly stats from QUAST

You should see assembly statistics for both the megahit and metaspades assembly. Do the assemblies look similar?

In [None]:
%cd $work_dir/out_quast
!cat */report.txt #cats all reports

## Step 2: Checkm2

Checkm2 is another tool that allows you to produce a quality report on the assembled contigs.

The documentation can be found [here](https://github.com/chklovski/CheckM2).

### Checkm2 database file

This tool requires a database file to run. More information on downloading the database can be found in the documentation. The current database has been downloaded and saved in the following location:

/groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd

In [None]:
# Create a script to run on each of bins
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. CheckM runs on the bin files in the MEGAHIT_DIR and METASPADES_DIR
# 3. The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bhurwitz
#SBATCH --array=0-4                         
#SBATCH --output=Job-checkm-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directories for each of the reports
MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit_checkm
METASPADES_OUTDIR=${WORK_DIR}/out_spades_checkm

### create the outdirs if they do not exist
if [[ ! -d "$MEGAHIT_OUTDIR" ]]; then
  echo "$MEGAHIT_OUTDIR does not exist. Directory created"
  mkdir $MEGAHIT_OUTDIR
fi

if [[ ! -d "$METASPADES_OUTDIR" ]]; then
  echo "$METASPADES_OUTDIR does not exist. Directory created"
  mkdir $METASPADES_OUTDIR
fi

MEGAHIT_CONTIGS="${MEGAHIT_DIR}/${SAMPLE_ID}/out_concoct/fasta_bins"
METASPADES_CONTIGS="${METASPADES_DIR}/${SAMPLE_ID}/out_concoct/fasta_bins"

### Run Megahit
apptainer run /contrib/singularity/shared/bhurwitz/checkm2\:1.0.1--pyh7cba7a3_0.sif checkm2 \
        predict --threads 24 \
        --input $MEGAHIT_CONTIGS \
        -x fa \
        --output-directory $MEGAHIT_OUTDIR/${SAMPLE_ID} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd
        
### Run Metaspades
apptainer run /contrib/singularity/shared/bhurwitz/checkm2\:1.0.1--pyh7cba7a3_0.sif checkm2 \
        predict --threads 24 \
        --input $METASPADES_CONTIGS \
        -x fa \
        --output-directory $METASPADES_OUTDIR/${SAMPLE_ID} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd    
'''

with open('checkm_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's run the sbatch script, this should take ~1 hour to run
!sbatch ./checkm_parallel.sh

In [None]:
# Welcome back, let's see if the job is still running
!squeue --user=$netid

#### Let's check out the assembly stats from Checkm2

In [None]:
%cd $work_dir/out_megahit_checkm
!cat */quality_report.tsv #cats all reports

In [None]:
%cd $work_dir/out_metaspades_checkm
!cat */quality_report.tsv #cats all reports

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/10_assembly_qc.ipynb $work_dir