# Quality Check: Assemblies


Step 0: Checking to make sure you have your assemblies

Step 1: Running Quast on on the assembly

Step 2: Running CheckM on the assembly

## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "YOUR_NETID"
xfile = "list"

In [None]:
# set directories
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/01_getting_data"
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/04_assembly_qc"
unicycler_dir = work_dir + "/out_unicycler"

In [None]:
# Go into the working directory
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export UNICYCLER_DIR=$unicycler_dir" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 0:  Checking the contig count


In [None]:
# Check that we have megahit contigs after the binning step
import os
xlist = xfile_dir + '/' + xfile
lines = open(xlist).read().splitlines()
for file in lines:
    command = 'egrep ">" ' + unicycler_dir + '/' + file + '/assembly.fasta | wc -l'
    os.system(command)

Great! Looks like we have all of our contig files for megahit and metaspades (or if not, then check the slurm logs to fix). 

## Step 1: Quast

How good are our assemblies? We can check the quality by running tools that look at the contigs produced by our assembly algorithms. 

Let's see what the quality of our assemblies for both megahit and metaspades, using a bioinformatics tool called quast. We can run this tool on multiple assemblies at once.

In [None]:
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=12:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-15                         
#SBATCH --output=04A_quast-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directories for the reports
### note that we are going to compare both assemblies at once
OUTDIR=${WORK_DIR}/out_quast

### create the outdir if it does not exist
if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### Contigs to use
CONTIGS=$UNICYCLER_DIR/${SAMPLE_ID}/assembly.fasta

### Run Quast
apptainer run ${QUAST} quast -t 24 \
        -o $OUTDIR/${SAMPLE_ID} \
        -m 500 \
        $CONTIGS
'''

with open('04A_quast.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's run the sbatch script, this should take 10-15 minutes to run once the job starts
# You can go on to Step 2 in the meantime.
# The quality reports for quast and checkm can run at the same time
!sbatch ./04A_quast.sh.sh

In [None]:
# Check if quast is running
!squeue --user=$netid

## Step 2: Checkm2

Checkm2 is another tool that allows you to produce a quality report on the assembled contigs.

The documentation can be found [here](https://github.com/chklovski/CheckM2).

### Checkm2 database file

This tool requires a database file to run. More information on downloading the database can be found in the documentation. The current database has been downloaded and saved in the following location:

/groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd

In [None]:
# Create a script to run on the contigs
# A few important points:
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-15                      
#SBATCH --output=04B_checkm-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directory for the report
OUTDIR=${WORK_DIR}/out_checkm

### create the outdirs if they do not exist
if [[ ! -d "$CHECKM_OUTDIR" ]]; then
  echo "$CHECKM_OUTDIR does not exist. Directory created"
  mkdir $CHECKM_OUTDIR
fi

### Run checkm
apptainer run ${CHECKM} checkm2 \
        predict --threads 24 \
        --input $UNICYCLER_DIR \
        -x fasta \
        --output-directory $OUTDIR/${SAMPLE_ID} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd  
'''

with open('04B_checkm.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's run the sbatch script, this should take ~1 hour to run
!sbatch ./04B_checkm.sh

In [None]:
# Welcome back, let's see if the job is still running
!squeue --user=$netid

#### Let's check out the assembly stats from Checkm2

In [None]:
%cd $work_dir/out_checkm
!cat */quality_report.tsv #cats all reports

In [None]:
%cd $work_dir/out_checkm
!cat */quality_report.tsv #cats all reports