# Quality Check: Assemblies


Step 0: Checking to make sure you have your assemblies

Step 1: Running Quast on on the assembly

Step 2: Running CheckM on the assembly

## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [6]:
# set the variables for your netid and xfile
id = "MY_ID"
accessions = "MY_ACCESSIONS"

In [18]:
# set directories
xfile_dir = "/my_dir_path/" + id + "/00_getting_data"
work_dir = "/my_dir_path/" + id + "/04_assembly_qc"
data_dir = "/my_dir_path/" + id + "/00_getting_data"
unicycler_dir = "/my_dir_path/" + id + "/03_assembly" + "/out_unicycler"

In [None]:
# Go into the working directory
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [7]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export ID=$id" > config.sh
!echo "export ACESSIONS=$accessions" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export DATA_DIR=$data_dir" >> config.sh
!echo "export UNICYCLER_DIR=$unicycler_dir" >> config.sh
!echo "export QUAST=/path_to_containers/quast:5.2.0--py39pl5321h4e691d4_3.sif" >> config.sh
!echo "export CHECKM=/path_to_containers/checkm2\:1.0.1--pyh7cba7a3_0.sif" >> config.sh

In [None]:
# check the config file to be sure it is correct
!cat config.sh

## Step 0:  Checking the contig count


In [None]:
# Check that we have contigs after the assembly
import os
list = data_dir + '/' + accessions
print(list)
lines = open(list).read().splitlines()
for file in lines:
    command = 'egrep ">" ' + unicycler_dir + '/' + file + '/assembly.fasta | wc -l'
    os.system(command)

Great! Looks like we have all of our contig files from unicycler (or if not, then check the slurm logs to fix). 

## Step 1: Quast

How good are our assemblies? We can check the quality by running tools that look at the contigs produced by our assembly algorithms. 

Let's see what the quality of our assemblies from unicycler, using a bioinformatics tool called quast. We can run this tool on multiple assemblies at once.

In [3]:
my_code = '''#!/bin/bash
#BSUB -J 04A_quast-lsf[1-15]%15     # job name, with array number to run in parallel
#BSUB -n 24                         # number of CPUs required per task
#BSUB -q shared_memory              # the queue to run on
#BSUB -R "span[hosts=1]"            # number of hosts to spread the jobs across, 1 host used here
#BSUB -R "rusage[mem=140GB]"        # required total memory for the job 
#BSUB -o "./output.%J_%I.log"       # standard output file (%J is job name, %I is the array number)
#BSUB -e "./error.%J_%I.log"        # standard error file (%J is job ID, %I is the array number)
#BSUB -W 12:00                      # time to run

pwd; hostname; date

source ./config.sh
names=($(cat $DATA_DIR/$ACCESSIONS))
JOBINDEX=$(($LSB_JOBINDEX - 1))
SAMPLE_ID=${names[${JOBINDEX}]}

### create output directories for the reports
### note that we are going to compare both assemblies at once
OUTDIR=${WORK_DIR}/out_quast

### create the outdir if it does not exist
if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### Contigs to use
CONTIGS=$UNICYCLER_DIR/${SAMPLE_ID}/assembly.fasta

### Run Quast
apptainer run ${QUAST} quast -t ${LSB_DJOB_NUMPROC} \
        -o $OUTDIR/${SAMPLE_ID} \
        -m 500 \
        $CONTIGS
'''

with open('04A_quast-lsf.sh', mode='w') as file:
    file.write(my_code)

## Step 2: Checkm2

Checkm2 is another tool that allows you to produce a quality report on the assembled contigs.

The documentation can be found [here](https://github.com/chklovski/CheckM2).

### Checkm2 database file

This tool requires a database file to run. More information on downloading the database can be found in the documentation. The current database has been downloaded and saved in the following location:

/path-to_databases/checkm2_database/uniref100.KO.1.dmnd

In [4]:
# Create a script to run on the contigs
# A few important points:
my_code = '''#!/bin/bash
#BSUB -J 04B_checkm-lsf[1-15]%15    # job name, with array number to run in parallel
#BSUB -n 24                         # number of CPUs required per task
#BSUB -q shared_memory              # the queue to run on
#BSUB -R "span[hosts=1]"            # number of hosts to spread the jobs across, 1 host used here
#BSUB -R "rusage[mem=140GB]"        # required total memory for the job 
#BSUB -o "./output.%J_%I.log"       # standard output file (%J is job name, %I is the array number)
#BSUB -e "./error.%J_%I.log"        # standard error file (%J is job ID, %I is the array number)
#BSUB -W 12:00                      # time to run
                                  
pwd; hostname; date

source ./config.sh
names=($(cat $DATA_DIR/$ACCESSIONS))
JOBINDEX=$(($LSB_JOBINDEX - 1))
SAMPLE_ID=${names[${JOBINDEX}]}

### create output directory for the report
OUTDIR=${WORK_DIR}/out_checkm

### create the outdirs if they do not exist
if [[ ! -d "$CHECKM_OUTDIR" ]]; then
  echo "$CHECKM_OUTDIR does not exist. Directory created"
  mkdir -p $CHECKM_OUTDIR
fi

### Run checkm
apptainer run ${CHECKM} checkm2 \
        predict --threads ${LSB_DJOB_NUMPROC} \
        --input $UNICYCLER_DIR \
        -x fasta \
        --output-directory $OUTDIR/${SAMPLE_ID} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd  
'''

with open('04B_checkm-lsf.sh', mode='w') as file:
    file.write(my_code)

In [5]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

export JOB1="04A_quast-lsf"
export JOB2="04B_checkm-lsf"

# JOB1: first job - no dependencies
bsub -J $JOB1 < ${JOB1}.sh

# JOB2 depends on JOB1
bsub -J $JOB2 -w 'done($JOB1)' < ${JOB2}.sh

'''

with open('04_launch_pipeline-lsf.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./04_launch_pipeline-lsf.sh

In [None]:
# Check if quast is running
!bjobs --user=$id

#### Let's check out the assembly stats from Checkm2

In [None]:
%cd $work_dir/out_checkm
!cat */quality_report.tsv #cats all reports