# Metagenomic Binning & quality

This notebook will go through the workflow for binning contigs into species-level bins from a metagenome assembled genome (MAG). Once the contigs are binned, we will assess the quality and completeness of the genomes in the bins.

-----------

Sections:

1. Create species-level bins for your metagenome assembled genomes (MAGs).
2. Rename your bins for further processing
3. Use Quast to get stats on the MAGs.
4. Use CheckM to assess the quality of your species-level bins and MAGs.
5. Launch the pipeline to run each of the run scripts.

-----------



## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "YOUR_NETID"
xfile = "YOUR_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_binning_quality"
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/07_contam_removal"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=$work_dir" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh
!echo "export MEGAHIT_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_megahit" >> config.sh
!echo "export BWA=/contrib/singularity/shared/bhurwitz/bwa:0.7.8--he4a0461_9.sif" >> config.sh
!echo "export SAMTOOLS=/contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif" >> config.sh
!echo "export CONCOCT=/contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif" >> config.sh
!echo "export QUAST=/contrib/singularity/shared/bhurwitz/quast:5.2.0--py39pl5321h4e691d4_3.sif" >> config.sh
!echo "export CHECKM=/contrib/singularity/shared/bhurwitz/checkm2\:1.0.1--pyh7cba7a3_0.sif" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: 09A_binning

Aligning reads to your megahit contigs via bwa, and binning with concoct 

In this step, we will align the reads from your "screened and cleaned" fastq file back to the contigs you created using each of the assemblers: metaspades and megahit. In the next step, we will use this information to determine the "coverage" for each of the contigs. These data will be used by the binning step to place contigs into the same species-level (hopefully single organism) bins based on the coverage and sequence composition of the contigs.

In [None]:
# Create a script to align your reads to each of your assemblies, and then bin
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. bwa aligns each of the fastq files in the trimmed and human filtered $FASTQ_DIR
# 3. We then run concoct to bin the contigs
# The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                         
#SBATCH --output=09A_binning-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=8G                                  

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

CONCOCT_OUTDIR=${WORK_DIR}/out_concoct
OUTDIR=${CONCOCT_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$CONCOCT_OUTDIR" ]]; then
  echo "$CONCOCT_OUTDIR does not exist. Directory created"
  mkdir $CONCOCT_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs (our input file for binning)
CONTIGS="${MEGAHIT_DIR}/${SAMPLE_ID}/final.contigs.fa"

### create the index from the contigs
apptainer run ${BWA} bwa index ${CONTIGS}

### align reads to the index
apptainer run ${BWA} bwa mem \
-t $SLURM_CPUS_PER_TASK \
${CONTIGS} \
${PAIR1} \
${PAIR2} \
> $OUTDIR/result.sam

### convert sam to bam
apptainer run ${SAMTOOLS} samtools view \
-b -F 4 ${OUTDIR}/result.sam > ${OUTDIR}/result.bam

apptainer run ${SAMTOOLS} samtools \
sort ${OUTDIR}/result.bam > ${OUTDIR}/my_sorted.bam

apptainer run ${SAMTOOLS} samtools \
index ${OUTDIR}/my_sorted.bam

### run the binning step
### following the protocol here: https://concoct.readthedocs.io/en/latest/usage.html
echo "Starting cut_up_fasta.py"
MIN_CONTIG_LENGTH=10000
apptainer run ${CONCOCT} cut_up_fasta.py \
${CONTIGS} \
--chunk_size ${MIN_CONTIG_LENGTH} \
--overlap_size 0 \
--bedfile ${OUTDIR}/contigs_10k.bed \
--merge_last \
> ${OUTDIR}/contigs_10k.fa

echo "Starting concoct_coverage_table.py"
apptainer run ${CONCOCT} concoct_coverage_table.py \
${OUTDIR}/contigs_10k.bed ${OUTDIR}/my_sorted.bam > ${OUTDIR}/coverage_table.tsv

echo "Starting concoct"
apptainer run ${CONCOCT} concoct --threads 24 \
--composition_file ${OUTDIR}/contigs_10k.fa --coverage_file ${OUTDIR}/coverage_table.tsv -b ${OUTDIR}

echo "Starting merge_cutup_clustering.py"
apptainer run ${CONCOCT} \
merge_cutup_clustering.py ${OUTDIR}/clustering_gt1000.csv > ${OUTDIR}/clustering_merged.csv

echo "Starting extract_fasta_bins.py"
mkdir ${OUTDIR}/fasta_bins
apptainer run ${CONCOCT} \
extract_fasta_bins.py ${CONTIGS} ${OUTDIR}/clustering_merged.csv --output_path ${OUTDIR}/fasta_bins

'''

with open('09A_binning.sh', mode='w') as file:
    file.write(my_code)

## Step 2: 09B_add_bin_nums

Let's rename the bins and combine into a single file

Rock on! The last run script creates bins for your megahit contigs. These bins should represent the species (and individual organisms) present in your samples.

The next run script will generate a series of files for each of your samples. In particular we will create a series of *.fasta files preceeded by numbers. These are the different genome bins predicted by binning. Then we will combine these into a single file.

In [None]:
# Create a script to rename genome bins
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command
# 2. The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=01:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class                        
#SBATCH --output=09B_add_bin_nums-%a.out
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=5G 
source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

CONCOCT_OUTDIR=${WORK_DIR}/out_concoct

cd $CONCOCT_OUTDIR

for i in {0..7}; do
    SAMPLE_ID=${names[$i]}
    BIN_DIR=${CONCOCT_OUTDIR}/${SAMPLE_ID}/fasta_bins
    echo ${SAMPLE_ID}
    touch ${SAMPLE_ID}.all_contigs.fna
    cd ${BIN_DIR}
    for file in *.fa; do
        num=$(echo $file | sed 's/.fa//')
        cat $num.fa | sed -e "s/^>/>${num}_/" >> $CONCOCT_OUTDIR/${SAMPLE_ID}.all_contigs.fna
    done
    cd $CONCOCT_OUTDIR
done

cd $WORK_DIR
'''

with open('09B_add_bin_nums.sh', mode='w') as file:
    file.write(my_code)

## Step 3: 09C_quast

Quast (checking the quality of our assembly)

How good are our assemblies? We can check the quality by running tools that look at the contigs produced by our assembly algorithms. 

Let's see what the quality of our assemblies for megahit, using a bioinformatics tool called quast.

In [None]:
# Create a script to run Quast on each of our contig files
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command
# 2. Quast runs on the contigs files in the MEGAHIT_DIR and METASPADES_DIR
# 3. The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=12:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                         
#SBATCH --output=09C_quast-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directories for the reports
### note that we are going to compare both assemblies at once
OUTDIR=${WORK_DIR}/out_quast

### create the outdir if it does not exist
if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### Contigs to use post-binning
CONCOCT_OUTDIR=${WORK_DIR}/out_concoct
CONCOCT_CONTIGS=$CONCOCT_OUTDIR/${SAMPLE_ID}.all_contigs.fna

### Run Quast
apptainer run ${QUAST} quast -t 24 \
        -o $OUTDIR/${SAMPLE_ID} \
        -m 500 \
        $CONCOCT_CONTIGS
'''

with open('09C_quast.sh', mode='w') as file:
    file.write(my_code)

## Step 4: 09D_checkm

Checkm2 is another tool that allows you to produce a quality report on the assembled contigs.

The documentation can be found [here](https://github.com/chklovski/CheckM2).

This tool requires a database file to run. More information on downloading the database can be found in the documentation. The current database has been downloaded and saved in the following location:

/groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd

Let's create a run script to run checkM

In [None]:
# Create a script to run on each of bins
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command 
# 2. CheckM runs on the bin files in the MEGAHIT_DIR and METASPADES_DIR
# 3. The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-7                       
#SBATCH --output=09D_checkm-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### create output directory for the report
OUTDIR=${WORK_DIR}/out_checkm

### create the outdirs if they do not exist
if [[ ! -d "$CHECKM_OUTDIR" ]]; then
  echo "$CHECKM_OUTDIR does not exist. Directory created"
  mkdir $CHECKM_OUTDIR
fi

### Contigs to use post-binning
CONCOCT_OUTDIR=${WORK_DIR}/out_concoct
CONCOCT_CONTIGS=${CONCOCT_OUTDIR}/${SAMPLE_ID}/fasta_bins

### Run Megahit
apptainer run ${CHECKM} checkm2 \
        predict --threads 24 \
        --input $CONCOCT_CONTIGS \
        -x fa \
        --output-directory $OUTDIR/${SAMPLE_ID} \
        --database_path /groups/bhurwitz/databases/checkm2_database/uniref100.KO.1.dmnd  
'''

with open('09D_checkm.sh', mode='w') as file:
    file.write(my_code)

## Step 5: Putting it all together

Once you have created the the run scripts, you are ready to put them together in a pipeline to run each of the steps one by one. Notice which steps are dependent on the others.

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 09A_binning: first job - no dependencies
job1=$(sbatch 09A_binning.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

# 09B_add_bin_nums: jid2 depends on jid1
job2=$(sbatch --dependency=afterok:$jid1 09B_add_bin_nums.sh)
jid2=$(echo $job2 | sed 's/^Submitted batch job //')
echo $jid2

# 09C_quast: jid3 depends on jid2
job3=$(sbatch --dependency=afterok:$jid2 09C_quast.sh)
jid3=$(echo $job3 | sed 's/^Submitted batch job //')
echo $jid3

# 09D_checkm: jid4 depends on jid3
job4=$(sbatch --dependency=afterok:$jid3 09D_checkm.sh)
jid4=$(echo $job4 | sed 's/^Submitted batch job //')
echo $jid4

'''

with open('09_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./09_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Note that this will take some time to run, so go get a coffee!
!squeue --user=$netid

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/09_binning_quality/hw09_binning_quality.ipynb $work_dir