# Metagenomic Binning

This notebook will go through the workflow for binning contigs into species-level bins from a metagenome assembled genome (MAG).

1. Create species-level bins for your megahit MAGs
2. Create species-level bins for your metaspades MAGs


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "YOUR_NETID"
xfile = "YOUR_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_metag_binning"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/09_metag_binning" >> config.sh
!echo "export XFILE_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/07_contam_removal" >> config.sh
!echo "export MEGAHIT_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_megahit" >> config.sh
!echo "export METASPADES_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_spades" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Aligning reads to your megahit contigs via bwa, and binning with concoct 

In this step, we will align the reads from your "screened and cleaned" fastq file back to the contigs you created using each of the assemblers: metaspades and megahit. In the next step, we will use this information to determine the "coverage" for each of the contigs. These data will be used by the binning step to place contigs into the same species-level (hopefully single organism) bins based on the coverage and sequence composition of the contigs.

In [None]:
# Create a script to align your reads to each of your assemblies, and then bin
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. bwa aligns each of the fastq files in the trimmed and human filtered $FASTQ_DIR
# 3. We then run concoct to bin the contigs
# The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-mega-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=8G                                  

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit
OUTDIR=${MEGAHIT_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$MEGAHIT_OUTDIR" ]]; then
  echo "$MEGAHIT_OUTDIR does not exist. Directory created"
  mkdir $MEGAHIT_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${MEGAHIT_DIR}/${SAMPLE_ID}/final.contigs.fa"

### create the index from the contigs
apptainer run /contrib/singularity/shared/bhurwitz/bwa:0.7.8--he4a0461_9.sif bwa index ${CONTIGS}

### align reads to the index
apptainer run /contrib/singularity/shared/bhurwitz/bwa:0.7.8--he4a0461_9.sif bwa mem \
-t $SLURM_CPUS_PER_TASK \
${CONTIGS} \
${PAIR1} \
${PAIR2} \
> $OUTDIR/result.sam

### convert sam to bam
apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools view \
-b -F 4 ${OUTDIR}/result.sam > ${OUTDIR}/result.bam

apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools \
sort ${OUTDIR}/result.bam > ${OUTDIR}/my_sorted.bam

apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools \
index ${OUTDIR}/my_sorted.bam

### run the binning step
### following the protocol here: https://concoct.readthedocs.io/en/latest/usage.html
echo "Starting cut_up_fasta.py"
MIN_CONTIG_LENGTH=10000
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif cut_up_fasta.py \
${CONTIGS} \
--chunk_size ${MIN_CONTIG_LENGTH} \
--overlap_size 0 \
--bedfile ${OUTDIR}/contigs_10k.bed \
--merge_last \
> ${OUTDIR}/contigs_10k.fa

echo "Starting concoct_coverage_table.py"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif concoct_coverage_table.py \
${OUTDIR}/contigs_10k.bed ${OUTDIR}/my_sorted.bam > ${OUTDIR}/coverage_table.tsv

echo "Starting concoct"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif concoct --threads 24 \
--composition_file ${OUTDIR}/contigs_10k.fa --coverage_file ${OUTDIR}/coverage_table.tsv -b ${OUTDIR}/out_concoct/

echo "Starting merge_cutup_clustering.py"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif \
merge_cutup_clustering.py ${OUTDIR}/out_concoct/clustering_gt1000.csv > ${OUTDIR}/out_concoct/clustering_merged.csv

echo "Starting extract_fasta_bins.py"
mkdir ${OUTDIR}/out_concoct/fasta_bins
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif \
extract_fasta_bins.py ${CONTIGS} ${OUTDIR}/out_concoct/clustering_merged.csv --output_path ${OUTDIR}/out_concoct/fasta_bins

'''

with open('megahit_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat megahit_bin_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the megahit_bin_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run the megahit contig binning
# Remember that this may take a while to run, so take a break, and get a coffee.
!sbatch ./megahit_bin_parallel.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# Note that this step will take 1-2 hours per file
# For example, you can look at Job-mega-bins-0.out
!cat Job-mega-bins-0.out | head

Rock on! You have created bins for your megahit contigs. These bins should represent the species (and individual organisms) present in your samples.

This step will generate a series of files for each of your samples. Take a look at the files generated. In particular you should see a series of *.fasta files preceeded by numbers. These are the different genome bins predicted by binning.

In [None]:
# Double check that you have bins for your contigs from megahit.
# These bins are in files named like this: "1.fa"
!ls $work_dir/out_megahit/ERR*/out_concoct/fasta_bins

That is correct! You can see that each one of the files 1.fa, 2.fa, 3.fa... represents one bin, and that bin should contain one species, and hopefully one organism, and we can see how complete that bin is (meaning the % of the genome of that species that is represented). 

In [None]:
#Let's check one, for example mine is called ERR2198611/1.fa
# and there are 6 contigs in that file. How about yours?
!egrep '>' $work_dir/out_megahit/YOUR_SAMPLE/out_concoct/fasta_bins/1.fa | wc -l

Now, we are going to generate a concatenated file that contains all of our genome bins put together. We will change the fasta header name to include the bin number so that we can tell them apart later.

Let's write a script to do this. Note that this script will just run locally on this machine, so no coffee break required!

In [None]:
my_code = '''#!/bin/bash

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit

cd $MEGAHIT_OUTDIR

for i in {0..4}; do
    SAMPLE_ID=${names[$i]}
    BIN_DIR=${MEGAHIT_OUTDIR}/${SAMPLE_ID}/out_concoct/fasta_bins
    echo ${SAMPLE_ID}
    touch ${SAMPLE_ID}.all_contigs.fna
    cd ${BIN_DIR} 
    for file in *.fa; do
        num=$(echo $file | sed 's/.fa//')
        cat $num.fa | sed -e "s/^>/>${num}_/" >> $MEGAHIT_OUTDIR/${SAMPLE_ID}.all_contigs.fna
    done
    cd $MEGAHIT_OUTDIR
done

cd $WORK_DIR

'''

with open('megahit_add_bin_nums.sh', mode='w') as file:
    file.write(my_code)

In [None]:
!chmod +x ./megahit_add_bin_nums.sh
!ls -l megahit_add_bin_nums.sh

In [None]:
!./megahit_add_bin_nums.sh

In [None]:
# Let's check to see if the re-naming worked, where all ids are 
# named according to their bin id "_" name.
# My concatenated bin file is called ERR2198611.all_contigs.fna
# Change this to one of your samples
# You should see the the ids all start with their bin_id now
!egrep '>' $work_dir/out_megahit/ERR2198611.all_contigs.fna | head

Looks great! Now we have all of our bins assigned, and we have all of our contigs in a single file.

## Step 2: Binning contigs from your Metaspades Assembly

Rinse and repeat!

In this step, we will create species-level bins for the contigs that were created from your metaspades assembly.

In [None]:
# Create a script to align your reads to each of your assemblies, and then bin
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. bwa aligns each of the fastq files in the trimmed and human filtered $FASTQ_DIR
# 3. concoct is used to bin the contigs into "species-level" bins 
# The results will be written into our $WORK_DIR
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-spades-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=8G                                  

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

METASPADES_OUTDIR=${WORK_DIR}/out_spades
OUTDIR=${METASPADES_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$METASPADES_OUTDIR" ]]; then
  echo "$METASPADES_OUTDIR does not exist. Directory created"
  mkdir $METASPADES_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${METASPADES_DIR}/${SAMPLE_ID}/contigs.fasta"

### create the index from the contigs
apptainer run /contrib/singularity/shared/bhurwitz/bwa:0.7.8--he4a0461_9.sif bwa index ${CONTIGS}

### align reads to the index
apptainer run /contrib/singularity/shared/bhurwitz/bwa:0.7.8--he4a0461_9.sif bwa mem \
-t $SLURM_CPUS_PER_TASK \
${CONTIGS} \
${PAIR1} \
${PAIR2} \
> $OUTDIR/result.sam

### convert sam to bam
apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools view \
-b -F 4 ${OUTDIR}/result.sam > ${OUTDIR}/result.bam

apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools \
sort ${OUTDIR}/result.bam > ${OUTDIR}/my_sorted.bam

apptainer run /contrib/singularity/shared/bhurwitz/samtools:1.17--hd87286a_1.sif samtools \
index ${OUTDIR}/my_sorted.bam

### run the binning step
### following the protocol here: https://concoct.readthedocs.io/en/latest/usage.html
echo "Starting cut_up_fasta.py"
MIN_CONTIG_LENGTH=10000
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif cut_up_fasta.py \
${CONTIGS} \
--chunk_size ${MIN_CONTIG_LENGTH} \
--overlap_size 0 \
--bedfile ${OUTDIR}/contigs_10k.bed \
--merge_last \
> ${OUTDIR}/contigs_10k.fa

echo "Starting concoct_coverage_table.py"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif concoct_coverage_table.py \
${OUTDIR}/contigs_10k.bed ${OUTDIR}/my_sorted.bam > ${OUTDIR}/coverage_table.tsv

echo "Starting concoct"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif concoct --threads 24 \
--composition_file ${OUTDIR}/contigs_10k.fa --coverage_file ${OUTDIR}/coverage_table.tsv -b ${OUTDIR}/out_concoct/

echo "Starting merge_cutup_clustering.py"
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif \
merge_cutup_clustering.py ${OUTDIR}/out_concoct/clustering_gt1000.csv > ${OUTDIR}/out_concoct/clustering_merged.csv

echo "Starting extract_fasta_bins.py"
mkdir ${OUTDIR}/out_concoct/fasta_bins
apptainer run /contrib/singularity/shared/bhurwitz/concoct:1.1.0--py311h245ed52_4.sif \
extract_fasta_bins.py ${CONTIGS} ${OUTDIR}/out_concoct/clustering_merged.csv --output_path ${OUTDIR}/out_concoct/fasta_bins

'''

with open('metaspades_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# check the code was created
!cat metaspades_bin_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the metaspades_bin_parallel.sh script?
!ls

In [None]:
# Let's run the sbatch script, this should take ~1 hour to run
# Time for some coffee..
!sbatch ./metaspades_bin_parallel.sh

In [None]:
# Welcome back, let's see if the job is still running
!squeue --user=bhurwitz

In [None]:
# Double check that you have bins for your contigs from megahit.
# These bins are in files named like this: "ERR2198611.001.fasta"
!ls $work_dir/out_spades

In [None]:
my_code = '''#!/bin/bash

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

METASPADES_OUTDIR=${WORK_DIR}/out_spades

cd $METASPADES_OUTDIR

for i in {0..4}; do
    SAMPLE_ID=${names[$i]}
    BIN_DIR=${METASPADES_OUTDIR}/${SAMPLE_ID}/out_concoct/fasta_bins
    echo ${SAMPLE_ID}
    touch ${SAMPLE_ID}.all_contigs.fna
    cd ${BIN_DIR} 
    for file in *.fa; do
        num=$(echo $file | sed 's/.fa//')
        cat $num.fa | sed -e "s/^>/>${num}_/" >> $METASPADES_OUTDIR/${SAMPLE_ID}.all_contigs.fna
    done
    cd $METASPADES_OUTDIR
done

cd $WORK_DIR

'''

with open('metaspades_add_bin_nums.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Change permissions and check to see you have the script
!chmod +x ./metaspades_add_bin_nums.sh
!ls -l metaspades_add_bin_nums.sh

In [None]:
# Run the script to add bin ids and create a single fasta
!./metaspades_add_bin_nums.sh

In [None]:
# Let's check to see if the re-naming worked, where all ids are 
# named according to their bin id "_" name.
# My concatenated bin file is called ERR2198611.fasta
# Change this to one of your samples
# You should see the the ids all start with their bin_id now
!egrep '>' $work_dir/out_spades/YOUR_FILE.all_contigs.fna | head

You did it! We now have created bins for all of our contigs, and we have a single fasta file for each that we will now run through the 
Assembly quality control process. But, that is for next time!

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/09_concoct_binning.ipynb $work_dir