# Metagenomic Binning

This notebook will go through the workflow for binning contigs into species-level bins from a metagenome assembled genome (MAG).

1. Create species-level bins for your megahit MAGs
2. Create species-level bins for your metaspades MAGs


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "YOUR_NETID"
xfile = "YOUR_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_metag_binning"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/09_metag_binning" >> config.sh
!echo "export XFILE_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/07_contam_removal" >> config.sh
!echo "export MEGAHIT_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_megahit" >> config.sh
!echo "export METASPADES_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_spades" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Binning contigs from your Megahit Assembly

In this step, we will create species-level bins for the contigs that were created from your megahit assembly. Note that this step will take about 1 hour to run. Once you submit the script using sbatch, you can go on to step 2 to kick off the binning for the metaspades assembly at the same time (up to the sbatch step).

In [None]:
# Create a script to run maxbin to bin megahit contigs by species
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. maxbin runs on each of the fastq files in the trimmed and human filtered $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 38G of memory), we are also asking for 2 hours to run.
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=02:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-mega-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=50G                                  

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit
OUTDIR=${MEGAHIT_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$MEGAHIT_OUTDIR" ]]; then
  echo "$MEGAHIT_OUTDIR does not exist. Directory created"
  mkdir $MEGAHIT_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${MEGAHIT_DIR}/${SAMPLE_ID}/final.contigs.fa"

apptainer run /contrib/singularity/shared/bhurwitz/maxbin2:2.2.7--hdbdd923_5.sif run_MaxBin.pl \
-thread 24 -contig ${CONTIGS} \
-reads ${PAIR1} \
-reads2 ${PAIR2} \
-out ${OUTDIR}/${SAMPLE_ID} # OUTDIR is the actual ERR* directory, but SAMPLE_ID is the file pre-fix/header (ERR*.001.fasta)

'''

with open('megahit_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat megahit_bin_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the megahit_bin_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run the megahit contig binning
# Remember that this may take a while to run, so take a break, and get a coffee.
!sbatch ./megahit_bin_parallel.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# Note that this step will take ~55 minutes per file
# For example, you can look at Job-mega-bins-0.out
!cat Job-mega-bins-0.out

Rock on! You have created bins for your megahit contigs. These bins should represent the species present in your samples.

This step will generate a series of files for each of your samples. Take a look at the files generated. In particular you should see a series of *.fasta files preceeded by numbers. These are the different genome bins predicted by MaxBin.

In [None]:
# Double check that you have bins for your contigs from megahit.
# These bins are in files named like this: "ERR2198611.001.fasta"
!ls $work_dir/out_megahit/ERR*

Let's see if we have the ERR*.summary files

In [None]:
!ls $work_dir/out_megahit/ERR*summary

In [None]:
# Choose one of the summary files from above and look at it in detail
# What is shown?
!cat $work_dir/out_megahit/YOUR_FILE.summary | head

That is correct! You can see that each one of the files *001.fasta, *002.fasta ... represents one bin, and that bin should contain one species, and we can see how complete that bin is (meaning the % of the genome of that species that is represented). 

In [None]:
#Let's check one, for example mine is called ERR2198611.001.fasta
# and there are 22 contigs in that file. How about yours?
!egrep '>' $work_dir/out_megahit/YOUR_FILE.001.fasta | wc -l

Now, we are going to generate a concatenated file that contains all of our genome bins put together. We will change the fasta header name to include the bin number so that we can tell them apart later.

Let's write a script to do this. Note that this script will just run locally on this machine, so no coffee break required!

In [None]:
my_code = '''#!/bin/bash

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit

cd $MEGAHIT_OUTDIR

for i in {0..4}; do
    SAMPLE_ID=${names[$i]}
    echo ${SAMPLE_ID}
    touch ${SAMPLE_ID}.all_contigs.fna
    for file in ${SAMPLE_ID}.*.fasta; do
        num=$(echo $file | sed "s/${SAMPLE_ID}\.//" | sed 's/.fasta//')
        cat ${SAMPLE_ID}.$num.fasta | sed -e "s/^>/>${num}_/" >> ${SAMPLE_ID}.all_contigs.fna
    done
done

cd $WORK_DIR

'''

with open('megahit_add_bin_nums.sh', mode='w') as file:
    file.write(my_code)

In [None]:
!chmod +x ./megahit_add_bin_nums.sh
!ls -l megahit_add_bin_nums.sh

In [None]:
!./megahit_add_bin_nums.sh

In [None]:
# Let's check to see if the re-naming worked, where all ids are 
# named according to their bin id "_" name.
# My concatenated bin file is called ERR2198611.all_contigs.fna
# Change this to one of your samples
# You should see the the ids all start with their bin_id now
!egrep '>' $work_dir/out_megahit/YOUR_FILE.all_contigs.fna | head

Looks great! Now we have all of our bins assigned, and we have all of our contigs in a single file.

## Step 2: Binning contigs from your Metaspades Assembly

Rinse and repeat!

In this step, we will create species-level bins for the contigs that were created from your metaspades assembly.

In [None]:
# Create a script to run maxbin to bin metaspades contigs by species
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. maxbin runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 38G of memory), we are also asking for 2 hours to run
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=02:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-metaspades-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=75G                                   

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1="${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz"
PAIR2="${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz"

METASPADES_OUTDIR=${WORK_DIR}/out_spades
OUTDIR=${METASPADES_OUTDIR}

### create the outdir if it does not exist
if [[ ! -d "$METASPADES_OUTDIR" ]]; then
  echo "$METASPADES_OUTDIR does not exist. Directory created"
  mkdir $METASPADES_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${METASPADES_DIR}/${SAMPLE_ID}/contigs.fasta"

apptainer run /contrib/singularity/shared/bhurwitz/maxbin2:2.2.7--hdbdd923_5.sif run_MaxBin.pl \
-thread 24 -contig ${CONTIGS} \
-reads ${PAIR1} \
-reads2 ${PAIR2} \
-out ${OUTDIR}/${SAMPLE_ID}

'''

with open('metaspades_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# check the code was created
!cat metaspades_bin_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the metaspades_bin_parallel.sh script?
!ls

In [None]:
# Let's run the sbatch script, this should take ~1 hour to run
# Time for some coffee..
!sbatch ./metaspades_bin_parallel.sh

In [None]:
# Welcome back, let's see if the job is still running
!squeue --user=$netid

In [None]:
# Double check that you have bins for your contigs from megahit.
# These bins are in files named like this: "ERR2198611.001.fasta"
!ls $work_dir/out_spades

In [None]:
# Check to see if you have your summary files
!ls $work_dir/out_spades/ERR*summary

In [None]:
# Now let's create the same script as above to add the bin ids
# to to the contig names, and put into a single fasta file by sample
my_code = '''#!/bin/bash

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

METASPADES_OUTDIR=${WORK_DIR}/out_spades

cd $METASPADES_OUTDIR

for i in {0..4}; do
    SAMPLE_ID=${names[$i]}
    echo ${SAMPLE_ID}
    touch ${SAMPLE_ID}.all_contigs.fna
    for file in ${SAMPLE_ID}.*.fasta; do
        num=$(echo $file | sed "s/${SAMPLE_ID}\.//" | sed 's/.fasta//')
        cat ${SAMPLE_ID}.$num.fasta | sed -e "s/^>/>${num}_/" >> ${SAMPLE_ID}.all_contigs.fna
    done
done

cd $WORK_DIR

'''

with open('metaspades_add_bin_nums.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Change permissions and check to see you have the script
!chmod +x ./metaspades_add_bin_nums.sh
!ls -l metaspades_add_bin_nums.sh

In [None]:
# Run the script to add bin ids and create a single fasta
!./metaspades_add_bin_nums.sh

In [None]:
# Let's check to see if the re-naming worked, where all ids are 
# named according to their bin id "_" name.
# My concatenated bin file is called ERR2198611.fasta
# Change this to one of your samples
# You should see the the ids all start with their bin_id now
!egrep '>' $work_dir/out_spades/YOUR_FILE.all_contigs.fna | head

You did it! We now have created bins for all of our contigs, and we have a single fasta file for each that we will now run through the 
Assembly quality control process. But, that is for next time!

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/09_metag_binning.ipynb $work_dir