# Assembling reads into contigs

This notebook will go through the workflow for using the metaspades and megahit assembly tools. In this section we are going to assemble our reads into contigs. Contigs are fragments of DNA that represent parts of a genome. If you are lucky, you might even be able to assemble an entire genome in a single contig! But, most of the time, contigs are just part of a genome with missing fragments in between contigs that prevent you from assembling the entire genome.

1. An introduction to [Metaspades](https://cab.spbu.ru/files/release3.12.0/manual.html)
2. An introduction to [Megahit](https://github.com/voutcn/megahit)


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/08_assembly"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

### Data Management

We'll be creating two assemblies based on the trimmed/human removed reads. Let's setup the output directories ahead of time.

In [None]:
!mkdir $work_dir/out_spades
!mkdir $work_dir/out_megahit

In [None]:
# create a config file with all of the variables you need
# notice that we will assemble the reads that are both trimmed and have human removed.
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export XFILE_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/07_contam_removal" >> config.sh
!echo "export OUT_SPADES=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_spades" >> config.sh
!echo "export OUT_MEGA=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_megahit" >> config.sh

In [None]:
!cat config.sh

## Step 1: Running Metaspades to create contigs

You will be assembling your reads using a program called spades, that has the metaspades.py program within it for assembling metagenomes comprised of multiple organisms.

It's important to note that this assembler is memory intensive, and for large files it takes a lot of resource and time. A common error for large files is running out of memory to complete the job in the HPC. If needed, we can modify our script if it requires more memory. 

Puma can have 94 CPUs @ 5gb/CPU <br>
Ocelote can have 28 CPUs @ 6gb/CPU

This [HPC documentation](https://public.confluence.arizona.edu/display/UAHPC/Running+Jobs+with+SLURM) is handy to have as you edit your scripts and use different HPCs within UA.

In [None]:
# Create a script to run metaspades
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. metaspades runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $OUT_SPADES directory
# 4. Notice that we are asking for alot more resource (28 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --output=Job-spades-%a.out
#SBATCH --account=bh_class
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=48:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-4

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq*
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq*

#add threads flag & exposition on adding threads or it runs inefficient
apptainer run /contrib/singularity/shared/bhurwitz/spades:3.15.5--h95f258a_1.sif metaspades.py \
   -o ${OUT_SPADES}/${names[${SLURM_ARRAY_TASK_ID}]} \
   --pe1-1 ${PAIR1} \
   --pe1-2 ${PAIR2} \
   --threads 47 \
   --memory 256
'''

with open('run_metaspades.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the rrun_metaspades.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run metaspades on each of the FASTQ files
# Remember that this may take a while to run, so take a break, 
# and get a coffee.
!sbatch ./run_metaspades.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# For example, you can look at Job-spades-0.out
!ls
!cat Job-spades-0.out

In [None]:
# Do you see a contigs file?
!ls $work_dir/out_spades/*

Great job! You should now have assemblies from metaspades. You can kick off the assembly below with megahit without disrupting your work from above. Go for it!

## Step 2: Running Megahit to create contigs

We'll now repeat the process using megahit -- A different algorithm to assemble your contigs. This assembler works a lot faster, using less resources but isn't as accurate as spades. If you find spades crashing due to memory-out errors megahit will be able to assemble the bigger read files. 

Other options we'll explore in later notebooks are removing reads based on a reference genome database. 

In [None]:
# Create a script to run megahit
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. megahit runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $OUT_MEGA directory
# 4. Notice that we are asking for alot more resource (28 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --output=Job-megahit-%a.out
#SBATCH --account=bh_class
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-4

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq*
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq*

apptainer run /contrib/singularity/shared/bhurwitz/megahit:1.2.9--h5b5514e_3.sif megahit \
   -1 ${PAIR1} \
   -2 ${PAIR2} \
   -o ${OUT_MEGA}/${names[${SLURM_ARRAY_TASK_ID}]}
'''

with open('run_megahit.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the rrun_megahit.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run megahit on each of the FASTQ files
# Remember that this may take a while to run, so take a break, 
# and get a coffee.
!sbatch ./run_megahit.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# For example, you can look at Job-megahit-0.out
!ls
!cat Job-megahit-0.out

In [None]:
# Do you see a final_contigs.fa file?
!ls $work_dir/out_megahit/*

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/08_assembly.ipynb $work_dir