# Assembling reads into contigs

This notebook will go through the workflow for using the megahit assembly tool. In this section we are going to assemble our reads into contigs. Contigs are fragments of DNA that represent parts of a genome. If you are lucky, you might even be able to assemble an entire genome in a single contig! But, most of the time, contigs are just part of a genome with missing fragments in between contigs that prevent you from assembling the entire genome.

Check out this introduction to [Megahit](https://github.com/voutcn/megahit)

Assembling the reads into contigs gives us metagenome assembled genomes (MAGs). Note that these are different from an assembled genome that is created from reads from a single organism grown in culture (an isolate).

-----------

Sections:

1. Run Megahit to create metagenome assembled genomes (MAGs).

-----------



## Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/08_assembly"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our fastq files with human contam removed.
fastq_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/07_contam_removal"
xfile_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/05_getting_data"
out_dir = work_dir + "/out_megahit"

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we will assemble the reads that are both trimmed and have human removed.
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export XFILE_DIR=$xfile_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh
!echo "export OUT_MEGA=$out_dir" >> config.sh
!echo "export MEGAHIT=/contrib/singularity/shared/bhurwitz/megahit:1.2.9--h5b5514e_3.sif" >> config.sh

### Data Management

We'll be creating an assembly based on the trimmed/human removed reads. Let's setup the output directory ahead of time.

In [None]:
!mkdir $work_dir/out_megahit

## Step 1: Running Megahit to create contigs

Let's create an assembly of all of the genomes in your microbiomes using megahit. This assembler is fast, and uses less resources than other metagenome assemblers. 

In [None]:
# Create a script to run megahit
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command
# 2. megahit runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $OUT_MEGA directory
# 4. Notice that we are asking for alot more resource (28 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --output=08A_assembly-%a.out
#SBATCH --account=bh_class
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-7

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq.gz
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq.gz

apptainer run ${MEGAHIT} megahit \
   -1 ${PAIR1} \
   -2 ${PAIR2} \
   -o ${OUT_MEGA}/${names[${SLURM_ARRAY_TASK_ID}]}
'''

with open('08A_assembly.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the 08A_assembly.sh script?
!pwd
!ls

In [None]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 08A_assembly: first job - no dependencies
job1=$(sbatch 08A_assembly.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

'''

with open('08_launch_pipeline.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./08_launch_pipeline.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Note that this will take some time to run, so go get a coffee!
!squeue --user=$netid

## Final Step
Copy your notebook to the current working directory

In [None]:
!cp ~/be487-fall-2024/assignments/08_assembly/hw08_assembly.ipynb $work_dir