# Assembling reads into contigs (fragments of a genome)

This notebook will go through the workflow for using the unicycler assembly tool. In this section we are going to assemble our reads into contigs. Contigs are fragments of DNA that represent parts of a genome. If you are lucky, you might even be able to assemble an entire genome in a single contig! But, most of the time, contigs are just part of a genome with missing fragments in between contigs that prevent you from assembling the entire genome.

-----------

Sections:

1. Run Unicycler to create an assembled genome.

-----------



## Getting Started

Before we get started you will need to set several variables that we will use throughout this notebook. 

In [None]:
# set the variables for your user id and accessions
netid = "MY_ID"
accessions = "MY_ACCESSIONS"

In [None]:
# Go into the working directory
work_dir = "/my_dir_path/" + id + "/03_assembly"
%cd $work_dir

In [None]:
# Set the fastq directory. This is where we have our fastq files with human contam removed.
fastq_dir = "/my_dir_path/" + id + "/02_taxonomy"
data_dir = "/my_dir_path/" + id + "/00_getting_data"
out_dir = work_dir + "/out_unicycler"

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we will assemble the reads that are both trimmed and have human removed.
!echo "export ID=$id" > config.sh
!echo "export ACCESSIONS=$accessions" >> config.sh
!echo "export DATA_DIR=$data_dir" >> config.sh
!echo "export FASTQ_DIR=$fastq_dir" >> config.sh
!echo "export OUT_UNI=$out_dir" >> config.sh
!echo "export UNICYCLER=/path_to_containers/unicycler:0.5.0--py39h4e691d4_3.sif" >> config.sh

### Data Management

We'll be creating an assembly based on the trimmed/human removed reads. Let's setup the output directory ahead of time.

In [None]:
!mkdir $work_dir/out_megahit

## Step 1: Running Unicycler to create contigs

Let's create an assembly of all of the genomes in your microbiomes using megahit. This assembler is fast, and uses less resources than other metagenome assemblers. 

In [3]:
# Create a script to run megahit
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command
# 2. unicycler runs on each of the fastq files in the trimmed/human screened $FASTQ_DIR
# 3. The results will be written into our $OUT_UNI directory
# 4. Notice that we are asking for alot more resource (28 cores and 5G of memory per core)

my_code = '''#!/bin/bash
#SBATCH --output=03_assembly-%a.out
#SBATCH --account=your_account
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=48:00:00
#SBATCH --cpus-per-task=28
#SBATCH --mem-per-cpu=5gb
#SBATCH --array=0-15

pwd; hostname; date

source ./config.sh
names=($(cat $DATA_DIR/$ACCESSIONS))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

NO_HUMAN=${FASTQ_DIR}/out_reads_taxonomy/${SAMPLE_ID}/nonhuman_reads
PAIR1=${NO_HUMAN}/r1.fq.gz
PAIR2=${NO_HUMAN}/r2.fq.gz

#add threads flag & exposition on adding threads or it runs inefficient
apptainer run ${UNICYCLER} unicycler -1 ${PAIR1} -2 ${PAIR2} -o ${OUT_UNI}/${names[${SLURM_ARRAY_TASK_ID}]} --threads 28

'''

with open('03_assembly-lsf.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the 03_assembly.sh script?
!pwd
!ls

In [5]:
# Let's create the launcher script to kick off our pipeline.

my_code = '''#! /bin/bash

# 03_assembly: first job - no dependencies
job1=$(sbatch 03_assembly.sh)
jid1=$(echo $job1 | sed 's/^Submitted batch job //')
echo $jid1

'''

with open('03_launch_pipeline-lsf.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Make the pipeline script executable
!chmod +x *.sh

In [None]:
# now let's run it!
!./03_launch_pipeline-lsf.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
# Note that this will take some time to run, so go get a coffee!
!squeue --user=$id

## Done