# Assembly

***

## Intro

General processes for assembly from trimmed and filtered Illumina HiSeq data are detailed below. However, some decisions of your own will need to be made along the way depending on your study question and/or time or computational restraints. 

Assembler

- Current leading options include **SPAdes** and **IDBA-UD**
- The examples that follow use **metaSPAdes** (i.e. **SPAdes** with the -meta flag enabled). 
- However, **SPAdes** can have very large computational requirements depending on the dataset. As an example, the Handley group's assemblies from estuary samples required several hundred GB RAM and multiple days of NeSI run time *per sample*. This can be very variable depending on the sample type. If a faster option is required (and one that uses less RAM), **IDBA-UD** is a good alternative.
- Note: if running **IDBA-UD**, it requires a single interleaved input file per sample (rather than separate R1 and R2 read files).
- From Vollmers *et al*. 2017 (https://doi.org/10.1371/journal.pone.0169662): 
  - "*If micro diversity is not a major issue, and the primary research goal is to bin and reconstruct representative bacterial genomes from a given environment*, ***metaSPAdes*** *should clearly be the assembler of choice... If micro diversity is however an issue, or if the degree of captured diversity is far more important than contig lengths, then* ***IDBA-UD*** *or* ***Megahit*** *should be preferred.*"




## Assembly via metaSPAdes

k-mer settings (`-k` parameter)

- Depending on the data (and if you have the time and resources), this is a parameter worth trying alternative settings with. You can run multiple different assemblies changing this setting, and then assess the quality of each assembly and choose the best option for your data set.
  - Note: for experimenting with alternative kmer sizes: selected kmers should be odd numbers, should not be *too* short, and also cannot be longer than the read length (after trimming etc.).
- Alternatively you can use the auto setting by omitting this parameter altogether (this automatically selects what it deems to be appropriate k-mer settings).
  - Note: in our experience, auto setting can require ~ double the RAM usage

Single- versus co-assemblies

- Depending on your study question and available time and computational resources, you may wish to do single assemblies (i.e. a separate assembly *per sample*), or some variety of co-assemblies (e.g. full co-assembly (all samples together), or mini co-assemblies (e.g. one assembly of samples from group A and a separate assembly of samples from group B)). 
- Note:
  - Individual assemblies per sample may result in better assemblies overall
  - Alternatively, co-assemblies may be better as assembling rarer taxa that occur in > 1 sample
  - If following up with read mapping (e.g. mapping WTS reads back to assembled WGS contigs), any more than one single co-assembly at this stage will require a subsequent step to dereplicate assembled contigs (or binned genomes) across the multiple assemblies. (n.b. You can also use a combined approach of both individual assemblies and co-assemblies and dereplicate across all assemblies).
- For co-assemblies, input files can be concatenated together via `cat`, e.g: 
  - `cat sample1_R1.fastq.gz sample2_R1.fastq.gz sample3_R1.fastq.gz sample4_R1.fastq.gz > for_assembly_A_R1.fastq.gz`
  
Runtimes and RAM requirements

- Unfortunately this is difficult to predict with new sample types and/or data sets, so you may have to run a few attempts, slowly adding more time or RAM allocated until you find a number that works.

*N.b. Examples below based on using auto kmer setting (omitting `-k` option)*

## Example: Co-assembly

Concatenate reads for assembly:

In [None]:
cd /working/dir

mkdir -p 2.assembly/0.spades_coassembly_infiles/

cat 1b.QC_Filtered_host/*_R1_hostFilt.fastq > 2.assembly/0.spades_coassembly_infiles/filtered_reads_R1.fastq
cat 1b.QC_Filtered_host/*_R2_hostFilt.fastq > 2.assembly/0.spades_coassembly_infiles/filtered_reads_R2.fastq
cat 1b.QC_Filtered_host/*_single_hostFilt.fastq > 2.assembly/0.spades_coassembly_infiles/filtered_reads_single.fastq


Run co-assembly

*NOTE: when changing the memory allocation, make sure to change it in both the SBATCH header and the actual spades call (the `-m` flag)*

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_2.co-assembly_spades
#SBATCH --time 12:00:00
#SBATCH --mem=80GB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e wgs_2.co-assembly_spades.err
#SBATCH -o wgs_2.co-assembly_spades.out

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Change to working directory
cd /working/dir

# Make output directory 
mkdir -p 2.assembly/1.spades_assembly_coassembly/

# Run rnaSPAdes
srun spades.py --meta -t 16 -m 80 \
-1 2.assembly/0.spades_coassembly_infiles/filtered_reads_R1.fastq \
-2 2.assembly/0.spades_coassembly_infiles/filtered_reads_R2.fastq \
-s 2.assembly/0.spades_coassembly_infiles/filtered_reads_single.fastq \
-o 2.assembly/1.spades_assembly_coassembly/


## Example: Individual assemblies 

Running individual assemblies as slurm array

*NOTE: all slurm array examples in this doc are based on nine samples numbered from 1 to 9. Modify the `#SBATCH --array=` header for the appropriate number of samples in your dataset* 

*NOTE: when changing the memory allocation, make sure to change it in both the SBATCH header and the actual spades call (the `-m` flag)*

In [None]:
#!/bin/bash -e
#SBATCH -A your_project_account
#SBATCH -J wgs_2.assembly_spades
#SBATCH --time 12:00:00
#SBATCH --mem=80GB
#SBATCH --array=1-9
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -e wgs_2.assembly_spades_a%.err
#SBATCH -o wgs_2.assembly_spades_a%.out

# Load module(s)
module purge
module load SPAdes/3.13.1-gimkl-2018b

# Change to working directory
cd /working/dir

# Make output directory 
mkdir -p 2.assembly/

# Run rnaSPAdes
srun spades.py --meta -t 16 -m 80 \
-1 1b.QC_Filtered_host/S${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
-2 1b.QC_Filtered_host/S${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq \
-s 1b.QC_Filtered_host/S${SLURM_ARRAY_TASK_ID}_single_hostFilt.fastq \
-o 2.assembly/1.spades_assembly_S${SLURM_ARRAY_TASK_ID}/


## Optional: Filtering out short contigs

For downstream processing, it is generally a good idea to filter out short contigs (for example, those less than 1000 or 2000 bp).

If you wish to filter out short contigs, you can do so via `seqmagick`:

In [None]:
## Filter out contigs < 1000 bp using seqmagick

# Set up working directories
cd /working/dir

# Load seqmagick
module purge
module load seqmagick/0.7.0-gimkl-2017a-Python-3.6.3

# Individual sample assemblies
for i in {1..9}; do
    seqmagick convert --min-length 1000 2.assembly/1.spades_assembly_S${i}/scaffolds.fasta 2.assembly/1.spades_assembly_S${i}/scaffolds.m1000.fasta
done

# Co-assembly
seqmagick convert --min-length 1000 2.assembly/1.spades_assembly_coassembly/scaffolds.fasta 2.assembly/1.spades_assembly_coassembly/scaffolds.m1000.fasta


## Assessing assemblies

The following are some of the ways in which you can examine how well your data assembly went. This is also useful if you have tested different assembly parameters (e.g. differet k-mer sizes) or different assemblers on a subset of samples and are deciding what parameter settings to use for the proper assemblies.

Here we will be looking at:

- **counts of contigs** output by each assembly (including the filtered vs. unfiltered assemblies)
- **relative length of contigs** output by each assembly via contig N/L50 values (an indication of the relative length of contigs in each of the assemblies). 

You can use these metrics (among others) to select the assembly parameters (and/or assembler) you wish to proceed with for the actual assemblies of all samples (or co-assembly, or multiple mini co-assemblies, if that is the option you go for). 

- NOTE: *more* contigs may not neccessarily mean the better assembly. An assembly with fewer contigs but with contigs of greater length on average may be preferred. Ultimately, this is a little bit of a trial and error process, and what entails the "best" assembly may depend on both your data and the question you're asking of it downstream.


##### Counting the number of contigs in each of the assemblies (including the filtered vs. non-filtered files)

In [None]:
cd /working/dir

## Individual sample assemblies
for i in {1..9}; do
    # All contigs
    grep -c '>' 2.assembly/1.spades_assembly_S${i}/scaffolds.fasta
    # Contigs > 1000 bp
    grep -c '>' 2.assembly/1.spades_assembly_S${i}/scaffolds.m1000.fasta
done

## Co-assembly
# All contigs
grep -c '>' 2.assembly/1.spades_assembly_coassembly/scaffolds.fasta
# Contigs > 1000 bp
grep -c '>' 2.assembly/1.spades_assembly_coassembly/scaffolds.m1000.fasta


##### Assembly statistics via BBMap's stats.sh script

A key thing to take note of from the output of this script is the `contig N/L50`

In [None]:
module purge
module load BBMap/38.73-gimkl-2018b

cd /working/dir

## Run stats.sh

# Individual assemblies
for i in {1..9}; do
    # All contigs
    stats.sh in=2.assembly/1.spades_assembly_S${i}/scaffolds.fasta
    # Contigs > 1000 bp
    stats.sh in=2.assembly/1.spades_assembly_S${i}/scaffolds.m1000.fasta
done

# Co-assembly
stats.sh in=2.assembly/1.spades_assembly_coassembly/scaffolds.fasta
stats.sh in=2.assembly/1.spades_assembly_coassembly/scaffolds.m1000.fasta

***