A bash pipeline for the de novo assembly of viral genomes generated via Illumina NGS. Currently handles the following viruses: HIV-1, RSV, RRV and HMPV. This pipeline includes quality checks and reports via FastQc and MultiQC, trimming and mapping of reads via BBTools and de novo assembly using MEGAHIT.
Current version: V1
Table of contents
The pipeline can be run on either a standard computer or a HPC server. We tested the pipeline on a standard desktop computer with the following specifications:
RAM: 16+ GB CPU: 4+ cores, @1.90 GHz
The pipeline has been tested on several Linux operating systems including the following systems:
Linux: Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04
A conda package manager like Miniconda3. Instructions on how to install:
- Download the latest miniconda installation script
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Make the miniconda installation script executable
chmod +x Miniconda3-latest-Linux-x86_64.sh
- Run miniconda installation script
bash ./Miniconda3-latest-Linux-x86_64.sh
Time to install the pipeline took less than 15 minutes on standard desktop computer.
- Download the initial environment installation file
wget https://raw.githubusercontent.com/laulambr/virus_assembly/main/scripts/install_env.sh
- Run the script in the terminal
bash ./install_env.sh
- Check if installation worked
conda activate virus_assembly
Pipeline: NGS pipeline for viral assembly.
usage: virus_assembly [-h -v -p -q] (-i dir -m value -t value )
(-s string)
with:
-h Show help text
-v Version of the pipeline
-n Name of RUN.
-i Input directory
-s Viral species [HIV, RSV, RRV, HMPV]
-c Perform clipping of primers
-q Perform quality check using fastQC
-m Memory
-t Number of threads
- Activate environment.
conda activate virus_assembly
- Head to the directory where you will perform the analysis.
- Place the raw fastq.gz files in a directory called source.
- Create a list holding the sample names from you sequencing files called IDs.list and place it in the main directory.
- Start the pipeline using the following command
virus_assembly -i path/to/main/directory -s VIRUS
- When the pipeline has finished, 5 additional folders will have been created:
- 1_reads: Includes the trimmed and normalised reads.
- 2_ref_map: Includes a bam file of the trimmed reads against the viral reference genome and a pdf with the qualimap results.
- 3_contigs: Includes the de novo assembled contigs by megahit for each sample.
- 4_filter: Includes the high converage contigs generated by megahit (_ *hicov.fasta), filtered and reorientated against the viral reference genome (_ *reoriented.fa)
- 5_remap: Includes both a fasta file holding the viral contigs for that sample and a bam file of the trimmed reads agains those contigs.
We provide an example analysis using pre-installed test data, which took less than 1 minute on standard desktop.
- Activate environment.
conda activate virus_assembly
- Head to the directory with the github clone of this repository and head to the test data folder
cd $CONDA_PREFIX/virus_assembly/test_data
- Run the following command
virus_assembly -i $CONDA_PREFIX/virus_assembly/test_data -s HIV
- The pipeline will run, check the newly created 5_read directory for a fasta file containing the new de novo assembled HIV-1 provirus.
- HIV: K03455.1
- RSV: MH760627; MH760652
- RRV: RRV_ref (Accession pending)
- HMPV: HMPV205; HMPV218 (Accessions pending)