Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Variant calling in deeply sequenced viral populations

This set of scripts provides an automated pipeline for identifying mutations in viral populations using Illumina deep sequencing. Coupled with our deep sequencing variant calling tool, it provides a full workflow to move from Illumina fastq sequences to a VCF file of population mutations and amino acid changes.


Configuration files in YAML input format define all inputs to the process. An example configuration file is a useful starting point. With this file, the entire run process consists of a single commandline:

 python scripts/variant_identify.py <your_config.yaml>

This creates a variation directory containing files named raw_your-run-name-sort-realign.tsv which has detailed statistics about each position with aligned reads. These values feed directly into the variant calling framework.

What does it do?

The build script performs the following steps to prepare for variant calling:

  • Collapses the input fastq reads into unique reads. At high sequencing depth, we expect extensive read duplication, and this step avoids uncessary overhead of aligning identical reads multiple times.

  • Aligns collapsed fastq reads to reference genome. This handles ambiguous reference genomes with IUPAC characters, which is useful for error matching in viral populations with known variant regions.

  • Re-aligns reads, avoiding inconsistent and incorrect alignments due to indels.

  • Summarizes unique reads at each position with read quality score, alignment quality score and percent representation of the k-mer surrounding region. These metrics feed directly into variant calling.


The pipeline leverages these freely available tools:

  • novoalign -- alignment to the reference genome
  • Picard -- Manipulation of BAM alignment files
  • GATK -- re-alignment of reads around indels
  • khmer -- count k-mer regions surrounding each variant

The CloudBioLinux project provides automated installation scripts with all of these dependencies.

Following installation of these, run:

$ python setup.py build
$ sudo python setup.py install

to install required Python libraries.