Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

..
Failed to load latest commit information.
bcbio_hiv
cbl_install
config
scripts
test
README.md
setup.py

README.md

Variant calling in deeply sequenced viral populations

This set of scripts provides an automated pipeline for identifying mutations in viral populations using Illumina deep sequencing. Coupled with our deep sequencing variant calling tool, it provides a full workflow to move from Illumina fastq sequences to a VCF file of population mutations and amino acid changes.

Usage

Configuration files in YAML input format define all inputs to the process. An example configuration file is a useful starting point. With this file, the entire run process consists of a single commandline:

 python scripts/variant_identify.py <your_config.yaml>

This creates a variation directory containing files named raw_your-run-name-sort-realign.tsv which has detailed statistics about each position with aligned reads. These values feed directly into the variant calling framework.

What does it do?

The build script performs the following steps to prepare for variant calling:

  • Collapses the input fastq reads into unique reads. At high sequencing depth, we expect extensive read duplication, and this step avoids uncessary overhead of aligning identical reads multiple times.

  • Aligns collapsed fastq reads to reference genome. This handles ambiguous reference genomes with IUPAC characters, which is useful for error matching in viral populations with known variant regions.

  • Re-aligns reads, avoiding inconsistent and incorrect alignments due to indels.

  • Summarizes unique reads at each position with read quality score, alignment quality score and percent representation of the k-mer surrounding region. These metrics feed directly into variant calling.

Installation

The pipeline leverages these freely available tools:

  • novoalign -- alignment to the reference genome
  • Picard -- Manipulation of BAM alignment files
  • GATK -- re-alignment of reads around indels
  • khmer -- count k-mer regions surrounding each variant

The CloudBioLinux project provides automated installation scripts with all of these dependencies.

Following installation of these, run:

$ python setup.py build
$ sudo python setup.py install

to install required Python libraries.

Something went wrong with that request. Please try again.