variant discovery and annotation using GATK and Ensembl
Ruby Perl R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.textile

Pipette: A framework for Bioinformatic pipelines

Pipette is a framework to quickly and easily build a pipeline for multi-step processing of input. It is especially suited for Bioinformatic applications.

Pipette also contains a collection of pipelines created in Pipette. Current Pipette Pipelines:

  • variant: a variant calling and annotation pipeline.

How it works

Pipette pipelines are made of steps. Each step is independent of other steps. Each step defines 2 attributes for itself:

  • input: The input parameters the step needs to run.
  • output: The parameters the step will generate after running.

Pipette chains these steps together, passing in input parameters and collecting output parameters automatically.

To create a new Pipeline in Pipette, simply sub-class the Pipeline class.

Tools

Pipette also includes a number of tools that perform the bulk of the processing for each step. These tools are simple wrappers for existing bioinformatic programs and also custom processing.

Requirements

Pipette is written in Ruby. Also, various tools are expected to be installed if used. See below for required tools for each pipeline.

Pipette Pipelines:

Variant calling and annotation pipeline

Pipeline to facilitate the automation of finding variants (SNPs/Indels) in an aligned BAM file.

Prerequisites

Currently, this pipeline does not perform the initial alignment step. This pipeline starts with an indexed BAM file and a reference genome.

Requirements

This pipeline wraps the calling of a number of external tools to perform the variant calling and annotation.

Current Applications Expected by vp:

Configuration Options

./variant_pipeline.rb -h
Usage: variant_pipeline [options]
    -i, --input BAM_FILE             REQUIRED - Input BAM file to call SNPs on
    -r, --reference FA_FILE          REQUIRED - Reference Fasta file for genome
    -o, --output PREFIX              Output prefix to use for generated files
    -j, --cores NUM                  Specify number of cores to run GATK on. Default: 4
    -c, --recalibrate COVARIATE_FILE If provided, recalibration will occur using input covariate file. Default: recalibration not performed
    -a, --annotate GENOME            Annotate the SNPs and Indels using Ensembl based on input GENOME. Example Genome: FruitFly
        --gatk JAR_FILE              Specify GATK installation
        --snpeff JAR_FILE            Specify snppEff Jar location
        --snpeff_config CONFIG_FILE  Specify snppEff config file location
        --samtools BIN_PATH          Specify location of samtools
    -q, --quiet                      Turn off some output
    -s realign,recalibrate,call,filter,annotate,
        --steps                      Specify only which steps of the pipeline should be executed
    -y, --yaml YAML_FILE             Yaml configuration file that can be used to load options. Command line options will trump yaml options
    -h, --help                       Displays help screen, then exits

Run Example

The easiest way to run the variant pipeline is to create a config yaml file to store most of the required configurations and then run using the -y flag.

sample_config.yml

reference: "~/genomes/Drosophila_melanogaster.BDGP5.4.54.dna.fasta"
input: "./aligned.fly.bam"
output: aligned.fly 
annotate: dm5.34
gatk: "~/tools/GATK/GenomeAnalysisTK.jar"
snpeff: "~/tools/snpEff/snpEff.jar"
snpeff_config: "~/tools/snpEff/snpEff.config"

Then simply run variant_pipeline with the -y flag:

./variant_pipeline/pipette.rb variant -y ./sample_config.yml