Pipette: A framework for Bioinformatic pipelines
Pipette is a framework to quickly and easily build a pipeline for multi-step processing of input. It is especially suited for Bioinformatic applications.
Pipette also contains a collection of pipelines created in Pipette. Current Pipette Pipelines:
- variant: a variant calling and annotation pipeline.
How it works
Pipette pipelines are made of steps. Each step is independent of other steps. Each step defines 2 attributes for itself:
- input: The input parameters the step needs to run.
- output: The parameters the step will generate after running.
Pipette chains these steps together, passing in input parameters and collecting output parameters automatically.
To create a new Pipeline in Pipette, simply sub-class the Pipeline class.
Pipette also includes a number of tools that perform the bulk of the processing for each step. These tools are simple wrappers for existing bioinformatic programs and also custom processing.
Pipette is written in Ruby. Also, various tools are expected to be installed if used. See below for required tools for each pipeline.
Variant calling and annotation pipeline
Pipeline to facilitate the automation of finding variants (SNPs/Indels) in an aligned BAM file.
Currently, this pipeline does not perform the initial alignment step. This pipeline starts with an indexed BAM file and a reference genome.
This pipeline wraps the calling of a number of external tools to perform the variant calling and annotation.
Current Applications Expected by vp:
- GATK – performs most of the SNP / Indel calling steps using the GATK’s Unified Genotyper
- samtools – for indexing output from GATK
- snpEff – for annotation
./variant_pipeline.rb -h Usage: variant_pipeline [options] -i, --input BAM_FILE REQUIRED - Input BAM file to call SNPs on -r, --reference FA_FILE REQUIRED - Reference Fasta file for genome -o, --output PREFIX Output prefix to use for generated files -j, --cores NUM Specify number of cores to run GATK on. Default: 4 -c, --recalibrate COVARIATE_FILE If provided, recalibration will occur using input covariate file. Default: recalibration not performed -a, --annotate GENOME Annotate the SNPs and Indels using Ensembl based on input GENOME. Example Genome: FruitFly --gatk JAR_FILE Specify GATK installation --snpeff JAR_FILE Specify snppEff Jar location --snpeff_config CONFIG_FILE Specify snppEff config file location --samtools BIN_PATH Specify location of samtools -q, --quiet Turn off some output -s realign,recalibrate,call,filter,annotate, --steps Specify only which steps of the pipeline should be executed -y, --yaml YAML_FILE Yaml configuration file that can be used to load options. Command line options will trump yaml options -h, --help Displays help screen, then exits
The easiest way to run the variant pipeline is to create a config yaml file to store most of the required configurations and then run using the -y flag.
reference: "~/genomes/Drosophila_melanogaster.BDGP5.4.54.dna.fasta" input: "./aligned.fly.bam" output: aligned.fly annotate: dm5.34 gatk: "~/tools/GATK/GenomeAnalysisTK.jar" snpeff: "~/tools/snpEff/snpEff.jar" snpeff_config: "~/tools/snpEff/snpEff.config"
Then simply run variant_pipeline with the -y flag:
./variant_pipeline/pipette.rb variant -y ./sample_config.yml