Simple FASTQ quality assessment using Python
Python
Latest commit 9f52496 Sep 28, 2016 @mdshw5 committed on GitHub Merge pull request #25 from trsherborne/master
Update all usages of mpl.rc to mpl.cycler

README.md

fastqp

Build Status PyPI

Simple FASTQ, SAM and BAM read quality assessment and plotting using Python.

Features

  • Requires only Python with Numpy, Scipy, and Matplotlib libraries
  • Works with (gzipped) FASTQ, SAM, and BAM formatted reads
  • Tabular, tidy, output statistics so you can create your own graphs
  • A useful set of default graphics rivaling comparable QC packages
  • Counts all IPUAC ambiguous nucleotide codes (NMWSKRY) if present in sequences
  • Downsamples input files to around 2,000,000 reads (user adjustable)
  • Allows a 5' and 3' (left and right) cycle limit for graphics generation
  • Tracks kmers and sequence duplication for the entire input file
  • Plot base call reference mismatches for aligned reads
  • Optional sequence duplication calculation using Bloom filters (beta)

Requirements

Tested on Python 2.7, and 3.4

Tested on Mac OS 10.10 and Linux 2.6.18

Installation

pip install [--user] fastqp

Note: BAM file support requires samtools

Usage

usage: fastqp [-h] [-q] [-s BINSIZE] [-a NAME] [-n NREADS] [-p BASE_PROBS] [-k {2,3,4,5,6,7}] [-o OUTPUT]
              [-ll LEFTLIMIT] [-rl RIGHTLIMIT] [-mq MEDIAN_QUAL] [--aligned-only | --unaligned-only] [-d]
              input

simple NGS read quality assessment using Python

positional arguments:
  input                 input file (one of .sam, .bam, .fq, or .fastq(.gz) or stdin (-))

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           do not print any messages (default: False)
  -s BINSIZE, --binsize BINSIZE
                        number of reads to bin for sampling (default: auto)
  -a NAME, --name NAME  sample name identifier for text and graphics output (default: input file name)
  -n NREADS, --nreads NREADS
                        number of reads sample from input (default: 2000000)
  -p BASE_PROBS, --base-probs BASE_PROBS
                        probabilites for observing A,T,C,G,N in reads (default: 0.25,0.25,0.25,0.25,0.1)
  -k {2,3,4,5,6,7}, --kmer {2,3,4,5,6,7}
                        length of kmer for over-repesented kmer counts (default: 5)
  -o OUTPUT, --output OUTPUT
                        base name for output files (default: fastqp_figures)
  -ll LEFTLIMIT, --leftlimit LEFTLIMIT
                        leftmost cycle limit (default: 1)
  -rl RIGHTLIMIT, --rightlimit RIGHTLIMIT
                        rightmost cycle limit (-1 for none) (default: -1)
  -mq MEDIAN_QUAL, --median-qual MEDIAN_QUAL
                        median quality threshold for failing QC (default: 30)
  --aligned-only        only aligned reads (default: False)
  --unaligned-only      only unaligned reads (default: False)
  -d, --count-duplicates
                        calculate sequence duplication rate (default: False)

Changes

See releases page for details.

Examples

quality heatmap

gc plot

gc distribution

nucleotide plot

nucleotide mismatch plot

kmer distribution

depth plot

quality percentiles

quality distribution

adapter kmer distribution

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.