Quaff is a pair HMM for aligning FASTQ reads to FASTA references, with the following features:
-
it pre-filters and bands the DP algorithms by looking for diagonals with lots of k-mer matches
-
you can train the pair HMM on sequences using the Forward-Backward algorithm
-
you can align reads to references using the Viterbi algorithm
-
you can also align reads to reads (i.e. look for overlaps), using Viterbi
-
it attempts to model the FASTQ quality scores (using a negative binomial distribution), and also uses k-mer context for modeling substitution and/or gap-opening probabilities
-
it's highly parallel: multithreaded, can run remote servers, or launch its own temporary Amazon EC2 cluster
A full manual can be found in the repository at doc/manual.pdf
Usage: quaff {help,train,align,overlap} [options]
Commands:
TRAINING
quaff train refs.fasta reads.fastq >params.json
(to fit a model to unaligned sequences, using EM/Forward-Backward)
-maxiter <n> Max number of EM iterations (default is 100)
-mininc <n> EM convergence threshold as relative log-likelihood increase
(default is .01)
-maxreadmb <n> Use only the first n megabases of the read training set
-force Force each read to match a refseq, i.e. disallow null model
-suborder <k> Allow substitutions to depend on k-mer contexts
-gaporder <k> Allow gap open probabilities to depend on k-mer contexts
-order <k> Shorthand for '-suborder <k> -gaporder <k>'
-prior <file>, -saveprior <file>
Respectively: load/save prior pseudocounts from/to file
-saveparams <file>, -savecounts <file>, -savecountswithprior <file>
Save parameters (or E-step counts) to file, not stdout
(saved counts can subsequently be used as a prior)
ALIGNMENT
quaff align refs.fasta reads.fastq
(to align FASTQ reads to FASTA reference sequences, using Viterbi)
-printall Print all pairwise alignments, not just best for each read
quaff overlap reads.fastq
(to find overlaps between FASTQ reads, using Viterbi)
Alignment options (for align/overlap commands):
-threshold <n>, -nothreshold
Log-odds ratio score threshold for alignment reporting
-noquals Ignore read quality scores during alignment
-savealign <file>
Stream alignments to file, instead of stdout
-format {fasta,stockholm,sam,refseq}
Alignment output format
GENERAL
General options (for all commands, except where indicated):
-verbose, -vv, -vvv, -v4, -v5, etc.
Various levels of logging (-nocolor for monochrome)
-V, --version Print GNU-style version info
-h, --help Print help message
-params <file> Load model parameters from file
-ref <file> Load additional FASTA reference sequences
-read <file> Load additional FASTQ read sequences
-fwdstrand Do not include reverse-complemented sequences
-global Force all of refseq to be aligned (align/train only)
-null <file>, -savenull <file>
Respectively: load/save null model from/to file
-kmatch <k> Length of kmers for pre-filtering heuristic (default 6)
-kmatchn <n> Threshold# of kmer matches to seed a diagonal
(default is 14 for overlap, 20 for align/train)
-kmatchband <n> Size of DP band around kmer-matching diagonals (default 64)
-kmatchmb <M> Set kmer threshold to use M megabytes of memory
-kmatchmax Set kmer threshold to use all available memory (slow)
-kmatchoff No kmer threshold, do full DP (typically very slow)
PARALLEL PROCESSING
Parallelization is via a thread pool, which can be extended over a cluster.
If IP addresses (or AWS credentials) are given, then jobs are run over sockets,
and (if NFS is unavailable) files may be synchronized using S3 or rsync.
Alternatively, jobs can be run using a queueing system (such as PBS or SGE),
in which case NFS is required for both job synchronization and file sharing.
Options for parallel processing over sockets:
-threads <N>, -maxthreads
Use N threads, or use all cores available
-remote [user@]host[:port[-maxport]]
Start a (multithreaded) remote quaff server via SSH
-sshkey <file> SSH private key file
-sshpath <p> Path to ssh
-rsyncpath <p> Path to rsync
-remotepath <p> Path to remote binary (default /usr/local/bin/quaff)
-rsync Client will rsync data to server dir (/tmp/quaff)
-s3bucket <B> Client/server will sync data files to/from bucket B
-ec2instances <N>
Launch N temporary EC2 instances as servers
-ec2ami <AMI>, -ec2type <type>, -ec2cores <numberOfCores>,
-ec2user <user>, -ec2key <keypair>, -ec2group <group>, -ec2port <port>
Control various aspects of the launched instances
(defaults: ami-1ecae776, m3.medium, 1,
ec2-user, quaff, quaff, 8000)
Since quaff opens one socket per remote thread (plus one ssh job per server),
you may need to raise the system limits on # of files/sockets per process
(e.g. OSX 10.10 limits you to 128 sockets/process by default).
By default, quaff assumes all data files are in the same place on the server.
You can copy them across using -rsync, or -s3bucket, or other means (eg NFS).
For AWS, ensure aws CLI tools are installed and credentials are set
(i.e. AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY environment variables).
You must use an AMI consistent with your AWS_DEFAULT_REGION
(the default AMI is a standard Amazon EC2 Linux for us-east-1).
A standard AMI should be fine: quaff downloads prereqs and builds itself.
Also ensure that security group <group> allows incoming connections
on ports 22 (ssh) and the range from <port> .. <port> + <numberOfCores> - 1.
Any problems can often be diagnosed by turning up the logging to -v5 or so.
Quaff makes every effort to clean up rogue EC2 instances, but please check!
Options for parallel processing over queueing system:
-qsubjobs <N> Submit up to N simultaneous jobs to queueing system
-qsub <path>, -qsubopts <options>
Path to, and options for, job submission program (qsub)
-qsubdir <path> Temp directory to use for job scripts (must be on NFS)
-qsubheader <header file>
Specify header for job scripts (e.g. for PBS directives)
To use queueing, specify a nonzero value for -qsubjobs.