Skip to content

ihh/quaff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quaff is a pair HMM for aligning FASTQ reads to FASTA references, with the following features:

  • it pre-filters and bands the DP algorithms by looking for diagonals with lots of k-mer matches

  • you can train the pair HMM on sequences using the Forward-Backward algorithm

  • you can align reads to references using the Viterbi algorithm

  • you can also align reads to reads (i.e. look for overlaps), using Viterbi

  • it attempts to model the FASTQ quality scores (using a negative binomial distribution), and also uses k-mer context for modeling substitution and/or gap-opening probabilities

  • it's highly parallel: multithreaded, can run remote servers, or launch its own temporary Amazon EC2 cluster

A full manual can be found in the repository at doc/manual.pdf


Usage: quaff {help,train,align,overlap} [options]

Commands:

TRAINING

 quaff train refs.fasta reads.fastq  >params.json
  (to fit a model to unaligned sequences, using EM/Forward-Backward)

   -maxiter <n>    Max number of EM iterations (default is 100)
   -mininc <n>     EM convergence threshold as relative log-likelihood increase
                    (default is .01)
   -maxreadmb <n>  Use only the first n megabases of the read training set
   -force          Force each read to match a refseq, i.e. disallow null model
   -suborder <k>   Allow substitutions to depend on k-mer contexts
   -gaporder <k>   Allow gap open probabilities to depend on k-mer contexts
   -order <k>      Shorthand for '-suborder <k> -gaporder <k>'
   -prior <file>, -saveprior <file>
                   Respectively: load/save prior pseudocounts from/to file
   -saveparams <file>, -savecounts <file>, -savecountswithprior <file>
                   Save parameters (or E-step counts) to file, not stdout
                    (saved counts can subsequently be used as a prior)


ALIGNMENT

 quaff align refs.fasta reads.fastq
  (to align FASTQ reads to FASTA reference sequences, using Viterbi)

   -printall       Print all pairwise alignments, not just best for each read


 quaff overlap reads.fastq
  (to find overlaps between FASTQ reads, using Viterbi)


Alignment options (for align/overlap commands):
   -threshold <n>, -nothreshold
                   Log-odds ratio score threshold for alignment reporting
   -noquals        Ignore read quality scores during alignment
   -savealign <file>
                   Stream alignments to file, instead of stdout
   -format {fasta,stockholm,sam,refseq}
                   Alignment output format

GENERAL

General options (for all commands, except where indicated):
   -verbose, -vv, -vvv, -v4, -v5, etc.
                   Various levels of logging (-nocolor for monochrome)
   -V, --version   Print GNU-style version info
   -h, --help      Print help message

   -params <file>  Load model parameters from file
   -ref <file>     Load additional FASTA reference sequences
   -read <file>    Load additional FASTQ read sequences
   -fwdstrand      Do not include reverse-complemented sequences
   -global         Force all of refseq to be aligned (align/train only)
   -null <file>, -savenull <file>
                   Respectively: load/save null model from/to file

   -kmatch <k>     Length of kmers for pre-filtering heuristic (default 6)
   -kmatchn <n>    Threshold# of kmer matches to seed a diagonal
                    (default is 14 for overlap, 20 for align/train)
   -kmatchband <n> Size of DP band around kmer-matching diagonals (default 64)
   -kmatchmb <M>   Set kmer threshold to use M megabytes of memory
   -kmatchmax      Set kmer threshold to use all available memory (slow)
   -kmatchoff      No kmer threshold, do full DP (typically very slow)


PARALLEL PROCESSING

Parallelization is via a thread pool, which can be extended over a cluster.
If IP addresses (or AWS credentials) are given, then jobs are run over sockets,
and (if NFS is unavailable) files may be synchronized using S3 or rsync.
Alternatively, jobs can be run using a queueing system (such as PBS or SGE),
in which case NFS is required for both job synchronization and file sharing.


Options for parallel processing over sockets:
   -threads <N>, -maxthreads
                   Use N threads, or use all cores available
   -remote [user@]host[:port[-maxport]]
                   Start a (multithreaded) remote quaff server via SSH
   -sshkey <file>  SSH private key file
   -sshpath <p>    Path to ssh
   -rsyncpath <p>  Path to rsync
   -remotepath <p> Path to remote binary (default /usr/local/bin/quaff)
   -rsync          Client will rsync data to server dir (/tmp/quaff)
   -s3bucket <B>   Client/server will sync data files to/from bucket B
   -ec2instances <N>
                   Launch N temporary EC2 instances as servers
   -ec2ami <AMI>, -ec2type <type>, -ec2cores <numberOfCores>,
   -ec2user <user>, -ec2key <keypair>, -ec2group <group>, -ec2port <port>
                   Control various aspects of the launched instances
                    (defaults: ami-1ecae776, m3.medium, 1,
                               ec2-user, quaff, quaff, 8000)

Since quaff opens one socket per remote thread (plus one ssh job per server),
you may need to raise the system limits on # of files/sockets per process
(e.g. OSX 10.10 limits you to 128 sockets/process by default).

By default, quaff assumes all data files are in the same place on the server.
You can copy them across using -rsync, or -s3bucket, or other means (eg NFS).

For AWS, ensure aws CLI tools are installed and credentials are set
(i.e. AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY environment variables).
You must use an AMI consistent with your AWS_DEFAULT_REGION
(the default AMI is a standard Amazon EC2 Linux for us-east-1).
A standard AMI should be fine: quaff downloads prereqs and builds itself.
Also ensure that security group <group> allows incoming connections
on ports 22 (ssh) and the range from <port> .. <port> + <numberOfCores> - 1.
Any problems can often be diagnosed by turning up the logging to -v5 or so.

Quaff makes every effort to clean up rogue EC2 instances, but please check!


Options for parallel processing over queueing system:
   -qsubjobs <N>   Submit up to N simultaneous jobs to queueing system
   -qsub <path>, -qsubopts <options>
                   Path to, and options for, job submission program (qsub)
   -qsubdir <path> Temp directory to use for job scripts (must be on NFS)
   -qsubheader <header file>
                   Specify header for job scripts (e.g. for PBS directives)

To use queueing, specify a nonzero value for -qsubjobs.