We recommend that you obtain GenomicConsensus by installing the most recent version of SMRTanalysis. Other means of installation are not officially supported.
Basic usage---using 8 CPUs to compute consensus of mapped reads and variants relative to a reference---is as follows:
% quiver -j8 aligned_reads{.cmp.h5, .bam, .fofn, or .xml} \ > -r reference{.fasta or .xml} -o variants.gff \ > -o consensus.fasta -o consensus.fastq
quiver
is a shortcut for variantCaller --algorithm=quiver
.
Naturally, to use arrow you could use the arrow
shortcut or
variantCaller --algorithm=arrow
.
in this example we perform haploid consensus and variant calling on
the mapped reads in the aligned_reads.bam
which was aligned to
reference.fasta
. The reference.fasta
is only used for
designating variant calls, not for computing the consensus. The
consensus quality score for every position can be found in the output
FASTQ file.
Note that 2.3 SMRTanalysis does not support "dataset" input (FOFN or XML files); those who need this feature should wait for the forthcoming release of SMRTanalysis 3.0 or build from GitHub sources.
We do not recommend attempting to construct a single giant cmp.h5 file and then processing it on a single node. This is inefficient and users attempting to do this have run into many problems with the instability of the HDF5 library (which PacBio is moving away from, in favor of BAM.)
To run a large-scale resequencing job (>50 megabase genome @ 50x coverage,nominally), you want to spread the computation load across multiple nodes in your computing cluster.
The smrtpipe workflow engine in SMRTanalysis 2.3 provides a convenient workflow automating this---it will automatically spread the load for both mapping and quiver jobs among your available cluster nodes. This is accessible via the SMRTportal UI; the simplest way to set up and run thse workflows is via tha UI. Nonetheless, we include command-line instructions for completeness.
If you have to run the smrtpipe workflow manually from the command line, a recipe is as folows:
Make sure the reference you will align and compare against is present in a SMRTportal "reference repository". Even if you don't want to use SMRTportal, you need to build/import the reference appropriately, and the simplest way to do that is via SMRTportal. If you don't have a SMRTportal instance, you can use the
referenceUploader
command to prepare your reference repository.Prepare an "input.fofn" file listing, one-per-line, each "bax.h5" file in your input data set.
Convert the "input.fofn" to an "input.xml" file that SMRTpipe can understand:
$ fofnToSmrtpipeInput.py input.fofn > input.xml
Prepare your "params.xml" file. Here is a params.xml template you can use; you should just need to edit the reference path.
Activate your SMRTanalysis environment, and invoke smrtpipe:
$ source <SMRT Analysis>/etc/setup.sh $ smrtpipe.py --distribute --params=params.xml xml:input.xml
After successful execution is complete, the results should be available as data/consensus.fast[aq].gz and data/variants.gff.gz, etc.
Please consult the SMRTpipe reference manual for further information.
Note that resequencing (mapping reads against a reference genome and then calling consensus and identifying variants) and polishing (mapping reads against a draft assembly and then taking the consensus output as the final, polished, assembly) are the same algorithmic operation, the only effective difference is that the "variants.gff" output is not biologically meaningful in the polishing case---it just records the edits that were made to the draft to produce the polished assembly.
(Forthcoming)
If you need to access the the latest code for some reason, a convenient way to build it is to use PacBio's pitchfork build system, which will take care of all third party dependencies for you. Here's a recipe:
git clone git@github.com:PacificBiosciences/pitchfork.git cd pitchfork make GenomicConsensus # may take some time, as it builds dependencies...
Now, with GenomicConsensus built, you can use it via:
bash --init-file deployment/setup-env.sh # Puts you in a subshell where your build is available quiver --help # now you have quiver, arrow, etc. available
If you encounter build issues using pitchfork, please report the issues there. Note that you can deploy PacBio software to a location of your choice using pitchfork.
Please consult the FAQ document.