Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
This release contains a number of small bug fixes, and changes to defaults that generally yield slight improvements. However, the reason for the minor (not simply patch) version bump is that this release introduces one major new feature for indexing, which requires bumping the index version (and which means that existing indices will need to be rebuilt for the new version).
This version introduces the option to use an [external memory perfect hash construction algorithm (https://github.com/ot/emphf) to construct the hash function (as opposed to the default Google dense hash) and so requires less memory. This behavior is invoked by passing the
--perfectHash flag to the sailfish
index command. Because the perfect hash function is built in external memory, construction of the hash using this data structure is sower. We don't have longitudinal benchmarks, but it is somewhere between 2 and 5x slower to populate the perfect hash than the dense hash. However, constructing the hash itself requires less memory (less RAM, anyway) and, once constructed, the perfect hash is considerably smaller. Typically, quantification on an index built using a perfect hash will require only ~50% of the memory that is required when using a dense hash.
The performance difference in terms of mapping speed between the two indices is very minimal. Usually, since the perfect hash is smaller, it can be loaded more quickly from disk (this benefit is most noticeable if the index, itself, is build on a very large set of transcripts). The choice of index should have no result on downstream quantification results. The primary motivation for this feature is to allow the construction of indices on large de novo transcriptomes in less RAM. So, the default recommendation (and behavior) is to use the dense hash unless you run into memory problems building the index; in that case, you can use the
--perfectHash flag to try and limit memory usage.
This release contains only fairly minor changes from v0.9.1, mainly related to improving interoperation with RapClust. It adds the
--discardOrphans flag, which will disallow orphaned quasi-mappings of paired-end reads.
This release provides some performance improvements and bug-fixes over v0.9.0. It also provides some new features:
- The ability to write out the equivalence classes and their counts. The
--dumpEqflag will write out a file in the auxiliary directory called
eq_classes.txt. This file has the following format:
N (num transcripts) M (num equiv classes) tn_1 tn_2 ... tn_N eq_1_size t_11 t_12 ... count eq_2_size t_21 t_22 ... count
That is, the file begins with a line that contains the number of transcripts (say N) then a line that contains the number of equivalence classes (say M). It is then followed by N lines that list the transcript names --- the order here is important, because the labels of the equivalence classes are given in terms of the ID's of the transcripts. The rank of a transcript in this list is the ID with which it will be labeled when it appears in the label of an equivalence class. Finally, the file contains M lines, each of which describes an equivalence class of fragments. The first entry in this line is the number of transcripts in the label of this equivalence class (the number of different transcripts to which fragments in this class map --- call this k). The line then contains the k transcript IDs. Finally, the line contains the count of fragments in this equivalence class (how many fragments mapped to these transcripts). The values in each such line are tab separated.
- The ability to select the "auxiliary" directory where information such as bootstrap estimates, gibbs samples, and auxiliary parameters are stored. The default sub-directory of the quantification directory is aux, but it can be changed with --auxDir. This original name is apparently a conflict on systems based on cygwin where aux is a reserved directory name. Even if you're not using this feature, it's recommended to upgrade for the performance improvements and bug fixes.
This is a fairly major new release of Sailfish (thus the major version bump). It includes some new features and makes minor but backward-incompatible changes to the output format.
- Sequence-specific bias correction --- The old bias correction methodology has been removed from Sailfish and replaced with a new sequence-specific bias correction model. Bias correction is enabled with the
--biasCorrectflag. The new model has numerous benefits over the old. First, it should more accurately correct for sequence specific biases, leading to better estimates in biased samples. Second, it should not suffer from the same pathological "over-correction" failure cases of the old model --- if there is no substantial bias in the sample, it should have only a minimal effect on quantification results.
- New output format --- The new output format (which will also be adopted by Salmon v0.6.0 onward) adds another column,
EffectiveLength, to the output which records the effective length of each transcript. This is the third column, and the
NumReadscolumns have both been shifted by 1. Also, the
quant.sfoutput file has been simplified and now contains no comment lines. The first row in the file is an (un-commented) header that lists the column names, and the subsequent rows are the quantification estimates.
- Information about the command used --- Since the comment lines have been removed from the
quant.sffile, this information (and more), which can sometimes be useful, has been output to other locations. There is a JSON formatted file in the top-level output directory called
cmd_info.json. This contains a JSON structure with the relevant command line parameters (which used to appear in the
- Meta-information about the run --- Quite a bit of useful information appears in the file
aux/meta_info.jsonunder the main quantification directory. This records information such as the number of reads processed, the number mapped, the percentage mapped, which type of posterior sampling (e.g. Gibbs / bootstrap), if any, was performed.
- Auxiliary parameters from the run --- In addition to the
aux/directory of the main quantification directory contains other useful files. Specifically, it contains gzipped, binary, data for any bootstrap or Gibbs samples that were generated, and gzipped binary data about the fragment length distribution and bias parameters (the latter is only meaningful if bias-correction was performed).
- This release fixes a bug where the mapping location of a fragment may have been miscalculated by a small number of bases in certain cases. This in turn could lead to a small shift in the fragment length distribution and in the resulting quantification estimates.
- Special thanks go to Ayush Sengupta for helping out with the implementation of sequence-specific bias correction.
- Special thanks go to Mike Love for testing the effectiveness of the sequence-specific bias correction implementation on some experimental (GEUVADIS) data!
This release brings with it minor bug-fixes and two significant new features.
- Fixed a bug where the computed mapping rate (output in a comment at the top of quant.sf) could slightly over-estimate the true mapping rate (i.e. the sum of the estimated counts divided by the number of observed fragments).
- Fixed a bug that prevented some messages from being written to the log prior to exit (when errors were encountered in processing).
- Support for flexible handling of stranded libraries. This includes two new options
--ignoreLibCompat. The default behavior --- when neither of these flags are specified --- is the following. When a fragment is mapped, all multi-mappings are checked for compatibility with the specified library format type. If any mappings are compatible, then all incompatible mappings are discarded. However, if no compatible mappings are found, then the incompatible mappings will be counted.
- When the
--enforceLibCompatflag is passed, then only compatible fragments will ever be considered. Thus, if there are no compatible mappings for a fragment but incompatible mappings exist, the fragment will be considered as if it has no mappings.
- When the
--ignoreLibCompatflag is passed, then all mappings are considered compatible. This effectively disables testing compatibility of mappings with the specified format.
- When the
- Large quasi-index support has been added. Now, when building the index, Sailfish will determine if a 32-bit suffix array is sufficient or if a 64-bit suffix array is required. It will build and use the appropriate suffix array (and report the result to the log). Note: The indexing code is generic, but the 64-bit index has been tested much less than the 32-bit index.
- Improvement to the manner in which gene lengths are calculated when aggregating transcript-level results to the gene level. If at least one transcript of a gene is expressed, the gene length is computed as the (expression) weighted sum of the lengths of the expressed transcripts. If no transcript of a gene is expressed (i.e. the TPM of all of its transcripts are 0), then the length is reported as the average transcript length. This improves upon the prior rule of simply reporting the gene length as the length of the gene's longest transcript.
This release includes some important improvements and bug-fixes which include:
- A very rare bug in which
boost::hash_combine()would exhibit pathologically bad behavior. This caused the concurrent hash map to continue size-doubling, which could consume massive amounts of memory. [Thanks to Nick Schurch for finding this bug and for sharing data that reproduces it].
This release also includes improvements in quasi-mapping which reduce mapping ambiguity and improve results for very similar expressed transcripts that reside on opposite strands.
This release addresses a few issues and adds a significant new feature:
- Fixed an issue where argument validation could cause Sailfish to fail to process certain single-end libraries.
New features and changes:
- New: Added the ability to compute bootstrap samples using either the Variational Bayesian or normal EM algorithm. Providing the option
--numBootstraps k, where
kis some integer greater than
0will cause Sailfish to compute
kbootstrap samples. Bootstraps are computed in parallel using the number of threads normally provided to Sailfish via the
--threadsoption. The bootstraps themselves are computed in parallel, each making use of a single thread. The results are saved in the quantification directory to a file called
quant_bootstraps.sf. The format of the file is comment lines (starting with a
#), followed by a header line listing the transcript ids, followed by
klines, each of which gives the bootstrap values for each transcript, under each bootstrap. The order of the values in a line is the same as the order of the transcripts in the header.
- New: Added a new method for adjusting for effective length that is less "aggressive" with non-standard fragment length distributions. The previous behavior can be enabled with
- Removed the
--useGSOptflag since it was somewhat redundant. Now, simply providing a number greater than
--numGibbsSampleswill enable the posterior Gibbs sampler.
This release addresses a few issues and adds some new features:
- Build from source now works with Apple Clang 7.0 (thanks for reporting this Rory Kirchner).
- Ensure that lengths are calculated correctly if empirical fragment length distribution contains only a single value.
- Added the ability to generate samples from the posterior distribution of abundances via Gibbs sampling. This feature is enabled with
--useGSOpt, and the number of samples to generate is provided via
- When aggregating estimates to the gene level, one can now provide
--txpAggregationKeyto decide which key from the gtf file to use for aggregation.
This is a minor bugfix release. The bug affected only v0.7.3. When there were too few reads to estimate an empirical fragment length distribution (< 50,000 uniquely mapping paired-end reads), and there was a transcript of exactly the same length as the fragment length distribution prior, there was a case that could lead to a division by 0 and a NaN TPM estimate. This bug fix resolves this bug.