Skip to content

Releases: jltsiren/gcsa2

GCSA2 v1.3

23 Jan 03:52
Compare
Choose a tag to compare
  • Uses C++14 and the vgteam fork of SDSL.
  • Support for Clang.
  • Deterministic shuffle in locate(range, max_positions, results) to avoid platform-specific behavior.
  • Construction, serialization, and statistics fixes for empty indexes.
  • Installation script.

GCSA2 v1.2

11 May 03:59
Compare
Choose a tag to compare

Various improvements to index construction. Deals with some bottlenecks when the temporary files are on a fast SSD.

  • New functionality: locate(range, max_positions, results) returns a random subset of the matching positions in the query range.
  • Read and write data in smaller blocks to avoid the issue with >2 GB reads in GCC on macOS.
  • Delete temporary files when std::exit() is called.
  • Size limit is now the total for all temporary files. The default was increased to 2048 GB.
  • Faster index construction: faster preprocessing, better scheduling in PathGraph::extend().

GCSA2 v1.1

23 Feb 22:46
Compare
Choose a tag to compare

Support for haplotype-aware indexing and higher-order indexes in VG.

  • Node mappings for using separate sets of node identifiers for graph transformations and locate() queries.
    • Intended for graphs pruned with vg prune --unfold-paths, which replaces complex subgraphs with subgraphs that only contain specific paths.
    • Build a path graph for the unfolded graph but create an index that maps to the original graph.
  • Support for 4 doubling steps (paths of length up to 256).
  • Optionally specify a sample period instead of always sampling the initial offsets of the original nodes.

GCSA2 v1.0

22 May 13:49
Compare
Choose a tag to compare

This version contains bug fixes and other minor improvements. As there have been no major changes since September 2016, it seems appropriate to call this version 1.0.

  • The prefix of temporary files is now gcsa_ instead of .gcsa_. No more large hidden files that may remain if the construction fails.
  • Construction is aborted if reading/writing temporary files fails.
  • Tools now display the version of GCSA2.
  • Fixed a buffer overflow in LCP construction.

GCSA2 v0.8

01 Oct 18:46
Compare
Choose a tag to compare

This is a performance update.

  • New encoding for the FM-index that makes it as fast as the FM-index in BWA.
  • Kmer comparison: report the size of the symmetric difference between two indexes, and optionally output the kmers specific to one of the indexes.
  • The LCP array file has now a proper header.
  • Due to changes in file formats, old indexes must be rebuilt.
  • Headers are now located under include/gcsa.

GCSA2 v0.7

16 Aug 13:40
Compare
Choose a tag to compare

This is a major construction update.

  • Faster index construction due to simplified disk I/O.
  • The index is now based on maximally pruned de Bruijn graphs, which are more intuitive and slightly smaller than the non-maximally pruned graphs in the earlier versions.
  • Overlapping subgraphs (e.g. a pruned variation graph and the reference path) can be indexed in separate files without excessive memory usage.
  • The Alphabet object is now a property of InputGraph, not a GCSA construction parameter.
  • Verbosity level can be changed runtime with Verbosity::set().
  • GCSA2 now compiles with an OpenMP-enabled Clang compiler. Index construction is slower than with g++ due to the lack of multi-threaded sorting.

GCSA2 v0.6.1

16 Mar 13:40
Compare
Choose a tag to compare

This is a quick bug fix.

  • STNode::lcp() now returns the string depth of the node itself, not of its parent.
  • String depths can also be determined by using LCPArray::depth().

GCSA2 v0.6

14 Mar 15:02
Compare
Choose a tag to compare

This is a major functionality update. It adds support for the following operations:

  • Counting queries determining the number of distinct start nodes in a lexicographic range of path nodes. The solution is based on a generalization of Sadakane's document counting structure.
  • Parent queries in the suffix tree in order to find maximal exact matches quickly. The solution is based on a range minimum tree over the LCP array, which can also be used to add support for other suffix tree operations. (The lack of inverse suffix array functionality in GCSA prevents us from making the index fully equivalent to a suffix tree.)
  • Counting the number of distinct kmers in the index. This is primarily useful for determining how much space is saved by pruning the de Bruijn graph. The same approach can be used for e.g. comparing two indexes based on the kmers they contain.

Other things to consider:

  • Index construction uses somewhat more time and memory due to the new structures.
  • Index size has increased by 10-15% (without the LCP array) or by about 50% (with the LCP array).
  • Index file format has changed. Old indexes cannot be used anymore, as a conversion tool would not be that much more efficient than rebuilding the indexes.

GCSA2 v0.5

14 Dec 10:55
Compare
Choose a tag to compare

This is the first actual release of GCSA2. The indexes are smaller than in the earlier releases, and the interfaces have been frozen and documented.

  • The final index is typically 25% to 30% smaller than before. This is caused by more aggressive pruning of the de Bruijn graph.
  • Index construction is slightly faster than before due to asynchronous reading of temporary files and a faster de Bruijn graph implementation.
  • The construction interface, the high-level query interface, and the low-level query interface have been frozen and documented.
  • The index file format has changed. Old indexes can be converted to the new format by using the convert_gcsa tool.

GCSA2 v0.4

22 Nov 14:36
Compare
Choose a tag to compare
GCSA2 v0.4 Pre-release
Pre-release

This is the fourth pre-release of GCSA2. The memory usage of index construction is now significantly lower, at the price of 2x longer construction time.

  • The construction algorithm is now disk-based. It keeps the graphs on disk and loads at most one chromosome at a time into memory.
  • Graph order is no longer a hard limit for query length. Longer queries may still result in false positives, however.
  • Headers have been reorganized into public headers (gcsa.h, files.h, support.h, and utils.h) and internal headers (path_graph.h, dbg.h, internal.h).