Skip to content

suffix array construction and searching algorithms for in-memory binary data.

License

Notifications You must be signed in to change notification settings

hucsmn/suffix_array

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Suffix array

Suffix array construction and searching algorithms for in-memory binary data.

To index plain texts, burntsushi's suffix featuring utf-8 support is a better choice.

This crate uses the Amos Wenger's C bindings to Yuta Mori's dissufsort to construct suffix array, which is the fastest known suffix array construction algorithm (SACA) running in single thread that uses merely O(1) additional workspace.

In addition, I have implemented the space-efficient parallel SACA pSACAK in a separate crate. For now, it has not been thoroughly benchmarked as well as optimized.

TODO

  • Benchmark using Pizza&Chili Corpus.

  • Rewrite the benchmarks.

  • Speed up searching by LCP array, a.k.a. construct enhanced suffix array (ESA). There are two major classes of LCP array construction algorithms (LACA): The first class produces LCP array from suffix array. AFAIK, these LACAs have to allocate additional auxiliary arrays (such as ISA, PLCP, and BWT) to construct the LCP array. The second class constructs the suffix array together with its LCP array. They are fast and space-efficient but require sophisticated modifications on kind of SACAs based on induce copying (such as sais-lite, saca-k, and divsufsort). This crate would not provide ESA support until I figure out a proper way to implement it.

  • Speed up searching by bucket pointers, inspired by this paper (which uses a trie as index, interleaved arrays and text fingerprints to improve locality for ESA searching on disk). See SuffixArray::enable_buckets().

  • Add compressed suffix array (CSA) support. CSA safcrificed the speed quite a lot to gain some space efficiency, which is not that necessary.

  • Serialization/deserialization. Enable the optional pack feature to use those APIs. This feature is based on Paul Masurel's bitpacking.

  • Rewrite suffix array construction algorithm. Currently, this crate uses dissufsort to construct all the suffix arrays.

  • Look into other one's efforts on improving libdivsufsort and multikey quick sort: 1, 2, and 3.

About

suffix array construction and searching algorithms for in-memory binary data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages