FMDummy text indexes library

What is it?

The FMDummy text indexes are fast variants of the FM-index [1], a well-known compressed full-text index by Ferragina and Manzini [2]. We focus more on search speed than space use. One of the novelties is a rank solution with 1 cache miss in the worst case, which (to our knowledge) was not used earlier elsewhere.

The current version handles only the count query (i.e., returns the number of occurrences of the given pattern).

Requirements

The FMDummy text indexes require:

C++11 ready compiler such as g++ version 4.7 or higher
a 64-bit UNIX operating system
text size:
- < 2GB for FMDummy1, FMDummy3 and FMDummyHWT
- < 0.8GB for FMDummy2 with FMD2_SCHEMA_SCBO and FMD2_BPC_3
- < 1.2GB for FMDummy2 with FMD2_SCHEMA_SCBO and FMD2_BPC_4
- < 1.1GB for FMDummy2 with FMD2_SCHEMA_CB and FMD2_BPC_3
- < 1.5GB for FMDummy2 with FMD2_SCHEMA_CB and FMD2_BPC_4
FMDummy2 text limitations are the worst cases for incompressible text, usually they are not so strong

Installation

To download and build the library use the following commands:

git clone https://github.com/mranisz/fmdummy.git
cd fmdummy
make

Usage

To use the FMDummy library:

include "fmdummy/fmdummy.hpp" to your project
compile it with "-faligned-new -std=c++11 -O3 -mpopcnt" options and link it with libraries:
- fmdummy/libfmdummy.a
- fmdummy/libs/libaelf64.a
use "fmdummy" and "shared" namespaces

API

There are several functions you can call on each of the FMDummy text index:

build the index using text file called textFileName:

void build(const char *textFileName);

save the index to file called fileName:

void save(const char *fileName);

load the index from file called fileName:

void load(const char *fileName);

free memory occupied by index:

void free();

get the index size in bytes (size in memory):

unsigned long long getIndexSize();

get the size in bytes of the text used to build the index:

unsigned int getTextSize();

get the result of count query:

unsigned int count(unsigned char *pattern, unsigned int patternLen);

FMDummy1<FMDummy1Type T>

FMDummy1 can be built for up to 16 selected characters from text.

Parameters:

T:
- FMD1_256 - using 256b blocks: 64b of rank data and 192b of text data
- FMD1_512 - using 512b blocks: 64b of rank data and 448b of text data
selectedChars:
- up to 16 ordinal character values, e.g. {'A','C','G','T'}
- {} (default) - all characters from the text

Constructors:

FMDummy1<FMDummy1Type T>();
FMDummy1<FMDummy1Type T>(vector<unsigned char> selectedChars);

FMDummy1Hash<FMDummy1Type T>

FMDummy1Hash is FMDummy1 with hashed k-symbol prefixes of suffixes from suffix array to speed up searches (k ≥ 2). This variant is particularly efficient in speed for short patterns (not much longer than k).

Parameters:

T:
- FMD1_256 - using 256b blocks: 64b of rank data and 192b of text data
- FMD1_512 - using 512b blocks: 64b of rank data and 448b of text data
selectedChars:
- up to 16 ordinal character values, e.g. {'A','C','G','T'}
- {} (default) - all characters from the text
k - length of prefixes of suffixes from suffix array
loadFactor - hash table load factor

Limitations:

pattern length ≥ k (patterns shorter than k are handled by standard variant of FMDummy1 index)
k ≥ 2
0.0 < loadFactor < 1.0

Constructors:

FMDummy1Hash<FMDummy1Type T>(unsigned int k, double loadFactor);
FMDummy1Hash<FMDummy1Type T>(vector<unsigned char> selectedChars, unsigned int k, double loadFactor);

FMDummy2<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

Parameters:

T:
- FMD2_256 - using 256b blocks: 64b of rank data and 192b of encoded text data
- FMD2_512 - using 512b blocks: 64b of rank data and 448b of encoded text data
S:
- FMD2_SCHEMA_SCBO - using SCBO encoding (A. Fariña, G. Navarro, J. Paramá. Boosting text compression with word-based statistical encoding. The Computer Journal, 55(1):111–131, 2012)
- FMD2_SCHEMA_CB - using CB encoding (Sz. Grabowski, M. Raniszewski, S. Deorowicz. FM-index for dummies. arXiv:1506.04896, 2015)
BPC:
- FMD2_BPC_4 - using 4 bits to store the encoded symbol
- FMD2_BPC_3 - using 3 bits to store the encoded symbol

Constructors:

FMDummy2<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>();

FMDummy2Hash<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

FMDummy2Hash is FMDummy2 with hashed k-symbol prefixes of suffixes from suffix array to speed up searches (k ≥ 2). This variant is particularly efficient in speed for short patterns (not much longer than k).

Parameters:

T:
- FMD2_256 - using 256b blocks: 64b of rank data and 192b of encoded text data
- FMD2_512 - using 512b blocks: 64b of rank data and 448b of encoded text data
S:
- FMD2_SCHEMA_SCBO - using SCBO encoding (A. Fariña, G. Navarro, J. Paramá. Boosting text compression with word-based statistical encoding. The Computer Journal, 55(1):111–131, 2012)
- FMD2_SCHEMA_CB - using CB encoding (Sz. Grabowski, M. Raniszewski, S. Deorowicz. FM-index for dummies. arXiv:1506.04896, 2015)
BPC:
- FMD2_BPC_4 - using 4 bits to store the encoded symbol
- FMD2_BPC_3 - using 3 bits to store the encoded symbol
k - length of prefixes of suffixes from suffix array
loadFactor - hash table load factor

Limitations:

pattern length ≥ k (patterns shorter than k are handled by standard variant of FMDummy2 index)
k ≥ 2
0.0 < loadFactor < 1.0

Constructors:

FMDummy2Hash<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>(unsigned int k, double loadFactor);

FMDummy3<FMDummy3Type T>

FMDummy3 is intended for DNA sequences (it searches only for patterns consisting of the symbols A, C, G, T).

Parameters:

T:
- FMD3_512 - using 512b blocks: 128b of rank data and 384b of text data
- FMD3_1024 - using 1024b blocks: 128b of rank data and 896b of text data

Constructors:

FMDummy3<FMDummy3Type T>();

FMDummy3Hash<FMDummy3Type T>

FMDummy3Hash is FMDummy3 with hashed k-symbol prefixes of suffixes from suffix array to speed up searches (k ≥ 2). This variant is particularly efficient in speed for short patterns (not much longer than k).

Parameters:

T:
- FMD3_512 - using 512b blocks: 128b of rank data and 384b of text data
- FMD3_1024 - using 1024b blocks: 128b of rank data and 896b of text data
k - length of prefixes of suffixes from suffix array
loadFactor - hash table load factor

Limitations:

pattern length ≥ k (patterns shorter than k are handled by standard variant of FMDummy3 index)
k ≥ 2
0.0 < loadFactor < 1.0

Constructors:

FMDummy3Hash<FMDummy3Type T>(unsigned int k, double loadFactor);

FMDummyHWT<FMDummyHWTType T, WTDummyType W>

Parameters:

T:
- FMDHWT_512 - using 512b blocks: 64b of rank data and 448b of encoded text data
- FMDHWT_1024 - using 1024b blocks: 64b of rank data and 960b of encoded text data
W:
- WTDummy_2 - using binary Huffman-shaped wavelet tree
- WTDummy_4 - using 4-ary Huffman-shaped wavelet tree
- WTDummy_8 - using 8-ary Huffman-shaped wavelet tree

Constructors:

FMDummyHWT<FMDummyHWTType T, WTDummyType W>();

FMDummyHWTHash<FMDummyHWTType T, WTDummyType W>

FMDummyHWTHash is FMDummyHWT with hashed k-symbol prefixes of suffixes from suffix array to speed up searches (k ≥ 2). This variant is particularly efficient in speed for short patterns (not much longer than k).

Parameters:

T:
- FMDHWT_512 - using 512b blocks: 64b of rank data and 448b of encoded text data
- FMDHWT_1024 - using 1024b blocks: 64b of rank data and 960b of encoded text data
W:
- WTDummy_2 - using binary Huffman-shaped wavelet tree
- WTDummy_4 - using 4-ary Huffman-shaped wavelet tree
- WTDummy_8 - using 8-ary Huffman-shaped wavelet tree
k - length of prefixes of suffixes from suffix array
loadFactor - hash table load factor

Limitations:

pattern length ≥ k (patterns shorter than k are handled by standard variant of FMDummyHWT index)
k ≥ 2
0.0 < loadFactor < 1.0

Constructors:

FMDummyHWTHash<FMDummyHWTType T, WTDummyType W>(unsigned int k, double loadFactor);

FMDummy1 usage example

#include <iostream>
#include <stdlib.h>
#include "fmdummy/shared/patterns.hpp"
#include "fmdummy/fmdummy.hpp"

using namespace std;
using namespace shared;
using namespace fmdummy;

int main(int argc, char *argv[]) {

	unsigned int queriesNum = 1000000;
	unsigned int patternLen = 20;
	vector<unsigned char> selectedChars = {'A', 'C', 'G', 'T'};
	FMDummy1<FMD1_256> *FMD1 = new FMDummy1<FMD1_256>(selectedChars);
	const char *textFileName = "dna";
	const char *indexFileName = "dna-fm1.idx";

	if (fileExists(indexFileName)) {
		FMD1->load(indexFileName);
	} else {
		FMD1->build(textFileName);
		FMD1->save(indexFileName);
	}

	double indexSize = (double)FMD1->getIndexSize();
	cout << "Index size: " << indexSize << "B (" << (indexSize / (double)FMD1->getTextSize()) << "n)" << endl << endl;

	Patterns32 *P = new Patterns32(textFileName, queriesNum, patternLen, selectedChars);
	unsigned char **patterns = P->getPatterns();

	for (unsigned int i = 0; i < queriesNum; ++i) {
		cout << "Pattern |" << patterns[i] << "| occurs " << FMD1->count(patterns[i], patternLen) << " times." << endl;
	}

	delete FMD1;
	delete P;
}

Using other FMDummy indexes is analogous.

Benchmark

Benchmark script countFMDummy.sh should be run with superuser privileges (as the memory cache is cleared before each test). Benchmark results are saved in the "results" folder.

Before launching a benchmark please download the data files from http://ranisz.iis.p.lodz.pl/indexes/data/text_files/200/ and extract them into the main project folder.

External resources used in FMDummy project

Suffix array building by Yuta Mori (sais)
A multi-platform library of highly optimized functions for C and C++ by Agner Fog (asmlib)
A very fast hash function by Yann Collet (xxHash)

References

Sz. Grabowski, M. Raniszewski. FM-index for Dummies. Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation - 13th International Conference, BDAS 2017, Ustroń, Poland, May 30 - June 2, 2017, Proceedings, pp. 189-201.
P. Ferragina, G. Manzini. Opportunistic data structures with applications. In Proc. FOCS, pp. 390–398, IEEE, 2000.

Authors

Szymon Grabowski
Marcin Raniszewski
Sebastian Deorowicz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FMDummy text indexes library

What is it?

Requirements

Installation

Usage

API

FMDummy1<FMDummy1Type T>

FMDummy1Hash<FMDummy1Type T>

FMDummy2<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

FMDummy2Hash<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

FMDummy3<FMDummy3Type T>

FMDummy3Hash<FMDummy3Type T>

FMDummyHWT<FMDummyHWTType T, WTDummyType W>

FMDummyHWTHash<FMDummyHWTType T, WTDummyType W>

FMDummy1 usage example

Benchmark

External resources used in FMDummy project

References

Authors

About

Releases 5

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
libs		libs
results		results
shared		shared
test		test
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
countFMDummy.sh		countFMDummy.sh
fmdummy.hpp		fmdummy.hpp

License

mranisz/fmdummy

Folders and files

Latest commit

History

Repository files navigation

FMDummy text indexes library

What is it?

Requirements

Installation

Usage

API

FMDummy1<FMDummy1Type T>

FMDummy1Hash<FMDummy1Type T>

FMDummy2<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

FMDummy2Hash<FMDummy2Type T, FMDummy2Schema S, FMDummy2BPC BPC>

FMDummy3<FMDummy3Type T>

FMDummy3Hash<FMDummy3Type T>

FMDummyHWT<FMDummyHWTType T, WTDummyType W>

FMDummyHWTHash<FMDummyHWTType T, WTDummyType W>

FMDummy1 usage example

Benchmark

External resources used in FMDummy project

References

Authors

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages