Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Home

hyan edited this page · 23 revisions
Clone this wiki locally

Kamikaze is a utility package wrapping set implementations on document lists. It also implements P4Delta compression algorithm for sorted integer segments to enable Inverted List compression for search engines like Lucene.

PForDelta is a compression method that was originally proposed by S. Heman 1. It was not originally designed to compress inverted indexes of search engines. It was first used in 2008 by WestLab (Web Exploration and Search Technology Lab at Polytechnic Institute of New York University) to compress inverted indexes of search engines 2. Recently, WestLab 3 proposed several other optimizations on PForDelta and claimed that optimized PForDelta can achieve a better tradeoff between the compressed size and the decompression speed than other inverted index compression methods, for example, Rice coding, Variable-byte coding, Gamma coding, Interpolative coding, etc.

The beauty of PForDelta is that it supports extremely fast decompression while also achieving a small compressed size. The basic idea is as follows 3: in order to compress a block of k numbers, say, 128 numbers, it first determines a value b such that most of the 128 values to be encoded (say, 90%) are less than 2^b and thus fit into a fixed bit field of b bits each. The remaining values, called exceptions, are coded separately. If we apply PForDelta to blocks containing some multiple of 32 values, then decompression involves extracting groups of 32 b-bit values, and finally patching the result by decoding a smaller number of exceptions. This process can be implemented extremely efficiently by providing, for each value of b, an optimized method for extracting 32 b-bit values from b memory words. PForDelta can be modified and tuned in various ways by choosing different thresholds for the number of exceptions allowed, and by encoding the exceptions in different ways.

For details about PForDelta and its several optimized variations, please refer to:

1 S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005, (The original PForDelta)

2 J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conference, April 2008., (A variation of the original PForDelta, which is implemented by the WestLab, and used to compress inverted indexes of search engines).

3 Inverted Index Compression and Query Processing with Optimized Document Ordering, (Several other further optimized PForDelta, for example, NewPFD and OptPFD, which are also implemented by WestLab, and used to compress inverted indexes of search engines).

Something went wrong with that request. Please try again.