A general purpose counting filter: making every bit count by Pandey et al., SIGMOD’17
A Rank and Seek Quotient Filter (RSQF) is an Approximate Membership Query data structure. It is similar to the popular Bloom Filter (BF) however where a BF only provides insert, and lookup.
An RSQF provides:
- insert
- delete
- lookup
- resize1
- merge
1 - resize does not permit an increase in p (e.g. unique identity capacity). It does however increase the total number of available slots. This provides two benefits:
- Faster queries as the remainders can be aligned closer to their home slot with shorter runs.
- Increase in the number of entries allowing for deletion.
This implementation of RSQF has a fixed error rate of 1/512 or ~0.2%. This was done so that each block is ensured to be contiguous in memory.
r is therefore fixed at 9 (r = log2(1/0.001953)
). q will vary relative to p
(e.g. for 1m entries p = log2(1,000,000/0.001953)
).
A filter that accepts 1 million entries will require ~11.67 bits per entry and require approximately 1.46MB (Si) of memory.
- Rank
- Select
- Hash (fnv, might consider Murmur3, CityHash, or xxHash)
- FirstAvailableSlot
- Insert
- MayContain
- Resize
- Merge
- Count (CQF)
The following table shows the approximate sizing for this RSQF implementation at various values of n.
n | δ | p | r | q | Q size |
---|---|---|---|---|---|
100,000 | 1/512 | 26 | 9 | 17 | 184.32 KB |
1,000,000 | 1/512 | 29 | 9 | 20 | 1.47 MB |
10,000,000 | 1/512 | 33 | 9 | 24 | 23.59 MB |
100,000,000 | 1/512 | 36 | 9 | 27 | 188.74 MB |
1,000,000,000 | 1/512 | 39 | 9 | 30 | 1.51 GB |
This glossary summarises the variables specified in the stonybrook paper.
Note: Where integers are specified, fractional values are rounded up to the nearest integer when calculated from floating point values.
- n - (integer)
- Maximum number of insertions (e.g. 1,000,000).
- δ - (fraction)
- Error rate or false-positive rate (e.g. 1/512 or 1/100).
- p - (integer)
- Number of bits required from the hashed input to achieve the target error for the given number of insertions (n). The p-bit hash is split into high bits (quotient) and low bits (remainder).
- r / remainder - (integer)
- The number of remainder bits which are written to `Q.remainders`.
- q / quotient - (integer)
- The number of quotient bits used to indicate the expected `home slot` in the filter.
- run
- A run is a consecutive group of remainders where the quotient is equal.
- occupied - (bit)
- A bit that indicates the position of a home slot for a given `run`.
- runend - (bit)
- A bit that indicates the end of a `run`.
- Q - (struct)
- The RSQF data structure which contains 2q r-bits of available space allocated by a `block` array. The memory in bits required by the struct can be calculated as follows:
- Q.occupieds - (bit vector)
- Q.runends - (bit vector)
- Q.remainders - (bit vector)
- block
- A block is `64(r + 2) + 8` bit structure. It is composed of the following fields:
- offset - 1 x 8-bit.
- runends - 1 x 64-bit.
- occupieds - 1 x 64-bit.
- remainders - r x 64-bit.
- home slot - (array index)
- The home slot is the location where a remainder would be placed if h0(x) is unoccupied by another value.
- slot i - (array index)
- h(x) - (integer)
- A universal hashing function. For this library FNV-1a (64-bit) was employed as it is available in the standard library.
- h0(x) / i - (integer)
- The masked upper bits of the hash shifted right `r` times.
- h1(x) - (integer)
- The masked lower half of the hash.
- B - (bit vector)
- Variable representing a bit-vector. Typically one of `Q.occupieds`, `Q.runends`, or `Q.remainders`.
- RANK(B, i) - (integer)
- Rank returns the number of 1s in B up to position i.
- SELECT(B, i) - (integer)
- Select returns the index of the ith 1 in B.
- Oi - (integer)
- Is every 64th slot which is stored in `Q[i].offset` to save space. The offset is calculated with the algorithm that follows.
- Oj - (integer)
- Oj is a derived intermediate slot value which is discovered using the algorithm that follows.
p = log2(n/δ)
r = log2(1/δ)
q = p - r
h0(a) = h0(b) = h0(c)
2^q/64 * (8+64(r+2))
i = h0(x); Q[i].remainders == h1(x)
i = h0(x)
h0 = (h(x) >> r) & (2^q - 1)
h1 = h(x) & (2^r - 1)
RANK(B, i) =
SELECT(B, i) =
Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i
i = h0(x) Oi = SELECT(Q.runends, RANK(Q.occupieds, i)) - i d = RANK(q.occupieds[i + 1, ..., j], j - i - 1) t = SELECT(Q.runends[i + Oi + 1,...,2^q - 1], d) Oj = i + Oi + t - j