You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a flaw in the double hashing scheme: if hash_2 mod l = 0, then only one bit will be set in the bloom filter irrespective of the value k (the probability of this happening is 1/l = 1/1024 !!!)
The text was updated successfully, but these errors were encountered:
In the double hash encoding scheme, the hash values for k hash functions H_j are derived from
a linear combination of the hash values of two hash functions g and h.
H_j(b) := (g(b) + j·h(b)) mod l, j ∈ {0, . . . , k −1}
where l denotes the length of the Bloom filter.
As pointed out above, if h(b) mod l = 0, then H_j(b) = H_i(b), i,j ∈ {0, . . . , k −1}. Or in words, the n-gram b will only be encoded in one bit, irrespective of the value of k.
Discussion
The probability of h(b) mod l = 0 is 1/l. What are the consequences?
Security
The difference between setting one or k bits will most likely have a detectable influence on the generated Bloom filter.
If this artifact can be exploited is, to the best of my knowledge, so far unexplored.
The set of all realistically possible n-grams for an identifier is often not very big. Thus, in those cases, one can expect that only very few n-grams will hash to zero (mod l). Let's assume that there is exactly one special n-gram. Then, knowing which Bloom filters contain this n-gram, combined with knowledge of the underlying distribution, will give the attacker an advantage over the case where all n-grams set the same amount of bits in the Bloom filters.
(I want to stress that this is not, by any means, a fully thought trough attack, but rather a sketch of an idea.)
Matching accuracy
If one n-gram of an input value gets encoded by just one bit whereas all other n-grams are represented by k bits in the Bloom filter, then we end up with an undesirable distortion of the individual importance of each n-gram. This surely cannot be beneficial for good matching accuracy, but if it has a measurable impact has to be investigated.
Conclusion
Both, the effects on security and on accuracy are so far under-explored. However, there are indications that there might be negative effects, which leads to the conclusion that we should err on the side of caution and either show that these concerns are unfounded, or modify the double hashing scheme.
Possible remedies
The problem here is that h(b) mod l can be zero in some cases.
One obvious approach is to make sure that this does not happen.
Something like this:
x = h(b) mod l
while x == 0:
x = h(b + i) mod l
i = i + 1
if h is a cryptographic hash function, then we can expect every output bit to change with equal probability for even the slightest input change. Thus, the probability that we loop k times is 1/(l^k).
From @wilko77:
The text was updated successfully, but these errors were encountered: