Flaw in double hash encoding of n-grams #33

hardbyte · 2018-01-09T03:16:53Z

There is a flaw in the double hashing scheme: if hash_2 mod l = 0, then only one bit will be set in the bloom filter irrespective of the value k (the probability of this happening is 1/l = 1/1024 !!!)

wilko77 · 2018-02-07T05:58:07Z

In the double hash encoding scheme, the hash values for k hash functions H_j are derived from
a linear combination of the hash values of two hash functions g and h.

H_j(b) := (g(b) + j·h(b)) mod l,   j ∈ {0, . . . , k −1}

where l denotes the length of the Bloom filter.

As pointed out above, if h(b) mod l = 0, then H_j(b) = H_i(b), i,j ∈ {0, . . . , k −1}. Or in words, the n-gram b will only be encoded in one bit, irrespective of the value of k.

Discussion

The probability of h(b) mod l = 0 is 1/l. What are the consequences?

Security

The difference between setting one or k bits will most likely have a detectable influence on the generated Bloom filter.

If this artifact can be exploited is, to the best of my knowledge, so far unexplored.

The set of all realistically possible n-grams for an identifier is often not very big. Thus, in those cases, one can expect that only very few n-grams will hash to zero (mod l). Let's assume that there is exactly one special n-gram. Then, knowing which Bloom filters contain this n-gram, combined with knowledge of the underlying distribution, will give the attacker an advantage over the case where all n-grams set the same amount of bits in the Bloom filters.
(I want to stress that this is not, by any means, a fully thought trough attack, but rather a sketch of an idea.)

Matching accuracy

If one n-gram of an input value gets encoded by just one bit whereas all other n-grams are represented by k bits in the Bloom filter, then we end up with an undesirable distortion of the individual importance of each n-gram. This surely cannot be beneficial for good matching accuracy, but if it has a measurable impact has to be investigated.

Conclusion

Both, the effects on security and on accuracy are so far under-explored. However, there are indications that there might be negative effects, which leads to the conclusion that we should err on the side of caution and either show that these concerns are unfounded, or modify the double hashing scheme.

Possible remedies

The problem here is that h(b) mod l can be zero in some cases.
One obvious approach is to make sure that this does not happen.
Something like this:

  x = h(b) mod l
  while x == 0:
    x = h(b + i) mod l
    i = i + 1

if h is a cryptographic hash function, then we can expect every output bit to change with equal probability for even the slightest input change. Thus, the probability that we loop k times is 1/(l^k).

wilko77 · 2018-02-21T04:32:13Z

closed via #52

hardbyte added bug security labels Jan 9, 2018

hardbyte mentioned this issue Jan 9, 2018

Attack on Bloom filters built from multiple identifiers data61/anonlink#40

Closed

hardbyte added this to the Sprint 2018-01-29 milestone Jan 24, 2018

hardbyte assigned wilko77 Jan 30, 2018

hardbyte modified the milestones: Sprint 2018-01-29, Sprint 2018-02-12 Feb 9, 2018

wilko77 mentioned this issue Feb 16, 2018

Feature always double encoding #52

Merged

wilko77 closed this as completed Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaw in double hash encoding of n-grams #33

Flaw in double hash encoding of n-grams #33

hardbyte commented Jan 9, 2018

wilko77 commented Feb 7, 2018

wilko77 commented Feb 21, 2018

Flaw in double hash encoding of n-grams #33

Flaw in double hash encoding of n-grams #33

Comments

hardbyte commented Jan 9, 2018

wilko77 commented Feb 7, 2018

Discussion

Security

Matching accuracy

Conclusion

Possible remedies

wilko77 commented Feb 21, 2018