Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaw in double hash encoding of n-grams #33

Closed
hardbyte opened this issue Jan 9, 2018 · 2 comments
Closed

Flaw in double hash encoding of n-grams #33

hardbyte opened this issue Jan 9, 2018 · 2 comments
Assignees

Comments

@hardbyte
Copy link
Collaborator

hardbyte commented Jan 9, 2018

From @wilko77:

There is a flaw in the double hashing scheme: if hash_2 mod l = 0, then only one bit will be set in the bloom filter irrespective of the value k (the probability of this happening is 1/l = 1/1024 !!!)

@wilko77
Copy link
Collaborator

wilko77 commented Feb 7, 2018

In the double hash encoding scheme, the hash values for k hash functions H_j are derived from
a linear combination of the hash values of two hash functions g and h.

H_j(b) := (g(b) + j·h(b)) mod l,   j ∈ {0, . . . , k −1}

where l denotes the length of the Bloom filter.

As pointed out above, if h(b) mod l = 0, then H_j(b) = H_i(b), i,j ∈ {0, . . . , k −1}. Or in words, the n-gram b will only be encoded in one bit, irrespective of the value of k.

Discussion

The probability of h(b) mod l = 0 is 1/l. What are the consequences?

Security

The difference between setting one or k bits will most likely have a detectable influence on the generated Bloom filter.

If this artifact can be exploited is, to the best of my knowledge, so far unexplored.

The set of all realistically possible n-grams for an identifier is often not very big. Thus, in those cases, one can expect that only very few n-grams will hash to zero (mod l). Let's assume that there is exactly one special n-gram. Then, knowing which Bloom filters contain this n-gram, combined with knowledge of the underlying distribution, will give the attacker an advantage over the case where all n-grams set the same amount of bits in the Bloom filters.
(I want to stress that this is not, by any means, a fully thought trough attack, but rather a sketch of an idea.)

Matching accuracy

If one n-gram of an input value gets encoded by just one bit whereas all other n-grams are represented by k bits in the Bloom filter, then we end up with an undesirable distortion of the individual importance of each n-gram. This surely cannot be beneficial for good matching accuracy, but if it has a measurable impact has to be investigated.

Conclusion

Both, the effects on security and on accuracy are so far under-explored. However, there are indications that there might be negative effects, which leads to the conclusion that we should err on the side of caution and either show that these concerns are unfounded, or modify the double hashing scheme.

Possible remedies

The problem here is that h(b) mod l can be zero in some cases.
One obvious approach is to make sure that this does not happen.
Something like this:

  x = h(b) mod l
  while x == 0:
    x = h(b + i) mod l
    i = i + 1

if h is a cryptographic hash function, then we can expect every output bit to change with equal probability for even the slightest input change. Thus, the probability that we loop k times is 1/(l^k).

@wilko77
Copy link
Collaborator

wilko77 commented Feb 21, 2018

closed via #52

@wilko77 wilko77 closed this as completed Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants