reduce FP, less memory on marshalling, correct error-rate formula #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR does quite a few things. First of all, it reduces the false-error rate quite dramatically. It turned out that when several filters were used, they were correlated.
Reduction of FP rate
Originally, using a
500Mb
bloom filter, and filling it with100M
items, the values were:9.303647 %
0.007492 %
-- this means that if it's a 9% chance of a bit being set, then there should be a0.0075%
chance of hitting all four of them on a random false positive.2.658000 %
(2658 out of 100000)Additonally, it was the case that
9.3%
. However:>50%
. That's quite a correlation.So the correlation caused the error rate to not be
0.0075%
at all, but rather2.6%
--346
times larger!This PR introduces some rotation to make better use of the full width of the input hash during hashing, and also reuses the vector on each iteration. With these changes:
9.303936 %
-- same as before, obviously0.007493 %
-- same as before, obviously0.009000 %
(9 out of 100000)Additionaly, it can now be seen that the correlation is (at least mostly) gone:
9.373000 %
(9373 out of 100000)9.474021 %
(888 out of 9373)Speed-wise, this does incur a little penalty:
Reduction of memory allocs
Previously, the amount of memory needed to read a bloom filter from disk was at least triple the size of the filter. This has now been reduced to more or less only the size of the bloom filter.
Fix reporting of false-positive rate
The calculation was wrong, a misplaced paranthesis.