Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce FP, less memory on marshalling, correct error-rate formula #4

Merged
merged 1 commit into from
Dec 18, 2020

Conversation

holiman
Copy link
Owner

@holiman holiman commented Dec 18, 2020

This PR does quite a few things. First of all, it reduces the false-error rate quite dramatically. It turned out that when several filters were used, they were correlated.

Reduction of FP rate

Originally, using a 500Mb bloom filter, and filling it with 100M items, the values were:

  • Fill ratio: 9.303647 %
  • Theoretical hitrate : 0.007492 % -- this means that if it's a 9% chance of a bit being set, then there should be a 0.0075% chance of hitting all four of them on a random false positive.
  • Actual false positive rate (on 100K random tests): 2.658000 % (2658 out of 100000)

Additonally, it was the case that

  • The hit rate of filter 1 (out of four) (in 100K random tests): 9.456000 % (9456 out of 100000) -- which matches with the expected 9.3%. However:
  • 1-filter Hit rate: 53.489848 % (5058 out of 9456)
    • If the first filter hit, then the chance that the second filter would also hit went up to >50%. That's quite a correlation.

So the correlation caused the error rate to not be 0.0075% at all, but rather 2.6% -- 346 times larger!

This PR introduces some rotation to make better use of the full width of the input hash during hashing, and also reuses the vector on each iteration. With these changes:

  • Fill ratio: 9.303936 % -- same as before, obviously
  • Theoretical hitrate : 0.007493 % -- same as before, obviously
  • Hit rate (100K random tests): 0.009000 % (9 out of 100000)
    • Which is very well in line with the expected

Additionaly, it can now be seen that the correlation is (at least mostly) gone:

  • Zero-filter Hit rate (100K random tests): 9.373000 % (9373 out of 100000)
  • 1-filter Hit rate: 9.474021 % (888 out of 9373)

Speed-wise, this does incur a little penalty:

name                                    old time/op    new time/op    delta
AddX10kX5/add-10kx5-6                      101ns ±19%      99ns ± 9%     ~     (p=0.643 n=5+5)
AddX10kX5/add-10kx5-hash-6                69.8ns ± 3%    70.1ns ± 6%     ~     (p=0.730 n=5+5)
Contains1kX10kX5/contains-6               71.4ns ±15%    76.0ns ± 7%     ~     (p=0.222 n=5+5)
Contains1kX10kX5/containsHash-6           44.3ns ± 5%    48.4ns ± 1%   +9.30%  (p=0.008 n=5+5)
Contains100kX10BX20/contains-6             166ns ± 2%     167ns ± 6%     ~     (p=0.683 n=5+5)
Contains100kX10BX20/containshash-6         128ns ± 4%     118ns ± 1%   -7.36%  (p=0.008 n=5+5)
UnionInPlace/union-8-6                    88.6µs ± 9%    88.1µs ± 4%     ~     (p=1.000 n=5+5)
Contains94percentMisses/contains-6        86.3ns ±26%    99.9ns ±32%     ~     (p=0.175 n=5+5)
Contains94percentMisses/containsHash-6    42.4ns ± 3%    56.8ns ± 2%  +34.04%  (p=0.008 n=5+5)
Write1Mb-6                                6.57ms ± 5%    6.38ms ± 6%     ~     (p=0.310 n=5+5)

name                                    old alloc/op   new alloc/op   delta
AddX10kX5/add-10kx5-6                      8.00B ± 0%     8.00B ± 0%     ~     (all equal)
AddX10kX5/add-10kx5-hash-6                 0.00B          0.00B          ~     (all equal)
Write1Mb-6                                 877kB ± 0%     877kB ± 0%     ~     (p=0.310 n=5+5)

name                                    old allocs/op  new allocs/op  delta
AddX10kX5/add-10kx5-6                       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
AddX10kX5/add-10kx5-hash-6                  0.00           0.00          ~     (all equal)
Write1Mb-6                                  32.0 ± 0%      32.0 ± 0%     ~     (all equal)

Reduction of memory allocs

Previously, the amount of memory needed to read a bloom filter from disk was at least triple the size of the filter. This has now been reduced to more or less only the size of the bloom filter.

Fix reporting of false-positive rate

The calculation was wrong, a misplaced paranthesis.

@holiman holiman merged commit 242f565 into master Dec 18, 2020
@holiman
Copy link
Owner Author

holiman commented Dec 18, 2020

I tested this against the upstream (archived) steakknife implementation -- results were :

  • Error rate: 99.821256 % (well, wrong formula)
  • Fill ratio: 9.303879 %
  • Theoretical hitrate : 0.007493 %
  • Hit rate (100K random tests): 2.633000 % (2633 out of 100000)

So, 2633 false positives intead of ~9.

@rjl493456442
Copy link

The result looks pretty cool. However, I don't understand why it can make such a big difference.
All the keys are also generated randomly, why it will introduce the correlation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants