reduce FP, less memory on marshalling, correct error-rate formula #4

holiman · 2020-12-18T23:47:45Z

This PR does quite a few things. First of all, it reduces the false-error rate quite dramatically. It turned out that when several filters were used, they were correlated.

Reduction of FP rate

Originally, using a 500Mb bloom filter, and filling it with 100M items, the values were:

Fill ratio: 9.303647 %
Theoretical hitrate : 0.007492 % -- this means that if it's a 9% chance of a bit being set, then there should be a 0.0075% chance of hitting all four of them on a random false positive.
Actual false positive rate (on 100K random tests): 2.658000 % (2658 out of 100000)

Additonally, it was the case that

The hit rate of filter 1 (out of four) (in 100K random tests): 9.456000 % (9456 out of 100000) -- which matches with the expected 9.3%. However:
1-filter Hit rate: 53.489848 % (5058 out of 9456)
- If the first filter hit, then the chance that the second filter would also hit went up to >50%. That's quite a correlation.

So the correlation caused the error rate to not be 0.0075% at all, but rather 2.6% -- 346 times larger!

This PR introduces some rotation to make better use of the full width of the input hash during hashing, and also reuses the vector on each iteration. With these changes:

Fill ratio: 9.303936 % -- same as before, obviously
Theoretical hitrate : 0.007493 % -- same as before, obviously
Hit rate (100K random tests): 0.009000 % (9 out of 100000)
- Which is very well in line with the expected

Additionaly, it can now be seen that the correlation is (at least mostly) gone:

Zero-filter Hit rate (100K random tests): 9.373000 % (9373 out of 100000)
1-filter Hit rate: 9.474021 % (888 out of 9373)

Speed-wise, this does incur a little penalty:

name                                    old time/op    new time/op    delta
AddX10kX5/add-10kx5-6                      101ns ±19%      99ns ± 9%     ~     (p=0.643 n=5+5)
AddX10kX5/add-10kx5-hash-6                69.8ns ± 3%    70.1ns ± 6%     ~     (p=0.730 n=5+5)
Contains1kX10kX5/contains-6               71.4ns ±15%    76.0ns ± 7%     ~     (p=0.222 n=5+5)
Contains1kX10kX5/containsHash-6           44.3ns ± 5%    48.4ns ± 1%   +9.30%  (p=0.008 n=5+5)
Contains100kX10BX20/contains-6             166ns ± 2%     167ns ± 6%     ~     (p=0.683 n=5+5)
Contains100kX10BX20/containshash-6         128ns ± 4%     118ns ± 1%   -7.36%  (p=0.008 n=5+5)
UnionInPlace/union-8-6                    88.6µs ± 9%    88.1µs ± 4%     ~     (p=1.000 n=5+5)
Contains94percentMisses/contains-6        86.3ns ±26%    99.9ns ±32%     ~     (p=0.175 n=5+5)
Contains94percentMisses/containsHash-6    42.4ns ± 3%    56.8ns ± 2%  +34.04%  (p=0.008 n=5+5)
Write1Mb-6                                6.57ms ± 5%    6.38ms ± 6%     ~     (p=0.310 n=5+5)

name                                    old alloc/op   new alloc/op   delta
AddX10kX5/add-10kx5-6                      8.00B ± 0%     8.00B ± 0%     ~     (all equal)
AddX10kX5/add-10kx5-hash-6                 0.00B          0.00B          ~     (all equal)
Write1Mb-6                                 877kB ± 0%     877kB ± 0%     ~     (p=0.310 n=5+5)

name                                    old allocs/op  new allocs/op  delta
AddX10kX5/add-10kx5-6                       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
AddX10kX5/add-10kx5-hash-6                  0.00           0.00          ~     (all equal)
Write1Mb-6                                  32.0 ± 0%      32.0 ± 0%     ~     (all equal)

Reduction of memory allocs

Previously, the amount of memory needed to read a bloom filter from disk was at least triple the size of the filter. This has now been reduced to more or less only the size of the bloom filter.

Fix reporting of false-positive rate

The calculation was wrong, a misplaced paranthesis.

…duced mem bloat in marshalling from disk

holiman · 2020-12-18T23:55:08Z

I tested this against the upstream (archived) steakknife implementation -- results were :

Error rate: 99.821256 % (well, wrong formula)
Fill ratio: 9.303879 %
Theoretical hitrate : 0.007493 %
Hit rate (100K random tests): 2.633000 % (2633 out of 100000)

So, 2633 false positives intead of ~9.

rjl493456442 · 2020-12-21T05:54:04Z

The result looks pretty cool. However, I don't understand why it can make such a big difference.
All the keys are also generated randomly, why it will introduce the correlation?

drastically improved false-positive rate, fixed error-rate report, re…

de5ae85

…duced mem bloat in marshalling from disk

holiman merged commit 242f565 into master Dec 18, 2020

holiman mentioned this pull request Dec 19, 2020

all: bloom-filter based pruning mechanism ethereum/go-ethereum#21724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce FP, less memory on marshalling, correct error-rate formula #4

reduce FP, less memory on marshalling, correct error-rate formula #4

holiman commented Dec 18, 2020

holiman commented Dec 18, 2020

rjl493456442 commented Dec 21, 2020

reduce FP, less memory on marshalling, correct error-rate formula #4

reduce FP, less memory on marshalling, correct error-rate formula #4

Conversation

holiman commented Dec 18, 2020

Reduction of FP rate

Reduction of memory allocs

Fix reporting of false-positive rate

holiman commented Dec 18, 2020

rjl493456442 commented Dec 21, 2020