Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Fix ScalableBloomFilter and tests on top of #5 #6

Merged
merged 3 commits into from Dec 27, 2012

Conversation

Projects
None yet
3 participants
Contributor

etrepum commented Dec 26, 2012

This takes the branch from pr #5 that fixes BF implementation, fixes SBF, and makes the tests pass again.

glangford and others added some commits Dec 13, 2012

@glangford glangford Suboptimal allocation of BloomFilter bitarray size
Empirical measurement of the false positive rate of BloomFilter is
substantially lower that the requested error rates. For example, a
BloomFilter with capacity 20,000 and an error rate of .01, gives actual
error rate .0003. This is because the bit size allocated to the filter
is unnecessarily large, and does not match the theoretically calculated
required size.
808bce0
@glangford glangford Add performance benchmarks for BloomFilter; test code to demonstrate …
…size bug
dcb8e49
@etrepum etrepum Bump to 2.0. Fix ScalableBloomFilter and make the tests resilient to …
…FP error.
64e42c8
Contributor

etrepum commented Dec 26, 2012

The bad news is that ScalableBloomFilter is definitely implemented incorrectly, if you do the math it converges to a much higher error rate than it's supposed to, because the first filter starts off with the target error rate so it can only get worse from there since checking each filter is an independent event. It appears that this bug has also propagated to other implementations inspired by this one [1]. Should probably tell them once it's fixed.

Per the paper and other implementations [2] the error rate for each bloom filter in the SBF has to be calculated much differently. Specifically the sum([err * (1-ratio) * pow(ratio, n) for n in [0..inf]]) should converge to err. The implementation in [2] calculates the error rate of the next filter by just doing a multiply of the previous error rate, which is probably what we should do too.

(err is 0.001 in this case, with ratio of 0.9)

>>> sum([0.001*(1-0.9)*pow(0.9, n) for n in xrange(0,328)])
0.0009999999999999994

[1] http://code.google.com/p/scalable-bloom-filter/source/browse/trunk/trunk/scalable-bloom-filter/src/main/java/com/elaunira/sbf/ScalableBloomFilter.java
[2] https://github.com/basho/riak_core/blob/master/src/bloom.erl

@ghost ghost assigned jaybaird Dec 27, 2012

@jaybaird jaybaird added a commit that referenced this pull request Dec 27, 2012

@jaybaird jaybaird Merge pull request #6 from etrepum/master
Fix ScalableBloomFilter and tests on top of #5
abfccef

@jaybaird jaybaird merged commit abfccef into jaybaird:master Dec 27, 2012

Contributor

glangford commented on pybloom/benchmarks.py in 64e42c8 Dec 31, 2012

The original code was correct:

for i in xrange(f.capacity, f.capacity + trials):

The +1 here is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment