Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives #31

Open
dholms opened this issue Aug 11, 2020 · 3 comments
Open

False positives #31

dholms opened this issue Aug 11, 2020 · 3 comments

Comments

@dholms
Copy link

dholms commented Aug 11, 2020

I'm getting a much higher false positive rate than I would expect from a bloom filter of the size that I'm using

I'm using a 1024-bit bloomfilter with 16 hashes and 20 elements in each filter.

I'm running a test which adds 20 elements to a filter, checking before adding each that the item isn't already in the filter.

After running the test 500 times, there are ~4 collisions.

Given a bloom filter with those parameters, there should only be about a 1/1.3 billion chance of collision (https://hur.st/bloomfilter/?n=20&p=&m=1024&k=16)

Here's the short script:

    let collisions = 0
    for(let i=0; i< 500; i++) {
      const filter = new BloomFilter(1024, 16)
      const dict = {}
      for(let j=0; j< 20; j++) {
        const str = Math.floor(Math.random() * 1000000000).toString(16)
        if(filter.test(str) && dict[str] !== true){
          console.log("COLLISION: ", str)
          collisions++
        }
        filter.add(str)
        dict[str] = true
      }
      console.log(i)
    }
    console.log('done: ', collisions)
@jasondavies
Copy link
Owner

Interesting. The only way I was able to fix this was by using k independent hash functions, instead of the current version which uses 2 independent hash functions a and b internally, and combines them to generate k hash functions using:

h_i(d) = a(d) + i * b(d)

As far as I know this is common practice, and the idea is mentioned in Building a Better Bloom Filter. I'm guessing that this optimisation has some effect on the false positive rate as the k hash functions are not completely independent.

@jasondavies
Copy link
Owner

Perhaps @Gopiandcode would be interested in this. He is the author of Bloom filters debunked: Dispelling 30 Years of bad math with Coq!

@nponeccop
Copy link

I'd suggest you try the benchmarking using a CSRNG from crypto instead of Math.random(). With such complex bit patterns, the randomness provided by simple RNGs is often not enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants