Java implementation of a probabilistic set data structure
Pull request Compare This branch is 28 commits behind Cue:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
AUTHORS
LICENSE
README.md
pom.xml

README.md

greplin-bloom-filter

Greplin Bloom Filter

An (optionally persistent) counting Bloom Filter implementation in Java.

Status:

This is a very early stage project. It works for our needs. We haven't verified it works beyond that. Issue reports and patches are very much appreciated!

For example, some obvious improvements include

  • Support for a variable number of count bits (including 1)

  • More efficient thread safety

  • More hash functions to choose between

  • More efficient disk persistence

Pre-requisites:

Maven

Installation

git clone https://github.com/Greplin/greplin-bloom-filter.git

cd greplin-bloom-filter

mvn install

Implementation details

  • This is a counting bloom filter that uses 4-bits per bucket. Once the count exceeds 15 items in a bucket, decrements are no longer possible.

  • Instead of using N distinct hashes, we use two linear combinations of two runs of a repeated murmur hash per Kirch and Mitzenmacher.

Usage

// create a new bloom filter with the desired set of properties
final int expectedNumberOfItems = 10000;
final double desiredFalsePositiveRate = 0.00001;
BloomFilter bf = BloomFilter.createOptimal("/tmp/bloom.dat", expectedNumberOfItems, desiredFalsePositiveRate, true);

// test out the bloom filter
bf.add("Hello World".getBytes());
System.out.println(bf.contains("Hello World".getBytes()));
System.out.println(bf.contains("Foo Bar".getBytes()));

// persist it to disk (note that it is only persisted to disk when you call flush)
bf.flush()

// try removing an item
bf.remove("Hello World".getBytes());
System.out.println(bf.contains("Hello World".getBytes()));

// close the bloom filter (which also persists any unflushed changes)
bf.close();

Authors

Greplin, Inc.