BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters
Ruby C Objective-C
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
benchmark for redis lookups, first query the first bit, if not set then missing, Mar 31, 2011
examples Change BloomFilter class to use Bitset instead of BitSet. Oct 8, 2012
ext/cbloomfilter fix bit padding bug in bucket_set/unset Aug 15, 2013
lib bump version (bugfix) Jan 5, 2015
spec disable deletes on non-counting redis filter Jan 5, 2015
.gitignore changed marshal #dump #load format to be size of in-memory bitmap Dec 5, 2012
.rspec bundlerize + rspec2 cleanup Jan 5, 2011
Gemfile
README.md use -rb require in readme Feb 23, 2013
Rakefile
bloomfilter-rb.gemspec Update to rspec 3 and modern "expect" syntax. Aug 1, 2014

README.md

BloomFilter(s) in Ruby

  • Native (MRI/C) counting bloom filter
  • Redis-backed getbit/setbit non-counting bloom filter
  • Redis-backed set-based counting (+TTL) bloom filter

Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail, check the wikipedia article. Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, ..., k-1). This may or may not give you a good distribution, it all depends on the data.

Performance of the Bloom filter depends on a number of variables:

  • size of the bit array
  • size of the counter bucket
  • number of hash functions

Resources


MRI/C API Example

MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.

require 'bloomfilter-rb'

bf = BloomFilter::Native.new(:size => 100, :hashes => 2, :seed => 1, :bucket => 3, :raise => false)
bf.insert("test")
bf.include?("test")     # => true
bf.include?("blah")     # => false

bf.delete("test")
bf.include?("test")     # => false

# Hash with a bloom filter!
bf["test2"] = "bar"
bf["test2"]             # => true
bf["test3"]             # => false

bf.stats
# => Number of filter bits (m): 10
# => Number of filter elements (n): 2
# => Number of filter hashes (k) : 2
# => Predicted false positive rate = 10.87%

Redis-backed setbit/getbit bloom filter

Uses getbit/setbit on Redis strings - efficient, fast, can be shared by multiple/concurrent processes.

bf = BloomFilter::Redis.new

bf.insert('test')
bf.include?('test')     # => true
bf.include?('blah')     # => false

bf.delete('test')
bf.include?('test')     # => false

Memory footprint

  • 1.0% error rate for 1M items, 10 bits/item: 2.5 mb
  • 1.0% error rate for 150M items, 10 bits per item: 358.52 mb
  • 0.1% error rate for 150M items, 15 bits per item: 537.33 mb

Redis-backed counting bloom filter with TTL's

Uses regular Redis get/set counters to implement a counting filter with optional TTL expiry. Because each "bit" requires its own key in Redis, you do incur a much larger memory overhead.

bf = BloomFilter::CountingRedis.new(:ttl => 2)

bf.insert('test')
bf.include?('test')     # => true

sleep(2)
bf.include?('test')     # => false

Credits

Tatsuya Mori valdzone@gmail.com (Original C implementation: http://vald.x0.com/sb/)

License

MIT License - Copyright (c) 2011 Ilya Grigorik