BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters
Ruby C Objective-C
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
benchmark
examples
ext/cbloomfilter
lib bump version (bugfix) Jan 5, 2015
spec disable deletes on non-counting redis filter Jan 5, 2015
.gitignore changed marshal #dump #load format to be size of in-memory bitmap Dec 5, 2012
.rspec bundlerize + rspec2 cleanup Jan 5, 2011
Gemfile
README.md use -rb require in readme Feb 23, 2013
Rakefile Reorder requires and add spec prerequisites Mar 6, 2012
bloomfilter-rb.gemspec Update to rspec 3 and modern "expect" syntax. Aug 1, 2014

README.md

BloomFilter(s) in Ruby

  • Native (MRI/C) counting bloom filter
  • Redis-backed getbit/setbit non-counting bloom filter
  • Redis-backed set-based counting (+TTL) bloom filter

Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail, check the wikipedia article. Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, ..., k-1). This may or may not give you a good distribution, it all depends on the data.

Performance of the Bloom filter depends on a number of variables:

  • size of the bit array
  • size of the counter bucket
  • number of hash functions

Resources


MRI/C API Example

MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.

require 'bloomfilter-rb'

bf = BloomFilter::Native.new(:size => 100, :hashes => 2, :seed => 1, :bucket => 3, :raise => false)
bf.insert("test")
bf.include?("test")     # => true
bf.include?("blah")     # => false

bf.delete("test")
bf.include?("test")     # => false

# Hash with a bloom filter!
bf["test2"] = "bar"
bf["test2"]             # => true
bf["test3"]             # => false

bf.stats
# => Number of filter bits (m): 10
# => Number of filter elements (n): 2
# => Number of filter hashes (k) : 2
# => Predicted false positive rate = 10.87%

Redis-backed setbit/getbit bloom filter

Uses getbit/setbit on Redis strings - efficient, fast, can be shared by multiple/concurrent processes.

bf = BloomFilter::Redis.new

bf.insert('test')
bf.include?('test')     # => true
bf.include?('blah')     # => false

bf.delete('test')
bf.include?('test')     # => false

Memory footprint

  • 1.0% error rate for 1M items, 10 bits/item: 2.5 mb
  • 1.0% error rate for 150M items, 10 bits per item: 358.52 mb
  • 0.1% error rate for 150M items, 15 bits per item: 537.33 mb

Redis-backed counting bloom filter with TTL's

Uses regular Redis get/set counters to implement a counting filter with optional TTL expiry. Because each "bit" requires its own key in Redis, you do incur a much larger memory overhead.

bf = BloomFilter::CountingRedis.new(:ttl => 2)

bf.insert('test')
bf.include?('test')     # => true

sleep(2)
bf.include?('test')     # => false

Credits

Tatsuya Mori valdzone@gmail.com (Original C implementation: http://vald.x0.com/sb/)

License

MIT License - Copyright (c) 2011 Ilya Grigorik