Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 93 lines (62 sloc) 2.987 kb
0a9ea6c @igrigorik update readme
authored
1 # BloomFilter(s) in Ruby
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
2
0a9ea6c @igrigorik update readme
authored
3 - Native (MRI/C) counting bloom filter
4 - Redis-backed getbit/setbit non-counting bloom filter
5 - Redis-backed set-based counting (+TTL) bloom filter
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
6
7 Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail, check the [wikipedia article](http://en.wikipedia.org/wiki/Bloom_filter). Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, ..., k-1). This may or may not give you a good distribution, it all depends on the data.
8
9 Performance of the Bloom filter depends on a number of variables:
10
11 - size of the bit array
12 - size of the counter bucket
13 - number of hash functions
14
15 ## Resources
16
17 - Determining parameters: [Scalable Datasets: Bloom Filters in Ruby](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/)
18 - Applications & reasons behind bloom filter: [Flow analysis: Time based bloom filter](http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/)
19
6147e7c @igrigorik add memory footprint info for redis backend
authored
20 ***
21
0a9ea6c @igrigorik update readme
authored
22 ## MRI/C API Example
23
24 MRI/C implementation which creates an in-memory filter which can be saved and reloaded from disk.
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
25
1ddbe7d @igrigorik syntax highlighting in readme
authored
26 ```ruby
27 require 'bloomfilter'
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
28
1ddbe7d @igrigorik syntax highlighting in readme
authored
29 bf = BloomFilter::Native.new(:size => 100, :hashes => 2, :seed => 1, :bucket => 3, :raise => false)
30 bf.insert("test")
31 bf.include?("test") # => true
32 bf.include?("blah") # => false
0a9ea6c @igrigorik update readme
authored
33
1ddbe7d @igrigorik syntax highlighting in readme
authored
34 bf.delete("test")
35 bf.include?("test") # => false
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
36
1ddbe7d @igrigorik syntax highlighting in readme
authored
37 # Hash with a bloom filter!
38 bf["test2"] = "bar"
39 bf["test2"] # => true
40 bf["test3"] # => false
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
41
1ddbe7d @igrigorik syntax highlighting in readme
authored
42 bf.stats
43 # => Number of filter bits (m): 10
44 # => Number of filter elements (n): 2
45 # => Number of filter hashes (k) : 2
46 # => Predicted false positive rate = 10.87%
47 ```
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
48
6147e7c @igrigorik add memory footprint info for redis backend
authored
49 ***
50
0a9ea6c @igrigorik update readme
authored
51 ## Redis-backed setbit/getbit bloom filter
52
53 Uses [getbit](http://redis.io/commands/getbit)/[setbit](http://redis.io/commands/setbit) on Redis strings - efficient, fast, can be shared by multiple/concurrent processes.
54
1ddbe7d @igrigorik syntax highlighting in readme
authored
55 ```ruby
56 bf = BloomFilter::Redis.new
0a9ea6c @igrigorik update readme
authored
57
1ddbe7d @igrigorik syntax highlighting in readme
authored
58 bf.insert('test')
59 bf.include?('test') # => true
60 bf.include?('blah') # => false
0a9ea6c @igrigorik update readme
authored
61
1ddbe7d @igrigorik syntax highlighting in readme
authored
62 bf.delete('test')
63 bf.include?('test') # => false
64 ```
0a9ea6c @igrigorik update readme
authored
65
6147e7c @igrigorik add memory footprint info for redis backend
authored
66 ### Memory footprint
67
68 - 1.0% error rate for 1M items, 10 bits/item: *2.5 mb*
69 - 1.0% error rate for 150M items, 10 bits per item: *358.52 mb*
70 - 0.1% error rate for 150M items, 15 bits per item: *537.33 mb*
71
72 ***
73
0a9ea6c @igrigorik update readme
authored
74 ## Redis-backed counting bloom filter with TTL's
75 Uses regular Redis get/set counters to implement a counting filter with optional TTL expiry. Because each "bit" requires its own key in Redis, you do incur a much larger memory overhead.
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
76
1ddbe7d @igrigorik syntax highlighting in readme
authored
77 ```ruby
78 bf = BloomFilter::CountingRedis.new(:ttl => 2)
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
79
1ddbe7d @igrigorik syntax highlighting in readme
authored
80 bf.insert('test')
81 bf.include?('test') # => true
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
82
1ddbe7d @igrigorik syntax highlighting in readme
authored
83 sleep(2)
84 bf.include?('test') # => false
85 ```
7d3532c @igrigorik bundlerize + rspec2 cleanup
authored
86
87 ## Credits
88
0a9ea6c @igrigorik update readme
authored
89 Tatsuya Mori <valdzone@gmail.com> (Original C implementation: http://vald.x0.com/sb/)
90
91 ## License
92
1ddbe7d @igrigorik syntax highlighting in readme
authored
93 MIT License - Copyright (c) 2011 Ilya Grigorik
Something went wrong with that request. Please try again.