Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Bloom Filter implementation in Clojure

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 examples
Octocat-spinner-32 jbit
Octocat-spinner-32 src
Octocat-spinner-32 test
Octocat-spinner-32 .gitignore
Octocat-spinner-32 README.textile
Octocat-spinner-32 project.clj
Octocat-spinner-32 push.sh
README.textile

clj-bloom

Bloom Filter implementation in Clojure.

A bloom filter is a probabilistic data structure for testing set membership. A deterministic data structure for testing set membership is a java.util.Set, some other languages would use a hash (or map) data structure with a boolean as the value and the target strings as the key. As the number of entries in your set grows, so will the memory usage of these data structures. Bloom filters sacrifice determinism in favor of significantly lower memory usage. Bloom filters do not suffer from false negatives (indicating that a given string is not in the set when it is in fact in the set). They do suffer from false positives (indicating that the string is in the set when it is in fact not), though the false positive rate can be estimated fairly accurately, more importantly, the size of the filter and the number of hashes can be chosen to give you a target false positive rate.

The examples/words.clj example demonstrates that calculating these values based on an estimated set size and a target false positive rate gets you within a factor of about three of the desired rate.

A use for such a filter might be to quickly test if a given string is not a dictionary word, helping to ensure a user’s password is not a dictionary word.

This implementation uses a Java java.util.BitSet as the bit array implementation and provides several helpers for different hashing functions.

Usage

This example code is contained in the file examples/words.clj it initializes
a bloom filter with the dictionary from /usr/local/dict/words. You can run
this example by executing lein run -m words.

(ns words
  (:require
   [clojure.java.io                 :as io]
   [com.github.kyleburton.clj-bloom :as bf])
  (:import
   [java.util BitSet]))

(def ^{:dynamic true} words-file           "/usr/share/dict/words")
(def ^{:dynamic true} num-expected-entries 2875518)

(def all-words (memoize (fn [words-file]
                          (with-open [rdr (io/reader words-file)]
                            (doall (line-seq rdr))))))

(defn add-words-to-filter! [filter]
  (dorun
   (doseq [word (all-words words-file)]
     (bf/add! filter (.toLowerCase ^String word)))))

(defn run [hash-fn]
  (let [filter (bf/bloom-filter (* 10 1024 1024) hash-fn)]
    (add-words-to-filter! filter)
    (dorun
     (doseq [w (.split "The quick brown ornithopter hyper-jumped over the lazy trollusk" "\\s+")]
       (if (bf/include? filter (.toLowerCase ^String w))
         (prn (format "HIT:  '%s' in the filter" w))
         (prn (format "MISS: '%s' not in the filter" w)))))))

(defn make-words-filter [m-expected-entries k-hashes hash-fn]
  (let [flt (bf/bloom-filter
             m-expected-entries
             (bf/make-permuted-hash-fn
              (or hash-fn bf/make-hash-fn-hash-code)
              (map str (range 0 k-hashes))))]
    (add-words-to-filter! flt)
    flt))

(defn words-fp-rate [count flt]
  (loop [times count
         fps   0]
    (cond (= 0 times)
          fps
          (bf/include? flt (str times))
          (recur (dec times)
                 (inc fps))

          :else
          (recur (dec times)
                 fps))))

(defn report-fp-rate [count flt]
  (let [rate (words-fp-rate count flt)]
    (/ rate count 1.0)))

(defn eprintf [& args]
  (.println System/err (apply format args)))

(def word-flt-1pct          (make-words-filter num-expected-entries 7 bf/make-hash-fn-hash-code))
(def word-flt-crc32-1pct    (make-words-filter num-expected-entries 7 bf/make-hash-fn-crc32))
(def word-flt-adler32-1pct  (make-words-filter num-expected-entries 7 bf/make-hash-fn-adler32))
(def word-flt-md5-1pct      (make-words-filter num-expected-entries 7 bf/make-hash-fn-md5))
(def word-flt-sha1-1pct     (make-words-filter num-expected-entries 7 bf/make-hash-fn-sha1))

(defn -main [& args]
  (let [words-file (or (first args)
                            words-file)]
   (binding [words-file (or (first args)
                            words-file)
             num-expected-entries (count (all-words words-file))]
     (eprintf "words-file=%s, num-expected-entries=%d" words-file num-expected-entries)
     (eprintf "bf/make-hash-fn-hash-code")
     (eprintf "n=%s, k=10, p=0.01: Java's hashCode: fp=%f  cardinality=%d\n"
              num-expected-entries
              (report-fp-rate 100000 word-flt-1pct)
              (.cardinality ^BitSet (:bitarray word-flt-1pct)))
     (eprintf "bf/make-hash-fn-crc32")
     (eprintf "n=%s, k=10, p=0.01: CRC32:           fp=%f  cardinality=%d\n"
              num-expected-entries
              (report-fp-rate 100000 word-flt-crc32-1pct)
              (.cardinality ^BitSet (:bitarray word-flt-crc32-1pct)))
     (eprintf "bf/make-hash-fn-adler32")
     (eprintf "n=%s, k=10, p=0.01: Adler32:         fp=%f  cardinality=%d\n"
              num-expected-entries
              (report-fp-rate 100000 word-flt-adler32-1pct)
              (.cardinality ^BitSet (:bitarray word-flt-adler32-1pct)))
     (eprintf "bf/make-hash-fn-md5")
     (eprintf "n=%s, k=10, p=0.01: MD5:             fp=%f  cardinality=%d\n"
              num-expected-entries
              (report-fp-rate 100000 word-flt-md5-1pct)
              (.cardinality ^BitSet (:bitarray word-flt-md5-1pct)))
     (eprintf "bf/make-hash-fn-sha1")
     (eprintf "n=%s, k=10, p=0.01: SHA1:            fp=%f  cardinality=%d\n"
              num-expected-entries
              (report-fp-rate 100000 word-flt-sha1-1pct)
              (.cardinality ^BitSet (:bitarray word-flt-sha1-1pct))))))

(comment

  (-main)

  )
<pre>

@/usr/share/dict/words@ on my system is 24,86,813 bytes while the recommended size of this filter for a 1% FP rate is 2,875,518 bytes - a nearly ten fold reduction in required memory.  If the entries were larger the memory savings would be larger as well.

h2. Parameters

The "Wikipedia article on Bloom Filters":http://en.wikipedia.org/wiki/Bloom_filter discusses methods for determining optimal values for the number of hashes, @k@, the size of the bit array, @m@, given an expected number of entries @n@ and a target false positive probability, @p@.

This library provides a functions to help you determine those parameters for your data set.  Following along with the @words.clj@ example, the @/usr/share/dict/words@ file on my system contains about 234,936 words.  This gives us @n = 234936@ entries, if we can tolerate a 1% error rate, the size of the bit array should be:

<pre>
(num-bits-for-entries-and-fp-probability 234936 0.01) => 2251875

or about 2.2 million bits. Given this bit array size, we can now compute an optimal number of hashes:

(num-hash-fns-for-entries-and-bits 234936 2251875)  => 6.643855378585771

rounding to 7. You can obtain both of these values in one call with:

(optimal-n-and-k 300000 0.01) => [2875518  7]

Based on the benchmarks in the examples/words.clj file, I do not recommend using Java’s hashCode function for use with the bloom filter. It results in a wildly higher false positive rate then the other hashes. My recommendation is to use the values reported by optimal-n-and-k as a starting point and measure your actual false positive rate, keeping in mind the ‘4.8 bits per key’ rule as a way to decrease the false-positive rate by an order of magnitude.

There is also a helper function make-optimal-filter that takes an estimated number of entries, a desired false-positive probability and an optional hash builder function.

Hash Functions

The words.clj example shows how to create your own hash function. As part of the definition of a bloom filter, a number of hashes k are executed to determine the bits to set. The interface for this hash function should take a java.lang.String and an int. The string being the data to compute the hash for, and the int being the size in bits (n) of the bit array. The hash function must return the sequence of bit locations to be set in the filter for the given input string.

Limitations

java.util.BitSet

The current implementation uses a java.util.BitSet, which is limited to 2^31 - 1 bits. This puts some boundaries on the number of elements and the false-positive probability.

Both java.util.BitSet and java.math.BigInteger have this limitation. I will look into implementing a version of the filter that uses byte arrays under the hood so that this limitation can be lifted.

Installation

If you’re using Leiningen, add the following to your project.clj file’s :dependencies:

  [com.github.kyleburton/clj-bloom "1.0.1"]

For maven:

  <dependencies>
    <dependency>
      <groupId>com.github.kyleburton</groupId>
      <artifactId>clj-bloom</artifactId>
      <version>1.0.1</version>
    </dependency>
    ...
  </dependencies>

Building

To build using Leiningen:

  clj-bloom$ lein test
  clj-bloom$ lein jar
  clj-bloom$ lein uberjar

References

License

Same as Clojure

Something went wrong with that request. Please try again.