Implementations of the classic and covering locality-sensitive hashing schemes for Hamming space
C++
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bench
include/hemingway
src
test
vendor
.editorconfig
.gitignore
LICENSE.md
README.md
Tupfile.ini
Tuprules.tup
paper.pdf

README.md

Hemingway

Implementations of the classic and covering locality-sensitive hashing schemes for Hamming space

Build Status

Hemingway is a library for performing nearest-neighbour searches in datasets consisting of high-dimensional bit vectors using locality-sensitive hashing. Two flavors of LSH are currently supported: Classic LSH, originally described by Piotr Indyk and Rajeev Motwani, and Covering LSH, a newer construction by Rasmus Pagh. You can read our paper on the library here.

As for the name? Say the words "locality-sensitive hashing" and "Hamming space" fast enough a bunch of times and you'll eventually arrive at "Hemingway". If not, you probably didn't pick appropriate parameters.

Contents

Installation

Hemingway can be built using any compiler that supports C++11. A tup-based pipeline is included and is the recommended way to build Hemingway. Start by downloading and installing tup and then run the following from the root of the library:

$ CXX=<compiler> tup

This will compile the implementation files to src/lsh.a. Grab this along with the header files and you'll be all set! You can check out the next section for a guide on how to actually use the library.

Usage

The design of Hemingway is fairly simple in that it exposes only two classes for all your LSH needs: lsh::vector and lsh::table. The lsh::vector class is used for representing bit vectors of arbitrary dimensionality and is constructed by supplying a std::vector<bool> containing the components of the vector:

lsh::vector v({1, 0, 0, 0, 1, 1, 0, 1});

The lsh::table class is used for representing a lookup table containing partitions of vector buckets. Two LSH schemes are currently supported in lookup tables:

Classic: In this scheme, vectors are hashed into buckets using random bit masks associated with each partition. When constructing this table, 3 parameters are specified: The dimensionality of input vectors, the number of bits to sample from vectors, and the number of partitions to use:

lsh::table t({.dimensions = 8, .samples = 3, .partitions = 4);

Covering: In this scheme, vectors are hashed into buckets using carefully constructed bit masks that ensure that the hashes of vectors within a given radius from each other will collide. Only two parameters are specified when constructing this table: The dimensionality of input vectors, and the radius that should be covered:

lsh::table t({.dimensions = 8, .radius = 2);

Once you've constructed your table, go ahead and add your vectors:

t.insert(v);

Once you've added all your vectors you can perform queries against the lookup table:

lsh::vector r = t.query(lsh::vector({1, 0, 1, 0, 1, 1, 0, 0}));

API

To be written once the API has stabilized a bit.

Authors

This library came about as a result of the Advanced Algorithms seminar held at the IT University of Copenhagen. We would like to give thanks to our supervisors for not only their help but also their immense patience during the seminar.

Kasper Kronborg Isager Radosław Niemczyk

License

Copyright © 2016 Kasper Kronborg Isager and Radosław Niemczyk. Licensed under the terms of the MIT license.