Skip to content
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Python C++ C
Branch: master
Clone or download
Latest commit da67215 Jan 24, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples Moved example notebook into examples directory. Jun 1, 2017
lsh Fix functools ImportError on python 2.7 Oct 12, 2017
.gitignore Initial commit Sep 3, 2014
LICENSE Create LICENSE Oct 18, 2016 Update Jan 24, 2019 setup allows pip install Mar 7, 2017


pylsh is a Python implementation of locality sensitive hashing with minhash. It is very useful for detecting near duplicate documents.

The implementation uses the MurmurHash v3 library to create document finger prints.

Cython is needed if you want to regenerate the .cpp files for the hashing and shingling code. By default the setup script uses the pregenerated .cpp sources, you can change this with the USE_CYTHON flag in

NumPy is needed to run the code.

The MurmurHash3 library is distributed under the MIT license. More information


For an overview of how LSH works and how to set the parameters see this notebook. The notebook is also available in the examples directory.


> git clone
> cd LSH
> python install
You can’t perform that action at this time.