Skip to content

jakobnissen/MinHash.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinHash

Dev CI Codecov

Efficient minhashing in Julia

MinHash.jl offers generic, efficient MinHash sketching, and functions to efficiently compute the number of shared minhashes between sketches. This package is envisioned to be used as a dependency for other Julia packages that needs minhashing.

Interface

Types

MinHasher{F}(s::Integer)

A MinHasher object performs the minhashing, using function F as a hash function, and storing the s smallest hashes only. F defaults to Base.hash.

MinHashSketch(::MinHasher)

Stores the information of a MinHasher (namely, the hash function, maximal number of hashes, and the hashes themselves) in a more efficient type. This type should be used to store the actual hashes.

Methods

update!(::MinHasher, it)

Iterate over it, adding each element to the minhasher.

sketch(F, it, s::Integer) Hash all elements of it using function F, storing at most the s smallest hashes. Equivalent to:

hasher = MinHasher{F}(s)
update!(hasher, it)
return MinHashSketch(hasher)

sketch(it, s::Integer)

Same as sketch(Base.hash, it, s)

intersectionlength(a::MinHashSketch, b::MinHashSketch)

Efficiently compute the number of hashes both in a and b. Does not check that the hash functions for the two sketches are the same, result will be meaningless if they are not.

intersectionlength(::AbstractVector{MinHashSketch})

Efficiently compute a lower triangular matrix (of type Matrix{Int}) of shared hashes for all pairs in the input vector. For long vectors, this is much more efficient than calculating the distances pairwise.

About

Fast and generic implementation of the minhash algorithm

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages