Scala library for sketching, locality sensitive hashing, approximate similarity search and other things
Scala
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
README.md
build.sbt
license.txt

README.md

Sketches

Sketches is a library for sketching, locality sensitive hashing, approximate similarity search and other things.

Usage

Basic use case is search for all similar items in a dataset.

import atrox.sketch._

val sets: IndexedSeq[Set[Int]] = loadMyData()

val (bands, hashes) = LSH.pickHashesAndBands(threshold = 0.5, maxHashes = 64)
val lsh = LSH(sets, MinHash[Set[Int]](hashes), LSHBuildCfg(bands = bands))

val cfg = LSHCfg(maxResults = 50)

for (Sim(idx1, idx2, estimate, similarity) <- lsh.withConfig(cfg).allSimilarItems) {
  println(s"similarity between item $idx1 and $idx2 is estimated to $estimate")
}

There's more configuration options available.

lsh.withConfig(LSHCfg(
  // Return only the 100 most relevant results.
  // It's strongly recommended to use this option.
  maxResults = 100,

  // Perform similarity search in parallel.
  parallel = true,

  // Use as much memory as needed. This leads to faster bulk queries but
  // might need to store the complete result set in memory.
  compact = false,

  // Skip anomalously large buckets. This speeds things up quite a bit.
  maxBucketSize = sets.size / 10
))

And more query methods.

lsh.similarItems(q)
lsh.similarIndexes(q)
lsh.allSimilarItems
lsh.allSimilarIndexes

And more sketching methods.

  • MinHash and SingleBitMinHash for estimating Jaccard index
  • WeightedMinHash for estimating weighted Jaccard index
  • RandomHyperplanes for estimation cosine similarity
  • RandomProjections for LSH based on euclidean distance
  • HammingDistance
  • SimHash