Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinHash: benchmark memory, speed and accuracy with varying r and b #23

Open
vienneraphael opened this issue Sep 29, 2023 · 1 comment
Open
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@vienneraphael
Copy link
Collaborator

This issue concerns fuzzy deduplication of text pairs.

Find what's the tradeoff between memory, speed, accuracy when varying r and b.
We need to find a way to use way less than 9k because it requires too much CPU resources.

@vienneraphael vienneraphael added enhancement New feature or request good first issue Good for newcomers labels Sep 29, 2023
@CreativeSelf0
Copy link

fine-tuning the parameters rows per band and number of bands in MinHash depends heavily on the data and the specific use case, such as fuzzy deduplication of text pairs in your scenario. These parameters control the trade-off between accuracy and performance (speed and memory usage):

  1. Accuracy: Increasing r and b generally increases the accuracy of the MinHash similarity estimates, as more hash functions are utilized, capturing more aspects of the data.
  2. Speed: However, higher values of r and b can slow down the computation, as more hash functions need to be evaluated.
  3. Memory Usage: Likewise, more memory is required to store the additional hash values.

The ideal settings for r and b could vary depending on:

  • The size and nature of your dataset.
  • The level of similarity/dissimilarity among text pairs in your dataset.
  • The computational resources available to you.
  • Your tolerance for error versus your need for speed and lower memory usage.

An empirical approach, where you run experiments with different values of r and b on a representative subset of your data, can be very informative. By analyzing how the performance metrics (speed, memory usage, and accuracy) change with different settings, you can better understand the trade-offs and find an optimal configuration for your specific use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants