Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaccard() on hash vectors is inaccurate #37

Open
inkybutton opened this issue Dec 7, 2022 · 0 comments
Open

jaccard() on hash vectors is inaccurate #37

inkybutton opened this issue Dec 7, 2022 · 0 comments

Comments

@inkybutton
Copy link

inkybutton commented Dec 7, 2022

Describe the bug
Hey there, thank you for creating this library! I like the work you've put into the documentation. I'm a newbie when it comes to Julia and LSH so apologies if I get things wrong.
I tried running jaccard() on hash vectors after using MinHash and the default base.hash hashing method, and received puzzling results, sometimes with similarity value above 1, which should be impossible. I think this is due to summing large hash values causing an overflow.
This is probably not the intended use of the function - collision_probability seems like the right function. I got drawn to do this because explanations of Minhash I've seen say that the Jaccard index for hashes should approximate the value for the original shingles. I suspect other learners may be led to do this too.

To Reproduce

hashfn = LSHFunctions.MinHash(15)
A = Set(["ab", "bc", "cd"]);
B = Set(["xy", "yz", "za"])
hashes_A = hashfn(A)
hashes_B = hashfn(B)
jaccard(hashes_A, hashes_B)

This results in similarity >0, sometimes >1.

Expected behavior
Because there is no overlap between the sets, the Jaccard index for their minhashes should always be 0. A naïve implementation that does not use arithmetic operations seems to get it right, e.g.:

function jaccard(x::AbstractVector, y::AbstractVector)::Float64
    length(A ∩ B) / length(A ∪ B)
end

Or perhaps something that might clarify which function to use!

Screenshots
image

@inkybutton inkybutton changed the title jaccard() on vectors is inaccurate, yielding similarity >1 sometimes jaccard() on hash vectors is inaccurate Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant