Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError with sparse input and metric='jaccard' #33

Closed
jay-reynolds opened this issue Dec 5, 2017 · 12 comments
Closed

ZeroDivisionError with sparse input and metric='jaccard' #33

jay-reynolds opened this issue Dec 5, 2017 · 12 comments

Comments

@jay-reynolds
Copy link

Example to reproduce error:

import numpy as np
from sklearn import manifold
import umap

X = np.random.choice([0, 1], size=(1000, 50), p=[90./100, 10./100])

tsne = manifold.TSNE(metric='jaccard')
y_tsne = tsne.fit_transform(X)

um = umap.UMAP(metric='jaccard')
y_umap = um.fit_transform(X)

p=[85./100, 15./100] works.

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

I suspect my Jaccard implementation is not liking all zero rows. That should not be too hard to fix, but I'll have to take a look and see what's actually going on there. Thanks for the report!

@jay-reynolds
Copy link
Author

jay-reynolds commented Dec 5, 2017

That was my thought, initially -- I forgot to mention that I'm pretty certain I correctly did a test to clean out zero rows (not shown above)
Also added a column of 1s just as a sanity check while testing.

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

Well I did certainly have a bug in handling of all zero rows, which is now fixed. It seems that isn't enough however, so now it's a matter of tracking down where the error actually is.

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

Ah, it looks like it is in the random projection trees using cosine splitting. Let me look into that further ...

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

It seems that the problem is when we try to split two identical points and the hyperplane is thus zero and we can't take the norm of it (although all zero vectors would also break this code, so I should fix that too). I will have to think a little about the "right" way to handle this.

@jay-reynolds
Copy link
Author

Ah, makes sense in light of my original data.
I'll try and familiarize myself more with umap.py in the meanwhile -- this is really interesting!
Looking forward to your paper.

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

After some thought standard defensive programming will, in fact, do the right thing. I should have been more careful and caught these the first time, but hopefully this will work better now.

Edit: I tested this with your code above and it now works for me; I'll let you pull from master and reinstall and see it fixes the problems on your end as well.

@jay-reynolds
Copy link
Author

I suspect you intended for:

if right_norm == 0.0: right_norm = 0.0

to instead be:

if right_norm == 0.0: right_norm = 1.0

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 5, 2017

Yes, sorry. I was working on a separate copy that "worked for me". All the more reason I appreciate the double check.

@jay-reynolds
Copy link
Author

No worries, happy to be able to help. Now to try the real data again!

@jay-reynolds
Copy link
Author

Not seeing any errors, closing ticket.

@lmcinnes
Copy link
Owner

lmcinnes commented Dec 6, 2017

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants