Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 0.2 #38

Open
ahirner opened this issue Jan 28, 2018 · 3 comments
Open

Performance regression in 0.2 #38

ahirner opened this issue Jan 28, 2018 · 3 comments

Comments

@ahirner
Copy link

ahirner commented Jan 28, 2018

I was pip updating from 0.1.3 to 0.2. Two sample workloads of us took a significant hit in performance: Reducing 480x13500 to 80x13500 ran 2:24 instead of 1:14 and reducing 480x6700 to 80x6700 took 1:49 instead of 0:28.

Alongside updating umap-learn, other libraries got a bump (llvmlite 0.2 to 0.21, numba 0.35.0 to 0.36.2). Neither of those affected running times. After downgrading to 0.1.3, I got the former numbers.

I saw that this commit disabled jitting for fuzzy_simplical_set. Could this or anything else cause this regression?

@lmcinnes
Copy link
Owner

I suspect small dataset sizes are the issue here. The changes that were made for 0.2 were largely targetted at large dataset sizes, and correcting some issues in the resulting embedding. These changes are fundamentally necessary, but they may result in less perfomant (but more accurate!) results for small number of points, particularly when reducing to larger embedding dimensions as you are doing here.

Long story short: I think this may simply be a necessary performance regression for the kinds of data you have here. Sorry.

@ahirner
Copy link
Author

ahirner commented Jan 28, 2018

That's interesting. It turned out that organizing small chunks of data many times works pretty well for our domain. I will have a look at the qualitative difference. So should this issue be closed?

@lmcinnes
Copy link
Owner

Leave it open for now -- I would like to be able to resolve such issues if I can, and perhaps with more time I might come up with an approach that could make this better. In the meantime you can use the new n_epochs parameter to speed up training time (at some loss of accuracy). For the dataset sizes you have I believe the effective default is 500; you could try dropping it to 200 and see if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants