Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What's the scaling complexity? #8

Closed
vmarkovtsev opened this issue Nov 11, 2017 · 6 comments
Closed

[Question] What's the scaling complexity? #8

vmarkovtsev opened this issue Nov 11, 2017 · 6 comments

Comments

@vmarkovtsev
Copy link

Looks like a great alternative to t-SNE! The readme mentions how fast it is, but I wonder what is the complexity in big-O depending on the number of samples and dimensions? Waiting for the paper to read impatiently!

@lmcinnes
Copy link
Owner

The complexity is essentially bounded by the nearest neighbor search phase. By making use of algorithms for approximate nearest neighbor graph construction I can get the scaling with input dimension down to essentially linear (effectively scaling with regard to the complexity of the distance computation). With regard to number of sample the scaling is a little harder to pin down (the algorithm I'm using, NN-descent, doesn't have a published theoretical complexity) and is, of course, dataset dependent (i.e. it will depend on properties of the distribution of the samples). Empirically I believe it should be something like O(d*n^1.14) or so. A better theoretical analysis of the complexity is definitely on my list of things to do, but its currently a long list.

@vmarkovtsev
Copy link
Author

Thanks a lot!

@matanox
Copy link

matanox commented Jan 17, 2021

O(d*n^1.14) sounds almost too good to be true doesn't it?

@lmcinnes
Copy link
Owner

It certainly does, but it is an empirical estimate of an average case complexity. I have not seen any complete estimates of NNDescent complexity. Some work by Richard Darling and Jacob Baron demonstrate both some cases where NNDescent will fail (with complexity greater than O(N^2)), and classes of problem for which a complexity of O(N log N) is provable. The catch is that both are relatively theoretical cases: the first involves metrics that fail friend-of-a-friend principles, which isn't true for most of the metrics UMAP would be used with; the second is a fairly constrained class of problem that one is unlikely to see in practice (but for which complexity is at least provable).

So, given that worst cases do exist, I guess the real answer (though it is a highly deceptive one, and would equally apply to any algorithm that requires nearest neighbor computations) is O(N^2). For practical purposes it certainly empirically scales at O(N log N) and there are certainly problem classes for which that is the provable complexity, so perhaps that is a reasonable estimate?

@jc-healy
Copy link
Contributor

jc-healy commented Jan 19, 2021 via email

@matanox
Copy link

matanox commented Jan 19, 2021

The max statistic is useful in matching the human trait of taking assurance and building social trust by being able to commit to some amount with what they consider certainty. It's helpful for planning. Of course the finer view of considering the distribution of the complexity can be very useful in augmenting that approach.

If we have a particular way to frame the factors affecting this algorithm's runtime complexity beyond just the size of the input, or even testing/sampling the data for predicting the runtime, that might be pragmatically helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants