[Question] What's the scaling complexity? #8

vmarkovtsev · 2017-11-11T18:15:14Z

Looks like a great alternative to t-SNE! The readme mentions how fast it is, but I wonder what is the complexity in big-O depending on the number of samples and dimensions? Waiting for the paper to read impatiently!

lmcinnes · 2017-11-11T20:52:59Z

The complexity is essentially bounded by the nearest neighbor search phase. By making use of algorithms for approximate nearest neighbor graph construction I can get the scaling with input dimension down to essentially linear (effectively scaling with regard to the complexity of the distance computation). With regard to number of sample the scaling is a little harder to pin down (the algorithm I'm using, NN-descent, doesn't have a published theoretical complexity) and is, of course, dataset dependent (i.e. it will depend on properties of the distribution of the samples). Empirically I believe it should be something like O(d*n^1.14) or so. A better theoretical analysis of the complexity is definitely on my list of things to do, but its currently a long list.

vmarkovtsev · 2017-11-11T21:04:10Z

Thanks a lot!

matanox · 2021-01-17T14:33:07Z

O(d*n^1.14) sounds almost too good to be true doesn't it?

lmcinnes · 2021-01-18T17:27:08Z

It certainly does, but it is an empirical estimate of an average case complexity. I have not seen any complete estimates of NNDescent complexity. Some work by Richard Darling and Jacob Baron demonstrate both some cases where NNDescent will fail (with complexity greater than O(N^2)), and classes of problem for which a complexity of O(N log N) is provable. The catch is that both are relatively theoretical cases: the first involves metrics that fail friend-of-a-friend principles, which isn't true for most of the metrics UMAP would be used with; the second is a fairly constrained class of problem that one is unlikely to see in practice (but for which complexity is at least provable).

So, given that worst cases do exist, I guess the real answer (though it is a highly deceptive one, and would equally apply to any algorithm that requires nearest neighbor computations) is O(N^2). For practical purposes it certainly empirically scales at O(N log N) and there are certainly problem classes for which that is the provable complexity, so perhaps that is a reasonable estimate?

jc-healy · 2021-01-19T17:18:30Z

Great response Leland. I think this highlights a fundamental problem O() notation. It's a worst case complexity which is potentially the tail end of a distribution of complexities and may not represent real world scaling in practice (even ignoring those pesky constants). As such it can be quite deceptive for many classes of algorithms. I've noticed this in particular with graph algorithms. The worst case graph distribution for many algorithms is a single long connected chain which *may* be incredibly unlikely for any particular family of graphs that users care about. While this is definitely an awful corner case for many graph algorithms it doesn't tell a user much about how the algorithm will perform in practice on their data. For that reason I'm a big fan of the less popular average case complexity. Of course, to do anything useful with average case complexity you need to come up with a distribution that your data is being sampled from which can be hard. The advantage in the case of nearest neighbour algorithms is that it forces you to explicitly think about things like the metric space your algorithm will be running over. While such notions certainly raise the complexity of the complexity analysis they do provide significantly more insight into the question of how well will this algorithm perform on *my* data which, I think, is the question we all want to answer. Now that sounds quite down on O-notation which I really don't mean to be. It's an upper limit on the average case complexity and if an algorithm has great O complexity then you know it will also have great average case complexity on your data. Instead I mean it as a warning, that all O(n^2) algorithms aren't born equal. I too often see folks making statements like all those algorithms are O(n^2) so use any of them. The max is a useful summary statistic of any distribution but it isn't necessarily the most useful descriptive statistic for that distribution.

…

On Mon, Jan 18, 2021 at 12:27 PM Leland McInnes ***@***.***> wrote: It certainly does, but it is an empirical estimate of an average case complexity. I have not seen any complete estimates of NNDescent complexity. Some work by Richard Darling and Jacob Baron demonstrate both some cases where NNDescent will fail (with complexity greater than O(N^2)), and classes of problem for which a complexity of O(N log N) is provable. The catch is that both are relatively theoretical cases: the first involves metrics that fail friend-of-a-friend principles, which isn't true for most of the metrics UMAP would be used with; the second is a fairly constrained class of problem that one is unlikely to see in practice (but for which complexity is at least provable). So, given that worst cases do exist, I guess the real answer (though it is a highly deceptive one, and would equally apply to *any* algorithm that requires nearest neighbor computations) is O(N^2). For practical purposes it certainly empirically scales at O(N log N) and there are certainly problem classes for which that is the provable complexity, so perhaps that is a reasonable estimate? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWV5AUXSWL4DHL2PEK3S2RVP7ANCNFSM4EDKKQPQ> .

matanox · 2021-01-19T19:42:33Z

The max statistic is useful in matching the human trait of taking assurance and building social trust by being able to commit to some amount with what they consider certainty. It's helpful for planning. Of course the finer view of considering the distribution of the complexity can be very useful in augmenting that approach.

If we have a particular way to frame the factors affecting this algorithm's runtime complexity beyond just the size of the input, or even testing/sampling the data for predicting the runtime, that might be pragmatically helpful.

vmarkovtsev closed this as completed Nov 11, 2017

huddlej mentioned this issue Apr 18, 2024

Report how methods scale with data size blab/cartography#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What's the scaling complexity? #8

[Question] What's the scaling complexity? #8

vmarkovtsev commented Nov 11, 2017

lmcinnes commented Nov 11, 2017

vmarkovtsev commented Nov 11, 2017

matanox commented Jan 17, 2021

lmcinnes commented Jan 18, 2021

jc-healy commented Jan 19, 2021 via email

matanox commented Jan 19, 2021 •

edited

Loading

[Question] What's the scaling complexity? #8

[Question] What's the scaling complexity? #8

Comments

vmarkovtsev commented Nov 11, 2017

lmcinnes commented Nov 11, 2017

vmarkovtsev commented Nov 11, 2017

matanox commented Jan 17, 2021

lmcinnes commented Jan 18, 2021

jc-healy commented Jan 19, 2021 via email

matanox commented Jan 19, 2021 • edited Loading

matanox commented Jan 19, 2021 •

edited

Loading