# LMNN: optimization of kNN computation by bound relaxation #1446

Open
opened this Issue Jun 27, 2018 · 15 comments

Projects
None yet
2 participants
Member

### rcurtin commented Jun 27, 2018

 Here's another nice way that I think we can have a lot of speedup with the kNN search (not the tree building). The kNN search stores bounds in the query tree during search. The algorithm itself is quite complex, but in essence, when we start the search we have a bound set at DBL_MAX that says, "the k-nearest-neighbor of any descendant of this query node is no further than DBL_MAX away". But we have a better bound on the k-nearest-neighbor---we know that the k-nearest-neighbor at iteration (t + 1) is going to be bounded by `d_{L_t}(x_i, x_j) + f(|| L_t - L_{t + 1} ||)` where `f(.)` is some quantity I still have to derive. If we start the kNN search with a much closer bound than `DBL_MAX`, search can be significantly faster (think order of magnitude). So to do this, we will have to: Derive the correct bound to use. Use cached query trees (#1445), or single-tree mode for small-batch computation. Set the bound to the bounded kNN distance before calling `knn.Search()`. If we are not caching query trees, store each query point's k'th NN distance for use next iteration. (I think the code already does this.)

Member

### manish7294 commented Jun 28, 2018

 This sounds like another pretty good idea. I believe the bounds here will be somewhat similar to the ones we will be using in #1449.
Member

### manish7294 commented Jun 28, 2018

 I think this may be correct: ` f(|| L_t - L_{t + 1} ||) = || L_{t + 1} - L_t ||_F^2 * max( ||x_a||^2 + ||x_i||^2)` x_a : a point among kth nearest neighbors of all data points
Member Author

### rcurtin commented Jul 2, 2018

 Right, I think this is close to correct. Here's a derivation, supposing x_i is the query point and x_a is the k-nearest neighbor (although the notation doesn't actually matter): ``````d(L_{t + 1} x_i, L_{t + 1} x_a) <= d(L_t x_i, L_t x_a) + d(L_t x_i, L_{t + 1} x_i) + d(L_t x_a, L_{t + 1} x_a) <= d(L_t x_i, L_t x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2). `````` Maybe I overlooked something? I'm not sure where the `max()` came from in your solution, and it seems like maybe it is missing a second argument?
Member

### manish7294 commented Jul 2, 2018

 Ah! That was just a intutional expression I had written earlier :)
Member

### manish7294 commented Jul 5, 2018

 Let me explain the essence behind max() d(L_{t + 1} x_i, L_{t + 1} x_a) <= d(L_t x_i, L_t x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2) The above bound is specific to individual points, but when we are dealing with a dataset or a subset of dataset we need the maximum value of `DBL_MAX ` to bound all points. So, here `max()` is the maximum value of `|| x_i ||^2 + || x_a ||^2` over all data points. It is the maximum value that will bound all points. Though there is no significance in doing this if we can set individual `DBL_MAX ` for each query point, which I am not sure of.
Member Author

### rcurtin commented Jul 5, 2018

 Right, this is a bound for each individual query point, so I think the max() is not needed here.
Member Author

### rcurtin commented Jul 24, 2018

 I'll handle the tree-related part of this one, unless you are already working on it. I'm thinking about single-tree search first, and basically when we do k-nearest-neighbor search for a single-tree algorithm we initialize all of the resulting distances to `[DBL_MAX, DBL_MAX, ...]`. Then, during the algorithm, we'll update that distance list with the "best nearest neighbors so far" (so, each value will decrease), and we choose to prune parts of the tree based on that. But the thing is, we already know at the outset of our search that the k'th nearest neighbor cannot be any further away than `bound := d(L_t * x_i, L_t * x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2)` where `x_a` is the k'th nearest neighbor of `x_i` at iteration `t`. So, we can instead initialize that entire set of resulting distances to `[bound, bound, ...]`, and this will allow much tighter pruning. The thing is, this set of resulting distances is set in `neighbor_search_rules_impl.hpp:51`, and I think it will need some amount of refactoring in order to set these the way we want. If you like, you can take a look through the code and see where you think we could nicely refactor to allow this support. I'll sleep on it and see what I think in the morning.
Member

### manish7294 commented Jul 26, 2018

 Do you happen to get something regarding this? As far I can see, there is no easy way to make this change without performing some good amount of refactoring as we can't directly access either of neighbor_search_rules or sort policy(which is not even a good choice to make change). Maybe change needs to made in neighbor_search itself.
Member Author

### rcurtin commented Jul 26, 2018

 Yes, I have been thinking about it. I think that I can implement this one so you can focus on the other optimizations and BoostMetric. I'll open a PR when it is ready.
Member

### manish7294 commented Jul 26, 2018

 Thanks! Shall I open a PR for BoostMetric as well, so that we don't get it left behind.
Member Author

### rcurtin commented Jul 26, 2018

 Right, I think that is not a bad idea. When you have a chance, be sure to go through and apply however many optimizations that you can that we've already applied to LMNN. That will help save some review time because if you don't do it I will probably point it out as a thing to be improved. :)
Member Author

### rcurtin commented Jul 28, 2018

 An update on this one: I worked on it yesterday and today, and I have a branch rcurtin/lmnn-tree-rules that you can take a look at. It does not incorporate the bounds from this issue yet, but instead of running `num_classes` KNN searches, now I built custom rules to run only once. Also we can do the targets and impostors searches simultaneously. My preliminary results show ~20-40% speedup on the covertype dataset (of various sizes) with L-BFGS, ~40-60% with amsgrad, and ~10-30% on MNIST. I have a lot more simulations I want to run, but keep in mind the actual optimization itself isn't done yet. (Technically that will require tree caching to work, so I'll adapt your work from that PR into the branch I am working on and send an update when it's done.) So, that said, I think that right now any remaining LMNN work is just timing the `impBounds` branch, and then I suppose working on the BoostMetric PR would be the next thing to do.
Member

### manish7294 commented Jul 28, 2018

 Wow! that's awesome. I couldn't have ever imagined doing something like that. You really have done a great amount of work. I just can't stop admiring your work. Thanks a lot for this. I will have a detailed look later tonight. :) Just a little comment from overview - we are gonna use triplets in BoostMetric, so maybe we should avoid removing 9994c52 and If you like we can again add it during boostmetric implementation.
Member Author

### rcurtin commented Jul 31, 2018

 Ah, ok, we can re-add that, it's no problem. I didn't realize that it was going to be used somewhere. :) And it is really not that much work for me, keep in mind that this is basically the exact topic my PhD thesis was on (accelerating things with trees) and I wrote the `NeighborSearchRules` code, so I already have a pretty good idea of exactly what needs to be changed. :)

Open

Open

### mlpack-bot bot commented Feb 18, 2019

 This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍