Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMNN: optimization of kNN computation by bound relaxation #1446

Open
rcurtin opened this Issue Jun 27, 2018 · 15 comments

Comments

Projects
None yet
2 participants
@rcurtin
Copy link
Member

rcurtin commented Jun 27, 2018

Here's another nice way that I think we can have a lot of speedup with the kNN search (not the tree building).

The kNN search stores bounds in the query tree during search. The algorithm itself is quite complex, but in essence, when we start the search we have a bound set at DBL_MAX that says, "the k-nearest-neighbor of any descendant of this query node is no further than DBL_MAX away". But we have a better bound on the k-nearest-neighbor---we know that the k-nearest-neighbor at iteration (t + 1) is going to be bounded by d_{L_t}(x_i, x_j) + f(|| L_t - L_{t + 1} ||) where f(.) is some quantity I still have to derive.

If we start the kNN search with a much closer bound than DBL_MAX, search can be significantly faster (think order of magnitude). So to do this, we will have to:

  1. Derive the correct bound to use.
  2. Use cached query trees (#1445), or single-tree mode for small-batch computation.
  3. Set the bound to the bounded kNN distance before calling knn.Search().
  4. If we are not caching query trees, store each query point's k'th NN distance for use next iteration. (I think the code already does this.)
@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jun 28, 2018

This sounds like another pretty good idea. I believe the bounds here will be somewhat similar to the ones we will be using in #1449.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jun 28, 2018

I think this may be correct:
f(|| L_t - L_{t + 1} ||) = || L_{t + 1} - L_t ||_F^2 * max( ||x_a||^2 + ||x_i||^2)

x_a : a point among kth nearest neighbors of all data points

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 2, 2018

Right, I think this is close to correct. Here's a derivation, supposing x_i is the query point and x_a is the k-nearest neighbor (although the notation doesn't actually matter):

d(L_{t + 1} x_i, L_{t + 1} x_a) <= d(L_t x_i, L_t x_a) + d(L_t x_i, L_{t + 1} x_i) + d(L_t x_a, L_{t + 1} x_a)
 <= d(L_t x_i, L_t x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2).

Maybe I overlooked something? I'm not sure where the max() came from in your solution, and it seems like maybe it is missing a second argument?

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 2, 2018

Ah! That was just a intutional expression I had written earlier :)

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 5, 2018

Let me explain the essence behind max()

d(L_{t + 1} x_i, L_{t + 1} x_a) <= d(L_t x_i, L_t x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2)

The above bound is specific to individual points, but when we are dealing with a dataset or a subset of dataset we need the maximum value of DBL_MAX to bound all points. So, here max() is the maximum value of || x_i ||^2 + || x_a ||^2 over all data points. It is the maximum value that will bound all points.
Though there is no significance in doing this if we can set individual DBL_MAX for each query point, which I am not sure of.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 5, 2018

Right, this is a bound for each individual query point, so I think the max() is not needed here.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 24, 2018

I'll handle the tree-related part of this one, unless you are already working on it. I'm thinking about single-tree search first, and basically when we do k-nearest-neighbor search for a single-tree algorithm we initialize all of the resulting distances to [DBL_MAX, DBL_MAX, ...]. Then, during the algorithm, we'll update that distance list with the "best nearest neighbors so far" (so, each value will decrease), and we choose to prune parts of the tree based on that.

But the thing is, we already know at the outset of our search that the k'th nearest neighbor cannot be any further away than bound := d(L_t * x_i, L_t * x_a) + || L_t - L_{t + 1} ||_F^2 ( || x_i ||^2 + || x_a ||^2) where x_a is the k'th nearest neighbor of x_i at iteration t. So, we can instead initialize that entire set of resulting distances to [bound, bound, ...], and this will allow much tighter pruning.

The thing is, this set of resulting distances is set in neighbor_search_rules_impl.hpp:51, and I think it will need some amount of refactoring in order to set these the way we want. If you like, you can take a look through the code and see where you think we could nicely refactor to allow this support. I'll sleep on it and see what I think in the morning.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 26, 2018

Do you happen to get something regarding this? As far I can see, there is no easy way to make this change without performing some good amount of refactoring as we can't directly access either of neighbor_search_rules or sort policy(which is not even a good choice to make change). Maybe change needs to made in neighbor_search itself.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 26, 2018

Yes, I have been thinking about it. I think that I can implement this one so you can focus on the other optimizations and BoostMetric. I'll open a PR when it is ready.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 26, 2018

Thanks! Shall I open a PR for BoostMetric as well, so that we don't get it left behind.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 26, 2018

Right, I think that is not a bad idea. When you have a chance, be sure to go through and apply however many optimizations that you can that we've already applied to LMNN. That will help save some review time because if you don't do it I will probably point it out as a thing to be improved. :)

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 28, 2018

An update on this one: I worked on it yesterday and today, and I have a branch rcurtin/lmnn-tree-rules that you can take a look at. It does not incorporate the bounds from this issue yet, but instead of running num_classes KNN searches, now I built custom rules to run only once. Also we can do the targets and impostors searches simultaneously.

My preliminary results show ~20-40% speedup on the covertype dataset (of various sizes) with L-BFGS, ~40-60% with amsgrad, and ~10-30% on MNIST. I have a lot more simulations I want to run, but keep in mind the actual optimization itself isn't done yet. (Technically that will require tree caching to work, so I'll adapt your work from that PR into the branch I am working on and send an update when it's done.)

So, that said, I think that right now any remaining LMNN work is just timing the impBounds branch, and then I suppose working on the BoostMetric PR would be the next thing to do.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 28, 2018

Wow! that's awesome. I couldn't have ever imagined doing something like that. You really have done a great amount of work. I just can't stop admiring your work. Thanks a lot for this. I will have a detailed look later tonight. :)
Just a little comment from overview - we are gonna use triplets in BoostMetric, so maybe we should avoid removing 9994c52 and If you like we can again add it during boostmetric implementation.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 31, 2018

Ah, ok, we can re-add that, it's no problem. I didn't realize that it was going to be used somewhere. :)

And it is really not that much work for me, keep in mind that this is basically the exact topic my PhD thesis was on (accelerating things with trees) and I wrote the NeighborSearchRules code, so I already have a pretty good idea of exactly what needs to be changed. :)

@mlpack-bot

This comment has been minimized.

Copy link

mlpack-bot bot commented Feb 18, 2019

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Feb 18, 2019

@rcurtin rcurtin removed the s: stale label Feb 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.