Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMNN: optimization by avoiding impostor recalculation using bounds #1448

Closed
rcurtin opened this Issue Jun 27, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@rcurtin
Copy link
Member

rcurtin commented Jun 27, 2018

This is related to #1407 and is an optimization that we could do. It focuses on the recalculation of impostors that is performed at each iteration (which happens every range iterations). Using the bounds in this paper, we can compute when we do not need to recalculate impostors:

http://www.ratml.org/misc/lmnn_bounds.pdf

Essentially to implement this, we will have to do the following:

  • Perform norm-centering before the entire LMNN computation to reduce the average norm of points (and make the bound more effective). Likely that's a good step to take anyway that can help resultant kNN accuracy.
  • Cache the transformation matrix each time Impostors() is called.
  • At each iteration's call to Impostors(), compute the Frobenius norm of the difference between the current transformation matrix and the cached transformation matrix.
  • If the Frobenius norm is below the bound, don't calculate impostors!

Note that the bound in the paper can apply for each point individually, so it is possible to do the following:

  • In the small-batch setting (Evaluate(..., begin, batchSize)), use single-tree search to only search for query points where the bound requires recomputation of impostors.
  • In the full-batch setting, if query trees are being cached, the computation of a new and smaller query tree would take time. Therefore, only build a custom smaller query tree if, e.g., 30% or less of query points need impostors recalculated.
@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 8, 2018

I tried these bound by creating an overload of Impostors() to calculate impostors for some specific points of the dataset which violates the bounds (there is no tree caching as of now).
The results were pretty good but are dataset dependent, like on iris, vc2 and balance_scale the new runtimes are around 65 - 70% of the previous runtimes whereas on letters and covertype-5k, the runtimes are almost same.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 9, 2018

Right, I would suspect that the speedup here would be dataset-dependent. Would you like to open a PR with what you've got implemented?

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 9, 2018

Sure, I will open one shortly.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Aug 21, 2018

This is done since #1466 is merged. @manish7294: great work! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.