Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMNN: optimization of kNN computation by tree caching #1445

Open
rcurtin opened this Issue Jun 27, 2018 · 19 comments

Comments

Projects
None yet
2 participants
@rcurtin
Copy link
Member

rcurtin commented Jun 27, 2018

This is a follow-up to the LMNN support added in #1407. One of the biggest parts of the computation we are performing is the repeated calculation of the impostors with a kNN search. Looking at current benchmarks, the algorithm spends half or more of its time performing these searches. Depending on the dataset, a lot of this time can be the tree building process (which at the current moment has no timer running for it when the reference tree is built). But the points in the tree don't change, although they are stretched. So if we can cache the tree structure between iterations, we can completely save that time.

As a side note, for the SGD-like optimizers, we run kNN with small batches as the query set. It's not worthwhile to cache the trees built on these batches, so let's not worry about that. But it is worthwhile to cache the reference trees---which in the case of an L_BFGS-like optimizer uses the reference sets also as query sets.

However, the thing that makes caching these trees tricky is that each iteration, we change the transformation matrix L. This means that all the bounds and quantities in our tree are wrong. However, we could update all the quantities inside the tree according to the new transformation matrix L, and then we would be able to use it for kNN search in the new space correctly. Over time, if L changes too much, the kNN search will get slower and slower (the trees will be able to prune less and less), but I am not sure if this will be enough of a problem that we'll need to care about it. We can figure that out during simulations.

It would probably be worthwhile to look through the KD-tree code (which is built with the BinarySpaceTree class) in src/mlpack/core/tree/binary_space_tree/ to see how these trees are built. I will give a decent description here though that I hope will be helpful. Given a dataset, we repeatedly split it on the dimension of maximum variance, in order to organize our points. Each node in the tree, then, contains points that are localized in space. For the BinarySpaceTree structure, there are a couple of important quantities:

  • Bound() (of type HRectBound): this specifies the bounding hyper-rectangle of each node. Any point held in a node (and any point held in any descendant of that node) will be contained within the bound.

  • FurthestDescendantDistance(): this is the distance between the center of the bound and the furthest descendant point from the center. (A descendant point is any point held in the node or any descendant of the node. In our BinarySpaceTree class, only leaf nodes in the tree hold any points, so the set of descendant points of a given node N turns out to be the points held in all leaf nodes that are descendants of the node N.)

  • MinimumBoundDistance(): this is half the minimum width of the bound.

  • ParentDistance(): this is the distance between the center of the bound and the center of the parent's bound.

  • The dataset iself (Dataset()).

For the kNN search to work correctly with a different L matrix, we will have to adjust each of these quantities. Off the top of my head, we should be able to do this by first multiplying the tree's dataset with the transformation matrix (tree.Dataset() = transformation * tree.Dataset() or something like this). Then, we could take a depth-first or breadth-first traversal through the tree and update the bound by multiplying each of its dimensions accordingly by the transformation matrix, and then recalculating the furthest descendant distance, minimum bound distance, and parent distance.

@manish7294: let me know what you think of the idea, and then we can decide how we'll do it. I don't think there is any need for a MahalanobisHRectBound class or anything like that.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jun 28, 2018

The idea seems pretty decent, it totally seems possible. It shall remove the overhead of generating trees again and again if we can just adapt the tree to new space. Let's do this :)
I think first point we will be to generate the trees and cache them efficiently (I think neighbor already provide a referenceTree() function to help us out and then maybe a vector of trees having size equals to the number of unique labels can be used.) Then maybe we can call individual tree updates(to be decided) using our reference tree without modifying anything at all.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jun 28, 2018

I think we will need to perform the above procedure for both reference & query tree, right?

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jun 28, 2018

That's correct, we'll need to cache the query trees and modify them also. There's no need to extract the reference trees from the KNN objects, you can just store the KNN objects themselves in one vector and the query trees in another.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 7, 2018

I tried working over this but I can't seem to get how we are going to update the tree. Please help me out.

Given a dataset, we repeatedly split it on the dimension of maximum variance, in order to organize our points.

Doing this optimization we are basically bypassing the SplitNode step, right? I was wondering won't the order of points change as we modify the dataset.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 9, 2018

Right, imagine that we have some KDTree queryTree. Then we'll need to write something a little like:

// First we need to stretch the dataset.
queryTree.Dataset() = transformationMatrix * queryTree.Dataset();
// Now call the function below...
UpdateTree(queryTree, transformationMatrix);

void UpdateTree(KDTree& node, const arma::mat& transformationMatrix)
{
  // Update each of the HRectBound dimensions...
  for (size_t d = 0; d < dim; ++d)
  {
    // Basically we had a previous bound and we want to "stretch" it for each dimension...
    node.Bound()[d].Hi() = some_operation(node.Bound()[d].Hi(), transformationMatrix);
    node.Bound()[d].Lo() = some_operation(node.Bound()[d].Lo(), transformationMatrix);
    // I haven't derived the exact expression above yet, you could give it a shot---I think it should
    // not be too difficult.
  }
  // Also recalculate node.Bound().MinWidth() and set it with the new bounds...

  // We also have to update the properties of the tree.  Some are easier after recursion.
  node.MinimumBoundDistance() = node.Bound().MinWidth() / 2.0;
  // Technically this is loose but it is what the BinarySpaceTree already does.
  node.FurthestDescendantDistance() = 0.5 * node.Bound().Diameter();

  // Recurse into the children.
  if (node.NumChildren() > 0)
  {
    UpdateNode(node.Child(0), transformationMatrix);
    UpdateNode(node.Child(1), transformationMatrix);

    // Recompute the parent distance for the left and right child.
    arma::vec center, leftCenter, rightCenter;
    node.Center(center);
    node.Child(0).Center(leftCenter);
    node.Child(1).Center(rightCenter);

    const double leftParentDistance = node.Metric().Evaluate(center, leftCenter);
    const double rightParentDistance = node.Metric().Evaluate(center,
        rightCenter);

    node.Child(0).ParentDistance() = leftParentDistance;
    node.Child(1).ParentDistance() = rightParentDistance;
}

I think something along those lines will work correctly. I haven't tested that, so double-check it, but it should be the right direction at least.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 10, 2018

I tried and ran a prototype of this, and here are some points taken from the observation.

// Basically we had a previous bound and we want to "stretch" it for each dimension...
node.Bound()[d].Hi() = some_operation(node.Bound()[d].Hi(), transformationMatrix);
node.Bound()[d].Lo() = some_operation(node.Bound()[d].Lo(), transformationMatrix);

I did a quick derivation, the change in points is basically arma::mat change = ((Transformation - TransformationOld) * dataset)
I think node.Bound()[d].Hi() should be node.Bound()[d].Hi() + max(change.row(d)). Here max() is the maximum of all the vectors.

// Also recalculate node.Bound().MinWidth() and set it with the new bounds...

new minWidth should be basically the max(node.Bound[d].Hi() - node.Bound[d].Lo())

// We also have to update the properties of the tree. Some are easier after recursion.
node.MinimumBoundDistance() = node.Bound().MinWidth() / 2.0;

I think the program will compute it by itself. Just MinWidth() needs to be updated first.

// Technically this is loose but it is what the BinarySpaceTree already does.
node.FurthestDescendantDistance() = 0.5 * node.Bound().Diameter();

There is no set function for doing this in current code. Maybe adding a new function will do.

I don't think whatever I wrote above is absolutely correct. Please correct me if I’m wrong anywhere.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 12, 2018

I did a quick derivation, the change in points is basically arma::mat change = ((Transformation - TransformationOld) * dataset)

Right, exactly, and we can see our hyperrectangle bound as parameterized by two points: the Hi() for each dimension and the Lo() for each dimension. So if we made a point like

arma::vec p(d);
for (size_t i = 0; i < d; ++i)
  p[i] = node.Bound()[i].Hi();

Then we could just transform the same way:

p = (transformationMatrix - transformationOld) * p;

There is a more efficient way to write that though, but in any case, that should apply to both the low and high sides of the bound.

I think the program will compute it by itself. Just MinWidth() needs to be updated first.

Correct me if I'm wrong, but I believe MinimumBoundDistance() is only set when the tree is built, so when we updated the tree we would have to manually update it.

There is no set function for doing this in current code. Maybe adding a new function will do.

I don't know if I understand, just the FurthestDescendantDistance() function should be sufficient here to set a new value.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 12, 2018

Oops, forgot one:

new minWidth should be basically the max(node.Bound[d].Hi() - node.Bound[d].Lo())

Shouldn't that be min() not max()?

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 13, 2018

Oops, forgot one:

new minWidth should be basically the max(node.Bound[d].Hi() - node.Bound[d].Lo())

Shouldn't that be min() not max()?

Ah! Sorry, I wonder what was I thinking while writing this. Ya, it should be min() :)

I think the program will compute it by itself. Just MinWidth() needs to be updated first.

Correct me if I'm wrong, but I believe MinimumBoundDistance() is only set when the tree is built, so when we updated the tree we would have to manually update it.

There is no set function for doing this in current code. Maybe adding a new function will do.

I don't know if I understand, just the FurthestDescendantDistance() function should be sufficient here to set a new value.

I meant to say there's no function to modify their values. We currently only have functions to return value in the code.

ElemType FurthestDescendantDistance() const;

 //! Return the minimum distance from the center of the node to any bound edge.
 ElemType MinimumBoundDistance() const;

So I think adding something like this will do

ElemType& FurthestDescendantDistance() {return furthestDescendantDistance};

 //! Return the minimum distance from the center of the node to any bound edge.
 ElemType& MinimumBoundDistance() { return minimumBoundDistance};

Now for the implementation part, we will need to have old transformation matrix as well. So, maybe an extra parameter as oldTransformationMatrix should also be passed. Right?

node.Bound()[d].Hi() = node.Bound()[d].Hi() + max(change.row(d))
And for implementation purpose, this is correct, right?

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 15, 2018

template<typename TreeType>
void UpdateTree(TreeType& node, const arma::mat& dataset)
{
  double minWidth = DBL_MAX;
  // Update each of the HRectBound dimensions...
  for (size_t d = 0; d < node.Bound().Dim(); ++d)
  {
    // Basically we had a previous bound and we want to "stretch" it for each dimension...
    node.Bound()[d].Hi() = arma::max(dataset.row(d));
    node.Bound()[d].Lo() = arma::min(dataset.row(d));
    // I haven't derived the exact expression above yet, you could give it a shot---I think it should
    // not be too difficult.
    double width = node.Bound()[d].Hi() - node.Bound()[d].Lo();
    if (minWidth > width)
      minWidth = width;
  }
  // Also recalculate node.Bound().MinWidth() and set it with the new bounds...
  node.Bound().MinWidth() = minWidth;

  // We also have to update the properties of the tree.  Some are easier after recursion.
  node.MinimumBoundDistance() = node.Bound().MinWidth() / 2.0;
  // Technically this is loose but it is what the BinarySpaceTree already does.
  node.FurthestDescendantDistance() = 0.5 * node.Bound().Diameter();

  // Recurse into the children.
  if (node.NumChildren() > 0)
  {
    UpdateTree(node.Child(0), dataset);
    UpdateTree(node.Child(1), dataset);

    // Recompute the parent distance for the left and right child.
    arma::vec center, leftCenter, rightCenter;
    node.Center(center);
    node.Child(0).Center(leftCenter);
    node.Child(1).Center(rightCenter);

    const double leftParentDistance = node.Metric().Evaluate(center, leftCenter);
    const double rightParentDistance = node.Metric().Evaluate(center,
        rightCenter);

    node.Child(0).ParentDistance() = leftParentDistance;
    node.Child(1).ParentDistance() = rightParentDistance;
  }
}

I have been trying with above code and it seems training is happening as minWidth keeps on changing. But it seems there is some variation in the learned distance.

As to my knowledge we are updating every properties of both BinaryTree and Bounds class here, except the basic tree structure. So, I wonder, if it's because of keeping the tree structure constant.

Given a dataset, we repeatedly split it on the dimension of maximum variance, in order to organize our points.

Is it not possible that the maximum variance dimension changes as the transformationMatrix changes?

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 15, 2018

So you are right that the dimension with maximum variance can indeed change. But that should not affect the correctness of the results (it may cause the search to be slightly slower though. However that slowdown is still better than the cost of rebuilding the tree).

I think we should write a unit test for UpdateTree() to ensure it is correct. To do this, we can take some dataset and some transformation matrix and call UpdateTree() and calculate kNN results. Then, create a stretched dataset with that transformation matrix and ensure its kNN results are the same.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 15, 2018

I ran a unit test as you said and the results were bit different. Out of 750 elements 151 were misclassified.

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 16, 2018

I think I have found quite an odd observation with KNN, like if we train a KNN instance over a dataset and then extract the reference tree from this instance and pass it to some other KNN instance, then the search results from both the instances are different.

 arma::Mat<size_t> neighbors;
 arma::mat distances;
 KNN knn;
 knn.Train( dataset);
 knn.Search(k, neighbors, distances);
 arma::Mat<size_t> output = neighbors;

 KNN knn2;
 knn2.Train(knn.ReferenceTree());
 knn2.Search(k, neighbors, distances);
 // On Comparing earlier output matrix and the new neighbors matrix, it seems
 // both the matrices are different.

Do you think it is something to be expected? Maybe I am overlooking something.

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 16, 2018

So, definitely when you open the PR we should include the test that you wrote. A couple of notes, first about your test directly above: when you call knn.Train(dataset), the dataset will be copied and reordered for the tree building process. But since the tree is built inside the knn object, the indices of the nearest neighbors will be re-mapped to their original points before they are returned. In order to avoid that, you would need to either rebuild the tree each time, or pass the dataset both times, i.e. either

KNN knn2;
knn2.Train(dataset);

or

// Build a tree manually for the first KNN object.
KNN::TreeType tree(dataset);
KNN knn;
knn.Train(tree); // this will copy the tree, then you can copy it again or make a new one for knn2

That should give you the same results. I also took a look at your UpdateTree() function and found a few comments. I'll put them below.

node.Bound()[d].Hi() = arma::max(dataset.row(d));

This will set the max of the bound to the max of that dimension in the entire dataset. Shouldn't we calculate Hi() by transforming the existing Hi()? I wrote some prototype code in a previous comment that does that, maybe you should try it. Alternately you could loop over every descendant point (you can get the index of each point with node.Descendant(i) and use node.NumDescendants() to get the number of descendant points of the node).

I think the other parts look correct; try fixing the Hi() and Lo() calculations, and let's see if it passes the test then. If it still doesn't work, just push it to Github and let me know what branch, and I can try to debug it. Digging into tree guts was not totally a part of your proposal so I can help out with it. :)

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 20, 2018

Any update on this one?

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 20, 2018

Sorry for the slow response here, got a bit deviated by boostmetric and other things.
Thanks for explaining the passing tree to KNN issue. It solved the problem, I was facing.

Now regarding the UpdateTree() , I have tried doing

node.Bound()[d].Hi() += arma::max(change.row(d))
node.Bound()[d].Lo() += arma::min(change.row(d))

and alternatively,

  double minWidth = DBL_MAX;
  arma::Col<double> mins(min(dataset, 1));
  arma::Col<double> maxs(max(dataset, 1));
  // Update each of the HRectBound dimensions...
  for (size_t d = 0; d < node.Bound().Dim(); ++d)
  {
   // Expand bound.
    node.Bound()[d] |= math::RangeType<double>(mins[d], maxs[d]);

    double width = node.Bound()[d].Width();
      minWidth = width;
  }
  node.Bound().MinWidth() = minWidth;

but results were no different from the earlier ones. Though, I have one doubt, neighbor search uses neighbor search statistics inside neighbor search rules, so don't we need to update them as well?

@rcurtin

This comment has been minimized.

Copy link
Member Author

rcurtin commented Jul 20, 2018

Can you push the code that you have? I'd like to try and debug this. :)

@manish7294

This comment has been minimized.

Copy link
Member

manish7294 commented Jul 20, 2018

Thanks! That would be a lot of help. :)
I will push it shortly.

@mlpack-bot

This comment has been minimized.

Copy link

mlpack-bot bot commented Feb 18, 2019

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Feb 18, 2019

@rcurtin rcurtin removed the s: stale label Feb 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.