Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random projection trees #726

Merged
merged 8 commits into from Aug 18, 2016
Merged

Random projection trees #726

merged 8 commits into from Aug 18, 2016

Conversation

lozhnikov
Copy link
Contributor

I added an implementation of random projection trees. The RPTreeMax variant is implemented slightly different since the technique for the calculation of a random deviation suggested in the paper is weird. And I use a number of random dataset points instead of all points in order to calculate the median value and the average distance between points.

Since the norm in the paper always denotes Euclidean distance I wrote arma::dot in order to calculate the scalar product of two vectors.

Right now, the tree use HRectBound.

for (size_t k = 0; k < direction.n_rows; k++)
direction[k] = math::Random(-1.0, 1.0);

ElemType length = metric::EuclideanDistance::Evaluate(origin, direction);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify this:

direction.randu(); // Fill with [0, 1].
direction -= 0.5; // Shift to [-0.5, 0.5].
const double norm = arma::norm(direction); // Get length for normalization.
if (norm == 0)
{
  const size_t k = math::RandInt(direction.n_rows);
  direction[k] = 1.0;
  length = 1.0;
}

// Normalize.
direction /= length;

@MarcosPividori
Copy link
Contributor

MarcosPividori commented Aug 5, 2016

@lozhnikov @rcurtin Woow! a minute! This pull request seems to implement something very similar to what I have implemented for spill trees: 9588a6d
When splitting the nodes according to the values in a given projection. I think we should merge both implementations.
In my commit I implement an abstraction: ProjectionVector and Hyperplane. And use SpaceSplit class to split the nodes according to their projection.
SplitInfo is really similar to the class that I use: Hyperplane, which can represent Axis-Orthogonal hyperplanes or general hyperplanes (not-necessarily axis-orthogonal).
PerformSplit method is similar to SplitPoints method.


direction[k] /= length;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also can be replaced with the shorter function idea I sketched in another comment. Another thought is, since you are using it twice, maybe mark the function static in RPTreeMaxSplit and then call it from RPTreeMeanSplit instead of implementing it again here. I am not sure this is a great function to add to the core mlpack code, because of its odd behavior when a vector of length 0 is selected. Is that behavior the best thing to do here, or maybe is it better to simply draw a random vector again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, we can move that function to the core, but in that case we can not use arma::norm.

@lozhnikov
Copy link
Contributor Author

@MarcosPividori I looked through your code, indeed, it is better to rewrite the RP tree split algorithm in terms of ProjectionVector and Hyperplane. I'll do that when your branch will be merged.

/**
* Each binary space tree node has two children which represent
* non-overlapping subsets of the space which the node represents. Therefore,
* children are not overlapping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue with the comment here.

@rcurtin
Copy link
Member

rcurtin commented Aug 5, 2016

Ok, I think that this is a good start. I like that you were able to reuse BinarySpaceTree. In addition to the comments I made inline, I think there are three main things left to be done:

  • The HRectBound will not be a tight bound. Instead the true bound is a random polytope. But I am not sure it is feasible right now to figure out how to precisely bound that polytope, so we can open an issue and leave that for later.
  • I need to run some timing tests to ensure that performance is reasonable with the RP tree. I think that it should be the same order of magnitude as the kd-tree search time for mlpack_knn.
  • Can you think of any more tests? Right now the tests just ensure that the dataset is ordered reasonably. But I think it makes sense to test the RPTreeMaxSplit and RPTreeMeanSplit classes individually, and then maybe run a test where we ensure that there does exist some splitting hyperplane between each child. (I am not sure how easy that test would be to construct.)

@MarcosPividori
Copy link
Contributor

MarcosPividori commented Aug 5, 2016

@lozhnikov @rcurtin
Continuing with my previous comment:

I think we have 3 independent decisions to make, when building a binary space tree:

  • Which kind of splitting Hyperplane: Axis-Orthogonal, or General (not necessarily Axis-Orthogonal).
  • How to choose the hyperplane: Random or Widest range (the one with widest range of projection values).
  • Which Splitting value we will consider: MidPoint value, Mean value, Median Value.

Every combination is valid, for example:

  • Axis-Orthogonal - Random - MidPoint Value.
  • General - Widest range - Mean Value.
  • Axis-Orthogonal - Widest range - Median Value.
  • etc.

Now, for Spill Trees, I conside all the combinations between Hyperplane type and Splitting Value, as 2 template parameters. We could add a third template parameter that represents the criterion to choose the hyperplane, between Random and Widest range.

What do you think? :)

@lozhnikov
Copy link
Contributor Author

I looked through a series of papers. The polytope distance problem requires a lot of arithmetic operations. All methods that I came across are based on Hilbert's algorithm (i.e. they are iterative).

I thought that in order to test the split algorithm I have to remember the normal to the hyperplane. Maybe I am mistaken.

@lozhnikov
Copy link
Contributor Author

@MarcosPividori The RP tree mean split algorithm sometimes split the space according to the distance to a particular point. So, I am not sure that it is good idea to add this functionality to the SpaceSplit class since SpaceSplit always splits data by a hyperplane.

@MarcosPividori
Copy link
Contributor

MarcosPividori commented Aug 5, 2016

@lozhnikov you are right. I supposed that binary space trees always consider a splitting hyperplane, but it looks that is not true....
Well it is a bit weird... when considering RPTree-Mean, points can be splitted based on a distance to a mean point, not using a splitting hyperplane... I don't think this case would be exactly a BinarySpaceTree, it would be something different.... woudn't it?
I have quickly read the paper and I can see RPTree-Mean is defined as you implemented it. But I think there is some kind of contradiction in the paper. In a paragraph it is said:

"On the other hand, an RPTree chooses a direction uniformly at random from the unit sphere ... and splits the data into two roughly equal-sized sets using a hyperplane orthogonal to this direction. We describe two variants, which we call RPTree-Max and RPTree-Mean. Both are adaptive to intrinsic dimension, although the proofs are in different models and use different techniques."

So, one would expect both variants to consider a hyperplane orthogonal to this direction...

When splitting based on the distance to a mean point, one child's sphere will contain the sphere of the other child, won't it?

@lozhnikov
Copy link
Contributor Author

@MarcosPividori I reread the paper. I think there are no errors. I guess the bound for the mean split random projection tree should contain a hollow like HollowBallBound for the vantage point tree.

Moreover, I found another paper "Random projection trees for vector quantization". The paper describes the same mean split algorithm.

Why do you think that 'RPTreeMean' differs from BinarySpaceTree?. As well as BinarySpaceTree, RPTreeMean divides the space into two parts. In contrast to KDTree, children of a RPTreeMean node may overlap. Thus, I updated traits for random projection trees.

@lozhnikov
Copy link
Contributor Author

I've fixed issues and I've thought out tests in order to ensure that a splitting hyperplane exists.

Two points are situated on opposite sides of a hyperplane if the hyperplane expression takes different signs on these points. Thus I can find the hyperplane as the solution of a system of inequalities. I use a simple iterative method in order to find the solution of the system. I thought it shouldn't work, but I tried a number of different seeds and it seems it works.

@MarcosPividori
Copy link
Contributor

@lozhnikov yes! I agree that there is no error. I meant to say that it was a bit confusing not using a hyperplane there.... because Random Projection Trees seems to be based on splitting hyperplanes. But, I am sure they have their reasons!
It becomes a bit difficult to generalize some concepts when we implement many differents flavours! :)
So, I think we can consider both PR separately, and once they are merged with the master branch we can see if it is possible to unify both implementations of the splitting algorithms, to avoid duplicated code. Would you agree?
Thanks!

@lozhnikov
Copy link
Contributor Author

@MarcosPividori yeah, sure. I think RPTreeMaxshould be definitely rewritten in terms of SpaceSplit and RPTreeMean should use the notion Hyperplane.

@rcurtin rcurtin mentioned this pull request Aug 14, 2016
const VecType& point,
const SplitInfo& splitInfo)
{
return (arma::dot(point, splitInfo.direction) <= splitInfo.splitVal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while thinking about this. The inner product is defined for L2 space but only for L2 space---in other spaces like L1, L3, L_inf, etc., it is not defined. I think this is why the paper restricts itself to considering L2. But I don't think that using the dot product like this in other spaces is a problem. Let me know if you think otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should reread the paper. I am not sure that other spaces preserve the theoretical guarantees.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true, I think the theory is only valid for L2. I just wanted to point out the issue, that's all. I don't think there is anything we can easily do about it.


template<typename BoundType, typename MatType>
void RPTreeMeanSplit<BoundType, MatType>::GetRandomDirection(
arma::Col<ElemType>& direction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I have already written this somewhere else, but would it be better to move this to core/math/ as a standalone function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually, the function is already there! math::RandVector(). Maybe it would be useful to add an overload with a parameter that allows you to select the number of dimensions. Note that selecting a random vector by uniformly sampling and then normalizing as you've done here actually causes bias, because you want to uniformly sample from the surface of a sphere. I believe the implementation Nishant wrote many years ago uses the Box-Muller transform:

https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

@rcurtin
Copy link
Member

rcurtin commented Aug 16, 2016

Ok, things look pretty good to me. As for the refactoring, we can do that after spill trees are merged.

I ran timing simulations, and as expected, the RP tree is slower than the kd-tree (after all it has looser bounds in all cases). I only have one remaining question and then a few observations.

  • Should we use RPMax as the default? Nearly all of my simulations indicate that it performs better than RPMean.
  • The random projection tree appears to do almost as well as the kd-tree in higher dimensions; for instance, performance is almost the same on the MNIST dataset. I wonder if there exist datasets where it may outperform the kd-tree.
  • The results of one of Pari's papers (http://papers.nips.cc/paper/5121-which-space-partitioning-tree-to-use-for-search) compares random projection trees, kd-trees, and some other types of trees for defeatist search. Pari's results indicate that the rp-tree is generally the worst-performing of the tree types that he tried. I wonder if we will see the same results in our own surveys for defeatist search.

I am not sure, in the end, how effective the random projection tree will be, but I think that it is good to have it implemented. It seems there are not many studies of its empirical effectiveness (if you know of any, send some links! I would be interested to read).

@lozhnikov
Copy link
Contributor Author

I fixed the issues. Indeed, the number of base cases with the mean-split rp tree is greater than the number of base cases with the max split rp-tree. On the other hand, the total number of scores is less. I tried a number of datasets and noticed that the total time with the mean-split rp tree is less. Indeed, the mean-split rp tree is more balanced since there is no shift from the median split value.

@rcurtin
Copy link
Member

rcurtin commented Aug 18, 2016

The AppVeyor build seems to be because there is no /bigobj, but #747 solves that, so I'll just ignore the error for now and it will be fixed when that is merged (very soon). Based on what you've said, I think staying with mean-split rp-tree as a default is fine. The numbers I saw were quite close anyway.

@rcurtin rcurtin merged commit 87776e5 into mlpack:master Aug 18, 2016
@rcurtin
Copy link
Member

rcurtin commented Aug 18, 2016

I made a few simplifications and style fixes in 0f4b25a; let me know if I broke anything and I will fix it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants