Random projection trees #726

lozhnikov · 2016-07-18T13:38:08Z

I added an implementation of random projection trees. The RPTreeMax variant is implemented slightly different since the technique for the calculation of a random deviation suggested in the paper is weird. And I use a number of random dataset points instead of all points in order to calculate the median value and the average distance between points.

Since the norm in the paper always denotes Euclidean distance I wrote arma::dot in order to calculate the scalar product of two vectors.

Right now, the tree use HRectBound.

rcurtin · 2016-08-05T14:08:01Z

src/mlpack/core/tree/binary_space_tree/rp_tree_max_split_impl.hpp

+  for (size_t k = 0; k < direction.n_rows; k++)
+    direction[k] = math::Random(-1.0, 1.0);
+
+  ElemType length = metric::EuclideanDistance::Evaluate(origin, direction);


I think you could simplify this:

direction.randu(); // Fill with [0, 1]. direction -= 0.5; // Shift to [-0.5, 0.5]. const double norm = arma::norm(direction); // Get length for normalization. if (norm == 0) { const size_t k = math::RandInt(direction.n_rows); direction[k] = 1.0; length = 1.0; } // Normalize. direction /= length;

MarcosPividori · 2016-08-05T14:28:23Z

@lozhnikov @rcurtin Woow! a minute! This pull request seems to implement something very similar to what I have implemented for spill trees: 9588a6d
When splitting the nodes according to the values in a given projection. I think we should merge both implementations.
In my commit I implement an abstraction: ProjectionVector and Hyperplane. And use SpaceSplit class to split the nodes according to their projection.
SplitInfo is really similar to the class that I use: Hyperplane, which can represent Axis-Orthogonal hyperplanes or general hyperplanes (not-necessarily axis-orthogonal).
PerformSplit method is similar to SplitPoints method.

rcurtin · 2016-08-05T14:30:55Z

src/mlpack/core/tree/binary_space_tree/rp_tree_mean_split_impl.hpp

+
+    direction[k] /= length;
+  }
+}


I think this also can be replaced with the shorter function idea I sketched in another comment. Another thought is, since you are using it twice, maybe mark the function static in RPTreeMaxSplit and then call it from RPTreeMeanSplit instead of implementing it again here. I am not sure this is a great function to add to the core mlpack code, because of its odd behavior when a vector of length 0 is selected. Is that behavior the best thing to do here, or maybe is it better to simply draw a random vector again?

Yes, I agree, we can move that function to the core, but in that case we can not use arma::norm.

lozhnikov · 2016-08-05T14:47:24Z

@MarcosPividori I looked through your code, indeed, it is better to rewrite the RP tree split algorithm in terms of ProjectionVector and Hyperplane. I'll do that when your branch will be merged.

rcurtin · 2016-08-05T14:48:27Z

src/mlpack/core/tree/binary_space_tree/traits.hpp

+  /**
+   * Each binary space tree node has two children which represent
+   * non-overlapping subsets of the space which the node represents.  Therefore,
+   * children are not overlapping.


Same issue with the comment here.

rcurtin · 2016-08-05T14:59:39Z

Ok, I think that this is a good start. I like that you were able to reuse BinarySpaceTree. In addition to the comments I made inline, I think there are three main things left to be done:

The HRectBound will not be a tight bound. Instead the true bound is a random polytope. But I am not sure it is feasible right now to figure out how to precisely bound that polytope, so we can open an issue and leave that for later.
I need to run some timing tests to ensure that performance is reasonable with the RP tree. I think that it should be the same order of magnitude as the kd-tree search time for mlpack_knn.
Can you think of any more tests? Right now the tests just ensure that the dataset is ordered reasonably. But I think it makes sense to test the RPTreeMaxSplit and RPTreeMeanSplit classes individually, and then maybe run a test where we ensure that there does exist some splitting hyperplane between each child. (I am not sure how easy that test would be to construct.)

MarcosPividori · 2016-08-05T15:11:07Z

@lozhnikov @rcurtin
Continuing with my previous comment:

I think we have 3 independent decisions to make, when building a binary space tree:

Which kind of splitting Hyperplane: Axis-Orthogonal, or General (not necessarily Axis-Orthogonal).
How to choose the hyperplane: Random or Widest range (the one with widest range of projection values).
Which Splitting value we will consider: MidPoint value, Mean value, Median Value.

Every combination is valid, for example:

Axis-Orthogonal - Random - MidPoint Value.
General - Widest range - Mean Value.
Axis-Orthogonal - Widest range - Median Value.
etc.

Now, for Spill Trees, I conside all the combinations between Hyperplane type and Splitting Value, as 2 template parameters. We could add a third template parameter that represents the criterion to choose the hyperplane, between Random and Widest range.

What do you think? :)

lozhnikov · 2016-08-05T15:18:32Z

I looked through a series of papers. The polytope distance problem requires a lot of arithmetic operations. All methods that I came across are based on Hilbert's algorithm (i.e. they are iterative).

I thought that in order to test the split algorithm I have to remember the normal to the hyperplane. Maybe I am mistaken.

lozhnikov · 2016-08-05T15:28:06Z

@MarcosPividori The RP tree mean split algorithm sometimes split the space according to the distance to a particular point. So, I am not sure that it is good idea to add this functionality to the SpaceSplit class since SpaceSplit always splits data by a hyperplane.

MarcosPividori · 2016-08-05T16:34:49Z

@lozhnikov you are right. I supposed that binary space trees always consider a splitting hyperplane, but it looks that is not true....
Well it is a bit weird... when considering RPTree-Mean, points can be splitted based on a distance to a mean point, not using a splitting hyperplane... I don't think this case would be exactly a BinarySpaceTree, it would be something different.... woudn't it?
I have quickly read the paper and I can see RPTree-Mean is defined as you implemented it. But I think there is some kind of contradiction in the paper. In a paragraph it is said:

"On the other hand, an RPTree chooses a direction uniformly at random from the unit sphere ... and splits the data into two roughly equal-sized sets using a hyperplane orthogonal to this direction. We describe two variants, which we call RPTree-Max and RPTree-Mean. Both are adaptive to intrinsic dimension, although the proofs are in different models and use different techniques."

So, one would expect both variants to consider a hyperplane orthogonal to this direction...

When splitting based on the distance to a mean point, one child's sphere will contain the sphere of the other child, won't it?

Various fixes of random projection trees.

lozhnikov · 2016-08-07T18:49:13Z

@MarcosPividori I reread the paper. I think there are no errors. I guess the bound for the mean split random projection tree should contain a hollow like HollowBallBound for the vantage point tree.

Moreover, I found another paper "Random projection trees for vector quantization". The paper describes the same mean split algorithm.

Why do you think that 'RPTreeMean' differs from BinarySpaceTree?. As well as BinarySpaceTree, RPTreeMean divides the space into two parts. In contrast to KDTree, children of a RPTreeMean node may overlap. Thus, I updated traits for random projection trees.

lozhnikov · 2016-08-07T19:02:48Z

I've fixed issues and I've thought out tests in order to ensure that a splitting hyperplane exists.

Two points are situated on opposite sides of a hyperplane if the hyperplane expression takes different signs on these points. Thus I can find the hyperplane as the solution of a system of inequalities. I use a simple iterative method in order to find the solution of the system. I thought it shouldn't work, but I tried a number of different seeds and it seems it works.

MarcosPividori · 2016-08-08T03:29:15Z

@lozhnikov yes! I agree that there is no error. I meant to say that it was a bit confusing not using a hyperplane there.... because Random Projection Trees seems to be based on splitting hyperplanes. But, I am sure they have their reasons!
It becomes a bit difficult to generalize some concepts when we implement many differents flavours! :)
So, I think we can consider both PR separately, and once they are merged with the master branch we can see if it is possible to unify both implementations of the splitting algorithms, to avoid duplicated code. Would you agree?
Thanks!

lozhnikov · 2016-08-09T08:47:48Z

@MarcosPividori yeah, sure. I think RPTreeMaxshould be definitely rewritten in terms of SpaceSplit and RPTreeMean should use the notion Hyperplane.

rcurtin · 2016-08-16T18:55:06Z

src/mlpack/core/tree/binary_space_tree/rp_tree_max_split.hpp

+    const VecType& point,
+    const SplitInfo& splitInfo)
+  {
+    return (arma::dot(point, splitInfo.direction) <= splitInfo.splitVal);


I spent a while thinking about this. The inner product is defined for L2 space but only for L2 space---in other spaces like L1, L3, L_inf, etc., it is not defined. I think this is why the paper restricts itself to considering L2. But I don't think that using the dot product like this in other spaces is a problem. Let me know if you think otherwise.

I should reread the paper. I am not sure that other spaces preserve the theoretical guarantees.

It's true, I think the theory is only valid for L2. I just wanted to point out the issue, that's all. I don't think there is anything we can easily do about it.

rcurtin · 2016-08-16T18:56:17Z

src/mlpack/core/tree/binary_space_tree/rp_tree_mean_split_impl.hpp

+
+template<typename BoundType, typename MatType>
+void RPTreeMeanSplit<BoundType, MatType>::GetRandomDirection(
+    arma::Col<ElemType>& direction)


I don't know if I have already written this somewhere else, but would it be better to move this to core/math/ as a standalone function?

Ah actually, the function is already there! math::RandVector(). Maybe it would be useful to add an overload with a parameter that allows you to select the number of dimensions. Note that selecting a random vector by uniformly sampling and then normalizing as you've done here actually causes bias, because you want to uniformly sample from the surface of a sphere. I believe the implementation Nishant wrote many years ago uses the Box-Muller transform:

https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

rcurtin · 2016-08-16T23:19:01Z

Ok, things look pretty good to me. As for the refactoring, we can do that after spill trees are merged.

I ran timing simulations, and as expected, the RP tree is slower than the kd-tree (after all it has looser bounds in all cases). I only have one remaining question and then a few observations.

Should we use RPMax as the default? Nearly all of my simulations indicate that it performs better than RPMean.
The random projection tree appears to do almost as well as the kd-tree in higher dimensions; for instance, performance is almost the same on the MNIST dataset. I wonder if there exist datasets where it may outperform the kd-tree.
The results of one of Pari's papers (http://papers.nips.cc/paper/5121-which-space-partitioning-tree-to-use-for-search) compares random projection trees, kd-trees, and some other types of trees for defeatist search. Pari's results indicate that the rp-tree is generally the worst-performing of the tree types that he tried. I wonder if we will see the same results in our own surveys for defeatist search.

I am not sure, in the end, how effective the random projection tree will be, but I think that it is good to have it implemented. It seems there are not many studies of its empirical effectiveness (if you know of any, send some links! I would be interested to read).

lozhnikov · 2016-08-17T13:36:45Z

I fixed the issues. Indeed, the number of base cases with the mean-split rp tree is greater than the number of base cases with the max split rp-tree. On the other hand, the total number of scores is less. I tried a number of datasets and noticed that the total time with the mean-split rp tree is less. Indeed, the mean-split rp tree is more balanced since there is no shift from the median split value.

rcurtin · 2016-08-18T00:25:04Z

The AppVeyor build seems to be because there is no /bigobj, but #747 solves that, so I'll just ignore the error for now and it will be fixed when that is merged (very soon). Based on what you've said, I think staying with mean-split rp-tree as a default is fine. The numbers I saw were quite close anyway.

rcurtin · 2016-08-18T00:45:32Z

I made a few simplifications and style fixes in 0f4b25a; let me know if I broke anything and I will fix it. :)

lozhnikov added 2 commits July 18, 2016 16:22

Added random projection trees.

6fa3a47

Added random projection trees to NSModel and RSModel.

e14a26f

rcurtin reviewed Aug 5, 2016
View reviewed changes

lozhnikov added 2 commits August 7, 2016 20:11

Added tests for the random projection tree split method.

bc8994a

Various fixes of random projection trees.

Merge remote-tracking branch 'upstream/master' into r-proj-tree

dc0b456

Merge remote-tracking branch 'upstream/master' into r-proj-tree

079cb80

rcurtin mentioned this pull request Aug 14, 2016

Spill trees #747

Merged

lozhnikov added 2 commits August 15, 2016 21:58

Merge remote-tracking branch 'upstream/master' into r-proj-tree

36f3966

Various rp-tree fixes. Fixed debug build.

ea66e5d

rcurtin reviewed Aug 16, 2016
View reviewed changes

Very minor fixes.

fd4d759

rcurtin merged commit 87776e5 into mlpack:master Aug 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random projection trees #726

Random projection trees #726

lozhnikov commented Jul 18, 2016

rcurtin Aug 5, 2016

MarcosPividori commented Aug 5, 2016 •

edited

rcurtin Aug 5, 2016

lozhnikov Aug 5, 2016

lozhnikov commented Aug 5, 2016

rcurtin Aug 5, 2016

rcurtin commented Aug 5, 2016

MarcosPividori commented Aug 5, 2016 •

edited

lozhnikov commented Aug 5, 2016

lozhnikov commented Aug 5, 2016

MarcosPividori commented Aug 5, 2016 •

edited

lozhnikov commented Aug 7, 2016

lozhnikov commented Aug 7, 2016

MarcosPividori commented Aug 8, 2016

lozhnikov commented Aug 9, 2016

rcurtin Aug 16, 2016

lozhnikov Aug 17, 2016

rcurtin Aug 17, 2016

rcurtin Aug 16, 2016

rcurtin Aug 16, 2016

rcurtin commented Aug 16, 2016

lozhnikov commented Aug 17, 2016

rcurtin commented Aug 18, 2016

rcurtin commented Aug 18, 2016

Random projection trees #726

Random projection trees #726

Conversation

lozhnikov commented Jul 18, 2016

Choose a reason for hiding this comment

MarcosPividori commented Aug 5, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lozhnikov commented Aug 5, 2016

Choose a reason for hiding this comment

rcurtin commented Aug 5, 2016

MarcosPividori commented Aug 5, 2016 • edited

lozhnikov commented Aug 5, 2016

lozhnikov commented Aug 5, 2016

MarcosPividori commented Aug 5, 2016 • edited

lozhnikov commented Aug 7, 2016

lozhnikov commented Aug 7, 2016

MarcosPividori commented Aug 8, 2016

lozhnikov commented Aug 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcurtin commented Aug 16, 2016

lozhnikov commented Aug 17, 2016

rcurtin commented Aug 18, 2016

rcurtin commented Aug 18, 2016

MarcosPividori commented Aug 5, 2016 •

edited

MarcosPividori commented Aug 5, 2016 •

edited

MarcosPividori commented Aug 5, 2016 •

edited