New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random projection trees #726
Conversation
for (size_t k = 0; k < direction.n_rows; k++) | ||
direction[k] = math::Random(-1.0, 1.0); | ||
|
||
ElemType length = metric::EuclideanDistance::Evaluate(origin, direction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could simplify this:
direction.randu(); // Fill with [0, 1].
direction -= 0.5; // Shift to [-0.5, 0.5].
const double norm = arma::norm(direction); // Get length for normalization.
if (norm == 0)
{
const size_t k = math::RandInt(direction.n_rows);
direction[k] = 1.0;
length = 1.0;
}
// Normalize.
direction /= length;
@lozhnikov @rcurtin Woow! a minute! This pull request seems to implement something very similar to what I have implemented for spill trees: 9588a6d |
|
||
direction[k] /= length; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also can be replaced with the shorter function idea I sketched in another comment. Another thought is, since you are using it twice, maybe mark the function static in RPTreeMaxSplit
and then call it from RPTreeMeanSplit
instead of implementing it again here. I am not sure this is a great function to add to the core mlpack code, because of its odd behavior when a vector of length 0 is selected. Is that behavior the best thing to do here, or maybe is it better to simply draw a random vector again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, we can move that function to the core, but in that case we can not use arma::norm
.
@MarcosPividori I looked through your code, indeed, it is better to rewrite the RP tree split algorithm in terms of |
/** | ||
* Each binary space tree node has two children which represent | ||
* non-overlapping subsets of the space which the node represents. Therefore, | ||
* children are not overlapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue with the comment here.
Ok, I think that this is a good start. I like that you were able to reuse
|
@lozhnikov @rcurtin I think we have 3 independent decisions to make, when building a binary space tree:
Every combination is valid, for example:
Now, for Spill Trees, I conside all the combinations between Hyperplane type and Splitting Value, as 2 template parameters. We could add a third template parameter that represents the criterion to choose the hyperplane, between Random and Widest range. What do you think? :) |
I looked through a series of papers. The polytope distance problem requires a lot of arithmetic operations. All methods that I came across are based on Hilbert's algorithm (i.e. they are iterative). I thought that in order to test the split algorithm I have to remember the normal to the hyperplane. Maybe I am mistaken. |
@MarcosPividori The RP tree mean split algorithm sometimes split the space according to the distance to a particular point. So, I am not sure that it is good idea to add this functionality to the |
@lozhnikov you are right. I supposed that binary space trees always consider a splitting hyperplane, but it looks that is not true.... "On the other hand, an RPTree chooses a direction uniformly at random from the unit sphere ... and splits the data into two roughly equal-sized sets using a hyperplane orthogonal to this direction. We describe two variants, which we call RPTree-Max and RPTree-Mean. Both are adaptive to intrinsic dimension, although the proofs are in different models and use different techniques." So, one would expect both variants to consider a hyperplane orthogonal to this direction... When splitting based on the distance to a mean point, one child's sphere will contain the sphere of the other child, won't it? |
Various fixes of random projection trees.
@MarcosPividori I reread the paper. I think there are no errors. I guess the bound for the mean split random projection tree should contain a hollow like Moreover, I found another paper "Random projection trees for vector quantization". The paper describes the same mean split algorithm. Why do you think that 'RPTreeMean' differs from |
I've fixed issues and I've thought out tests in order to ensure that a splitting hyperplane exists. Two points are situated on opposite sides of a hyperplane if the hyperplane expression takes different signs on these points. Thus I can find the hyperplane as the solution of a system of inequalities. I use a simple iterative method in order to find the solution of the system. I thought it shouldn't work, but I tried a number of different seeds and it seems it works. |
@lozhnikov yes! I agree that there is no error. I meant to say that it was a bit confusing not using a hyperplane there.... because Random Projection Trees seems to be based on splitting hyperplanes. But, I am sure they have their reasons! |
@MarcosPividori yeah, sure. I think |
const VecType& point, | ||
const SplitInfo& splitInfo) | ||
{ | ||
return (arma::dot(point, splitInfo.direction) <= splitInfo.splitVal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a while thinking about this. The inner product is defined for L2 space but only for L2 space---in other spaces like L1, L3, L_inf, etc., it is not defined. I think this is why the paper restricts itself to considering L2. But I don't think that using the dot product like this in other spaces is a problem. Let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should reread the paper. I am not sure that other spaces preserve the theoretical guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true, I think the theory is only valid for L2. I just wanted to point out the issue, that's all. I don't think there is anything we can easily do about it.
|
||
template<typename BoundType, typename MatType> | ||
void RPTreeMeanSplit<BoundType, MatType>::GetRandomDirection( | ||
arma::Col<ElemType>& direction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I have already written this somewhere else, but would it be better to move this to core/math/
as a standalone function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah actually, the function is already there! math::RandVector()
. Maybe it would be useful to add an overload with a parameter that allows you to select the number of dimensions. Note that selecting a random vector by uniformly sampling and then normalizing as you've done here actually causes bias, because you want to uniformly sample from the surface of a sphere. I believe the implementation Nishant wrote many years ago uses the Box-Muller transform:
Ok, things look pretty good to me. As for the refactoring, we can do that after spill trees are merged. I ran timing simulations, and as expected, the RP tree is slower than the kd-tree (after all it has looser bounds in all cases). I only have one remaining question and then a few observations.
I am not sure, in the end, how effective the random projection tree will be, but I think that it is good to have it implemented. It seems there are not many studies of its empirical effectiveness (if you know of any, send some links! I would be interested to read). |
I fixed the issues. Indeed, the number of base cases with the mean-split rp tree is greater than the number of base cases with the max split rp-tree. On the other hand, the total number of scores is less. I tried a number of datasets and noticed that the total time with the mean-split rp tree is less. Indeed, the mean-split rp tree is more balanced since there is no shift from the median split value. |
The AppVeyor build seems to be because there is no |
I made a few simplifications and style fixes in 0f4b25a; let me know if I broke anything and I will fix it. :) |
I added an implementation of random projection trees. The RPTreeMax variant is implemented slightly different since the technique for the calculation of a random deviation suggested in the paper is weird. And I use a number of random dataset points instead of all points in order to calculate the median value and the average distance between points.
Since the norm in the paper always denotes Euclidean distance I wrote
arma::dot
in order to calculate the scalar product of two vectors.Right now, the tree use
HRectBound
.