New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random splitting for numeric features for decision trees. #2883
Conversation
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @RishabhGarg108, thanks for implementing this! Sorry that it took so long to get to the review. I think that the implementation is basically sound; I left a few notes throughout.
P.S. I also figured out that if we can make boostrap a template parameter for RandomForest class which default to true, then creating an Extra-Tree will be as trivial as defining a typedef.
Ah, nice! Yeah, if we can do that, then we can provide an option with the random forest binding to train an Extra-Trees random forest instead.
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
Yes, we can do that also. 😃 Can you also suggest how shall I write the tests? I am having a hard time coming up with deterministic tests in this random setting. I have mentioned a couple of ideas I have in my mind in the description of this PR. Would love to hear your suggestions :-) |
src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp
Outdated
Show resolved
Hide resolved
src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp
Outdated
Show resolved
Hide resolved
Well, I don't think we can really make our tests deterministic here, but we can at least make the failure probability very low. Personally, I think a decent test might be to build a random forest on numeric data using this split, and then ensure that the accuracy of that random forest isn't awful. If you want to test that splits are different between It's probably worth checking the stability of the tests by running them, say, 1000 times with different random seeds and making sure that they don't fail. You can do this by modifying
and if you like you can filter all extraneous output too with |
src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
@@ -409,6 +410,38 @@ class RandomForest | |||
double avgGain; | |||
}; | |||
|
|||
/** | |||
* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest) | |
* Convenience typedef for Extra Trees. (Extremely Randomized Trees Forest). |
Missing typedef and use commonly used name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't exactly follow this comment. There is actually a using
directive below this comment section where I am aliasing ExtraTrees
. Also, can you please elaborate "commonly used name". Which name are you talking about here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @zoq meant that we should use the more commonly used name ("Randomized" vs. "Randomised").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification! I have addressed this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant.
@@ -0,0 +1,124 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot this until now, but I think we need to add this (and the implementation file) to the list of files in CMakeLists.txt
. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have added it 👍
Actually, I also reverted the parenthesis back to the square brackets as they originally were. I felt them totally unnecessary :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @RishabhGarg108! Sorry it took a few days to get back to this. This is nice support to add. :)
src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, no more comments from my side.
Thanks @RishabhGarg108! |
An attempt to solve #884.
The changes made in this PR are:-
RandomBinaryNumericSplit
[]
to()
for accessing elements of armadillo matrices and vectors. According to armadillo documentation, square brackets does not work correctly for the case[i,j]
while(i,j)
works fine. So, maybe we can try to be consistent about indexing. (Open to suggestions about this.)RandomBinaryNumericSplit
Reference Paper - http://orbi.ulg.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf
Here, I want to discuss what tests should be write for this.
Since, this method of splitting is randomised, so we can't ensure that we will be getting a better split even in the case when it does exist. So, we can't test this.BestNumericBinarySplit
yields a split different fromRandomBinaryNumericSplit
. This will be a randomised test and might fail occasionally. But we can reduce the extent of it by using a dataset of size maybe around 1000. This will reduce the probability of failing the test to about 0.1%.@rcurtin can you please take a look at the implementation and provide some feedback over what tests are necessary and we should write to cover all possibilities ? Also if I missed something then please add to it. Thanks :)
P.S. I also figured out that if we can make
boostrap
a template parameter forRandomForest
class which default totrue
, then creating anExtra-Tree
will be as trivial as defining a typedef.