Add random splitting for numeric features for decision trees. #2883

RishabhGarg108 · 2021-03-20T13:50:04Z

An attempt to solve #884.

The changes made in this PR are:-

Add implementation of RandomBinaryNumericSplit
Change [] to () for accessing elements of armadillo matrices and vectors. According to armadillo documentation, square brackets does not work correctly for the case [i,j] while (i,j) works fine. So, maybe we can try to be consistent about indexing. (Open to suggestions about this.)
Tests for the RandomBinaryNumericSplit

Reference Paper - http://orbi.ulg.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Here, I want to discuss what tests should be write for this.

~~Since, this method of splitting is randomised, so we can't ensure that we will be getting a better split even in the case when it does exist. So, we can't test this.~~
We can definitely test if the number of elements are less than the minimumSplitSize, then it shouldn't split.
Make sure, if gain reduces on splitting, then it doesn't split.
Another thing that we can test for is that for a sufficiently large dataset, the BestNumericBinarySplit yields a split different from RandomBinaryNumericSplit. This will be a randomised test and might fail occasionally. But we can reduce the extent of it by using a dataset of size maybe around 1000. This will reduce the probability of failing the test to about 0.1%.
Another test could be that we can split the same dataset multiple times (in the order of thousands). This would kind of like a simulation of an ensemble and the average gain of this ensemble should approach the gain of best split. (Here need to try some values to find minimum such value that will give approximately same gain)

@rcurtin can you please take a look at the implementation and provide some feedback over what tests are necessary and we should write to cover all possibilities ? Also if I missed something then please add to it. Thanks :)

P.S. I also figured out that if we can make boostrap a template parameter for RandomForest class which default to true, then creating an Extra-Tree will be as trivial as defining a typedef.

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

mlpack-bot · 2021-04-19T17:11:13Z

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

rcurtin

Hey @RishabhGarg108, thanks for implementing this! Sorry that it took so long to get to the review. I think that the implementation is basically sound; I left a few notes throughout.

P.S. I also figured out that if we can make boostrap a template parameter for RandomForest class which default to true, then creating an Extra-Tree will be as trivial as defining a typedef.

Ah, nice! Yeah, if we can do that, then we can provide an option with the random forest binding to train an Extra-Trees random forest instead.

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

src/mlpack/methods/decision_tree/best_binary_numeric_split.hpp

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp

src/mlpack/tests/decision_tree_test.cpp

RishabhGarg108 · 2021-04-20T02:28:18Z

Ah, nice! Yeah, if we can do that, then we can provide an option with the random forest binding to train an Extra-Trees random forest instead.

Yes, we can do that also. 😃

Can you also suggest how shall I write the tests? I am having a hard time coming up with deterministic tests in this random setting. I have mentioned a couple of ideas I have in my mind in the description of this PR. Would love to hear your suggestions :-)

src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp

rcurtin · 2021-04-21T20:42:13Z

Can you also suggest how shall I write the tests? I am having a hard time coming up with deterministic tests in this random setting. I have mentioned a couple of ideas I have in my mind in the description of this PR. Would love to hear your suggestions :-)

Well, I don't think we can really make our tests deterministic here, but we can at least make the failure probability very low. Personally, I think a decent test might be to build a random forest on numeric data using this split, and then ensure that the accuracy of that random forest isn't awful.

If you want to test that splits are different between BestBinaryNumericSplit and the random split, I think that's fine too---the probability of failure should be very low anyway. You could reduce it further by allowing it to run multiple times in the case of failure.

It's probably worth checking the stability of the tests by running them, say, 1000 times with different random seeds and making sure that they don't fail. You can do this by modifying src/mlpack/tests/main.cpp (there should already be commented code in there that uses a random seed), and then use a loop like this:

$ i=0; while(true); do echo $i; i=$(($i + 1)); bin/mlpack_test "TestSuiteName"; sleep 1; done

and if you like you can filter all extraneous output too with grep or similar, and just run it until you run out of patience and make sure there were no failures. :)

…-split

src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp

src/mlpack/methods/random_forest/random_forest_impl.hpp

src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp

src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp

zoq · 2021-05-01T00:21:28Z

src/mlpack/methods/random_forest/random_forest.hpp

@@ -409,6 +410,38 @@ class RandomForest
  double avgGain;
 };

+/**
+ * Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest)


Suggested change

* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest)

* Convenience typedef for Extra Trees. (Extremely Randomized Trees Forest).

Missing typedef and use commonly used name.

Sorry, I didn't exactly follow this comment. There is actually a using directive below this comment section where I am aliasing ExtraTrees. Also, can you please elaborate "commonly used name". Which name are you talking about here?

I think @zoq meant that we should use the more commonly used name ("Randomized" vs. "Randomised").

Thanks for the clarification! I have addressed this now.

Yes, that's what I meant.

src/mlpack/tests/random_forest_test.cpp

…-1 into random-split

rcurtin · 2021-05-03T15:06:44Z

src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp

@@ -0,0 +1,124 @@
+/**


I forgot this until now, but I think we need to add this (and the implementation file) to the list of files in CMakeLists.txt. 👍

Thanks. I have added it 👍

RishabhGarg108 · 2021-05-07T04:53:28Z

@zoq @rcurtin I hope this one is now ready?

RishabhGarg108 · 2021-05-07T05:06:35Z

Actually, I also reverted the parenthesis back to the square brackets as they originally were. I felt them totally unnecessary :)

rcurtin

Thanks @RishabhGarg108! Sorry it took a few days to get back to this. This is nice support to add. :)

src/mlpack/tests/decision_tree_test.cpp

src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp

mlpack-bot

Second approval provided automatically after 24 hours. 👍

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

zoq

Nice, no more comments from my side.

rcurtin · 2021-05-09T02:53:03Z

Thanks @RishabhGarg108!

RishabhGarg108 · 2021-05-09T04:07:30Z

Thanks, @rcurtin @zoq for reviewing and helping through this PR. It's great to have this merged :D

RishabhGarg108 added 2 commits March 19, 2021 23:41

Defined the random split class

6cd97a7

Added implementation of random split

0ad73bc

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Mar 20, 2021

RishabhGarg108 commented Mar 20, 2021

View reviewed changes

src/mlpack/methods/decision_tree/best_binary_numeric_split_impl.hpp Outdated Show resolved Hide resolved

Added tests

a140aba

mlpack-bot bot added the s: stale label Apr 19, 2021

rcurtin reviewed Apr 19, 2021

View reviewed changes

mlpack-bot bot removed the s: stale label Apr 19, 2021

RishabhGarg108 added 5 commits April 20, 2021 07:08

Make new file for random split.

d8ca63f

Changed imports

ccebfcb

Use math::Random which essentially does the same thing

25ce215

Fixed typo

374cb2d

Add citation.

8d63ae3

rcurtin reviewed Apr 21, 2021

View reviewed changes

src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp Outdated Show resolved Hide resolved

rcurtin reviewed Apr 21, 2021

View reviewed changes

src/mlpack/methods/decision_tree/random_binary_numeric_split.hpp Outdated Show resolved Hide resolved

RishabhGarg108 added 8 commits April 22, 2021 06:24

Move citation into class and added newline at EOF

f2bc627

Removed best found gain check, as discussed with @rcurtin

ab6ae18

Removed test where no split was made if there was no gain.

4704a25

Add test for different splits under best and random settings.

bf98ea5

Merge branch 'master' of https://github.com/mlpack/mlpack into random…

3c3660f

…-split

Add UseBootstrap template parameter to random forest

43097c9

Changed train function to use UseBootstrap

1ff5e61

Removed ElemType from RandomBinaryNumericSplit

c049d49

rcurtin reviewed Apr 23, 2021

View reviewed changes

src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp Show resolved Hide resolved

rcurtin reviewed Apr 23, 2021

View reviewed changes

src/mlpack/methods/random_forest/random_forest_impl.hpp Outdated Show resolved Hide resolved

zoq added c: methods t: added feature and removed s: unlabeled labels May 1, 2021

zoq reviewed May 1, 2021

View reviewed changes

RishabhGarg108 added 5 commits May 1, 2021 07:08

Add documentation for parameters of NumChildren

bff45dc

Add documentation for splitIfBetterGain

f4e5bf7

Fixed dataset name in test file

0eefde3

Merge branch 'random-split' of ssh://github.com/RishabhGarg108/mlpack…

347607d

…-1 into random-split

Changed and to && to fix windows build

5aece9b

rcurtin reviewed May 3, 2021

View reviewed changes

RishabhGarg108 added 2 commits May 3, 2021 21:28

Changed Randomised -> Randomized

9453469

Add random_split to CMakeLists.txt

27227f5

Reverted unnecessary bracket changes

0cb6a13

rcurtin approved these changes May 7, 2021

View reviewed changes

zoq reviewed May 8, 2021

View reviewed changes

src/mlpack/tests/decision_tree_test.cpp Outdated Show resolved Hide resolved

src/mlpack/methods/decision_tree/random_binary_numeric_split_impl.hpp Outdated Show resolved Hide resolved

mlpack-bot bot approved these changes May 9, 2021

View reviewed changes

mlpack-bot bot removed the s: needs review label May 9, 2021

RishabhGarg108 and others added 3 commits May 9, 2021 06:51

Apply suggestions from code review

0475778

Co-authored-by: Marcus Edel <marcus.edel@fu-berlin.de>

Merge branch 'master' into random-split

b2a5fe6

Update HISTORY.md

2a6152d

zoq approved these changes May 9, 2021

View reviewed changes

rcurtin merged commit e39065c into mlpack:master May 9, 2021

RishabhGarg108 deleted the random-split branch May 9, 2021 04:07

This was referenced Oct 14, 2022

Release version 4.0.0 #3285

Closed

Release version 4.0.0 #3286

Closed

rcurtin mentioned this pull request Oct 23, 2022

Release version 4.0.0 #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add random splitting for numeric features for decision trees. #2883

Add random splitting for numeric features for decision trees. #2883

RishabhGarg108 commented Mar 20, 2021 •

edited

mlpack-bot bot commented Apr 19, 2021

rcurtin left a comment

RishabhGarg108 commented Apr 20, 2021

rcurtin commented Apr 21, 2021

zoq May 1, 2021

RishabhGarg108 May 1, 2021

rcurtin May 3, 2021

RishabhGarg108 May 3, 2021

zoq May 3, 2021

rcurtin May 3, 2021

RishabhGarg108 May 3, 2021

RishabhGarg108 commented May 7, 2021

RishabhGarg108 commented May 7, 2021

rcurtin left a comment

mlpack-bot bot left a comment

zoq left a comment

rcurtin commented May 9, 2021

RishabhGarg108 commented May 9, 2021

	* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest)
	* Convenience typedef for Extra Trees. (Extremely Randomized Trees Forest).

Add random splitting for numeric features for decision trees. #2883

Add random splitting for numeric features for decision trees. #2883

Conversation

RishabhGarg108 commented Mar 20, 2021 • edited

mlpack-bot bot commented Apr 19, 2021

rcurtin left a comment

Choose a reason for hiding this comment

RishabhGarg108 commented Apr 20, 2021

rcurtin commented Apr 21, 2021

zoq May 1, 2021

Choose a reason for hiding this comment

RishabhGarg108 May 1, 2021

Choose a reason for hiding this comment

rcurtin May 3, 2021

Choose a reason for hiding this comment

RishabhGarg108 May 3, 2021

Choose a reason for hiding this comment

zoq May 3, 2021

Choose a reason for hiding this comment

rcurtin May 3, 2021

Choose a reason for hiding this comment

RishabhGarg108 May 3, 2021

Choose a reason for hiding this comment

RishabhGarg108 commented May 7, 2021

RishabhGarg108 commented May 7, 2021

rcurtin left a comment

Choose a reason for hiding this comment

mlpack-bot bot left a comment

Choose a reason for hiding this comment

zoq left a comment

Choose a reason for hiding this comment

rcurtin commented May 9, 2021

RishabhGarg108 commented May 9, 2021

RishabhGarg108 commented Mar 20, 2021 •

edited