Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random splitting for numeric features for decision trees. #2883

Merged
merged 36 commits into from May 9, 2021

Conversation

RishabhGarg108
Copy link
Member

@RishabhGarg108 RishabhGarg108 commented Mar 20, 2021

An attempt to solve #884.

The changes made in this PR are:-

  • Add implementation of RandomBinaryNumericSplit
  • Change [] to () for accessing elements of armadillo matrices and vectors. According to armadillo documentation, square brackets does not work correctly for the case [i,j] while (i,j) works fine. So, maybe we can try to be consistent about indexing. (Open to suggestions about this.)
  • Tests for the RandomBinaryNumericSplit

Reference Paper - http://orbi.ulg.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Here, I want to discuss what tests should be write for this.

  • Since, this method of splitting is randomised, so we can't ensure that we will be getting a better split even in the case when it does exist. So, we can't test this.
  • We can definitely test if the number of elements are less than the minimumSplitSize, then it shouldn't split.
  • Make sure, if gain reduces on splitting, then it doesn't split.
  • Another thing that we can test for is that for a sufficiently large dataset, the BestNumericBinarySplit yields a split different from RandomBinaryNumericSplit. This will be a randomised test and might fail occasionally. But we can reduce the extent of it by using a dataset of size maybe around 1000. This will reduce the probability of failing the test to about 0.1%.
  • Another test could be that we can split the same dataset multiple times (in the order of thousands). This would kind of like a simulation of an ensemble and the average gain of this ensemble should approach the gain of best split. (Here need to try some values to find minimum such value that will give approximately same gain)

@rcurtin can you please take a look at the implementation and provide some feedback over what tests are necessary and we should write to cover all possibilities ? Also if I missed something then please add to it. Thanks :)

P.S. I also figured out that if we can make boostrap a template parameter for RandomForest class which default to true, then creating an Extra-Tree will be as trivial as defining a typedef.

@mlpack-bot
Copy link

mlpack-bot bot commented Apr 19, 2021

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Apr 19, 2021
Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @RishabhGarg108, thanks for implementing this! Sorry that it took so long to get to the review. I think that the implementation is basically sound; I left a few notes throughout.

P.S. I also figured out that if we can make boostrap a template parameter for RandomForest class which default to true, then creating an Extra-Tree will be as trivial as defining a typedef.

Ah, nice! Yeah, if we can do that, then we can provide an option with the random forest binding to train an Extra-Trees random forest instead.

@mlpack-bot mlpack-bot bot removed the s: stale label Apr 19, 2021
@RishabhGarg108
Copy link
Member Author

Ah, nice! Yeah, if we can do that, then we can provide an option with the random forest binding to train an Extra-Trees random forest instead.

Yes, we can do that also. 😃

Can you also suggest how shall I write the tests? I am having a hard time coming up with deterministic tests in this random setting. I have mentioned a couple of ideas I have in my mind in the description of this PR. Would love to hear your suggestions :-)

@rcurtin
Copy link
Member

rcurtin commented Apr 21, 2021

Can you also suggest how shall I write the tests? I am having a hard time coming up with deterministic tests in this random setting. I have mentioned a couple of ideas I have in my mind in the description of this PR. Would love to hear your suggestions :-)

Well, I don't think we can really make our tests deterministic here, but we can at least make the failure probability very low. Personally, I think a decent test might be to build a random forest on numeric data using this split, and then ensure that the accuracy of that random forest isn't awful.

If you want to test that splits are different between BestBinaryNumericSplit and the random split, I think that's fine too---the probability of failure should be very low anyway. You could reduce it further by allowing it to run multiple times in the case of failure.

It's probably worth checking the stability of the tests by running them, say, 1000 times with different random seeds and making sure that they don't fail. You can do this by modifying src/mlpack/tests/main.cpp (there should already be commented code in there that uses a random seed), and then use a loop like this:

$ i=0; while(true); do echo $i; i=$(($i + 1)); bin/mlpack_test "TestSuiteName"; sleep 1; done

and if you like you can filter all extraneous output too with grep or similar, and just run it until you run out of patience and make sure there were no failures. :)

@@ -409,6 +410,38 @@ class RandomForest
double avgGain;
};

/**
* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Convenience typedef for Extra Trees. (Extremely Randomised Trees Forest)
* Convenience typedef for Extra Trees. (Extremely Randomized Trees Forest).

Missing typedef and use commonly used name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't exactly follow this comment. There is actually a using directive below this comment section where I am aliasing ExtraTrees. Also, can you please elaborate "commonly used name". Which name are you talking about here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @zoq meant that we should use the more commonly used name ("Randomized" vs. "Randomised").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! I have addressed this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant.

src/mlpack/tests/random_forest_test.cpp Outdated Show resolved Hide resolved
@@ -0,0 +1,124 @@
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this until now, but I think we need to add this (and the implementation file) to the list of files in CMakeLists.txt. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have added it 👍

@RishabhGarg108
Copy link
Member Author

@zoq @rcurtin I hope this one is now ready?

@RishabhGarg108
Copy link
Member Author

Actually, I also reverted the parenthesis back to the square brackets as they originally were. I felt them totally unnecessary :)

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RishabhGarg108! Sorry it took a few days to get back to this. This is nice support to add. :)

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, no more comments from my side.

@rcurtin
Copy link
Member

rcurtin commented May 9, 2021

Thanks @RishabhGarg108!

@rcurtin rcurtin merged commit e39065c into mlpack:master May 9, 2021
@RishabhGarg108
Copy link
Member Author

Thanks, @rcurtin @zoq for reviewing and helping through this PR. It's great to have this merged :D

@RishabhGarg108 RishabhGarg108 deleted the random-split branch May 9, 2021 04:07
This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants