-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binding test for hoeffding_tree #1211
Conversation
{ | ||
arma::mat inputData; | ||
DatasetInfo info; | ||
if (!data::Load("tae.csv", inputData, info)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could reuse an existing dataset here e.g. thyroid or reconstruct the iris dataset, the repo size is already quite high, so I think whenever we can we should avoid adding datasets. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, a dataset with last column as labels was needed, but modifying iris would do the trick too. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests! This is a good start but I think there are still some other things we should test. You can add them here in this PR if you like, or we can merge this and then we can note that more testing is needed in #1152---just let me know what you'd like to do.
We should test these options:
- min_samples: maybe we can build two models, one with very small min_samples and one with very large min_samples, and ensure that the model with small min_samples is larger (i.e. has more nodes in it).
- max_samples: same idea as min_samples, I think that will work.
- confidence: a model with lower confidence should split more often, so we can have the same strategy as min_samples and max_samples.
- numeric_split_strategy: when using the 'domingos' strategy, we can use a numeric dataset and check that as we change 'bins', the number of children of the root node change. For the 'binary' strategy, we can change 'bins' and ensure that we always have 2 children.
- passes: we can check that a tree with more passes has more children (since it will split more).
- observations_before_binning: hmm, I think that maybe we can set min_samples to 10, use the 'domingos' split, set observations_before_binning to 100, set the confidence really low, and then feed it only 50 samples and ensure that it didn't split (since it shouldn't split until observations_before_binning samples are seen).
Maybe there are better ways to do those tests... those are just the ideas off the top of my head. Anyway, let me know how you want to proceed. If you want to merge this now and the other tests can be done by someone else (or later by you) that's just fine of course.
@rcurtin Thanks for the review. I can work on adding the tests, no problem. Let this PR remain open, I'll commit them in a few days time since I'm also trying to implement something in |
Sure, no huge hurry. Good luck with your exams! :) |
Just FYI, this probably needs to be updated to handle the changes in #1214. Here is an example of the type of changes that might need to be made: |
@rcurtin Regarding the test for numeric_split_strategy, I don't think there exists any method to access the number of children of any node in the HoeffdingTreeModel class. Shall I write a utility function similar to HoeffdingTreeModel::NumNodes(), which can count NumChildren() or is there some other way around? |
Hmm, I think that |
I think the tests are working fine although I can't understand the travis build problem. Please let me know if any changes are to be made. |
We are working on a fix for the travis build: travis-ci/travis-ci#9129 and #1221. |
|
||
// Check that both models have different number of nodes. | ||
BOOST_CHECK_PREDICATE(std::not_equal_to<int>(), | ||
((CLI::GetParam<HoeffdingTreeModel*>("output_model"))->NumNodes())(nodes)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use BOOST_REQUIRE_NE(...)
here? Seems to me like that would be simpler.
All of the tests look good to me, so if you want to handle the one comment I left I think we can get this merged. Thanks for the nice tests! :) |
I have made the edit (some silly confusion caused it). It is great that the travis issue is being solved, failing travis build was becoming a persistent thing. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the contribution! I'll go ahead and merge this in 4 days to leave time in case anyone else has any other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, no comments from my end.
Thanks for the nice contribution! :) |
#1152