-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: port honest forests from neurodata/honest-forests #57
Conversation
Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>
@sampan501 preferably we don't want to have a replica of the "honest tree" for every single type of tree and instead just implement it in one place. In my mind, there are two options: jw the pros and cons of designing this as a "meta-estimator" instead, which takes any instantiation of a sklearn supervised decision tree (e.g. decisiontreeclassifier, obliquedecisiontreeclassifier, etc.) instead of as a subclass of decisiontreeclassifier? This can then sit in scikit-tree. Alternatively, we add the honest capabilities to the |
Ah, I see what you mean. I'm open to refactoring like this, I just added the initial code that @rflperry wrote as a first step. Here are the two options as I understand them:
Just some that I thought of. Based on this, seems like option 2 is better? Should I create a new module or store everything in "ensemble" |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #57 +/- ##
==========================================
+ Coverage 92.23% 93.39% +1.15%
==========================================
Files 12 16 +4
Lines 1146 1468 +322
==========================================
+ Hits 1057 1371 +314
- Misses 89 97 +8
☔ View full report in Codecov by Sentry. |
This is what you currently have in the PR. The second option I was floating is actually to refactor this PR and add the code to both
If we do option 2, then I would create a new class in
where the estimator is pre-defined. When we do fit, we probably just check that Alternatively, we would use the design:
where we pass in kwargs. The cons with this approach is that the user has to make sure they pass in all the parameters they want. E.g. a ObliquDecisiionTreeClassifier has more parameters than a normal DecisionTreeClassifier. To ensure stability wrt the underlying class, there would need to be some error-checking of the tree instance. These aren't necessarily fatal issues, but they do make life a little bit less ideal, so I'm slightly in favor of the option to add the honesty functionality to the fork of sklearn. However, I can defer to whatever one you think is easiest to implement and maintain. Lmk if this doesn't make sense and we can have a quick chat. |
We can just ignore the Cython part. Just do everything in Python exactly like you're doing now, where we just modify the fit function in |
Based on talk, we decided on option 2 |
And also on the following design option, with some checks on type of tree (must be classifier or regressor), and if it's already fitted or not:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The honest tree meta estimator LGTM. Wanna add a unit test, or port the unit test if it already exists?
Also it'll be good to add a sklearn check_estimator test to the classes to check if we missed any edge case behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Lmk what you think.
sktree/tree/_honest_tree.py
Outdated
honest_leaves = self.tree_.apply(X[self.honest_indices_]) | ||
|
||
self.tree_.value[:, :, :] = 0 | ||
for leaf_id, yval in zip(honest_leaves, y[self.honest_indices_, 0]): | ||
self.tree_.value[leaf_id][0, yval] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be migrated to a private function perhaps called _set_leaf_nodes(self, X, y)
, where (X, y) are a data pair where X is used to traverse the already built tree and y is used to set the leaf nodes.
This can then lead us to easily supporting PropensityTree
Thinking of making some large scale changes sometime in the next few weekends: scikit-learn-contrib/scikit-learn-contrib#61 (comment) Just checking in on this to see if you're able to get this in to consolidate the repo before then. Ideally aiming to merge this and #64 and #65 before then to make life easier |
@sampan501 just checking in. what else is there to do on this PR? |
I have to fix the ci errors. I plan on finishing that after submitting the supplemental to neurips |
Quick question: you're not planning on using hyppo as a dependency of scikit-tree right? |
No I would say the other way around makes more sense. If we want to incorporate and move |
Ok good. I was planning on adding it as an option dependency to hyppo |
@sampan501 I'm guessing you might be busy at your internship, so hope it's going well! If so, @YuxinB would you be interested in finishing this PR that Sampan started? The ability to convert an existing tree class that we have to an honest estimator will be nice to test the different trees for MI/CMI. |
Sounds good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adam2392 your thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mostly looks good. It would be good to tighten up the testing here while the PR is open.
Also, would it be easy/fast example to include the figure that you regenerated from the paper as an example in examples/
? Or is it a slow running simulation.
Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>
This reverts commit 88b4c68.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that I was wrong about the impute function. It seems to affect the performance under very low honest fraction, but I don't know why the coverage shows not covered.
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once CIs pass.
Thanks @sampan501 and @PSSF23 ! |
Fixes #56
Changes proposed in this pull request:
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting