Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: port honest forests from neurodata/honest-forests #57

Merged
merged 44 commits into from
Jun 10, 2023
Merged

Conversation

sampan501
Copy link
Member

@sampan501 sampan501 commented Mar 22, 2023

Fixes #56

Changes proposed in this pull request:

  • Add honest trees to scikit-tree

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>
@adam2392
Copy link
Collaborator

adam2392 commented Mar 22, 2023

@sampan501 preferably we don't want to have a replica of the "honest tree" for every single type of tree and instead just implement it in one place. In my mind, there are two options:

jw the pros and cons of designing this as a "meta-estimator" instead, which takes any instantiation of a sklearn supervised decision tree (e.g. decisiontreeclassifier, obliquedecisiontreeclassifier, etc.) instead of as a subclass of decisiontreeclassifier? This can then sit in scikit-tree.

Alternatively, we add the honest capabilities to the DecisionTreeClassifier and DecisionTreeRegressor classes so any supervised tree subclass can enable honesty. This would be a PR to the fork of scikit-learn:tree-featuresv2. Then add the honest=default_value to every single tree that inherits from the supervised tree. This is different from what you currently have here. The "honest" feature would be enabled as just a kwarg argument, rather than a separate class.

@sampan501
Copy link
Member Author

sampan501 commented Mar 22, 2023

instantiation of a sklearn supervised decision tree

Ah, I see what you mean. I'm open to refactoring like this, I just added the initial code that @rflperry wrote as a first step. Here are the two options as I understand them:

  1. Have HonestTreeClassifier subclass the respective decision tree type

    • Pros: Just one line of code vs. 2 for the user in the next case. Would have access to all parameters the in the super class.
    • Cons: Much more complex in implementation. Would need to potentially create a different Honest Forest for each classifier. Also (potentially) difficult for the user to use since they would have to parse the documentation to find which one they need
  2. Have HonestTreeClassifier as a "meta-class" that takes in the respective decision tree as a parameter:

    • Pros: Much simpler in implementation and much easier for the user. Only one additional line of code that the user has to write (instantiating the object) and the user just has to pic the tree type.
    • Cons: What is the base class we inherit from? We have lost access to parameters the user inputs when instantiating the object that are not stored as class attributes.
  3. Add an honesty flag to scikit-learn:tree-featuresv2

    • Pros: Easiest for the user (just an additional flag). Possibly the fastest method (whether or not it is easiest to maintain is a different matter)
    • Cons: Most difficult implementation (would have to parse the Cython and translate). Each implementation would be dependent on scikit-learn:tree-featuresv2. So, if this fork it not updated regularly, then potential new trees from scikit-learn main cannot be implemented. Also, if scikit-learn:tree-featuresv2 ever ends up getting merged with scikit-learn main, we are dependent on them accepting our PRs for new tree types.

Just some that I thought of. Based on this, seems like option 2 is better? Should I create a new module or store everything in "ensemble"

@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Patch coverage: 96.31% and project coverage change: +1.15 🎉

Comparison is base (9fa9129) 92.23% compared to head (efd295b) 93.39%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
+ Coverage   92.23%   93.39%   +1.15%     
==========================================
  Files          12       16       +4     
  Lines        1146     1468     +322     
==========================================
+ Hits         1057     1371     +314     
- Misses         89       97       +8     
Impacted Files Coverage Δ
sktree/__init__.py 77.27% <ø> (ø)
sktree/tree/_honest_tree.py 91.96% <91.96%> (ø)
sktree/ensemble/_honest_forest.py 95.77% <95.77%> (ø)
sktree/ensemble/__init__.py 100.00% <100.00%> (ø)
sktree/tests/test_honest_forest.py 100.00% <100.00%> (ø)
sktree/tests/test_neighbors.py 100.00% <100.00%> (ø)
sktree/tests/test_supervised_forest.py 99.40% <100.00%> (ø)
sktree/tests/test_unsupervised_forest.py 100.00% <100.00%> (ø)
sktree/tree/__init__.py 100.00% <100.00%> (ø)
sktree/tree/tests/test_honest_tree.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sampan501 sampan501 requested a review from adam2392 March 22, 2023 19:18
@sampan501 sampan501 self-assigned this Mar 22, 2023
@adam2392
Copy link
Collaborator

adam2392 commented Mar 22, 2023

instantiation of a sklearn supervised decision tree

Ah, I see what you mean. I'm open to refactoring like this, I just added the initial code that @rflperry wrote as a first step. Here are the two options as I understand them:

  1. Have HonestTreeClassifier subclass the respective decision tree type

This is what you currently have in the PR. The second option I was floating is actually to refactor this PR and add the code to both DecisionTreeClassifier and DecisionTreeRegressor classes in the fork of sklearn. I edited my comment in #57 (comment). This is the most general approach and imo is "cleaner", but might bit a bit more work since we want to keep the fork very well maintained wrt the sklearn internal design.

  1. Have HonestTreeClassifier as a "meta-class" that takes in the respective decision tree as a parameter:
    Just some that I thought of. Based on this, seems like option 2 is better? Should I create a new module or store everything in "ensemble"

If we do option 2, then I would create a new class in tree/_classes.py and also a new class in ensemble/_supervised_forest.py. There are two approaches to implementing this. One is to use the MetaEstimatorMixin from sklearn, which means we follow the design:

class HonestTree(MetaEstimatorMixin):
     def __init__(estimator, honest_ratio=0.5):
         ...

# example
clf = HonestTree(DecisionTreeClassifier())

where the estimator is pre-defined. When we do fit, we probably just check that estimator is of type BaseDecisionTree. The cons with this approach is that this is a completely different API then before.

Alternatively, we would use the design:

class HonestTree(BaseEstimator):
     def __init__(tree=DecisionTreeClassifier, honest_ratio=0.5, other tree kwargs...):
         ...

# example
clf = HonestTree(DecisionTreeClassifier, max_feature=...)

where we pass in kwargs. The cons with this approach is that the user has to make sure they pass in all the parameters they want. E.g. a ObliquDecisiionTreeClassifier has more parameters than a normal DecisionTreeClassifier.

To ensure stability wrt the underlying class, there would need to be some error-checking of the tree instance.

These aren't necessarily fatal issues, but they do make life a little bit less ideal, so I'm slightly in favor of the option to add the honesty functionality to the fork of sklearn. However, I can defer to whatever one you think is easiest to implement and maintain.

Lmk if this doesn't make sense and we can have a quick chat.

@adam2392
Copy link
Collaborator

Add an honesty flag to scikit-learn:tree-featuresv2
Pros: Easiest for the user (just an additional flag). Possibly the fastest method (whether or not it is easiest to maintain is a different matter)
Cons: Most difficult implementation (would have to parse the Cython and translate). Each implementation would be dependent on scikit-learn:tree-featuresv2. So, if this fork it not updated regularly, then potential new trees from scikit-learn main cannot be implemented. Also, if scikit-learn:tree-featuresv2 ever ends up getting merged with scikit-learn main, we are dependent on them accepting our PRs for new tree types.

We can just ignore the Cython part. Just do everything in Python exactly like you're doing now, where we just modify the fit function in DecisionTreeClassifier and DecisionTreeRegressor.

@sampan501
Copy link
Member Author

Based on talk, we decided on option 2

@adam2392
Copy link
Collaborator

Based on talk, we decided on option 2

And also on the following design option, with some checks on type of tree (must be classifier or regressor), and if it's already fitted or not:

class HonestTree(MetaEstimatorMixin):
     def __init__(estimator, honest_ratio=0.5):
         ...

# example
clf = HonestTree(DecisionTreeClassifier())

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The honest tree meta estimator LGTM. Wanna add a unit test, or port the unit test if it already exists?

Also it'll be good to add a sklearn check_estimator test to the classes to check if we missed any edge case behavior.

sktree/ensemble/_honest_forest.py Show resolved Hide resolved
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Lmk what you think.

sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved
sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved
sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved
sktree/ensemble/_honest_forest.py Show resolved Hide resolved
sktree/tree/_honest_tree.py Outdated Show resolved Hide resolved
sktree/tree/_honest_tree.py Show resolved Hide resolved
sktree/tree/_honest_tree.py Show resolved Hide resolved
Comment on lines 227 to 231
honest_leaves = self.tree_.apply(X[self.honest_indices_])

self.tree_.value[:, :, :] = 0
for leaf_id, yval in zip(honest_leaves, y[self.honest_indices_, 0]):
self.tree_.value[leaf_id][0, yval] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be migrated to a private function perhaps called _set_leaf_nodes(self, X, y), where (X, y) are a data pair where X is used to traverse the already built tree and y is used to set the leaf nodes.

This can then lead us to easily supporting PropensityTree

sktree/tree/_honest_tree.py Outdated Show resolved Hide resolved
sktree/tree/_honest_tree.py Outdated Show resolved Hide resolved
@adam2392
Copy link
Collaborator

Thinking of making some large scale changes sometime in the next few weekends: scikit-learn-contrib/scikit-learn-contrib#61 (comment)

Just checking in on this to see if you're able to get this in to consolidate the repo before then. Ideally aiming to merge this and #64 and #65 before then to make life easier

@adam2392
Copy link
Collaborator

@sampan501 just checking in. what else is there to do on this PR?

@sampan501
Copy link
Member Author

I have to fix the ci errors. I plan on finishing that after submitting the supplemental to neurips

@sampan501
Copy link
Member Author

Quick question: you're not planning on using hyppo as a dependency of scikit-tree right?

@adam2392
Copy link
Collaborator

Quick question: you're not planning on using hyppo as a dependency of scikit-tree right?

No I would say the other way around makes more sense. If we want to incorporate and move hyppo stuff into pywhy-stats, then scikit-tree would be an optional dependency if one uses trees to do statistical testing, otw not required.

@sampan501
Copy link
Member Author

Ok good. I was planning on adding it as an option dependency to hyppo

@adam2392
Copy link
Collaborator

adam2392 commented Jun 1, 2023

@sampan501 I'm guessing you might be busy at your internship, so hope it's going well!

If so, @YuxinB would you be interested in finishing this PR that Sampan started? The ability to convert an existing tree class that we have to an honest estimator will be nice to test the different trees for MI/CMI.

@sampan501
Copy link
Member Author

Sounds good!

PSSF23

This comment was marked as duplicate.

Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As all tests in the old honest tree repo passed and the simulation figure is replicated:
overlapping_gaussians
I believe the current patch satisfies this PR's requirements. We can merge the honesty part first and manage other utilities like propensity trees in next step.

@PSSF23 PSSF23 marked this pull request as ready for review June 8, 2023 18:00
@PSSF23
Copy link
Member

PSSF23 commented Jun 8, 2023

@adam2392 your thoughts?

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly looks good. It would be good to tighten up the testing here while the PR is open.

Also, would it be easy/fast example to include the figure that you regenerated from the paper as an example in examples/? Or is it a slow running simulation.

sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved
sktree/tests/test_honest_forest.py Outdated Show resolved Hide resolved
sktree/tree/_honest_tree.py Show resolved Hide resolved
sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved
Copy link
Member

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that I was wrong about the impute function. It seems to affect the performance under very low honest fraction, but I don't know why the coverage shows not covered.

sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved
sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved
sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved
adam2392 and others added 10 commits June 9, 2023 14:44
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once CIs pass.

@adam2392 adam2392 merged commit 41848be into main Jun 10, 2023
@adam2392
Copy link
Collaborator

Thanks @sampan501 and @PSSF23 !

@adam2392 adam2392 deleted the honesty branch June 10, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Add honest trees
4 participants