ENH: port honest forests from neurodata/honest-forests #57

sampan501 · 2023-03-22T18:47:25Z

Fixes #56

Changes proposed in this pull request:

Add honest trees to scikit-tree

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>

adam2392 · 2023-03-22T18:54:13Z

@sampan501 preferably we don't want to have a replica of the "honest tree" for every single type of tree and instead just implement it in one place. In my mind, there are two options:

jw the pros and cons of designing this as a "meta-estimator" instead, which takes any instantiation of a sklearn supervised decision tree (e.g. decisiontreeclassifier, obliquedecisiontreeclassifier, etc.) instead of as a subclass of decisiontreeclassifier? This can then sit in scikit-tree.

Alternatively, we add the honest capabilities to the DecisionTreeClassifier and DecisionTreeRegressor classes so any supervised tree subclass can enable honesty. This would be a PR to the fork of scikit-learn:tree-featuresv2. Then add the honest=default_value to every single tree that inherits from the supervised tree. This is different from what you currently have here. The "honest" feature would be enabled as just a kwarg argument, rather than a separate class.

sampan501 · 2023-03-22T19:10:11Z

instantiation of a sklearn supervised decision tree

Ah, I see what you mean. I'm open to refactoring like this, I just added the initial code that @rflperry wrote as a first step. Here are the two options as I understand them:

Have HonestTreeClassifier subclass the respective decision tree type
- Pros: Just one line of code vs. 2 for the user in the next case. Would have access to all parameters the in the super class.
- Cons: Much more complex in implementation. Would need to potentially create a different Honest Forest for each classifier. Also (potentially) difficult for the user to use since they would have to parse the documentation to find which one they need
Have HonestTreeClassifier as a "meta-class" that takes in the respective decision tree as a parameter:
- Pros: Much simpler in implementation and much easier for the user. Only one additional line of code that the user has to write (instantiating the object) and the user just has to pic the tree type.
- Cons: What is the base class we inherit from? We have lost access to parameters the user inputs when instantiating the object that are not stored as class attributes.
Add an honesty flag to scikit-learn:tree-featuresv2
- Pros: Easiest for the user (just an additional flag). Possibly the fastest method (whether or not it is easiest to maintain is a different matter)
- Cons: Most difficult implementation (would have to parse the Cython and translate). Each implementation would be dependent on scikit-learn:tree-featuresv2. So, if this fork it not updated regularly, then potential new trees from scikit-learn main cannot be implemented. Also, if scikit-learn:tree-featuresv2 ever ends up getting merged with scikit-learn main, we are dependent on them accepting our PRs for new tree types.

Just some that I thought of. Based on this, seems like option 2 is better? Should I create a new module or store everything in "ensemble"

codecov · 2023-03-22T19:14:16Z

Codecov Report

Patch coverage: 96.31% and project coverage change: +1.15 🎉

Comparison is base (9fa9129) 92.23% compared to head (efd295b) 93.39%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #57      +/-   ##
==========================================
+ Coverage   92.23%   93.39%   +1.15%     
==========================================
  Files          12       16       +4     
  Lines        1146     1468     +322     
==========================================
+ Hits         1057     1371     +314     
- Misses         89       97       +8

Impacted Files	Coverage Δ
sktree/__init__.py	`77.27% <ø> (ø)`
sktree/tree/_honest_tree.py	`91.96% <91.96%> (ø)`
sktree/ensemble/_honest_forest.py	`95.77% <95.77%> (ø)`
sktree/ensemble/__init__.py	`100.00% <100.00%> (ø)`
sktree/tests/test_honest_forest.py	`100.00% <100.00%> (ø)`
sktree/tests/test_neighbors.py	`100.00% <100.00%> (ø)`
sktree/tests/test_supervised_forest.py	`99.40% <100.00%> (ø)`
sktree/tests/test_unsupervised_forest.py	`100.00% <100.00%> (ø)`
sktree/tree/__init__.py	`100.00% <100.00%> (ø)`
sktree/tree/tests/test_honest_tree.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

adam2392 · 2023-03-22T19:20:27Z

instantiation of a sklearn supervised decision tree

Ah, I see what you mean. I'm open to refactoring like this, I just added the initial code that @rflperry wrote as a first step. Here are the two options as I understand them:

Have HonestTreeClassifier subclass the respective decision tree type

This is what you currently have in the PR. The second option I was floating is actually to refactor this PR and add the code to both DecisionTreeClassifier and DecisionTreeRegressor classes in the fork of sklearn. I edited my comment in #57 (comment). This is the most general approach and imo is "cleaner", but might bit a bit more work since we want to keep the fork very well maintained wrt the sklearn internal design.

Have HonestTreeClassifier as a "meta-class" that takes in the respective decision tree as a parameter:
Just some that I thought of. Based on this, seems like option 2 is better? Should I create a new module or store everything in "ensemble"

If we do option 2, then I would create a new class in tree/_classes.py and also a new class in ensemble/_supervised_forest.py. There are two approaches to implementing this. One is to use the MetaEstimatorMixin from sklearn, which means we follow the design:

class HonestTree(MetaEstimatorMixin):
     def __init__(estimator, honest_ratio=0.5):
         ...

# example
clf = HonestTree(DecisionTreeClassifier())

where the estimator is pre-defined. When we do fit, we probably just check that estimator is of type BaseDecisionTree. The cons with this approach is that this is a completely different API then before.

Alternatively, we would use the design:

class HonestTree(BaseEstimator):
     def __init__(tree=DecisionTreeClassifier, honest_ratio=0.5, other tree kwargs...):
         ...

# example
clf = HonestTree(DecisionTreeClassifier, max_feature=...)

where we pass in kwargs. The cons with this approach is that the user has to make sure they pass in all the parameters they want. E.g. a ObliquDecisiionTreeClassifier has more parameters than a normal DecisionTreeClassifier.

To ensure stability wrt the underlying class, there would need to be some error-checking of the tree instance.

These aren't necessarily fatal issues, but they do make life a little bit less ideal, so I'm slightly in favor of the option to add the honesty functionality to the fork of sklearn. However, I can defer to whatever one you think is easiest to implement and maintain.

Lmk if this doesn't make sense and we can have a quick chat.

adam2392 · 2023-03-22T19:22:24Z

Add an honesty flag to scikit-learn:tree-featuresv2
Pros: Easiest for the user (just an additional flag). Possibly the fastest method (whether or not it is easiest to maintain is a different matter)
Cons: Most difficult implementation (would have to parse the Cython and translate). Each implementation would be dependent on scikit-learn:tree-featuresv2. So, if this fork it not updated regularly, then potential new trees from scikit-learn main cannot be implemented. Also, if scikit-learn:tree-featuresv2 ever ends up getting merged with scikit-learn main, we are dependent on them accepting our PRs for new tree types.

We can just ignore the Cython part. Just do everything in Python exactly like you're doing now, where we just modify the fit function in DecisionTreeClassifier and DecisionTreeRegressor.

sampan501 · 2023-03-22T20:14:25Z

Based on talk, we decided on option 2

adam2392 · 2023-03-22T20:24:35Z

Based on talk, we decided on option 2

And also on the following design option, with some checks on type of tree (must be classifier or regressor), and if it's already fitted or not:

class HonestTree(MetaEstimatorMixin):
     def __init__(estimator, honest_ratio=0.5):
         ...

# example
clf = HonestTree(DecisionTreeClassifier())

adam2392

The honest tree meta estimator LGTM. Wanna add a unit test, or port the unit test if it already exists?

Also it'll be good to add a sklearn check_estimator test to the classes to check if we missed any edge case behavior.

sktree/ensemble/_honest_forest.py

adam2392

Left a few comments. Lmk what you think.

sktree/ensemble/_honest_forest.py

sktree/tree/_honest_tree.py

adam2392 · 2023-04-06T22:42:02Z

sktree/tree/_honest_tree.py

+        honest_leaves = self.tree_.apply(X[self.honest_indices_])
+
+        self.tree_.value[:, :, :] = 0
+        for leaf_id, yval in zip(honest_leaves, y[self.honest_indices_, 0]):
+            self.tree_.value[leaf_id][0, yval] += 1


This can be migrated to a private function perhaps called _set_leaf_nodes(self, X, y), where (X, y) are a data pair where X is used to traverse the already built tree and y is used to set the leaf nodes.

This can then lead us to easily supporting PropensityTree

sktree/tree/_honest_tree.py

adam2392 · 2023-04-18T16:20:25Z

Thinking of making some large scale changes sometime in the next few weekends: scikit-learn-contrib/scikit-learn-contrib#61 (comment)

Just checking in on this to see if you're able to get this in to consolidate the repo before then. Ideally aiming to merge this and #64 and #65 before then to make life easier

adam2392 · 2023-05-19T13:43:43Z

@sampan501 just checking in. what else is there to do on this PR?

sampan501 · 2023-05-19T13:45:08Z

I have to fix the ci errors. I plan on finishing that after submitting the supplemental to neurips

sampan501 · 2023-05-19T13:47:20Z

Quick question: you're not planning on using hyppo as a dependency of scikit-tree right?

adam2392 · 2023-05-19T13:54:16Z

Quick question: you're not planning on using hyppo as a dependency of scikit-tree right?

No I would say the other way around makes more sense. If we want to incorporate and move hyppo stuff into pywhy-stats, then scikit-tree would be an optional dependency if one uses trees to do statistical testing, otw not required.

sampan501 · 2023-05-19T14:26:44Z

Ok good. I was planning on adding it as an option dependency to hyppo

adam2392 · 2023-06-01T19:21:28Z

@sampan501 I'm guessing you might be busy at your internship, so hope it's going well!

If so, @YuxinB would you be interested in finishing this PR that Sampan started? The ability to convert an existing tree class that we have to an honest estimator will be nice to test the different trees for MI/CMI.

sampan501 · 2023-06-02T11:21:58Z

Sounds good!

PSSF23

As all tests in the old honest tree repo passed and the simulation figure is replicated:

I believe the current patch satisfies this PR's requirements. We can merge the honesty part first and manage other utilities like propensity trees in next step.

PSSF23 · 2023-06-08T18:03:39Z

@adam2392 your thoughts?

adam2392

It mostly looks good. It would be good to tighten up the testing here while the PR is open.

Also, would it be easy/fast example to include the figure that you regenerated from the paper as an example in examples/? Or is it a slow running simulation.

sktree/tests/test_honest_tree.py

sktree/tests/test_honest_forest.py

sktree/tree/_honest_tree.py

sktree/tests/test_honest_tree.py

sktree/ensemble/_honest_forest.py

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

This reverts commit 88b4c68.

PSSF23

It seems that I was wrong about the impute function. It seems to affect the performance under very low honest fraction, but I don't know why the coverage shows not covered.

sktree/ensemble/_honest_forest.py

sktree/tests/test_honest_tree.py

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392

LGTM once CIs pass.

adam2392 · 2023-06-10T14:01:36Z

Thanks @sampan501 and @PSSF23 !

initial commit of honesty code from neurodata/honest-forests

53f915b

Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>

sampan501 requested a review from adam2392 March 22, 2023 19:18

sampan501 self-assigned this Mar 22, 2023

make honest tree a metaestimator

fc15b63

adam2392 reviewed Apr 6, 2023

View reviewed changes

sktree/ensemble/_honest_forest.py Show resolved Hide resolved

sampan501 and others added 2 commits April 6, 2023 12:13

update honest forest to be a meta

a796790

Merge branch 'main' into honesty

e561cd2

adam2392 reviewed Apr 6, 2023

View reviewed changes

sampan501 and others added 4 commits April 19, 2023 14:00

Merge branch 'main' into honesty

f9a1dbc

Merge branch 'main' into honesty

a1761fa

Merge branch 'main' into honesty

f667121

Merge branch 'main' into honesty

acb4555

PSSF23 added 3 commits June 6, 2023 10:07

DOC modify docstrings

eb3c5df

Merge branch 'main' into honesty

53a0e5a

ENH initialize honest classes

14c354d

FIX optimize codecov upload with token

a148013

This comment was marked as duplicate.

Sign in to view

PSSF23 approved these changes Jun 8, 2023

View reviewed changes

PSSF23 marked this pull request as ready for review June 8, 2023 18:00

adam2392 reviewed Jun 9, 2023

View reviewed changes

sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved

sktree/tests/test_honest_forest.py Outdated Show resolved Hide resolved

sktree/tree/_honest_tree.py Show resolved Hide resolved

sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved

PSSF23 added 3 commits June 9, 2023 10:39

ENH add gaussian examples for honest forests

ab55ca4

ENH add tests for various estimators

cf3b576

STY remove unnecessary imports

5b483c5

adam2392 reviewed Jun 9, 2023

View reviewed changes

sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved

PSSF23 and others added 4 commits June 9, 2023 11:22

ENH add estimator checks for future proof

62a8dbb

DOC add sample weight note & FIX add check import

d84dfa1

ENH remove impute function

88b4c68

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

Revert "ENH remove impute function"

c23e6b4

This reverts commit 88b4c68.

PSSF23 reviewed Jun 9, 2023

View reviewed changes

adam2392 reviewed Jun 9, 2023

View reviewed changes

sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved

sktree/ensemble/_honest_forest.py Outdated Show resolved Hide resolved

sktree/tests/test_honest_tree.py Outdated Show resolved Hide resolved

adam2392 and others added 10 commits June 9, 2023 14:44

Merging main

459d449

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into honesty

34cf0c4

Fix tests almost

0a77d83

Signed-off-by: Adam Li <adam2392@gmail.com>

Merged

7fb51f0

Signed-off-by: Adam Li <adam2392@gmail.com>

DOC update description

fdd67ab

FIX remove test file imports

35abad1

Fixed tests

0b0600f

Signed-off-by: Adam Li <adam2392@gmail.com>

Fixed honest forest

d2c05f9

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding improved tests

18dfea6

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into honesty

efd295b

adam2392 approved these changes Jun 10, 2023

View reviewed changes

adam2392 merged commit 41848be into main Jun 10, 2023
22 checks passed

adam2392 deleted the honesty branch June 10, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: port honest forests from neurodata/honest-forests #57

ENH: port honest forests from neurodata/honest-forests #57

sampan501 commented Mar 22, 2023 •

edited by PSSF23

adam2392 commented Mar 22, 2023 •

edited

sampan501 commented Mar 22, 2023 •

edited

codecov bot commented Mar 22, 2023 •

edited

adam2392 commented Mar 22, 2023 •

edited

adam2392 commented Mar 22, 2023

sampan501 commented Mar 22, 2023

adam2392 commented Mar 22, 2023

adam2392 left a comment

adam2392 left a comment

adam2392 Apr 6, 2023

adam2392 commented Apr 18, 2023

adam2392 commented May 19, 2023

sampan501 commented May 19, 2023

sampan501 commented May 19, 2023

adam2392 commented May 19, 2023

sampan501 commented May 19, 2023

adam2392 commented Jun 1, 2023 •

edited

sampan501 commented Jun 2, 2023

This comment was marked as duplicate.

PSSF23 left a comment

PSSF23 commented Jun 8, 2023

adam2392 left a comment

PSSF23 left a comment

adam2392 left a comment

adam2392 commented Jun 10, 2023

ENH: port honest forests from neurodata/honest-forests #57

ENH: port honest forests from neurodata/honest-forests #57

Conversation

sampan501 commented Mar 22, 2023 • edited by PSSF23

Before submitting

After submitting

adam2392 commented Mar 22, 2023 • edited

sampan501 commented Mar 22, 2023 • edited

codecov bot commented Mar 22, 2023 • edited

Codecov Report

adam2392 commented Mar 22, 2023 • edited

adam2392 commented Mar 22, 2023

sampan501 commented Mar 22, 2023

adam2392 commented Mar 22, 2023

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 Apr 6, 2023

Choose a reason for hiding this comment

adam2392 commented Apr 18, 2023

adam2392 commented May 19, 2023

sampan501 commented May 19, 2023

sampan501 commented May 19, 2023

adam2392 commented May 19, 2023

sampan501 commented May 19, 2023

adam2392 commented Jun 1, 2023 • edited

sampan501 commented Jun 2, 2023

This comment was marked as duplicate.

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 commented Jun 8, 2023

adam2392 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 commented Jun 10, 2023

sampan501 commented Mar 22, 2023 •

edited by PSSF23

adam2392 commented Mar 22, 2023 •

edited

sampan501 commented Mar 22, 2023 •

edited

codecov bot commented Mar 22, 2023 •

edited

adam2392 commented Mar 22, 2023 •

edited

adam2392 commented Jun 1, 2023 •

edited