[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

PSSF23 · 2023-09-11T19:11:49Z

Changes proposed in this pull request:

Related to Posterior Forests (or whatever we call them) #111

#112 , #120 to be addressed in a future PR

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

Co-Authored-By: Sambit Panda <36676569+sampan501@users.noreply.github.com> Co-Authored-By: Yuxin <99897042+YuxinB@users.noreply.github.com> Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

sampan501 · 2023-09-11T19:13:56Z

Will this also have MIRF with Mutual Info as a stat?

codecov · 2023-09-11T19:15:19Z

Codecov Report

Attention: 225 lines in your changes are missing coverage. Please review.

Comparison is base (9b486bc) 87.71% compared to head (be16e5a) 44.51%.
Report is 1 commits behind head on main.

❗ Current head be16e5a differs from pull request most recent head 60d9c85. Consider uploading reports for the commit 60d9c85 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #125       +/-   ##
===========================================
- Coverage   87.71%   44.51%   -43.21%     
===========================================
  Files          30       36        +6     
  Lines        2426     3116      +690     
===========================================
- Hits         2128     1387      -741     
- Misses        298     1729     +1431

Files	Coverage Δ
sktree/__init__.py	`80.76% <100.00%> (ø)`
sktree/conftest.py	`100.00% <100.00%> (ø)`
sktree/ensemble/_eiforest.py	`57.14% <ø> (-42.86%)`	⬇️
sktree/stats/__init__.py	`100.00% <100.00%> (ø)`
sktree/tree/__init__.py	`100.00% <100.00%> (ø)`
sktree/tests/test_honest_forest.py	`35.95% <0.00%> (-64.05%)`	⬇️
sktree/ensemble/_honest_forest.py	`51.19% <60.00%> (-40.28%)`	⬇️
sktree/tree/_classes.py	`48.72% <37.50%> (-26.28%)`	⬇️
sktree/tree/tests/test_honest_tree.py	`33.78% <25.00%> (-66.22%)`	⬇️
sktree/tree/_honest_tree.py	`20.80% <30.00%> (-78.60%)`	⬇️
... and 4 more

... and 18 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

PSSF23

Referring to this file in hyppo. I removed everything related to hyppo for now, including the mutual information function, to avoid confusion. The _might.py file includes all other single-feature and multi-view methods we developed.

adam2392 · 2023-09-11T19:17:42Z

Referring to this file in hyppo. I removed everything related to hyppo for now, including the mutual information function, to avoid confusion. The _might.py file includes all other single-feature and multi-view methods we developed.

For now I will remove the multi-view stuff prolly and then also look at how we can introduce arbitrary metrics in here: e.g. MI, ROC_auc, etc.

I'll take a look tonight

PSSF23

Per @sampan501 , added stat="MI" and stat="AUC" as parameters for statistic().

sampan501 · 2023-09-12T11:25:00Z

@PSSF23 Can you also rename it to MIGHT? Might make all this name changing easier to follow - no pun intended

PSSF23

Renamed to MIGHT and MIGHT_MV
Added y-label permutation test to MIGHT (previously removed due to connection to hyppo)

PSSF23 · 2023-09-12T13:22:31Z

@sampan501 I might be misunderstanding the MI calculation, but why the original method had the axis=1 param? If it's not needed I'll correct MIGHT_MV as well.

sampan501 · 2023-09-12T14:29:01Z

predict_proba returns a (n_samples, n_classes) array as an output. So, the previous MI calculation was taking averages over the classes. We don't need to do that since you do that already in forest_pos. We can probably remove the mean for both calculations too.

sampan501 · 2023-09-12T14:29:51Z

Before we make any serious changes like the one above, we really need unit tests for this method

PSSF23

I made an empty test file. Let's edit and add to this message about what we need to test:

test on iris for mutual info for accuracy
partial AUROC
multiview splitter (not this time)

adam2392 · 2023-09-12T14:50:31Z

I made an empty test file. Let's edit and add to this message about what we need to test:

test on iris for mutual info for accuracy

partial AUROC

multiview splitter

Let's do multi view in a sep PR.

Signed-off-by: Adam Li <adam2392@gmail.com>

… might

PSSF23

By unit test results, MIGHT seems to perform the worst when coupled with PatchObliqueDecisionTreeClassifier(). I remember similar situations back with honest tree tests. Should I lower the passing threshold or remove the estimator option?

adam2392 · 2023-09-12T19:58:32Z

Changes proposed in this pull request:

Fix Posterior Forests (or whatever we call them) #111

Fix add multiview code/docs/tutorial #112

Fix pAUC & pvalue #120

I would remove the second two bullets to prevent those issues from getting closed w/o actually resolving the issue raised.

Signed-off-by: Adam Li <adam2392@gmail.com>

Add permute_stat to class variable

PSSF23

In FI, I cleared all the duplicate variables like posteriors_final_ and observe_posteriors_, and save the permuted statistic as class variable permute_stat_ (all other results of permutation are saved, so this one should be there as well). I also make all default statistic MI, but we might change it to pAUC later.

sktree/stats/forestht.py

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-10-03T17:53:59Z

sktree/stats/tests/test_forestht.py

+@pytest.mark.parametrize("backend", ["loky", "threading"])
+@pytest.mark.parametrize("n_jobs", [1, -1])
+def test_parallelization(backend, n_jobs):
+    """Test parallelization of training forests."""
+    n_samples = 100
+    n_features = 5
+    X = rng.uniform(size=(n_samples, n_features))
+    y = rng.integers(0, 2, size=n_samples)  # Binary classification
+
+    def run_forest(covariate_index=None):
+        clf = FeatureImportanceForestClassifier(
+            estimator=HonestForestClassifier(
+                n_estimators=10, random_state=seed, n_jobs=n_jobs, honest_fraction=0.2
+            ),
+            test_size=0.5,
+        )
+        pvalue = clf.test(X, y, covariate_index=[covariate_index], metric="mi")
+        return pvalue
+
+    out = Parallel(n_jobs=1, backend=backend)(
+        delayed(run_forest)(covariate_index) for covariate_index in range(n_features)
+    )
+    assert len(out) == n_features


@sampan501 to my knowledge, any issues w/ joblib should be fixed or are the result of some other issues perhaps?

Lmk if this unit-test sufficiently captures what the usage looks like in the power simulations.

n_jobs = 1 should be n_jobs = n_jobs in line 411, but otherwise good

PSSF23 · 2023-10-03T21:10:09Z

@adam2392 can the print statements in FIClf be removed?

adam2392 · 2023-10-03T21:29:16Z

Yeah feel free to push that. Im making some other changes tho so anything larger feel free to open a PR to this PR.

* Update Signed-off-by: Adam Li <adam2392@gmail.com> * Fix submodule Signed-off-by: Adam Li <adam2392@gmail.com> * Possible change to might code Signed-off-by: Adam Li <adam2392@gmail.com> * Add fixes Signed-off-by: Adam Li <adam2392@gmail.com> * Fix style Signed-off-by: Adam Li <adam2392@gmail.com> --------- Signed-off-by: Adam Li <adam2392@gmail.com>

Signed-off-by: Adam Li <adam2392@gmail.com>

… might

Signed-off-by: Adam Li <adam2392@gmail.com>

sampan501 · 2023-10-04T15:53:40Z

sktree/stats/tests/test_forestht.py

@@ -205,12 +209,12 @@ def test_linear_model(hypotester, model_kwargs, n_samples, n_repeats, test_size)
                    n_jobs=-1,
                ),
                "random_state": seed,
-                "permute_per_tree": True,
-                "sample_dataset_per_tree": True,
+                "permute_per_tree": False,


This is testing MI Sep and not MI/Tree right? Where are we testing MI/Tree?

Agreed. This is MI Sep. We should test MI / Tree as well.

Yeah I will add a pytest.mark.parametrize in a bit. I think last night the pvalue behavior was not converging as well as the MI Sep.

Signed-off-by: Adam Li <adam2392@gmail.com>

… might

sampan501 · 2023-10-05T03:09:19Z

@adam2392 @PSSF23 I'm still having an error with the sample size when using the parameters from our meeting today (n = 32, test_size = 0.2). Here's the trace:

"""
Traceback (most recent call last):
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/mendseqs/high-d/high-d-sims.py", line 1342, in compute_null
    pval = _nonperm_pval(test, sim, n, p, noise=noise, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/mendseqs/high-d/high-d-sims.py", line 1277, in _nonperm_pval
    pvalue = test[0](**test[1]).test(u, v, **kwargs)[1]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/forestht.py", line 412, in test
    metric_star, metric_star_pi = _compute_null_distribution_coleman(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/utils.py", line 291, in _compute_null_distribution_coleman
    first_half_metric = metric_func(y_test[non_nan_samples, :], y_pred_first_half)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/utils.py", line 34, in _mutual_information
    raise ValueError(f"y_true must be 1d, not {y_true.shape}")
ValueError: y_true must be 1d, not (1, 1)
"""

adam2392 · 2023-10-05T04:00:26Z

@adam2392 @PSSF23 I'm still having an error with the sample size when using the parameters from our meeting today (n = 32, test_size = 0.2). Here's the trace:

"""
Traceback (most recent call last):
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/miniconda3/envs/cancer/lib/python3.11/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/mendseqs/high-d/high-d-sims.py", line 1342, in compute_null
    pval = _nonperm_pval(test, sim, n, p, noise=noise, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/mendseqs/high-d/high-d-sims.py", line 1277, in _nonperm_pval
    pvalue = test[0](**test[1]).test(u, v, **kwargs)[1]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/forestht.py", line 412, in test
    metric_star, metric_star_pi = _compute_null_distribution_coleman(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/utils.py", line 291, in _compute_null_distribution_coleman
    first_half_metric = metric_func(y_test[non_nan_samples, :], y_pred_first_half)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sambit/scikit-tree/sktree/stats/utils.py", line 34, in _mutual_information
    raise ValueError(f"y_true must be 1d, not {y_true.shape}")
ValueError: y_true must be 1d, not (1, 1)
"""

Any chance you can reproduce the error w/ a small code snippet?

The following code works for me:

def test_small_dataset():
    n_samples = 32
    n_features = 5
    X = rng.uniform(size=(n_samples, n_features))
    y = rng.integers(0, 2, size=n_samples)  # Binary classification

    clf = FeatureImportanceForestClassifier(
        estimator=HonestForestClassifier(
            n_estimators=10, random_state=seed, n_jobs=1, honest_fraction=0.5
        ),
        test_size=0.2,
        permute_per_tree=False,
        sample_dataset_per_tree=False,
    )
    stat, pvalue = clf.test(X, y, covariate_index=[1,2], metric='mi')

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-10-05T04:11:09Z

FYI, I added a short unit-test to test small sample-sizes.

sampan501 · 2023-10-05T11:26:12Z

I will once I find the simulation and sample size that's causing the issue

Signed-off-by: Adam Li <adam2392@gmail.com>

ENH initialize with MIRF_AUC and MIRF_MV

5a19fc6

Co-Authored-By: Sambit Panda <36676569+sampan501@users.noreply.github.com> Co-Authored-By: Yuxin <99897042+YuxinB@users.noreply.github.com> Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

PSSF23 requested review from adam2392, sampan501 and YuxinB September 11, 2023 19:11

PSSF23 commented Sep 11, 2023

View reviewed changes

ENH add statistic alternatives

f1a8e49

PSSF23 commented Sep 11, 2023

View reviewed changes

sampan501 and others added 3 commits September 12, 2023 08:15

no axis=1 when taking posterior slice in MI

fd0b937

ENH add y-label permutation test to MIGHT

efc2587

FIX rename import

d1f7748

PSSF23 commented Sep 12, 2023

View reviewed changes

FIX, correct function param

d4abb4a

FIX remove axis param & TST initialize test file

a342482

PSSF23 commented Sep 12, 2023

View reviewed changes

adam2392 and others added 5 commits September 12, 2023 11:11

Adding modularity

69c76a8

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'might' of https://github.com/neurodata/scikit-tree into…

11f38c2

… might

TST experiment with unit test

b3ab11d

Merge branch 'main' into might

3ef1d7e

FIX correct variable name

4ed31f8

PSSF23 commented Sep 12, 2023

View reviewed changes

TST remove patch oblique tree tests

9114859

adam2392 and others added 3 commits October 2, 2023 21:40

Add clone to get estimators

80ada68

Signed-off-by: Adam Li <adam2392@gmail.com>

ENH mark all default tests as MI and correct posterior return parameter

ff37740

FIX unify all variable names so posteriors are not saved twice

aed9179

Add permute_stat to class variable

PSSF23 commented Oct 3, 2023

View reviewed changes

adam2392 reviewed Oct 3, 2023

View reviewed changes

sktree/stats/forestht.py Show resolved Hide resolved

adam2392 added 3 commits October 3, 2023 13:22

Add additional testing

c716440

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix CI

f6cb04b

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding parallelization test

8df008d

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 reviewed Oct 3, 2023

View reviewed changes

PSSF23 and others added 5 commits October 3, 2023 20:07

FIX remove extra print statememts

7964d99

Add fixes

3a2279a

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'might' of https://github.com/neurodata/scikit-tree into…

37e6643

… might

Add parallelization to the tree building and predicting posteriors

3a4a4b4

Signed-off-by: Adam Li <adam2392@gmail.com>

sampan501 reviewed Oct 4, 2023

View reviewed changes

PSSF23 and others added 3 commits October 4, 2023 14:43

ENH add MIGHT example notebook on AUC

e91060f

Consolidate parallleization

efbd440

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'might' of https://github.com/neurodata/scikit-tree into…

8718b0f

… might

set default for covariate_index in ForestHT test

be16e5a

Add unit-test for small sample sizes

26b5b5f

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 2 commits October 5, 2023 10:29

Final commit

80a4304

Signed-off-by: Adam Li <adam2392@gmail.com>

Release v0.2

60d9c85

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 merged commit f8a2ff7 into main Oct 5, 2023
21 of 22 checks passed

adam2392 deleted the might branch October 5, 2023 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

PSSF23 commented Sep 11, 2023 •

edited by adam2392

sampan501 commented Sep 11, 2023

codecov bot commented Sep 11, 2023 •

edited

PSSF23 left a comment

adam2392 commented Sep 11, 2023

PSSF23 left a comment

sampan501 commented Sep 12, 2023 •

edited

PSSF23 left a comment •

edited

PSSF23 commented Sep 12, 2023 •

edited

sampan501 commented Sep 12, 2023

sampan501 commented Sep 12, 2023

PSSF23 left a comment •

edited

adam2392 commented Sep 12, 2023

PSSF23 left a comment

adam2392 commented Sep 12, 2023

PSSF23 left a comment •

edited

adam2392 Oct 3, 2023

sampan501 Oct 3, 2023

PSSF23 commented Oct 3, 2023

adam2392 commented Oct 3, 2023

sampan501 Oct 4, 2023

PSSF23 Oct 4, 2023

adam2392 Oct 4, 2023

sampan501 commented Oct 5, 2023 •

edited

adam2392 commented Oct 5, 2023

adam2392 commented Oct 5, 2023

sampan501 commented Oct 5, 2023

[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

[ENH] add PermutationForests and FeatureImportanceForests to sktree #125

Conversation

PSSF23 commented Sep 11, 2023 • edited by adam2392

Before submitting

After submitting

sampan501 commented Sep 11, 2023

codecov bot commented Sep 11, 2023 • edited

Codecov Report

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 commented Sep 11, 2023

PSSF23 left a comment

Choose a reason for hiding this comment

sampan501 commented Sep 12, 2023 • edited

PSSF23 left a comment • edited

Choose a reason for hiding this comment

PSSF23 commented Sep 12, 2023 • edited

sampan501 commented Sep 12, 2023

sampan501 commented Sep 12, 2023

PSSF23 left a comment • edited

Choose a reason for hiding this comment

adam2392 commented Sep 12, 2023

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 commented Sep 12, 2023

PSSF23 left a comment • edited

Choose a reason for hiding this comment

adam2392 Oct 3, 2023

Choose a reason for hiding this comment

sampan501 Oct 3, 2023

Choose a reason for hiding this comment

PSSF23 commented Oct 3, 2023

adam2392 commented Oct 3, 2023

sampan501 Oct 4, 2023

Choose a reason for hiding this comment

PSSF23 Oct 4, 2023

Choose a reason for hiding this comment

adam2392 Oct 4, 2023

Choose a reason for hiding this comment

sampan501 commented Oct 5, 2023 • edited

adam2392 commented Oct 5, 2023

adam2392 commented Oct 5, 2023

sampan501 commented Oct 5, 2023

PSSF23 commented Sep 11, 2023 •

edited by adam2392

codecov bot commented Sep 11, 2023 •

edited

sampan501 commented Sep 12, 2023 •

edited

PSSF23 left a comment •

edited

PSSF23 commented Sep 12, 2023 •

edited

PSSF23 left a comment •

edited

PSSF23 left a comment •

edited

sampan501 commented Oct 5, 2023 •

edited