[ENH] Add "extratrees" for oblique trees #46 #75

SUKI-O · 2023-05-20T00:31:43Z

Implement #46

Changes proposed in this pull request:

Add 'Random' threshold selection for ObliqueTrees

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

codecov · 2023-05-20T00:43:08Z

Codecov Report

Patch coverage: 98.48% and project coverage change: +0.25% 🎉

Comparison is base (cf8e911) 86.73% compared to head (99cfcf6) 86.99%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #75      +/-   ##
==========================================
+ Coverage   86.73%   86.99%   +0.25%     
==========================================
  Files          26       26              
  Lines        2111     2153      +42     
==========================================
+ Hits         1831     1873      +42     
  Misses        280      280

Files Changed	Coverage Δ
sktree/__init__.py	`80.00% <ø> (ø)`
sktree/ensemble/__init__.py	`100.00% <ø> (ø)`
sktree/tree/tests/test_all_trees.py	`100.00% <ø> (ø)`
sktree/tree/tests/test_tree.py	`99.51% <96.42%> (+0.02%)`	⬆️
sktree/ensemble/_supervised_forest.py	`100.00% <100.00%> (ø)`
sktree/tests/test_supervised_forest.py	`99.41% <100.00%> (ø)`
sktree/tree/__init__.py	`100.00% <100.00%> (ø)`
sktree/tree/_classes.py	`75.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SUKI-O · 2023-05-20T00:47:01Z

In Scikit-Learn ExtraTree, node_split_random function selects a feature randomly and selects a threshold randomly.

        current_split.threshold = rand_uniform(
            min_feature_value,
            max_feature_value,
            random_state,
        )

It's picking rand_uniform between (min feature val, max feature val). But why can we sample p from range(p) for example? Wouldn't that be a better sampling strategy? The latter is more cognizant of the sample distribution at that node, no?

adam2392

Thanks! Good first try. This doesn't work the way intended as of now:

If you look at the design of the trees/splitter/criterion and Python wrapper, the Python wrapper instantiates each class and then passes them to some Tree Builder. The tree builder calls the API of each class. In https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_tree.pyx#L505, we see that the way it is called is in line w/ the abstract API defined by BaseSplitter.node_split. The current implementation would never be able to be used.
The only difference between extra trees and normal trees is that instead of sorting and choosing the best threshold at each split point, a random threshold is chosen. So: https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L714-L722 vs https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L409. Other lOC might be necessary due to the minor change in logic. I.e. we don't need to sort.
Re above point, rather than evaluate all splits, we don't need to do that anymore. A split point is just randomly chosen. I.e. we don't need to evaluate all split points. https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L427-L462.

Moreover, the logic is slightly different in the oblique splitter implementation. So you're basically taking the oblique splitter implementation and changing the analogous logic. Lmk if this makes sense. If it helps you when speccing out the high-level changes needed, then you can also write an outline of a code plan, and I can lyk if it's in the right direction.

This requires a bit of reading the code and perhaps writing down how things are moving and such to fully understand the tree implementation.

SUKI-O · 2023-05-21T02:47:30Z

At a high level, I was thinking of the following:

When ExtraObliqueTree is initialized, its splitter is set to random which instantiates RandomObliqueSplitter
It inherits BestObliqueSplitter but overwrites the node_split function.
The RandomObliqueSplitter.node_split method picks features via sample_proj_mat as in BestObliqueSplitter and iterates over each chosen feature except that it chooses threshold p randomly.
- Check if the min_samples_leaf criterion is met, else pick another random threshold
- Check if the min_weight_leaf criterion is met, else pick another random threshold
Reorganize into samples[start:p] + samples[p:end] --- same as in BaseObliqueSplitter

Am I totally off?
Don't we need to sort it anyway in order to split the samples into children nodes? I guess you can do it without sorting but computationally no less expensive, I don't think.
And to check for the min_weight_leaf criterion, we still need to evaluate at that particular choice of threshold, right?
Thanks again for the help :)

adam2392 · 2023-05-22T04:35:28Z

At a high level, I was thinking of the following:

When ExtraObliqueTree is initialized, its splitter is set to random which instantiates RandomObliqueSplitter

It inherits BestObliqueSplitter but overwrites the node_split function.

The RandomObliqueSplitter.node_split method picks features via sample_proj_mat as in BestObliqueSplitter and iterates over each chosen feature except that it chooses threshold p randomly.

Check if the min_samples_leaf criterion is met, else pick another random threshold

Check if the min_weight_leaf criterion is met, else pick another random threshold

Reorganize into samples[start:p] + samples[p:end] --- same as in BaseObliqueSplitter

Yeah more or less.

And then some FYI, in case it isn't clear in the code: start, p, end are pointers to the sample array. I.e. if you have 100 samples, these are pointers to indices to prevent data copying. start:p = left child usually and p:end = right child usually and start:end = current node.

Am I totally off? Don't we need to sort it anyway in order to split the samples into children nodes? I guess you can do it without sorting but computationally no less expensive, I don't think.

No this is more or less in the right direction. The whole point of choosing a threshold at random is so you can i) avoid sorting and ii) avoid computing the criterion for each possible split threshold.

And to check for the min_weight_leaf criterion, we still need to evaluate at that particular choice of threshold, right?

Well, more so that once we pick the threshold yes we have to evaluate it. But see comment above. The main point of random is to prevent the need to evaluate all particular choices of thresholds. This complements the randomness induced by: i) random subsetting of data is fed into each tree, ii) random features are picked for possible splitting on at each node in trees.

adam2392

Copying / pasting your comment in Slack here to leave comments (btw feel free to just comment on GH; it is a lot easier to respond):

I created RandomObliqueSplitter that inherits from BestObliqueSplitter which has sample_proj_mat function. It overwrites the node_split function with 'random' version that picks current_split.threshold at random.

Scikit-learn's random splitter has a slightly different design, but the high-level logic is more or less similar. I would recommend reviewing this function and writing out the logic for yourself on paper.

https://github.com/scikit-learn/scikit-learn/blob/4f89e717570e76650ca6f73e25c5151f8902dad5/sklearn/tree/_splitter.pyx#L625

It ensures min_weight_leaf and min_samples_leaf criteria are met but besides that it does not sort or evaluate other criterion, just once for the randomly chosen threshold. It builds but many test fails and I am clearly not understanding the root cause. Is there some specific consideration needed for testing 'random' version or should all the tests be applicable regardless?

Currently, you are sorting still, so I don't think this is accurate.

I would take a look at the sklearn implementation of best splitter and random splitter and write down the logic to compare. Next I would take a look at our oblique best splitter.

adam2392 · 2023-06-06T15:02:51Z

sktree/tree/_oblique_splitter.pyx

+            # Sort the samples
+            sort(&feature_values[start], &samples[start], end - start)


See here you are still sorting.

# Draw a random threshold current_split.threshold = rand_uniform( min_feature_value, max_feature_value, random_state, )

In node_split_random in Splitter, it samples a random (uniform) threshold between min_feature_value and max_feature_value as above and find and set .pos according to it by partitioner. partition_samples. To find min_feature_valueandmax_feature_value, it uses partitioner.find_min_max, which runs iteratively each feature value and is basically the same as sorting. I was going to use that to get min_feature_valueandmax_feature_valuethen follow the same logic asnode_split_random. But I had a question: can we just randomly sample current_split.posfromrand_int(start+min_samples_leaf, end, &self.rand_r_state)`? Is that a bad sampling strategy? Is either better than the other? If the latter is an acceptable sampling method, it'll be faster, (and no sorting).

I'm not sure I understand the question. You wouldn't randomly sample pos. Also finding min/max is not the same as sorting. Sorting is O(nlog(n)). Finding min/max is O(n).

The pseudocode for RandomSplitter.node_split is the following. The only difference is that compute_features() computes a projection of the data, rather than just choosing a column of the data. I think sklearn's random splitter does some extra stuff like keeping track of what columns contain constant values, whereas we don't do that.

# Some checks and stuff are missing, but this is the general gist, which is what's in sklearn's random splitting for idx in range(max_features): # for each projection, compute the features over all samples # (n_samples,) vector current_feature_values = compute_features_over_samples # find min and max of current feature values min, max = find_min_max(current_feature_values) # draw random thresohld current_split.threshodl = rand_uniform(min, max, randstate) # partition samples given the threshold by moving feature values and samples inplace current_split.pos = partition_samples() # evaluate split criterion.reset() criterion.update(current_split.pos) current_proxy_improvement = self.criterion.proxy_impurity_improvement() if current_proxy_improvement > best_proxy_improvement: best_proxy_improvement = current_proxy_improvement best = current # copy # reorganize samples inplace samples[start:best.pos] + samples[best.pos:end] # copy of the previous code

Possibly some good papers on this topic are the extra trees publication linked in the GH issue and the original random forest paper by Breiman.

adam2392 · 2023-07-13T14:48:28Z

Hi @SUKI-O lmk when you feel this is ready for review. It might help if you are able to write a short example using simulated data to show how this is faster than the normal oblique best splitter when building an Oblique forest.

It should also ideally be similar in performance.

Some example for inspiration:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html#sphx-glr-auto-examples-ensemble-plot-forest-iris-py

SUKI-O · 2023-07-14T08:19:39Z

Hi @SUKI-O lmk when you feel this is ready for review. It might help if you are able to write a short example using simulated data to show how this is faster than the normal oblique best splitter when building an Oblique forest.

It should also ideally be similar in performance.

Some example for inspiration:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html#sphx-glr-auto-examples-ensemble-plot-forest-iris-py

Do you think I can/should use the dataset used for ObliqueRF like this to run the test/example? If not, any suggestions on what may work best for this application?

adam2392 · 2023-07-14T12:38:50Z

Sure you can use that dataset.

I would also have an example comparing the decision boundaries on the iris dataset.

examples/plot_extra_oblique_random_forest.py

sktree/tree/_oblique_splitter.pyx

docs/whats_new/v0.1.rst

adam2392

The Cython code looks more correct to me now, but make sure you remove any extraneous code that is not being used.

In the examples, these are user-facing documentation that we want to hold to a pretty high standard because it should convey some "concept". In your case, it is comparing "extratrees*" vs "normal trees*" in some classification dataset.

We want to see how much time is saved vs how much performance changes.

Holding these examples to a high standard also helps you because if you ever wonder how to run something, you can refer back to the example and have high-quality sample code that you previously wrote.

examples/plot_extra_oblique_random_forest.py

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-08-24T12:45:08Z

@SUKI-O

I merged in changes from main and also fixed some unnecessary LOC that were added (#75 (comment)). If you want to re-run the exps. I suspect they will improve.

I think I fixed most of the CI errors in 338428a. So feel free to just rebuild and run the exps and if we see an improvement that we hypothesize, we can proceed.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-09-03T15:40:50Z

https://output.circle-artifacts.com/output/job/80903942-02eb-4fcd-a7ca-0d25ec870888/artifacts/0/dev/auto_examples/plot_extra_oblique_random_forest.html#sphx-glr-auto-examples-plot-extra-oblique-random-forest-py

Shows EORF is slower. I think this could be due to the really low sample size you force in the cnae-9, har, lsvt datasets. Perhaps you can increase the sample size and run it.

You can build and run it locally to get something that makes sense.

SUKI-O · 2023-09-03T22:07:34Z

I've tweeked around to get better results but for wide datasets like cnae-9 and lsvt, EORF is always slower. Does this make sense to you?

[Edit]

If I set max_depth = 3, max_features = 30, I could get to the above result. Is this legit?

adam2392 · 2023-09-05T15:16:54Z

I've tweeked around to get better results but for wide datasets like cnae-9 and lsvt, EORF is always slower. Does this make sense to you?
[Edit]

If I set max_depth = 3, max_features = 30, I could get to the above result. Is this legit?

Yes these make sense. I would add a note explaining that you set the max_depth=3 purely for demonstration purposes. In general, the max_depth can be tuned. The speedup is relative to sample size. O(n) in extra trees vs O(n log(n)) in normal trees. So limiting the sample size, we might see other factors play into the speed, which explains the fact that normal trees were actually faster (e.g. normal trees might've trained to smaller depths).

I would read the extra trees paper if that helps to communicate the key points: bias-variance-tradeoff, computational runtime. This is a good SO post too on the topic: https://stats.stackexchange.com/questions/175523/difference-between-random-forest-and-extremely-randomized-trees

adam2392

Okay from the exps, the general code seems fine!

There are a few issues tho to be addressed in order for this to get merged:

unused code: if code is not used, it should not be committed. If I'm misunderstanding if certain pieces of code are actually used, lmk.
examples: The examples can be improved: https://output.circle-artifacts.com/output/job/80903942-02eb-4fcd-a7ca-0d25ec870888/artifacts/0/dev/auto_examples/plot_extra_oblique_random_forest.html#sphx-glr-auto-examples-plot-extra-oblique-random-forest-py.

the examples should clearly convey some concepts. Right now, they are reworded versions of other examples. The example should guide the user to learning, or understanding something.

The iris example should clearly convey the differences between normal oblique decision trees and extra oblique decision trees. I.e. the only difference is the randomly chosen thresholds vs picking best thresholds and the trees are still useful.

The cc-18 example should clearly convey the tradeoff between choosing extra vs normal oblique decision trees.

examples/plot_decision_surface_iris.py

examples/plot_extra_oblique_random_forest.py

sktree/tree/_utils.pyx

examples/plot_extra_oblique_random_forest.py

adam2392 · 2023-09-05T22:58:46Z

examples/plot_extra_oblique_random_forest.py

+params = {
+    "max_features": None,
+    "n_estimators": 50,
+    "max_depth": 5,
+    "random_state": random_state,
+    "n_cv": 10,
+    "n_repeats": 1,
+}


You should add a note here regarding max_depth and the n_estimators being set low to allow the example to run on the CI. You can add some intuition on how these should be set normally.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-09-10T04:36:28Z

@SUKI-O I cleaned up the PR to remove any unnecessary code or documentation: d47f718

Lmk if this looks good to you and I will merge?

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-09-10T18:12:14Z

I will merge this in for now since this PR has the biggest diff and there are other PRs on the queue, but if there are questions, or comments you have, feel free to open up a GH issue.

adam2392 · 2023-09-10T18:13:22Z

Thanks @SUKI-O !

initial commit

fcbda1d

adam2392 requested changes May 20, 2023

View reviewed changes

SUKI-O added 2 commits May 25, 2023 18:52

WIP: modify node_split_random

d3adbd5

WIP: modify node_split func in splitter

2b41c97

adam2392 reviewed Jun 6, 2023

View reviewed changes

SUKI-OGIHARA added 2 commits July 3, 2023 13:53

merging main

1453a48

rework extra-trees, add tests

0c81ea0

SUKI-O marked this pull request as ready for review July 8, 2023 05:13

merge main

1540e6d

adam2392 mentioned this pull request Jul 13, 2023

[ENH] Extended isolation forest #101

Merged

5 tasks

SUKI-OGIHARA added 2 commits July 14, 2023 19:53

add example plot to compare extra oblique RF & oblique RF

1f851a5

Merge branch 'main' into extraobliquetree

56dd610