Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add "extratrees" for oblique trees #46 #75

Merged
merged 34 commits into from
Sep 10, 2023

Conversation

SUKI-O
Copy link
Member

@SUKI-O SUKI-O commented May 20, 2023

Implement #46

Changes proposed in this pull request:

  • Add 'Random' threshold selection for ObliqueTrees

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

@codecov
Copy link

codecov bot commented May 20, 2023

Codecov Report

Patch coverage: 98.48% and project coverage change: +0.25% 🎉

Comparison is base (cf8e911) 86.73% compared to head (99cfcf6) 86.99%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #75      +/-   ##
==========================================
+ Coverage   86.73%   86.99%   +0.25%     
==========================================
  Files          26       26              
  Lines        2111     2153      +42     
==========================================
+ Hits         1831     1873      +42     
  Misses        280      280              
Files Changed Coverage Δ
sktree/__init__.py 80.00% <ø> (ø)
sktree/ensemble/__init__.py 100.00% <ø> (ø)
sktree/tree/tests/test_all_trees.py 100.00% <ø> (ø)
sktree/tree/tests/test_tree.py 99.51% <96.42%> (+0.02%) ⬆️
sktree/ensemble/_supervised_forest.py 100.00% <100.00%> (ø)
sktree/tests/test_supervised_forest.py 99.41% <100.00%> (ø)
sktree/tree/__init__.py 100.00% <100.00%> (ø)
sktree/tree/_classes.py 75.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SUKI-O
Copy link
Member Author

SUKI-O commented May 20, 2023

In Scikit-Learn ExtraTree, node_split_random function selects a feature randomly and selects a threshold randomly.

        current_split.threshold = rand_uniform(
            min_feature_value,
            max_feature_value,
            random_state,
        )

It's picking rand_uniform between (min feature val, max feature val). But why can we sample p from range(p) for example? Wouldn't that be a better sampling strategy? The latter is more cognizant of the sample distribution at that node, no?

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good first try. This doesn't work the way intended as of now:

Moreover, the logic is slightly different in the oblique splitter implementation. So you're basically taking the oblique splitter implementation and changing the analogous logic. Lmk if this makes sense. If it helps you when speccing out the high-level changes needed, then you can also write an outline of a code plan, and I can lyk if it's in the right direction.

This requires a bit of reading the code and perhaps writing down how things are moving and such to fully understand the tree implementation.

@SUKI-O
Copy link
Member Author

SUKI-O commented May 21, 2023

At a high level, I was thinking of the following:

  • When ExtraObliqueTree is initialized, its splitter is set to random which instantiates RandomObliqueSplitter
  • It inherits BestObliqueSplitter but overwrites the node_split function.
  • The RandomObliqueSplitter.node_split method picks features via sample_proj_mat as in BestObliqueSplitter and iterates over each chosen feature except that it chooses threshold p randomly.
    • Check if the min_samples_leaf criterion is met, else pick another random threshold
    • Check if the min_weight_leaf criterion is met, else pick another random threshold
  • Reorganize into samples[start:p] + samples[p:end] --- same as in BaseObliqueSplitter

Am I totally off?
Don't we need to sort it anyway in order to split the samples into children nodes? I guess you can do it without sorting but computationally no less expensive, I don't think.
And to check for the min_weight_leaf criterion, we still need to evaluate at that particular choice of threshold, right?
Thanks again for the help :)

@adam2392
Copy link
Collaborator

At a high level, I was thinking of the following:

  • When ExtraObliqueTree is initialized, its splitter is set to random which instantiates RandomObliqueSplitter

  • It inherits BestObliqueSplitter but overwrites the node_split function.

  • The RandomObliqueSplitter.node_split method picks features via sample_proj_mat as in BestObliqueSplitter and iterates over each chosen feature except that it chooses threshold p randomly.

    • Check if the min_samples_leaf criterion is met, else pick another random threshold
    • Check if the min_weight_leaf criterion is met, else pick another random threshold
  • Reorganize into samples[start:p] + samples[p:end] --- same as in BaseObliqueSplitter

Yeah more or less.

And then some FYI, in case it isn't clear in the code: start, p, end are pointers to the sample array. I.e. if you have 100 samples, these are pointers to indices to prevent data copying. start:p = left child usually and p:end = right child usually and start:end = current node.

Am I totally off? Don't we need to sort it anyway in order to split the samples into children nodes? I guess you can do it without sorting but computationally no less expensive, I don't think.

No this is more or less in the right direction. The whole point of choosing a threshold at random is so you can i) avoid sorting and ii) avoid computing the criterion for each possible split threshold.

And to check for the min_weight_leaf criterion, we still need to evaluate at that particular choice of threshold, right?

Well, more so that once we pick the threshold yes we have to evaluate it. But see comment above. The main point of random is to prevent the need to evaluate all particular choices of thresholds. This complements the randomness induced by: i) random subsetting of data is fed into each tree, ii) random features are picked for possible splitting on at each node in trees.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying / pasting your comment in Slack here to leave comments (btw feel free to just comment on GH; it is a lot easier to respond):

I created RandomObliqueSplitter that inherits from BestObliqueSplitter which has sample_proj_mat function. It overwrites the node_split function with 'random' version that picks current_split.threshold at random.

Scikit-learn's random splitter has a slightly different design, but the high-level logic is more or less similar. I would recommend reviewing this function and writing out the logic for yourself on paper.

https://github.com/scikit-learn/scikit-learn/blob/4f89e717570e76650ca6f73e25c5151f8902dad5/sklearn/tree/_splitter.pyx#L625

It ensures min_weight_leaf and min_samples_leaf criteria are met but besides that it does not sort or evaluate other criterion, just once for the randomly chosen threshold. It builds but many test fails and I am clearly not understanding the root cause. Is there some specific consideration needed for testing 'random' version or should all the tests be applicable regardless?

Currently, you are sorting still, so I don't think this is accurate.

I would take a look at the sklearn implementation of best splitter and random splitter and write down the logic to compare. Next I would take a look at our oblique best splitter.

Comment on lines 615 to 616
# Sort the samples
sort(&feature_values[start], &samples[start], end - start)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here you are still sorting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Draw a random threshold
    current_split.threshold = rand_uniform(
        min_feature_value,
        max_feature_value,
        random_state,
    )

In node_split_random in Splitter, it samples a random (uniform) threshold between min_feature_value and max_feature_value as above and find and set .pos according to it by partitioner. partition_samples. To find min_feature_valueandmax_feature_value, it uses partitioner.find_min_max, which runs iteratively each feature value and is basically the same as sorting. I was going to use that to get min_feature_valueandmax_feature_valuethen follow the same logic asnode_split_random. But I had a question: can we just randomly sample current_split.posfromrand_int(start+min_samples_leaf, end, &self.rand_r_state)`? Is that a bad sampling strategy? Is either better than the other? If the latter is an acceptable sampling method, it'll be faster, (and no sorting).

Copy link
Collaborator

@adam2392 adam2392 Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. You wouldn't randomly sample pos. Also finding min/max is not the same as sorting. Sorting is O(nlog(n)). Finding min/max is O(n).

The pseudocode for RandomSplitter.node_split is the following. The only difference is that compute_features() computes a projection of the data, rather than just choosing a column of the data. I think sklearn's random splitter does some extra stuff like keeping track of what columns contain constant values, whereas we don't do that.

# Some checks and stuff are missing, but this is the general gist, which is what's in sklearn's random splitting
for idx in range(max_features):
    # for each projection, compute the features over all samples
    # (n_samples,) vector
    current_feature_values = compute_features_over_samples

    # find min and max of current feature values
    min, max = find_min_max(current_feature_values)

    # draw random thresohld
    current_split.threshodl = rand_uniform(min, max, randstate)

    # partition samples given the threshold by moving feature values and samples inplace
    current_split.pos = partition_samples()

    # evaluate split
    criterion.reset()
    criterion.update(current_split.pos)

    current_proxy_improvement = self.criterion.proxy_impurity_improvement()
    if current_proxy_improvement > best_proxy_improvement:
        best_proxy_improvement = current_proxy_improvement
        best = current  # copy
    
    # reorganize samples inplace samples[start:best.pos] + samples[best.pos:end]
    # copy of the previous code

Possibly some good papers on this topic are the extra trees publication linked in the GH issue and the original random forest paper by Breiman.

@SUKI-O SUKI-O marked this pull request as ready for review July 8, 2023 05:13
@adam2392
Copy link
Collaborator

Hi @SUKI-O lmk when you feel this is ready for review. It might help if you are able to write a short example using simulated data to show how this is faster than the normal oblique best splitter when building an Oblique forest.

It should also ideally be similar in performance.

Some example for inspiration:

@adam2392 adam2392 mentioned this pull request Jul 13, 2023
5 tasks
@SUKI-O
Copy link
Member Author

SUKI-O commented Jul 14, 2023

Hi @SUKI-O lmk when you feel this is ready for review. It might help if you are able to write a short example using simulated data to show how this is faster than the normal oblique best splitter when building an Oblique forest.

It should also ideally be similar in performance.

Some example for inspiration:

Do you think I can/should use the dataset used for ObliqueRF like this to run the test/example? If not, any suggestions on what may work best for this application?

@adam2392
Copy link
Collaborator

Sure you can use that dataset.

I would also have an example comparing the decision boundaries on the iris dataset.

docs/whats_new/v0.1.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cython code looks more correct to me now, but make sure you remove any extraneous code that is not being used.

In the examples, these are user-facing documentation that we want to hold to a pretty high standard because it should convey some "concept". In your case, it is comparing "extratrees*" vs "normal trees*" in some classification dataset.

We want to see how much time is saved vs how much performance changes.

Holding these examples to a high standard also helps you because if you ever wonder how to run something, you can refer back to the example and have high-quality sample code that you previously wrote.

adam2392 and others added 6 commits August 16, 2023 13:37
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

@SUKI-O

I merged in changes from main and also fixed some unnecessary LOC that were added (#75 (comment)). If you want to re-run the exps. I suspect they will improve.

I think I fixed most of the CI errors in 338428a. So feel free to just rebuild and run the exps and if we see an improvement that we hypothesize, we can proceed.

@SUKI-O SUKI-O changed the title [WIP] Add "extratrees" for oblique trees #46 [ENH] Add "extratrees" for oblique trees #46 Sep 2, 2023
@SUKI-O SUKI-O requested a review from adam2392 September 3, 2023 05:06
@adam2392
Copy link
Collaborator

adam2392 commented Sep 3, 2023

https://output.circle-artifacts.com/output/job/80903942-02eb-4fcd-a7ca-0d25ec870888/artifacts/0/dev/auto_examples/plot_extra_oblique_random_forest.html#sphx-glr-auto-examples-plot-extra-oblique-random-forest-py

Shows EORF is slower. I think this could be due to the really low sample size you force in the cnae-9, har, lsvt datasets. Perhaps you can increase the sample size and run it.

You can build and run it locally to get something that makes sense.

@SUKI-O
Copy link
Member Author

SUKI-O commented Sep 3, 2023

Screenshot 2023-09-04 at 7 04 29 AM

I've tweeked around to get better results but for wide datasets like cnae-9 and lsvt, EORF is always slower. Does this make sense to you?

[Edit]
Screenshot 2023-09-04 at 7 47 40 AM

If I set max_depth = 3, max_features = 30, I could get to the above result. Is this legit?

@adam2392
Copy link
Collaborator

adam2392 commented Sep 5, 2023

Screenshot 2023-09-04 at 7 04 29 AM I've tweeked around to get better results but for wide datasets like cnae-9 and lsvt, EORF is always slower. Does this make sense to you?

[Edit] Screenshot 2023-09-04 at 7 47 40 AM

If I set max_depth = 3, max_features = 30, I could get to the above result. Is this legit?

Yes these make sense. I would add a note explaining that you set the max_depth=3 purely for demonstration purposes. In general, the max_depth can be tuned. The speedup is relative to sample size. O(n) in extra trees vs O(n log(n)) in normal trees. So limiting the sample size, we might see other factors play into the speed, which explains the fact that normal trees were actually faster (e.g. normal trees might've trained to smaller depths).

I would read the extra trees paper if that helps to communicate the key points: bias-variance-tradeoff, computational runtime. This is a good SO post too on the topic: https://stats.stackexchange.com/questions/175523/difference-between-random-forest-and-extremely-randomized-trees

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay from the exps, the general code seems fine!

There are a few issues tho to be addressed in order for this to get merged:

  1. unused code: if code is not used, it should not be committed. If I'm misunderstanding if certain pieces of code are actually used, lmk.
  2. examples: The examples can be improved: https://output.circle-artifacts.com/output/job/80903942-02eb-4fcd-a7ca-0d25ec870888/artifacts/0/dev/auto_examples/plot_extra_oblique_random_forest.html#sphx-glr-auto-examples-plot-extra-oblique-random-forest-py.

the examples should clearly convey some concepts. Right now, they are reworded versions of other examples. The example should guide the user to learning, or understanding something.

The iris example should clearly convey the differences between normal oblique decision trees and extra oblique decision trees. I.e. the only difference is the randomly chosen thresholds vs picking best thresholds and the trees are still useful.

The cc-18 example should clearly convey the tradeoff between choosing extra vs normal oblique decision trees.

examples/plot_decision_surface_iris.py Outdated Show resolved Hide resolved
examples/plot_decision_surface_iris.py Outdated Show resolved Hide resolved
examples/plot_decision_surface_iris.py Outdated Show resolved Hide resolved
examples/plot_decision_surface_iris.py Outdated Show resolved Hide resolved
examples/plot_extra_oblique_random_forest.py Outdated Show resolved Hide resolved
sktree/tree/_utils.pyx Outdated Show resolved Hide resolved
examples/plot_extra_oblique_random_forest.py Outdated Show resolved Hide resolved
Comment on lines 117 to 124
params = {
"max_features": None,
"n_estimators": 50,
"max_depth": 5,
"random_state": random_state,
"n_cv": 10,
"n_repeats": 1,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a note here regarding max_depth and the n_estimators being set low to allow the example to run on the CI. You can add some intuition on how these should be set normally.

@adam2392
Copy link
Collaborator

@SUKI-O I cleaned up the PR to remove any unnecessary code or documentation: d47f718

Lmk if this looks good to you and I will merge?

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

I will merge this in for now since this PR has the biggest diff and there are other PRs on the queue, but if there are questions, or comments you have, feel free to open up a GH issue.

@adam2392 adam2392 merged commit 8770f55 into neurodata:main Sep 10, 2023
25 checks passed
@adam2392
Copy link
Collaborator

Thanks @SUKI-O !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants