Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST PR] Adding oblique trees (i.e. Forest-RC) to cythonized tree module #11

Closed
wants to merge 26 commits into from

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Sep 4, 2021

Reference Issues/PRs

Fixes:

What does this implement/fix? Explain your changes.

Adds cythonized oblique trees to the tree module. This is known as Forest-RC in the Breiman 2001 paper.

  • _oblique_tree.pxd/pyx: This file implements i) the ObliqueTree(Tree), which defines a few additional class members for storing the projection weight and indices and a new function for adding and oblique node and then ii) the ObliqueTreeBuilder(TreeBuilder), which defines how to build the oblique tree.
  • _oblique_splitter.pxd/pyx: This is the main change, which i) defines an ObliqueSplitRecord for keeping track of oblique splits, and ii) defines an ObliqueSplitter(Splitter) which gets oblique node splits and samples projection matrices, while also storing additional hyperparameters.
  • _classes.py: Defines new Python interfaces for the Oblique trees and forests

Any other comments?

I'm not an expert in cython and c++ interplay, but I suspect that if we can "generalize" the Node struct to carry projection vector and weight information (not used in Forest-RI, or axis-aligned Random Forest), then much of the tree, tree building code is not even necessary. The only thing that is different at a fundamental level is the idea of a sample_proj_mat at each node of the tree, which samples sparse combinations.

Another missing component currently is the implementation on sparse data, but this should be easily added in I presume.

Code That Can Be Shortened

New data structures and classes, ObliqueSplitRecord, ObliqueSplitter, ObliqueTree are defined. However, if the existing SplitRecord, Splitter, Tree can be generalized, then the existing functions can just be used to build Oblique Trees too.

However, I'm not sure if the scikit-learn devs would want that, rather then just replicating some code across these two cases?

Copy link

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prototype, I think I see what is required to make it easier for you to extend scikit-learn's trees.

In the redesign, I think methods such as tree._apply_dense and DenseObliqueSplitter.node_split would still need to exist to perform your custom logic + data structures.

Comment on lines 209 to 213
node_id = tree._add_oblique_node(parent, is_left, is_leaf, split.feature,
split.threshold, impurity, n_node_samples,
weighted_n_node_samples,
split.proj_vec_weights,
split.proj_vec_indices)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different API than the scikit-learn. As you mentioned in office hours, to better support this use case, tree._add_node should accept a SpiltRecord and SplitRecord is a class that can be subclassed. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if SplitRecord can become a class that can be subclassed, then this becomes easier. However, I wasn't sure if the struct was there for performance, or maintenance reasons.

(impurity <= min_impurity_decrease))

if not is_leaf:
splitter.oblique_node_split(impurity, &split, &n_constant_features)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same API as scikit-learn. As long as SplitRecord can be subclassed, I think this should just work with your custom splitter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I specced out at least initially is that almost all of the Tree class currently in scikit-learn can be used for any "new tree" that only changes the splitter function as long as SplitRecord and the type of splitter can be subclassed.

Since the tree doesn't do anything fancy except call Splitter functions.

@adam2392
Copy link
Collaborator Author

Some notes as of 3/10/22

Have been working on a refactor that would make adding oblique trees to the existing codebase a lot simpler.

Some notes for further improvement:

@adam2392 adam2392 closed this Jun 13, 2023
@adam2392 adam2392 deleted the obliquetrees branch June 13, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants