Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sklearn's HistGradientBoosting #64

Closed
interesaaat opened this issue May 11, 2020 · 13 comments · Fixed by #87
Closed

Add sklearn's HistGradientBoosting #64

interesaaat opened this issue May 11, 2020 · 13 comments · Fixed by #87
Assignees
Labels

Comments

@interesaaat
Copy link
Collaborator

No description provided.

@interesaaat interesaaat added the good first issue Good for newcomers label May 11, 2020
@ahmedkrmn
Copy link
Contributor

Hi Matteo,
Can you give more context on how can I approach the implementation of this feature?

@interesaaat
Copy link
Collaborator Author

Hi Ahmed,
thanks for showing interest in Hummingbird! @ksaur has create a branch with some test. If you pull the branch and try to run this test file you should be getting something like the following:

hummingbird.ml.exceptions.MissingConverter: Unable to find converter for model type <class 'sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingClassifier'>. It usually means the pipeline being converted contains a transformer or a predictor with no corresponding converter implemented. Please fill an issue at https://github.com/microsoft/hummingbird.

Next, to add a new operator in Hummingbird:

  • Go into hummingbird.ml.supported and add the HistGradientBoostingClassifier class to the _build_sklearn_operator_list (and to the documentation at the beginning of the file please!). This will basically tell to Hummingbird to recognize this new operator.
  • Now we need to actually add the operator converter. You can add the converter to hummingbird.ml.operator_converters.gbdt.py (just copy and past the last line in the file, and change "SklearnGradientBoostingClassifier" with "SklearnHistGradientBoostingClassifier"). This will tell Hummingbird that, to convert HistGradientBoostingClassifier it can use the same function of GradientBoostingClassifier. This will probably not work :)
  • The final step is to make the convert work. This will require some work on your side. Basically you can copy what we have for the GradientBoostingClassifier in convert_sklearn_gbdt_classifier into a new convert_sklearn_hist_gbdt_classifier function and try to map the tree parameters into the format understood by convert_sklearn_gbdt_classifier. You don't need to do anything more than this. convert_sklearn_gbdt_classifier will already pick the pytorch tree implementation for you, so there is no need to go deeper than this or implement anything in PyTorch.

Please share any doubt or question you may have!

@NicolasHug
Copy link

I'm one of the authors of the histgradientboosting estimators, feel free to ping me if you have any question related to them!

@interesaaat
Copy link
Collaborator Author

Glad to see you here Nicholas :)

@ahmedkrmn
Copy link
Contributor

Thanks @interesaaat for the detailed introduction and @NicolasHug for offering help!

I've installed the dependencies, and built the library using python setup.py install.
Everything is working fine. I ran the test_sklearn_histgbdt_converters.py that you've mentioned and it indeed provides the error message that you've said.

Anyways, I started implementing the convert_sklearn_hist_gbdt_classifier() function, but there is a problem that I would like to know your thoughts on:

First, I believe that the equivalent of tree_infos = operator.raw_operator.estimators_ from the convert_sklearn_gbdt_classifier() function would be tree_infos = operator.raw_operator._predictors in the new convert_sklearn_hist_gbdt_classifier().

The problem is that estimators_ is an array of DecisionTreeRegressor objects which have a tree_ property. This tree_ object has the following list of properties:
children_left, children_right, feature, threshold and value.

On the other hand, _predictors is an array of TreePredictor objects which has only one property nodes (arrays of nodes).

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

@NicolasHug
Copy link

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

A tree object has many arrays of size n_nodes, i.e. it has one array per property as you noticed (children_left, children_right, etc)

However the predictor object of the hist-GBDT is different: it's a single structured numpy array, i.e. it's an array whose elements have a specific dtype with multiple entries. It's basically an array of structs, if we were in C.
The dtype is specified here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/common.pyx#L18

For example the threshold property of the root can be accessed via nodes[0]['threshold']. Its left child is in nodes[nodes[0]['left']], etc.

@ahmedkrmn
Copy link
Contributor

ahmedkrmn commented May 30, 2020

Thanks @NicolasHug for the clarification!

IIUC, the equivalent of:

tree_info = operator.raw_operator.estimators_[0][0]
lefts = tree_info.tree_.children_left

should be:

tree_info = operator.raw_operator._predictors[0][0]
lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]

when using hist-GBDT.


If that is the case, it seems like nodes which don't have left nodes are represented with 0 instead of -1:
lefts for GBDT: [1, 2, -1, -1, 5, -1, -1]
lefts for hist-GBDT: [1, 2, 0, 0, 5, 0, 0]

@NicolasHug
Copy link

Yup I think you got it right

The array of nodes is initialized with all fields being 0. If a node doesn't have a left child that means it doesn't have a right child either so the left/right fields are 0, and the is_leaf field is True/1.

@interesaaat
Copy link
Collaborator Author

I think that for left, right and threshold we should have -1 instead of 0 in Hummingbird because the implementation looks for -1 values. (I might be wrong, but I am on the phone and I am having hard time checking the code) Anyway this is not hard :) Thanks Nicolas for the help!

@ahmedkrmn
Copy link
Contributor

OK great,
But wouldn't that require changing all the conditions with -1 to 0 in

def _find_max_depth(tree_parameters):

and
def _find_depth(node, current_depth):

@ahmedkrmn
Copy link
Contributor

ahmedkrmn commented May 30, 2020

@interesaaat would it be a good idea to compare against both -1 and 0 instead of -1 only in _find_max_depth() and leave the base case of recursion -1 as is for _find_depth(), or is there a better approach that you suggest?

@NicolasHug
Copy link

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?

lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

@ahmedkrmn
Copy link
Contributor

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?

lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

Yeah, that's better!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants