Add sklearn's HistGradientBoosting #64

interesaaat · 2020-05-11T21:10:55Z

No description provided.

ahmedkrmn · 2020-05-27T20:58:27Z

Hi Matteo,
Can you give more context on how can I approach the implementation of this feature?

interesaaat · 2020-05-28T00:19:55Z

Hi Ahmed,
thanks for showing interest in Hummingbird! @ksaur has create a branch with some test. If you pull the branch and try to run this test file you should be getting something like the following:

hummingbird.ml.exceptions.MissingConverter: Unable to find converter for model type <class 'sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingClassifier'>. It usually means the pipeline being converted contains a transformer or a predictor with no corresponding converter implemented. Please fill an issue at https://github.com/microsoft/hummingbird.

Next, to add a new operator in Hummingbird:

Go into hummingbird.ml.supported and add the HistGradientBoostingClassifier class to the _build_sklearn_operator_list (and to the documentation at the beginning of the file please!). This will basically tell to Hummingbird to recognize this new operator.
Now we need to actually add the operator converter. You can add the converter to hummingbird.ml.operator_converters.gbdt.py (just copy and past the last line in the file, and change "SklearnGradientBoostingClassifier" with "SklearnHistGradientBoostingClassifier"). This will tell Hummingbird that, to convert HistGradientBoostingClassifier it can use the same function of GradientBoostingClassifier. This will probably not work :)
The final step is to make the convert work. This will require some work on your side. Basically you can copy what we have for the GradientBoostingClassifier in convert_sklearn_gbdt_classifier into a new convert_sklearn_hist_gbdt_classifier function and try to map the tree parameters into the format understood by convert_sklearn_gbdt_classifier. You don't need to do anything more than this. convert_sklearn_gbdt_classifier will already pick the pytorch tree implementation for you, so there is no need to go deeper than this or implement anything in PyTorch.

Please share any doubt or question you may have!

NicolasHug · 2020-05-29T14:51:44Z

I'm one of the authors of the histgradientboosting estimators, feel free to ping me if you have any question related to them!

interesaaat · 2020-05-29T15:08:32Z

Glad to see you here Nicholas :)

ahmedkrmn · 2020-05-30T01:59:41Z

Thanks @interesaaat for the detailed introduction and @NicolasHug for offering help!

I've installed the dependencies, and built the library using python setup.py install.
Everything is working fine. I ran the test_sklearn_histgbdt_converters.py that you've mentioned and it indeed provides the error message that you've said.

Anyways, I started implementing the convert_sklearn_hist_gbdt_classifier() function, but there is a problem that I would like to know your thoughts on:

First, I believe that the equivalent of tree_infos = operator.raw_operator.estimators_ from the convert_sklearn_gbdt_classifier() function would be tree_infos = operator.raw_operator._predictors in the new convert_sklearn_hist_gbdt_classifier().

The problem is that estimators_ is an array of DecisionTreeRegressor objects which have a tree_ property. This tree_ object has the following list of properties:
children_left, children_right, feature, threshold and value.

On the other hand, _predictors is an array of TreePredictor objects which has only one property nodes (arrays of nodes).

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

NicolasHug · 2020-05-30T12:20:45Z

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

A tree object has many arrays of size n_nodes, i.e. it has one array per property as you noticed (children_left, children_right, etc)

However the predictor object of the hist-GBDT is different: it's a single structured numpy array, i.e. it's an array whose elements have a specific dtype with multiple entries. It's basically an array of structs, if we were in C.
The dtype is specified here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/common.pyx#L18

For example the threshold property of the root can be accessed via nodes[0]['threshold']. Its left child is in nodes[nodes[0]['left']], etc.

ahmedkrmn · 2020-05-30T15:58:12Z

Thanks @NicolasHug for the clarification!

IIUC, the equivalent of:

tree_info = operator.raw_operator.estimators_[0][0]
lefts = tree_info.tree_.children_left

should be:

tree_info = operator.raw_operator._predictors[0][0]
lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]

when using hist-GBDT.

If that is the case, it seems like nodes which don't have left nodes are represented with 0 instead of -1:
lefts for GBDT: [1, 2, -1, -1, 5, -1, -1]
lefts for hist-GBDT: [1, 2, 0, 0, 5, 0, 0]

NicolasHug · 2020-05-30T16:03:50Z

Yup I think you got it right

The array of nodes is initialized with all fields being 0. If a node doesn't have a left child that means it doesn't have a right child either so the left/right fields are 0, and the is_leaf field is True/1.

interesaaat · 2020-05-30T16:11:13Z

I think that for left, right and threshold we should have -1 instead of 0 in Hummingbird because the implementation looks for -1 values. (I might be wrong, but I am on the phone and I am having hard time checking the code) Anyway this is not hard :) Thanks Nicolas for the help!

ahmedkrmn · 2020-05-30T16:11:35Z

OK great,
But wouldn't that require changing all the conditions with -1 to 0 in

hummingbird/hummingbird/ml/operator_converters/_tree_commons.py

Line 62 in aac965a

def _find_max_depth(tree_parameters):

and

hummingbird/hummingbird/ml/operator_converters/_tree_commons.py

Line 106 in aac965a

def _find_depth(node, current_depth):

ahmedkrmn · 2020-05-30T16:15:44Z

@interesaaat would it be a good idea to compare against both -1 and 0 instead of -1 only in _find_max_depth() and leave the base case of recursion -1 as is for _find_depth(), or is there a better approach that you suggest?

NicolasHug · 2020-05-30T16:20:05Z

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?

lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

ahmedkrmn · 2020-05-30T16:21:44Z

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?
lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

Yeah, that's better!

interesaaat added the good first issue Good for newcomers label May 11, 2020

interesaaat mentioned this issue May 28, 2020

Add ExtraTreesRegressor #12

Closed

NicolasHug mentioned this issue May 30, 2020

Add GradientBoostingRegressor #11

Closed

ahmedkrmn mentioned this issue May 30, 2020

[WIP] Add sklearn's HistGradientBoosting #87

Merged

interesaaat added the work in progress label May 31, 2020

interesaaat assigned ahmedkrmn May 31, 2020

interesaaat closed this as completed in #87 Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sklearn's HistGradientBoosting #64

Add sklearn's HistGradientBoosting #64

interesaaat commented May 11, 2020

ahmedkrmn commented May 27, 2020

interesaaat commented May 28, 2020

NicolasHug commented May 29, 2020

interesaaat commented May 29, 2020

ahmedkrmn commented May 30, 2020

NicolasHug commented May 30, 2020

ahmedkrmn commented May 30, 2020 •

edited

NicolasHug commented May 30, 2020

interesaaat commented May 30, 2020

ahmedkrmn commented May 30, 2020

ahmedkrmn commented May 30, 2020 •

edited

NicolasHug commented May 30, 2020

ahmedkrmn commented May 30, 2020

Add sklearn's HistGradientBoosting #64

Add sklearn's HistGradientBoosting #64

Comments

interesaaat commented May 11, 2020

ahmedkrmn commented May 27, 2020

interesaaat commented May 28, 2020

NicolasHug commented May 29, 2020

interesaaat commented May 29, 2020

ahmedkrmn commented May 30, 2020

NicolasHug commented May 30, 2020

ahmedkrmn commented May 30, 2020 • edited

NicolasHug commented May 30, 2020

interesaaat commented May 30, 2020

ahmedkrmn commented May 30, 2020

ahmedkrmn commented May 30, 2020 • edited

NicolasHug commented May 30, 2020

ahmedkrmn commented May 30, 2020

ahmedkrmn commented May 30, 2020 •

edited

ahmedkrmn commented May 30, 2020 •

edited