ensemble tree shap #532

rayeaster · 2020-04-05T03:44:33Z

implementation for #515

similar (although not exactly same) python result for boston housing dataset:

CRIM      4.7090670e-01
ZN        1.1862384e-03
INDUS     3.6680367e-02
CHAS      0.0000000e+00
NOX       2.9865596e-01
RM        1.6684741e+00
AGE       3.4423586e-02
DIS       2.4549793e-01
RAD       1.1467161e-02
TAX       4.9007054e-02
PTRATIO   9.5596902e-02
B         8.1507929e-02
LSTAT     2.5720918e+00

haifengl · 2020-04-06T18:16:57Z

Can you get the SHAP values from python for this settings?

GradientTreeBoost model = GradientTreeBoost.fit(BostonHousing.formula, BostonHousing.data, Loss.ls(), 100, 20, 10, 5, 0.05, 0.7);

Thanks

rayeaster · 2020-04-08T04:53:28Z

Can you get the SHAP values from python for this settings?
GradientTreeBoost model = GradientTreeBoost.fit(BostonHousing.formula, BostonHousing.data, Loss.ls(), 100, 20, 10, 5, 0.05, 0.7);
Thanks

params = {
    "eta": 0.05,
    "max_depth": 20, 
    "subsample": 0.7,
    "tree_method": "hist",
    "grow_policy": "lossguide",
    "max_leaves ": 10
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)

CRIM      0.5188436
ZN        0.01572886
INDUS     0.14262857
CHAS      0.01477841
NOX       0.5418738
RM        2.651757
AGE       0.41095728
DIS       0.7134633
RAD       0.08578682
TAX       0.3313276
PTRATIO   0.52540994
B         0.2455673
LSTAT     3.7603402

haifengl · 2020-04-08T13:33:30Z

Thanks. But max_leaves should be bigger (at least 100). otherwise, we cannot go deep (max_depth).

haifengl · 2020-04-08T13:35:16Z

tree_method should be "exact", which matches our implementation.

rayeaster · 2020-04-08T13:43:16Z

tree_method should be "exact", which matches our implementation.

X,y = shap.datasets.boston()
params = {
    "eta": 0.05,
    "max_depth": 6, 
    "subsample": 0.7,
    "tree_method": "exact",
    "max_leaves ": 100
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)

CRIM      0.56108105
ZN        0.01085366
INDUS     0.10574973
CHAS      0.0058878
NOX       0.58213335
RM        2.5767236
AGE       0.3808417
DIS       0.69005835
RAD       0.06147488
TAX       0.3422722
PTRATIO   0.5062163
B         0.20863737
LSTAT     3.7203836

haifengl · 2020-04-08T13:44:49Z

Thanks. Can you make max_depth = 20? Keep all other parameters as is.

rayeaster · 2020-04-08T13:50:39Z

Thanks. Can you make max_depth = 20? Keep all other parameters as is.

no major change as below:

CRIM      0.5673161
ZN        0.02324653
INDUS     0.14829211
CHAS      0.01077364
NOX       0.62497944
RM        2.586821
AGE       0.4203959
DIS       0.67440164
RAD       0.05955494
TAX       0.32137313
PTRATIO   0.5085526
B         0.24730903
LSTAT     3.7028806

haifengl · 2020-04-08T13:55:26Z

Our order roughly matches. But the values are different, especially LSTAT. How could their LSTAT and RM SHAP values are so big? Given eta = 0.05, the values cannot be big.

rayeaster · 2020-04-08T14:11:50Z

Our order roughly matches. But the values are different, especially LSTAT. How could their LSTAT and RM SHAP values are so big? Given eta = 0.05, the values cannot be big.

just in case I mistake sth there (reproducible in https://repl.it/languages/python3)

and seems the max_leaves is not in effect for tree_method=exact from xgboost doc: https://xgboost.readthedocs.io/en/latest/parameter.html

max_leaves [default=0]
Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.

grow_policy [default= depthwise]
Controls a way new nodes are added to the tree.
Currently supported only if tree_method is set to hist.

import xgboost
import shap
import numpy as np

X,y = shap.datasets.boston()
params = {
    "eta": 0.05,
    "max_depth": 20, 
    "subsample": 0.7,
    "tree_method": "exact",
    "max_leaves ": 100
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_sum = np.sum(np.absolute(shap_values),axis=0)
shap_sum = shap_sum / y.size
print(shap_sum)    
        
shap.summary_plot(shap_values, X, plot_type="bar")

haifengl · 2020-04-08T14:22:20Z

I will ignore the exact values as XGBoost training algorithm may be very different from mine. Importantly, the order matches (only NOX and CRIM switch). I have checked my code many times yesterday. I am confident that it is correct. Thanks.

haifengl · 2020-04-08T14:24:23Z

BTW, the settings here are not proper for GBM, especially very depth and large trees (we set them large for complicated cases for SHAP). Can you get the SHAP values for random forest? Random Forest typically have large and deep trees. Thanks!

rayeaster · 2020-04-08T14:31:03Z

BTW, the settings here are not proper for GBM, especially very depth and large trees (we set them large for complicated cases for SHAP). Can you get the SHAP values for random forest? Random Forest typically have large and deep trees. Thanks!

you mean from sklearn or smile?

haifengl · 2020-04-08T14:34:15Z

I have tried random forest. The SHAP values are around 2.2 for top two features. This make senses. For GBM, it cannot be that high. Can you get the SHAP values from sklearn for the below settings?

RandomForest model = RandomForest.fit(BostonHousing.formula, BostonHousing.data,100, 3, 20, 100, 5, 1.0);

haifengl · 2020-04-08T14:41:32Z

When you get the SHAP values in python, have you reset the RNG seed or restart your python session every time? Just want to make sure that you use the same random number generator seed? Thanks.

rayeaster · 2020-04-08T14:54:19Z

When you get the SHAP values in python, have you reset the RNG seed or restart your python session every time? Just want to make sure that you use the same random number generator seed? Thanks.

for xgboost, it is by default set to 0:

seed [default=0]
Random number seed. This parameter is ignored in R package, use set.seed() instead.

haifengl · 2020-04-08T14:59:32Z

Can you please get the SHAP values for random forest? Thanks.

0 has no entropy, the worst seed for random number generator. Anyway, we don't need worry about it for this task.

rayeaster · 2020-04-08T15:01:52Z

I have tried random forest. The SHAP values are around 2.2 for top two features. This make senses. For GBM, it cannot be that high. Can you get the SHAP values from sklearn for the below settings?
RandomForest model = RandomForest.fit(BostonHousing.formula, BostonHousing.data,100, 3, 20, 100, 5, 1.0);

you are right. RF with sklearn give similar result around 2.2 for top 2 features

import shap
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data[:,:]
y = boston.target

regressor = RandomForestRegressor(n_estimators=100,
                                  max_depth=20,
                                  min_samples_split=5,
                                  max_features=0.33,
                                  max_leaf_nodes=100,
                                  random_state=0, 
                                  n_jobs=-1)

model = regressor.fit(X, y)


# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_sum = np.sum(np.absolute(shap_values),axis=0)
shap_sum = shap_sum / y.size
print(shap_sum)    
       
shap.summary_plot(shap_values, X, plot_type="bar")

haifengl · 2020-04-08T15:04:24Z

The feature is in the same order in GBM case? Thanks

rayeaster · 2020-04-08T15:11:37Z

The feature is in the same order in GBM case? Thanks

ordering for RF:

LSTAT
RM
NOX
INDUS
PTRATIO
CRIM
TAX
DIS
AGE
B
RAD
ZN
CHAS

haifengl · 2020-04-08T15:13:41Z

RAD is missing

ensemble tree shap

1146c5a

haifengl merged commit 8a7f50c into haifengl:shap Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensemble tree shap #532

ensemble tree shap #532

rayeaster commented Apr 5, 2020 •

edited

haifengl commented Apr 6, 2020

rayeaster commented Apr 8, 2020 •

edited

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020 •

edited

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020 •

edited

haifengl commented Apr 8, 2020

ensemble tree shap #532

ensemble tree shap #532

Conversation

rayeaster commented Apr 5, 2020 • edited

haifengl commented Apr 6, 2020

rayeaster commented Apr 8, 2020 • edited

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020 • edited

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020

haifengl commented Apr 8, 2020

rayeaster commented Apr 8, 2020 • edited

haifengl commented Apr 8, 2020

rayeaster commented Apr 5, 2020 •

edited

rayeaster commented Apr 8, 2020 •

edited

rayeaster commented Apr 8, 2020 •

edited

rayeaster commented Apr 8, 2020 •

edited