Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensemble tree shap #532

Merged
merged 1 commit into from Apr 5, 2020
Merged

ensemble tree shap #532

merged 1 commit into from Apr 5, 2020

Conversation

rayeaster
Copy link
Contributor

@rayeaster rayeaster commented Apr 5, 2020

implementation for #515

similar (although not exactly same) python result for boston housing dataset:

CRIM      4.7090670e-01
ZN        1.1862384e-03
INDUS     3.6680367e-02
CHAS      0.0000000e+00
NOX       2.9865596e-01
RM        1.6684741e+00
AGE       3.4423586e-02
DIS       2.4549793e-01
RAD       1.1467161e-02
TAX       4.9007054e-02
PTRATIO   9.5596902e-02
B         8.1507929e-02
LSTAT     2.5720918e+00

bostonShap

@haifengl haifengl merged commit 8a7f50c into haifengl:shap Apr 5, 2020
@haifengl
Copy link
Owner

haifengl commented Apr 6, 2020

Can you get the SHAP values from python for this settings?

GradientTreeBoost model = GradientTreeBoost.fit(BostonHousing.formula, BostonHousing.data, Loss.ls(), 100, 20, 10, 5, 0.05, 0.7);

Thanks

@rayeaster
Copy link
Contributor Author

rayeaster commented Apr 8, 2020

Can you get the SHAP values from python for this settings?

GradientTreeBoost model = GradientTreeBoost.fit(BostonHousing.formula, BostonHousing.data, Loss.ls(), 100, 20, 10, 5, 0.05, 0.7);

Thanks

params = {
    "eta": 0.05,
    "max_depth": 20, 
    "subsample": 0.7,
    "tree_method": "hist",
    "grow_policy": "lossguide",
    "max_leaves ": 10
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)
CRIM      0.5188436
ZN        0.01572886
INDUS     0.14262857
CHAS      0.01477841
NOX       0.5418738
RM        2.651757
AGE       0.41095728
DIS       0.7134633
RAD       0.08578682
TAX       0.3313276
PTRATIO   0.52540994
B         0.2455673
LSTAT     3.7603402

image

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

Thanks. But max_leaves should be bigger (at least 100). otherwise, we cannot go deep (max_depth).

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

tree_method should be "exact", which matches our implementation.

@rayeaster
Copy link
Contributor Author

tree_method should be "exact", which matches our implementation.

X,y = shap.datasets.boston()
params = {
    "eta": 0.05,
    "max_depth": 6, 
    "subsample": 0.7,
    "tree_method": "exact",
    "max_leaves ": 100
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)
CRIM      0.56108105
ZN        0.01085366
INDUS     0.10574973
CHAS      0.0058878
NOX       0.58213335
RM        2.5767236
AGE       0.3808417
DIS       0.69005835
RAD       0.06147488
TAX       0.3422722
PTRATIO   0.5062163
B         0.20863737
LSTAT     3.7203836

image

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

Thanks. Can you make max_depth = 20? Keep all other parameters as is.

@rayeaster
Copy link
Contributor Author

Thanks. Can you make max_depth = 20? Keep all other parameters as is.

no major change as below:

CRIM      0.5673161
ZN        0.02324653
INDUS     0.14829211
CHAS      0.01077364
NOX       0.62497944
RM        2.586821
AGE       0.4203959
DIS       0.67440164
RAD       0.05955494
TAX       0.32137313
PTRATIO   0.5085526
B         0.24730903
LSTAT     3.7028806

image

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

Our order roughly matches. But the values are different, especially LSTAT. How could their LSTAT and RM SHAP values are so big? Given eta = 0.05, the values cannot be big.

@rayeaster
Copy link
Contributor Author

rayeaster commented Apr 8, 2020

Our order roughly matches. But the values are different, especially LSTAT. How could their LSTAT and RM SHAP values are so big? Given eta = 0.05, the values cannot be big.

just in case I mistake sth there (reproducible in https://repl.it/languages/python3)

and seems the max_leaves is not in effect for tree_method=exact from xgboost doc: https://xgboost.readthedocs.io/en/latest/parameter.html

max_leaves [default=0]
Maximum number of nodes to be added. Only relevant when grow_policy=lossguide is set.

grow_policy [default= depthwise]
Controls a way new nodes are added to the tree.
Currently supported only if tree_method is set to hist.

import xgboost
import shap
import numpy as np

X,y = shap.datasets.boston()
params = {
    "eta": 0.05,
    "max_depth": 20, 
    "subsample": 0.7,
    "tree_method": "exact",
    "max_leaves ": 100
}
model = xgboost.train(params, xgboost.DMatrix(X, label=y), 100)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_sum = np.sum(np.absolute(shap_values),axis=0)
shap_sum = shap_sum / y.size
print(shap_sum)    
        
shap.summary_plot(shap_values, X, plot_type="bar")

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

I will ignore the exact values as XGBoost training algorithm may be very different from mine. Importantly, the order matches (only NOX and CRIM switch). I have checked my code many times yesterday. I am confident that it is correct. Thanks.

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

BTW, the settings here are not proper for GBM, especially very depth and large trees (we set them large for complicated cases for SHAP). Can you get the SHAP values for random forest? Random Forest typically have large and deep trees. Thanks!

@rayeaster
Copy link
Contributor Author

BTW, the settings here are not proper for GBM, especially very depth and large trees (we set them large for complicated cases for SHAP). Can you get the SHAP values for random forest? Random Forest typically have large and deep trees. Thanks!

you mean from sklearn or smile?

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

I have tried random forest. The SHAP values are around 2.2 for top two features. This make senses. For GBM, it cannot be that high. Can you get the SHAP values from sklearn for the below settings?

RandomForest model = RandomForest.fit(BostonHousing.formula, BostonHousing.data,100, 3, 20, 100, 5, 1.0);

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

When you get the SHAP values in python, have you reset the RNG seed or restart your python session every time? Just want to make sure that you use the same random number generator seed? Thanks.

@rayeaster
Copy link
Contributor Author

When you get the SHAP values in python, have you reset the RNG seed or restart your python session every time? Just want to make sure that you use the same random number generator seed? Thanks.

for xgboost, it is by default set to 0:

seed [default=0]
Random number seed. This parameter is ignored in R package, use set.seed() instead.

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

Can you please get the SHAP values for random forest? Thanks.

0 has no entropy, the worst seed for random number generator. Anyway, we don't need worry about it for this task.

@rayeaster
Copy link
Contributor Author

I have tried random forest. The SHAP values are around 2.2 for top two features. This make senses. For GBM, it cannot be that high. Can you get the SHAP values from sklearn for the below settings?

RandomForest model = RandomForest.fit(BostonHousing.formula, BostonHousing.data,100, 3, 20, 100, 5, 1.0);

you are right. RF with sklearn give similar result around 2.2 for top 2 features

import shap
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets

boston = datasets.load_boston()
X = boston.data[:,:]
y = boston.target

regressor = RandomForestRegressor(n_estimators=100,
                                  max_depth=20,
                                  min_samples_split=5,
                                  max_features=0.33,
                                  max_leaf_nodes=100,
                                  random_state=0, 
                                  n_jobs=-1)

model = regressor.fit(X, y)


# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap_sum = np.sum(np.absolute(shap_values),axis=0)
shap_sum = shap_sum / y.size
print(shap_sum)    
       
shap.summary_plot(shap_values, X, plot_type="bar")

image

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

The feature is in the same order in GBM case? Thanks

@rayeaster
Copy link
Contributor Author

rayeaster commented Apr 8, 2020

The feature is in the same order in GBM case? Thanks

ordering for RF:

LSTAT
RM
NOX
INDUS
PTRATIO
CRIM
TAX
DIS
AGE
B
RAD
ZN
CHAS

@haifengl
Copy link
Owner

haifengl commented Apr 8, 2020

RAD is missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants