Revert XGBoost categorical encodings for contributions #7613

exalate-issue-sync · 2023-05-11T17:17:28Z

For example, this needs to print True to use contribs to make SHAP plots, but won’t because the categorical column gets one-hot encoded by xgboost:

{code:python}prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

Set the predictors and response; set the factors:

prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
prostate["AGE"] = prostate["AGE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(ntrees = 100, learn_rate = 0.01, seed=2019)
model.train(training_frame=prostate, x=predictors, y=response)

calculate SHAP values using function predict_contributions

contributions = model.predict_contributions(prostate)

convert the H2O Frame to use with shap's visualization functions

contributions_matrix = contributions.as_data_frame().values

shap values are calculated for all features

shap_values = contributions_matrix[:,0:-1]

X = prostate[predictors].as_data_frame()
print(X.values.shape == shap_values.shape){code}

Current workaround:

{code:python}prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

Set the predictors and response; set the factors:

prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
prostate["AGE"] = prostate["AGE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"

from h2o.estimators import H2OXGBoostEstimator

model = H2OXGBoostEstimator(ntrees = 100, learn_rate = 0.01, seed=2019)
model.train(training_frame=prostate, x=predictors, y=response)

calculate SHAP values using function predict_contributions

contributions = model.predict_contributions(prostate)

sum one-hot contribs

groups = {c: [] for c in [x.split(".")[0] for x in contributions.columns]}
for c in contributions.columns:
groups[c.split(".")[0]].append(c)

rc = None
for k,v in groups.items():
c = contributions[v].sum(axis=1, return_frame=True)
c.columns = [k]
if rc is None:
rc = c
else:
rc = rc.cbind(c)

convert the H2O Frame to use with shap's visualization functions

contributions_matrix = rc.as_data_frame().values

shap values are calculated for all features

shap_values = contributions_matrix[:,0:-1]

expected values is the last returned column

expected_value = contributions_matrix[:,-1].min()

X = prostate[predictors].as_data_frame()
print(X.values.shape == shap_values.shape){code}

h2o-ops · 2023-05-14T19:22:53Z

JIRA Issue Details

Jira Issue: PUBDEV-8035
Assignee: Michal Kurka
Reporter: Joseph Granados
State: Resolved
Fix Version: 3.32.1.1
Attachments: N/A
Development PRs: N/A

h2o-ops added the fixVersion/3.32.1.1 label May 14, 2023

h2o-ops closed this as completed May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert XGBoost categorical encodings for contributions #7613

Revert XGBoost categorical encodings for contributions #7613

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

Revert XGBoost categorical encodings for contributions #7613

Revert XGBoost categorical encodings for contributions #7613

Comments

exalate-issue-sync bot commented May 11, 2023

Set the predictors and response; set the factors:

calculate SHAP values using function predict_contributions

convert the H2O Frame to use with shap's visualization functions

shap values are calculated for all features

Set the predictors and response; set the factors:

calculate SHAP values using function predict_contributions

sum one-hot contribs

convert the H2O Frame to use with shap's visualization functions

shap values are calculated for all features

expected values is the last returned column

h2o-ops commented May 14, 2023