Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert XGBoost categorical encodings for contributions #7613

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment
Closed

Revert XGBoost categorical encodings for contributions #7613

exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment

Comments

@exalate-issue-sync
Copy link

For example, this needs to print True to use contribs to make SHAP plots, but won’t because the categorical column gets one-hot encoded by xgboost:

{code:python}prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

Set the predictors and response; set the factors:

prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
prostate["AGE"] = prostate["AGE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(ntrees = 100, learn_rate = 0.01, seed=2019)
model.train(training_frame=prostate, x=predictors, y=response)

calculate SHAP values using function predict_contributions

contributions = model.predict_contributions(prostate)

convert the H2O Frame to use with shap's visualization functions

contributions_matrix = contributions.as_data_frame().values

shap values are calculated for all features

shap_values = contributions_matrix[:,0:-1]

X = prostate[predictors].as_data_frame()
print(X.values.shape == shap_values.shape){code}

Current workaround:

{code:python}prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

Set the predictors and response; set the factors:

prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
prostate["AGE"] = prostate["AGE"].asfactor()
predictors = ["ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON"]
response = "CAPSULE"

from h2o.estimators import H2OXGBoostEstimator

model = H2OXGBoostEstimator(ntrees = 100, learn_rate = 0.01, seed=2019)
model.train(training_frame=prostate, x=predictors, y=response)

calculate SHAP values using function predict_contributions

contributions = model.predict_contributions(prostate)

sum one-hot contribs

groups = {c: [] for c in [x.split(".")[0] for x in contributions.columns]}
for c in contributions.columns:
groups[c.split(".")[0]].append(c)

rc = None
for k,v in groups.items():
c = contributions[v].sum(axis=1, return_frame=True)
c.columns = [k]
if rc is None:
rc = c
else:
rc = rc.cbind(c)

convert the H2O Frame to use with shap's visualization functions

contributions_matrix = rc.as_data_frame().values

shap values are calculated for all features

shap_values = contributions_matrix[:,0:-1]

expected values is the last returned column

expected_value = contributions_matrix[:,-1].min()

X = prostate[predictors].as_data_frame()
print(X.values.shape == shap_values.shape){code}

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8035
Assignee: Michal Kurka
Reporter: Joseph Granados
State: Resolved
Fix Version: 3.32.1.1
Attachments: N/A
Development PRs: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant