Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GBM Interaction Gain and Variable Importance Inconsistency #6530

Open
exalate-issue-sync bot opened this issue Feb 21, 2023 · 3 comments
Open

GBM Interaction Gain and Variable Importance Inconsistency #6530

exalate-issue-sync bot opened this issue Feb 21, 2023 · 3 comments
Assignees
Labels

Comments

@exalate-issue-sync
Copy link

It seems that there is an inconsistency and possible inaccuracy in the calculation of variable importance for regression models.

Based on the documentation [here|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html#variable-importance-calculation-gbm-drf], the variable importance should be the decrease in squared error from a node and it's children nodes. When I try to recreate that calculation using a simple decision stump, I cannot reconcile the values.

Additionally, there seems to be an inconsistency as well compared to the output of h2o.feature_interaction. For a single split tree, I would have expected the metrics to be the same, however they are different, and oddly, the gain value is negative which seems incorrect. Please find the code with a reprex below.

Please let me know if anything about my approach is wrong. Thank you!

# set up
library(data.table)
library(h2o) # 3.36.1.4
h2o.init()
set.seed(12345)

# create dummy data
train_data <- data.frame(x1 = runif(n = 100, min = 1, max = 100), x2 = runif(n = 100, min = 1, max = 100))
train_data$y <- runif(n = 100)*10 + train_data$x1 * 1.5 +  train_data$x2 * -2 
train_data_h2o <- as.h2o(train_data)

# build dummy GBM model (decision tree)
gbm_model <- h2o.gbm(training_frame = train_data_h2o, x = c("x1","x2"), y = "y", ntrees = 1, max_depth = 1, min_rows = 1, seed = 12345)
# look at variable importance table
gbm_model@model$variable_importances

# generate predictions 
train_data_h2o <- h2o.cbind(train_data_h2o, h2o.predict(gbm_model, train_data_h2o))

# get single tree in GBM
tree <- h2o.getModelTree(model = gbm_model, tree_number = 1)

# calculate predictions after first split on x2
train_data_h2o$first_split_pred <- h2o.ifelse(train_data_h2o$x2 >= tree@thresholds[1], gbm_model@model$init_f + tree@predictions[[3]], gbm_model@model$init_f + tree@predictions[[2]])


########### Attempt to calculate relative_importance values from gbm_model@model$variable_importances
# first calculate each node's SSE 
# get SSE of root node
init_f_sse <- sum( (train_data_h2o$y - gbm_model@model$init_f)^2 ) # 518029.3

# calculate SSE from x2's child nodes
x2_right_sse <- sum((train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$first_split_pred)^2) # 225849
x2_left_sse <- sum((train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$first_split_pred)^2) # 244693.2

# x2's relative_importance manual calculation
init_f_sse - x2_right_sse - x2_left_sse # 47487.08

# compare to variable_importances table
gbm_model@model$variable_importances[1,2] # 249932 -- doesn't match above

feat_int <- h2o.feature_interaction(gbm_model)

feat_int[[1]]$gain # -0.0078125 

@exalate-issue-sync exalate-issue-sync bot added the R label Feb 21, 2023
@exalate-issue-sync
Copy link
Author

Michal Kurka commented: Thank you for the report and your investigation - it does look suspicious.

I would be just cautious about trying to manually reconstruct the tree predictions, what I would do is to use leaf node assignment to see where the observations actually end-up: [https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|smart-link]

Even simple comparison using {{>=}} can in some cases due to different precision considered (float32 vs float64)

{noformat}h2o.ifelse(train_data_h2o$x2 >= tree@thresholds[1], gbm_model@model$init_f + tree@predictions[[3]], gbm_model@model$init_f + tree@predictions[[2]]){noformat}

We will take a look.

@exalate-issue-sync
Copy link
Author

Dominick Sullivan commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] - I’m the OP of this ticket (wasn’t signed in at the time of the post). Just wanted to follow up on the progress of this. Thanks again!

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 10, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8884
Assignee: Adam Valenta
Reporter: N/A
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants