GBM Interaction Gain and Variable Importance Inconsistency #6530

exalate-issue-sync · 2023-02-21T22:09:53Z

It seems that there is an inconsistency and possible inaccuracy in the calculation of variable importance for regression models.

Based on the documentation [here|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html#variable-importance-calculation-gbm-drf], the variable importance should be the decrease in squared error from a node and it's children nodes. When I try to recreate that calculation using a simple decision stump, I cannot reconcile the values.

Additionally, there seems to be an inconsistency as well compared to the output of h2o.feature_interaction. For a single split tree, I would have expected the metrics to be the same, however they are different, and oddly, the gain value is negative which seems incorrect. Please find the code with a reprex below.

Please let me know if anything about my approach is wrong. Thank you!

# set up
library(data.table)
library(h2o) # 3.36.1.4
h2o.init()
set.seed(12345)

# create dummy data
train_data <- data.frame(x1 = runif(n = 100, min = 1, max = 100), x2 = runif(n = 100, min = 1, max = 100))
train_data$y <- runif(n = 100)*10 + train_data$x1 * 1.5 +  train_data$x2 * -2 
train_data_h2o <- as.h2o(train_data)

# build dummy GBM model (decision tree)
gbm_model <- h2o.gbm(training_frame = train_data_h2o, x = c("x1","x2"), y = "y", ntrees = 1, max_depth = 1, min_rows = 1, seed = 12345)
# look at variable importance table
gbm_model@model$variable_importances

# generate predictions 
train_data_h2o <- h2o.cbind(train_data_h2o, h2o.predict(gbm_model, train_data_h2o))

# get single tree in GBM
tree <- h2o.getModelTree(model = gbm_model, tree_number = 1)

# calculate predictions after first split on x2
train_data_h2o$first_split_pred <- h2o.ifelse(train_data_h2o$x2 >= tree@thresholds[1], gbm_model@model$init_f + tree@predictions[[3]], gbm_model@model$init_f + tree@predictions[[2]])


########### Attempt to calculate relative_importance values from gbm_model@model$variable_importances
# first calculate each node's SSE 
# get SSE of root node
init_f_sse <- sum( (train_data_h2o$y - gbm_model@model$init_f)^2 ) # 518029.3

# calculate SSE from x2's child nodes
x2_right_sse <- sum((train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$first_split_pred)^2) # 225849
x2_left_sse <- sum((train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$first_split_pred)^2) # 244693.2

# x2's relative_importance manual calculation
init_f_sse - x2_right_sse - x2_left_sse # 47487.08

# compare to variable_importances table
gbm_model@model$variable_importances[1,2] # 249932 -- doesn't match above

feat_int <- h2o.feature_interaction(gbm_model)

feat_int[[1]]$gain # -0.0078125

The text was updated successfully, but these errors were encountered:

exalate-issue-sync · 2023-02-21T22:09:55Z

Michal Kurka commented: Thank you for the report and your investigation - it does look suspicious.

I would be just cautious about trying to manually reconstruct the tree predictions, what I would do is to use leaf node assignment to see where the observations actually end-up: [https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|smart-link]

Even simple comparison using {{>=}} can in some cases due to different precision considered (float32 vs float64)

{noformat}h2o.ifelse(train_data_h2o$x2 >= tree@thresholds[1], gbm_model@model$init_f + tree@predictions[[3]], gbm_model@model$init_f + tree@predictions[[2]]){noformat}

We will take a look.

exalate-issue-sync · 2023-02-21T22:09:57Z

Dominick Sullivan commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] - I’m the OP of this ticket (wasn’t signed in at the time of the post). Just wanted to follow up on the progress of this. Thanks again!

h2o-ops · 2023-05-10T13:53:53Z

JIRA Issue Details

Jira Issue: PUBDEV-8884
Assignee: Adam Valenta
Reporter: N/A
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

exalate-issue-sync bot added the R label Feb 21, 2023

wendycwong assigned valenad1 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GBM Interaction Gain and Variable Importance Inconsistency #6530

GBM Interaction Gain and Variable Importance Inconsistency #6530

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

h2o-ops commented May 10, 2023

GBM Interaction Gain and Variable Importance Inconsistency #6530

GBM Interaction Gain and Variable Importance Inconsistency #6530

Comments

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

h2o-ops commented May 10, 2023