You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that there is an inconsistency and possible inaccuracy in the calculation of variable importance for regression models.
Based on the documentation [here|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html#variable-importance-calculation-gbm-drf], the variable importance should be the decrease in squared error from a node and it's children nodes. When I try to recreate that calculation using a simple decision stump, I cannot reconcile the values.
Additionally, there seems to be an inconsistency as well compared to the output of h2o.feature_interaction. For a single split tree, I would have expected the metrics to be the same, however they are different, and oddly, the gain value is negative which seems incorrect. Please find the code with a reprex below.
Please let me know if anything about my approach is wrong. Thank you!
# set up
library(data.table)
library(h2o) # 3.36.1.4
h2o.init()
set.seed(12345)
# create dummy data
train_data <- data.frame(x1 = runif(n = 100, min = 1, max = 100), x2 = runif(n = 100, min = 1, max = 100))
train_data$y <- runif(n = 100)*10 + train_data$x1 * 1.5 + train_data$x2 * -2
train_data_h2o <- as.h2o(train_data)
# build dummy GBM model (decision tree)
gbm_model <- h2o.gbm(training_frame = train_data_h2o, x = c("x1","x2"), y = "y", ntrees = 1, max_depth = 1, min_rows = 1, seed = 12345)
# look at variable importance table
gbm_model@model$variable_importances
# generate predictions
train_data_h2o <- h2o.cbind(train_data_h2o, h2o.predict(gbm_model, train_data_h2o))
# get single tree in GBM
tree <- h2o.getModelTree(model = gbm_model, tree_number = 1)
# calculate predictions after first split on x2
train_data_h2o$first_split_pred <- h2o.ifelse(train_data_h2o$x2 >= tree@thresholds[1], gbm_model@model$init_f + tree@predictions[[3]], gbm_model@model$init_f + tree@predictions[[2]])
########### Attempt to calculate relative_importance values from gbm_model@model$variable_importances
# first calculate each node's SSE
# get SSE of root node
init_f_sse <- sum( (train_data_h2o$y - gbm_model@model$init_f)^2 ) # 518029.3
# calculate SSE from x2's child nodes
x2_right_sse <- sum((train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 >= tree@thresholds[1],]$first_split_pred)^2) # 225849
x2_left_sse <- sum((train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$y - train_data_h2o[train_data_h2o$x2 < tree@thresholds[1],]$first_split_pred)^2) # 244693.2
# x2's relative_importance manual calculation
init_f_sse - x2_right_sse - x2_left_sse # 47487.08
# compare to variable_importances table
gbm_model@model$variable_importances[1,2] # 249932 -- doesn't match above
feat_int <- h2o.feature_interaction(gbm_model)
feat_int[[1]]$gain # -0.0078125
The text was updated successfully, but these errors were encountered:
Michal Kurka commented: Thank you for the report and your investigation - it does look suspicious.
I would be just cautious about trying to manually reconstruct the tree predictions, what I would do is to use leaf node assignment to see where the observations actually end-up: [https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|https://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/predict_leaf_node_assignment.H2OModel.html|smart-link]
Even simple comparison using {{>=}} can in some cases due to different precision considered (float32 vs float64)
Dominick Sullivan commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] - I’m the OP of this ticket (wasn’t signed in at the time of the post). Just wanted to follow up on the progress of this. Thanks again!
It seems that there is an inconsistency and possible inaccuracy in the calculation of variable importance for regression models.
Based on the documentation [here|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html#variable-importance-calculation-gbm-drf], the variable importance should be the decrease in squared error from a node and it's children nodes. When I try to recreate that calculation using a simple decision stump, I cannot reconcile the values.
Additionally, there seems to be an inconsistency as well compared to the output of
h2o.feature_interaction
. For a single split tree, I would have expected the metrics to be the same, however they are different, and oddly, thegain
value is negative which seems incorrect. Please find the code with a reprex below.Please let me know if anything about my approach is wrong. Thank you!
The text was updated successfully, but these errors were encountered: