Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical_encoding - xgboost : Enum and Eigen does not work in h2o.grid(). #7052

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments

Comments

@exalate-issue-sync
Copy link

Hi,

"Enum" and "Eigen" encodings does not work for "xgboost".
It can be reproduced with this:

#--------------
library(h2o)
h2o.init()

import the airlines dataset:

This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"

original data can be found at http://www.transtats.bts.gov/

airlines <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

convert columns to factors

airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])

set the predictor names and the response column name

predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum")
response <- "IsDepDelayed"

split into train and validation

airlines_splits <- h2o.splitFrame(data = airlines, ratios = 0.8, seed = 1234)
train <- airlines_splits[[1]]
valid <- airlines_splits[[2]]

#---- Grid Search
hyper_params <- list(
categorical_encoding = c(
'OneHotExplicit', 'OneHotInternal',
'Binary', 'Eigen', 'SortByResponse',
'EnumLimited', 'Enum', 'LabelEncoder'
)
)

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: list(strategy = "RandomDiscrete")

this GBM uses early stopping once the validation AUC doesn't improve by at least 0.01% for

5 consecutive scoring events

grid <- h2o.grid(
x = predictors, y = response, training_frame = train, validation_frame = valid,

algorithm = "xgboost",

grid_id = "air_xgboost", hyper_params = hyper_params,
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
search_criteria = list(strategy = "Cartesian"), parallelism = 0,
seed = 1234)

Sort the grid models by AUC

sorted_grid <- h2o.getGrid("air_xgboost", sort_by = "auc", decreasing = TRUE)
sorted_grid

#--------------

Which produces this output:

sorted_grid
H2O Grid Details
================

Grid ID: air_xgboost
Used hyper parameters:

  • categorical_encoding
    Number of models: 6
    Number of failed models: 2

Hyper-Parameter Search Summary: ordered by decreasing auc
categorical_encoding model_ids auc
1 Binary XGBoost_model_1647599361285_7139 0.74622
2 LabelEncoder XGBoost_model_1647599361285_7144 0.74464
3 OneHotExplicit XGBoost_model_1647599361285_7137 0.74074
4 OneHotInternal XGBoost_model_1647599361285_7138 0.74024
5 EnumLimited XGBoost_model_1647599361285_7142 0.62515
6 SortByResponse XGBoost_model_1647599361285_7141 0.49867
Failed models

categorical_encoding status_failed
Eigen FAIL
Enum FAIL
msgs_failed
"Override toEigenVec for this Algo!"
"Illegal argument(s) for XGBoost model: XGBoost_model_1647599361285_7143. Details: ERRR on field: _categorical_encoding: Enum encoding is not supported for XGBoost in current H2O.\n"

#----------- END OF OUTPUT -------

If this is the expected behaviour, the parameter's options should be updated.

Thanks!
Carlos Ortega (Spain)

@exalate-issue-sync exalate-issue-sync bot added the R label May 11, 2023
@exalate-issue-sync
Copy link
Author

Tomas Fryda commented: This seems to be expected behavior, for more information see XGBoost part of [https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html|smart-link] .

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8626
Assignee: Adam Valenta
Reporter: N/A
State: Resolved
Fix Version: 3.36.0.4
Attachments: N/A
Development PRs: Available

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

Linked PRs from JIRA

#6122
#6128

@h2o-ops h2o-ops closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants