-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GLM interaction predict with NPE #6540
Comments
Wendy Wong commented: {noformat}library(h2o) # version 3.36.0.3 data <- read.csv(system.file("extdata", "prostate.csv", package = "h2o")) creating fake factor variable for repr. exampledata <- data %>% mutate(state = as.factor(state.abb[DPROS]), initialize h2oh2o.init() generating training framedata_train <- data[1:300,] %>% as.h2o() specify model vars and interactionsmodel_interaction_pairs <- list(c('numeric_var1', 'state')) interaction_glm <- h2o.glm(y = "dependent_var", taking the first 6 observations in the data setbad <- head(data) generates out of bounds error since AR isn't present in the data set.bad <- bad %>% taking the first 7 observations in the data set, now AR is included.good <- head(data, 7) no error since all states in train are present.good <- good %>% to further illustrate:unique(data[1:300,]$state) # states present in train |
Wendy Wong commented: 0 To preface, this question is specific to h2o package version {{3.36.0.3}}, I have not yet tested it on other versions but for my purpose {{3.36.0.3}} is unfortunately mandatory. In this reproducible example we can see that each categorical variable used for interactions ({{state}}) must be present at least once in new data to generate predictions without an error. For convenience sake we can pretend that data sets {{good}} and {{bad}} are entirely new observations that the model has not seen. In the training set, {{state}} takes on values of either AK, AL, AZ, or AR. For some reason, if one of these states are not present in the new data, {{h2o.predict()}} generates the error: {noformat}java.lang.RuntimeException: DistributedException from localhost: 'Index 6 out of bounds for length 3', caused by java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds for length 3 Predicting on this data set does work: {noformat}> good Predicting on this data set does not, and returns the "java.lang.ArrayIndexOutOfBoundsException" error since AR is not present in the data. {noformat}> bad I have some possible solutions to this, but they definitely aren't as convenient as I'd like. Add rows to the new data set for every unique state in the training frame, remove rows after predicting. This works, but outside of this example I'd need to implement quite a few steps and checks to make it dynamic (i.e. 0 categorical interactions, >1 categorical interactions, and the corresponding unique values present in the training frame)Modify the h2o model object to remove the variable & interaction variable that are not in use for new data. (can't seem to get this to work, and it might be hard to make dynamic outside of this example. Also probably not best practice to modify a model object)Add new data to the original set and predict on everything, then filter out by some indicator. This is also not ideal since the data outside of this example is pretty big.I'm not quite understanding why each categorical variable must be present in new data that is being predicted, since each prediction should be based on that specific row. Is this just a limitation of h2o, or am I missing some additional argument or some alternative function? Are any other ways to use categorical interactions for an h2o glm when new data doesn't encompass every category? |
Wendy Wong commented: {noformat}# ugly fix 1 (works) bad1 <- bind_rows(bad,data.table(unique(data[1:300,]$state)) %>% select(state = V1)) bad1_h2o <- bad1 %>% as.h2o() bad1 %>% mutate(prediction = as.vector(h2o.predict(interaction_glm, bad1_h2o))) %>% filter(!is.na(dependent_var)) # ugly fix 2 (failed)interaction_glm2 <- interaction_glminteraction_glm2@model[["domains"]][[1]] <- paste0(unique(bad$state))interaction_glm2@model[["domains"]][[2]] <- paste0(unique(bad$state))interaction_glm2@model[["coefficients_table"]] <- interaction_glm@model[["coefficients_table"]] %>% filter(!grepl("AR",names))interaction_glm2@model[["standardized_coefficient_magnitudes"]] <- interaction_glm@model[["standardized_coefficient_magnitudes"]] %>% filter(!grepl("AR",names))interaction_glm2@model[["coefficients"]] <- interaction_glm@model[["coefficients"]][c(-4, -8)]interaction_glm2@model[["model_summary"]][["number_of_predictors_total"]] <- interaction_glm@model[["model_summary"]][["number_of_predictors_total"]] - 2interaction_glm2@model[["model_summary"]][["number_of_active_predictors"]] <- interaction_glm@model[["model_summary"]][["number_of_active_predictors"]] - 2# doesn't workbad2 <- bad %>%mutate(prediction = as.vector(h2o.predict(interaction_glm2, bad_h2o))){noformat} |
JIRA Issue Details Jira Issue: PUBDEV-8949 |
No description provided.
The text was updated successfully, but these errors were encountered: