-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577
Comments
Wendy commented: The GLM parameter remove_collinear_columns when using IRLSM solver is used to remove collinear columns in the Gram matrix when there is no regularization (lambda=0) or when there is only L2penalty (when lambda > 0 and alpha=0). In fact, if you set lambda and alpha to be both non-zero and between the values 0 and 1, remove_collinear_columns = true, only the L2 penalty is applied in building the GLM model. In fact, if remove_collinear_columns = true and when lambda > 0 and alpha < 1 and > 0, only L2penalty is applied during the optimization process. Normally the L1 and L2 penalty are used to combat collinear columns. However there are cases when we don’t want regularization like when we want to compute the p-value. In addition, using only L2 penalty, coefficients can be forced to be small but not sparse. L1 penalty is the regularization that will force sparse coefficients. |
Wendy commented: Hence, in the current implementation, if you set alpha and lambda values to be non-zero and between 0 and 1, the results from runs with remove_collinear_columns = True and False will generate different and it is expected. When say lambda = 0.01, alpha = 0.5, remove_collinear_columns=True, only L2 penalty is used in building the GLM model. If remove_collinear_columns=False, both L1 and L2 penalties are used in building the GLM model. This is what causing the two runs to yield different coefficients. It is the expected behavior and it is correct. |
Wendy commented: This code will generate different models because when remove_collinear_columns=True, there is no regularization while when remove_collinear_columns=False, L1 penalty is used to build the model: {noformat}data("iris") Set up Response and Predictorsresponse <- c("Petal.Width") glmnet.h2o.wo.rcc@model$coefficients_table |
Wendy commented: This code will generate the same model because there is no regularization applied when remove_collinear_columns = True and False. If there are no collinear columns, no columns are removed and the two models should provide the same coefficients on the iris dataset: {noformat}glmnet.h2o.w.rcc.lambda0 <- h2o.glm( glmnet.h2o.w.rcc.lambda0@model$coefficients_table glmnet.h2o.wo.rcc.lambda0@model$coefficients_table |
Wendy commented: The following example provide the same models regardless of the remove_collinear_columns settings. The reason is because the solver in this case is coordinate_descent and not IRLSM. setwd(normalizePath(dirname(R.utils::commandArgs(asValues=TRUE)$"f"))) check.glm.remove.collinear.columns <- function() { Import a sample binary outcome train/test set into R{noformat}data("iris"){noformat} h1. set.seed(999) {noformat}data <- iris Set up Response and Predictorsresponse <- c("Petal.Width") create beta constraintsconstraints <- as.h2o(data.frame(names = predictor ,lower_bounds = rep(-10000, length(predictor)) ,upper_bounds = rep(10000, length(predictor)))) glmnet.h2o.lambda0@model$coefficients_table } |
Wendy commented: I shall probably set a warning in this case when remove_collinear_columns=True but solver != IRLSM |
JIRA Issue Details Jira Issue: PUBDEV-8072 |
Linked PRs from JIRA |
In some cases “remove_collinear_columns = TRUE” with resets "lambda = 0" but not always.
The text was updated successfully, but these errors were encountered: