Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 8 comments

Comments

@exalate-issue-sync
Copy link

In some cases “remove_collinear_columns = TRUE” with resets "lambda = 0" but not always.

@exalate-issue-sync
Copy link
Author

Wendy commented: The GLM parameter remove_collinear_columns when using IRLSM solver is used to remove collinear columns in the Gram matrix when there is no regularization (lambda=0) or when there is only L2penalty (when lambda > 0 and alpha=0). In fact, if you set lambda and alpha to be both non-zero and between the values 0 and 1, remove_collinear_columns = true, only the L2 penalty is applied in building the GLM model.

In fact, if remove_collinear_columns = true and when lambda > 0 and alpha < 1 and > 0, only L2penalty is applied during the optimization process.

Normally the L1 and L2 penalty are used to combat collinear columns. However there are cases when we don’t want regularization like when we want to compute the p-value. In addition, using only L2 penalty, coefficients can be forced to be small but not sparse. L1 penalty is the regularization that will force sparse coefficients.

@exalate-issue-sync
Copy link
Author

Wendy commented: Hence, in the current implementation, if you set alpha and lambda values to be non-zero and between 0 and 1, the results from runs with remove_collinear_columns = True and False will generate different and it is expected.

When say lambda = 0.01, alpha = 0.5, remove_collinear_columns=True, only L2 penalty is used in building the GLM model. If remove_collinear_columns=False, both L1 and L2 penalties are used in building the GLM model. This is what causing the two runs to yield different coefficients. It is the expected behavior and it is correct.

@exalate-issue-sync
Copy link
Author

Wendy commented: This code will generate different models because when remove_collinear_columns=True, there is no regularization while when remove_collinear_columns=False, L1 penalty is used to build the model:

{noformat}data("iris")
data <- iris
iris.data <- as.h2o(data)

Set up Response and Predictors

response <- c("Petal.Width")
predictor <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
glmnet.h2o.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = .01,
max_iterations = 200,
standardize = TRUE,
seed=1
)
glmnet.h2o.w.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = .01,
max_iterations = 200,
remove_collinear_columns = TRUE,
standardize = TRUE,
seed=1
){noformat}

glmnet.h2o.wo.rcc@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.477795 1.199333
2 Sepal.Length -0.029807 -0.024682
3 Sepal.Width 0.076599 0.033387
4 Petal.Length 0.430312 0.759629
glmnet.h2o.w.rcc@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

@exalate-issue-sync
Copy link
Author

Wendy commented: This code will generate the same model because there is no regularization applied when remove_collinear_columns = True and False. If there are no collinear columns, no columns are removed and the two models should provide the same coefficients on the iris dataset:

{noformat}glmnet.h2o.w.rcc.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = 0.0,
max_iterations = 200,
remove_collinear_columns = TRUE,
standardize = TRUE,
seed=1
)
glmnet.h2o.wo.rcc.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = 0.0,
max_iterations = 200,
remove_collinear_columns = FALSE,
standardize = TRUE,
seed=1
){noformat}

glmnet.h2o.w.rcc.lambda0@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

glmnet.h2o.wo.rcc.lambda0@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

@exalate-issue-sync
Copy link
Author

Wendy commented: The following example provide the same models regardless of the remove_collinear_columns settings. The reason is because the solver in this case is coordinate_descent and not IRLSM.

setwd(normalizePath(dirname(R.utils::commandArgs(asValues=TRUE)$"f")))
source("../../../scripts/h2o-r-test-setup.R")

check.glm.remove.collinear.columns <- function() {

Import a sample binary outcome train/test set into R

{noformat}data("iris"){noformat}

h1. set.seed(999)

{noformat}data <- iris
data$weight_column <- runif(length(data), .1, 1)
data$cv_fold <- c(rep(1, length(data)/3), rep(2, length(data)/3), rep(3, length(data)/3))
iris.data <- as.h2o(data)

Set up Response and Predictors

response <- c("Petal.Width")
predictor <- c("Sepal.Length", "Sepal.Width", "Petal.Length")

create beta constraints

constraints <- as.h2o(data.frame(names = predictor ,lower_bounds = rep(-10000, length(predictor)) ,upper_bounds = rep(10000, length(predictor))))
glmnet.h2o.lambda0p2 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.2,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p2.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.2,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p1 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.1,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p1.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.1,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)

glmnet.h2o.lambda0@model$coefficients_table
glmnet.h2o.lambda0.wo.rcc@model$coefficients_table
glmnet.h2o.lambda0p1@model$coefficients_table
glmnet.h2o.lambda0p1.wo.rcc@model$coefficients_table
glmnet.h2o.lambda0p2@model$coefficients_table
glmnet.h2o.lambda0p2.wo.rcc@model$coefficients_table
browser(){noformat}

}

@exalate-issue-sync
Copy link
Author

Wendy commented: I shall probably set a warning in this case when remove_collinear_columns=True but solver != IRLSM

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8072
Assignee: Wendy
Reporter: Joseph Granados
State: Resolved
Fix Version: 3.32.1.2
Attachments: N/A
Development PRs: Available

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

Linked PRs from JIRA

#5419

@h2o-ops h2o-ops closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant