Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

exalate-issue-sync · 2023-05-11T17:13:21Z

In some cases “remove_collinear_columns = TRUE” with resets "lambda = 0" but not always.

exalate-issue-sync · 2023-05-11T17:13:22Z

Wendy commented: The GLM parameter remove_collinear_columns when using IRLSM solver is used to remove collinear columns in the Gram matrix when there is no regularization (lambda=0) or when there is only L2penalty (when lambda > 0 and alpha=0). In fact, if you set lambda and alpha to be both non-zero and between the values 0 and 1, remove_collinear_columns = true, only the L2 penalty is applied in building the GLM model.

In fact, if remove_collinear_columns = true and when lambda > 0 and alpha < 1 and > 0, only L2penalty is applied during the optimization process.

Normally the L1 and L2 penalty are used to combat collinear columns. However there are cases when we don’t want regularization like when we want to compute the p-value. In addition, using only L2 penalty, coefficients can be forced to be small but not sparse. L1 penalty is the regularization that will force sparse coefficients.

exalate-issue-sync · 2023-05-11T17:13:24Z

Wendy commented: Hence, in the current implementation, if you set alpha and lambda values to be non-zero and between 0 and 1, the results from runs with remove_collinear_columns = True and False will generate different and it is expected.

When say lambda = 0.01, alpha = 0.5, remove_collinear_columns=True, only L2 penalty is used in building the GLM model. If remove_collinear_columns=False, both L1 and L2 penalties are used in building the GLM model. This is what causing the two runs to yield different coefficients. It is the expected behavior and it is correct.

exalate-issue-sync · 2023-05-11T17:13:26Z

Wendy commented: This code will generate different models because when remove_collinear_columns=True, there is no regularization while when remove_collinear_columns=False, L1 penalty is used to build the model:

{noformat}data("iris")
data <- iris
iris.data <- as.h2o(data)

Set up Response and Predictors

response <- c("Petal.Width")
predictor <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
glmnet.h2o.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = .01,
max_iterations = 200,
standardize = TRUE,
seed=1
)
glmnet.h2o.w.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = .01,
max_iterations = 200,
remove_collinear_columns = TRUE,
standardize = TRUE,
seed=1
){noformat}

glmnet.h2o.wo.rcc@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.477795 1.199333
2 Sepal.Length -0.029807 -0.024682
3 Sepal.Width 0.076599 0.033387
4 Petal.Length 0.430312 0.759629
glmnet.h2o.w.rcc@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

exalate-issue-sync · 2023-05-11T17:13:27Z

Wendy commented: This code will generate the same model because there is no regularization applied when remove_collinear_columns = True and False. If there are no collinear columns, no columns are removed and the two models should provide the same coefficients on the iris dataset:

{noformat}glmnet.h2o.w.rcc.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = 0.0,
max_iterations = 200,
remove_collinear_columns = TRUE,
standardize = TRUE,
seed=1
)
glmnet.h2o.wo.rcc.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
alpha = 1,
lambda_search = FALSE,
lambda = 0.0,
max_iterations = 200,
remove_collinear_columns = FALSE,
standardize = TRUE,
seed=1
){noformat}

glmnet.h2o.w.rcc.lambda0@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

glmnet.h2o.wo.rcc.lambda0@model$coefficients_table
Coefficients: glm coefficients
names coefficients standardized_coefficients
1 Intercept -0.240307 1.199333
2 Sepal.Length -0.207266 -0.171630
3 Sepal.Width 0.222829 0.097123
4 Petal.Length 0.524083 0.925163

exalate-issue-sync · 2023-05-11T17:13:29Z

Wendy commented: The following example provide the same models regardless of the remove_collinear_columns settings. The reason is because the solver in this case is coordinate_descent and not IRLSM.

setwd(normalizePath(dirname(R.utils::commandArgs(asValues=TRUE)$"f")))
source("../../../scripts/h2o-r-test-setup.R")

check.glm.remove.collinear.columns <- function() {

Import a sample binary outcome train/test set into R

{noformat}data("iris"){noformat}

h1. set.seed(999)

{noformat}data <- iris
data$weight_column <- runif(length(data), .1, 1)
data$cv_fold <- c(rep(1, length(data)/3), rep(2, length(data)/3), rep(3, length(data)/3))
iris.data <- as.h2o(data)

Set up Response and Predictors

response <- c("Petal.Width")
predictor <- c("Sepal.Length", "Sepal.Width", "Petal.Length")

create beta constraints

constraints <- as.h2o(data.frame(names = predictor ,lower_bounds = rep(-10000, length(predictor)) ,upper_bounds = rep(10000, length(predictor))))
glmnet.h2o.lambda0p2 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.2,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p2.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.2,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p1 <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.1,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = TRUE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)
glmnet.h2o.lambda0p1.wo.rcc <- h2o.glm(
x = predictor,
y = response,
training_frame = iris.data,
ignore_const_cols = TRUE,
family = 'tweedie',
tweedie_variance_power = 1.5,
tweedie_link_power = 0,
alpha = 1,
lambda_search = FALSE,
lambda = 0.1,
fold_column = "cv_fold",
weights_column = "weight_column",
max_iterations = 200,
remove_collinear_columns = FALSE,
keep_cross_validation_models = TRUE,
beta_constraints = constraints,
standardize = TRUE,
seed=1
)

glmnet.h2o.lambda0@model$coefficients_table
glmnet.h2o.lambda0.wo.rcc@model$coefficients_table
glmnet.h2o.lambda0p1@model$coefficients_table
glmnet.h2o.lambda0p1.wo.rcc@model$coefficients_table
glmnet.h2o.lambda0p2@model$coefficients_table
glmnet.h2o.lambda0p2.wo.rcc@model$coefficients_table
browser(){noformat}

}

exalate-issue-sync · 2023-05-11T17:13:31Z

Wendy commented: I shall probably set a warning in this case when remove_collinear_columns=True but solver != IRLSM

h2o-ops · 2023-05-14T19:21:01Z

JIRA Issue Details

Jira Issue: PUBDEV-8072
Assignee: Wendy
Reporter: Joseph Granados
State: Resolved
Fix Version: 3.32.1.2
Attachments: N/A
Development PRs: Available

h2o-ops · 2023-05-14T19:21:02Z

Linked PRs from JIRA

#5419

h2o-ops added the fixVersion/3.32.1.2 label May 14, 2023

h2o-ops closed this as completed May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

h2o-ops commented May 14, 2023

Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

Identify and implement correct behavior for “remove_collinear_columns = TRUE” with “lambda > 0” #7577

Comments

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

Set up Response and Predictors

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

Import a sample binary outcome train/test set into R

Set up Response and Predictors

create beta constraints

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

h2o-ops commented May 14, 2023