Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upq = 1 -> more than one variable selected #23
Comments
|
The (average) number of covariates selected per data subset Hence, it looks like a bug in the Of note: If you use ## example from ?stabsel
library("stabs")
data("bodyfat", package = "TH.data")
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro3c anthro4 anthro3b anthro3a waistcirc hipcirc
# 0.00 0.00 0.00 0.01 0.01 0.02 0.11 0.90 0.95
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.lasso$max)
# [1] 2However, if you use (stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, cutoff = 0.75,
PFER = 1))
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro3b anthro4 anthro3c anthro3a waistcirc hipcirc
# 0.00 0.00 0.02 0.07 0.08 0.11 0.40 0.92 0.94
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.glmnet$max)
# [1] 2.54 |
|
Currently, we use glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], dfmax = 2 - 1)to achieve models with at maximum Alternatively, one could use glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], pmax = q)However, this selects at maximum |
The only thing that prohibits this is that you fix dfmax in glmnet for me. (which is also apparently not correct here, but thats not my point) Does this make sense as a usecase that I would like stabs to support? |
|
@berndbischl - thx now you blew my cover! But yeah that sounds like a reasonable idea. @hofnerb - do you think that might work? I need to use the lasso.glmnet function in order to assume poisson or gamma distribution of the dependent variable. However, one could remove the dmax limitation and finish at the selection probabilities. What do you think about this approach and how reliable are those probabilities anyway (as you mentioned, they currently don't sum up). I'd really appreciate your help as I want to include this analysis in a paper of mine (currently under revision). Many thanks Clemens |
|
@competulix You can simply use the selection frequencies from stabsel but you should not REMOVE the Well, you can do this but this would be a different idea than stability selection. I would rather propose to use the current implementation (or we try I fix the issue). Theoretically, this should not really be a big problem but some coding work. One simply would have to check that fit really just contains the first selected
# $s0
# NULL
#
# $s1
# [1] 3
#
# $s2
# [1] 3
#
# $s3
# [1] 2 3 6
#
# $s4
# [1] 2 3 6This also seems to be the reason why |
|
@competulix could you try the following fit.fun and compare the results with the results from glmnet.lasso? glmnet.lasso2 <- function(x, y, q, ...) {
if (!requireNamespace("glmnet", quietly=TRUE))
stop("Package ", sQuote("glmnet"), " needed but not available")
if (is.data.frame(x)) {
message("Note: ", sQuote("x"),
" is coerced to a model matrix without intercept")
x <- model.matrix(~ . - 1, x)
}
if ("lambda" %in% names(list(...)))
stop("It is not permitted to specify the penalty parameter ", sQuote("lambda"),
" for lasso when used with stability selection.")
## fit model
fit <- glmnet::glmnet(x, y, pmax = q, ...)
## which coefficients are non-zero?
selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]]
ret <- logical(ncol(x))
ret[selected] <- TRUE
names(ret) <- colnames(x)
## compute selection paths
cf <- fit$beta
sequence <- as.matrix(cf != 0)
## return both
return(list(selected = ret, path = sequence))
}Actually, with
Hence, if our realised At least to get some impression of stable effects, both should be fine. Please contact me via email (simply ask @berndbischl) for more help. |
|
@hofnerb - Thank you for the quick reply - I'll try this and get back to you asap. |
|
I just had a look at another implementation of stability selection in package hdi and found that they also use the conservative approximation of choosing at most Hence, I changed the code in the package. Per default, stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, args.fitfun = list(type = "anti"),
cutoff = 0.75, PFER = 1) |
|
wow this was fast - are you gonna send that to CRAN as well? |
|
I am currently preparing a new release version (as the last version is already very old) but am not sure if this is gonna work out today. If not, you can always download this github package as explained in the README.md. |
|
Ok sounds great! Github installation is the current option - publication-wise an official CRAN release is preferable though ;-). Thank you very much - I really like your work! |
|
The new version is on CRAN now. Thanks for your input. |
I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?
Thank you already.