Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

q = 1 -> more than one variable selected #23

Closed
competulix opened this issue Jan 25, 2017 · 12 comments
Closed

q = 1 -> more than one variable selected #23

competulix opened this issue Jan 25, 2017 · 12 comments
Labels
bug

Comments

@competulix
Copy link

@competulix competulix commented Jan 25, 2017

I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?

stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")

Stability Selection with unimodality assumption

Selected variables:
             A                E 
               2                6 

Selection probabilities:
  O                        N.                          Ag.                            Ed                 I                            G   
0.00                    0.01                    0.01                    0.02                    0.09                    0.10
C                          E                           A
0.19                    0.66                    0.67
                    

---
Cutoff: 0.65; q: 1; PFER (*):  0.192 
   (*) or expected number of low selection probability variables
PFER (specified upper bound):  0.2 
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)

Thank you already.

@hofnerb hofnerb added the bug label Jan 25, 2017
@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 25, 2017

The (average) number of covariates selected per data subset q does not necessarily correspond to the number of stably selected variables. However, the latter should usually be <= q.

Hence, it looks like a bug in the glmnet.lasso fitfun.


Of note:

If you use lars.lasso, everything works as expected as can be seen by the sum of selection probabilities:

## example from ?stabsel
library("stabs")
data("bodyfat", package = "TH.data")
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                        fitfun = lars.lasso, cutoff = 0.75,
                        PFER = 1))
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#                    2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth     anthro3c      anthro4     anthro3b     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.00         0.01         0.01         0.02         0.11         0.90         0.95 
# 
# ---
#     Cutoff: 0.75; q: 2; PFER (*):  0.454 
# (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.lasso$max)
# [1] 2

However, if you use glmnet.lasso the selection frequencies are above the pre-specified q:

(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                                   fitfun = glmnet.lasso, cutoff = 0.75,
                                   PFER = 1))
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth     anthro3b      anthro4     anthro3c     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.02         0.07         0.08         0.11         0.40         0.92         0.94 
# 
# ---
#     Cutoff: 0.75; q: 2; PFER (*):  0.454 
# (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)
sum(stab.glmnet$max)
# [1] 2.54
@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 25, 2017

Currently, we use

glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2], dfmax = 2 - 1)

to achieve models with at maximum q variables. However, on average, more than q variables are selected.

Alternatively, one could use

glmnet::glmnet(model.matrix(~. - 1, bodyfat[, -2]), bodyfat[, 2],  pmax = q)

However, this selects at maximum q variables, i.e., on average less than q. How can we best achieve exactly q variables?

@berndbischl
Copy link

@berndbischl berndbischl commented Jan 25, 2017

@hofnerb

  1. I know @competulix, thats "Clemens", I recommended to use stabs for one of his current papers.

  2. Maybe an idea: I guess I would be interested to simply run stability selection without really arriving at a "final set" of selected variables. Just calculate the selection frequencies and then "look at them".
    That also means I dont have to worry about the hard problem of fixing q and PFER before I know my results.

The only thing that prohibits this is that you fix dfmax in glmnet for me. (which is also apparently not correct here, but thats not my point)

Does this make sense as a usecase that I would like stabs to support?

@competulix
Copy link
Author

@competulix competulix commented Jan 27, 2017

@berndbischl - thx now you blew my cover! But yeah that sounds like a reasonable idea.

@hofnerb - do you think that might work? I need to use the lasso.glmnet function in order to assume poisson or gamma distribution of the dependent variable. However, one could remove the dmax limitation and finish at the selection probabilities. What do you think about this approach and how reliable are those probabilities anyway (as you mentioned, they currently don't sum up).

I'd really appreciate your help as I want to include this analysis in a paper of mine (currently under revision).

Many thanks

Clemens

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 27, 2017

@competulix You can simply use the selection frequencies from stabsel but you should not REMOVE the dmax limitation and run the model until convergence.

Well, you can do this but this would be a different idea than stability selection. I would rather propose to use the current implementation (or we try I fix the issue).

Theoretically, this should not really be a big problem but some coding work. One simply would have to check that fit really just contains the first q variables and nothing more. I just tried to do this but ran into another error: (in my example above) we sometimes simply jump from one selected variable to three (!):

selected
# $s0
# NULL
# 
# $s1
# [1] 3
# 
# $s2
# [1] 3
# 
# $s3
# [1] 2 3 6
# 
# $s4
# [1] 2 3 6

This also seems to be the reason why pmax = q selects <= q variables. This seems to be an algorithmic issue of package glmnet. The steps towards the solution seem to be too big.

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 27, 2017

@competulix could you try the following fit.fun and compare the results with the results from glmnet.lasso?

glmnet.lasso2 <- function(x, y, q, ...) {
    if (!requireNamespace("glmnet", quietly=TRUE))
        stop("Package ", sQuote("glmnet"), " needed but not available")
    
    if (is.data.frame(x)) {
        message("Note: ", sQuote("x"),
                " is coerced to a model matrix without intercept")
        x <- model.matrix(~ . - 1, x)
    }
    
    if ("lambda" %in% names(list(...)))
        stop("It is not permitted to specify the penalty parameter ", sQuote("lambda"),
             " for lasso when used with stability selection.")
    
    ## fit model
    fit <- glmnet::glmnet(x, y, pmax = q, ...)
    
    ## which coefficients are non-zero?
    selected <- predict(fit, type = "nonzero")
    selected <- selected[[length(selected)]]
    ret <- logical(ncol(x))
    ret[selected] <- TRUE
    names(ret) <- colnames(x)
    ## compute selection paths
    cf <- fit$beta
    sequence <- as.matrix(cf != 0)
    ## return both
    return(list(selected = ret, path = sequence))
}

Actually, with glmnet.lasso2 we should be on the save side as the bound for the PFER is defined as follows

PFER <= 1/(2 * cutoff - 1) * q^2 / p

Hence, if our realised q is smaller than the pre-specified one, this leads to a conservative approximation. On the other hand, this upper bound is conservative anyway, so it should not be an issue if we use the anti-conservative glmnet.lasso implementation.

At least to get some impression of stable effects, both should be fine.

Please contact me via email (simply ask @berndbischl) for more help.

@competulix
Copy link
Author

@competulix competulix commented Jan 27, 2017

@hofnerb - Thank you for the quick reply - I'll try this and get back to you asap.

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 27, 2017

I just had a look at another implementation of stability selection in package hdi and found that they also use the conservative approximation of choosing at most q variables.

Hence, I changed the code in the package. Per default, glmnet.lasso now computes the conservative version (called glmnet.lasso2 above). The old, anti-conservative version can be used by setting args.fitfun = list(type = "anticonservative") (or an abbreviation thereof), e.g.:

stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                         fitfun = glmnet.lasso, args.fitfun = list(type = "anti"), 
                         cutoff = 0.75, PFER = 1)
@hofnerb hofnerb closed this in eaa3b49 Jan 27, 2017
@competulix
Copy link
Author

@competulix competulix commented Jan 27, 2017

wow this was fast - are you gonna send that to CRAN as well?

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Jan 27, 2017

I am currently preparing a new release version (as the last version is already very old) but am not sure if this is gonna work out today. If not, you can always download this github package as explained in the README.md.

@competulix
Copy link
Author

@competulix competulix commented Jan 27, 2017

Ok sounds great! Github installation is the current option - publication-wise an official CRAN release is preferable though ;-).

Thank you very much - I really like your work!

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Feb 1, 2017

The new version is on CRAN now. Thanks for your input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.