Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use stabsel on non-zero matrix after lasso? #21

Closed
raechin opened this issue Jan 4, 2017 · 1 comment
Closed

Can I use stabsel on non-zero matrix after lasso? #21

raechin opened this issue Jan 4, 2017 · 1 comment

Comments

@raechin
Copy link

raechin commented Jan 4, 2017

Dear Hofner,

I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).

Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.

Here is my code:

library(glmnet)
library(stabs)

#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]

#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )


#xuse: x with lasso coefficient>0 
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero]) 

#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]


output:

stab.lasso1$phat
s0
waistcirc 1.00
hipcirc 1.00
kneebreadth 0.97
anthro3a 0.68
anthro3b 0.92
anthro3c 0.57

stab.lasso2$phat
s0
age 0.52
waistcirc 0.99
hipcirc 1.00
elbowbreadth 0.45
kneebreadth 0.94
anthro3a 0.57
anthro3b 0.85
anthro3c 0.56
anthro4 0.17

stab.lasso1$selected
waistcirc hipcirc kneebreadth anthro3b
1 2 3 5
stab.lasso2$selected
waistcirc hipcirc kneebreadth anthro3b
2 3 5 7

Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?

Thank you!

@hofnerb
Copy link
Owner

hofnerb commented Jan 11, 2017

Dear @raechin,

some notes regarding stability selection:

  1. You do not get p-values. What you obtain are selection frequencies which show you the stability of your results.
  2. You should not use only variables which were pre-selected based on the same algorithm. This is somehow against the idea of stability selection. You can, however, select a subset of variables for other reasons. However, I would not do this unless necessary and especially I would not reduce my data set to a set of somehow predictive variables only. Use a (much) larger subset of variables so that stability selection can really work and distinguish stable and un-stable variables.
  3. You cannot fix lambda. The idea of stability selection is to NOT specify a penalty parameter but to restrict the model by specifying the average number of non-zero coefficients and seeing how often which variable was selected. As you pre-specify lambda you get the same variables. Using stability selection correctly will give you with the full data set:
set.seed(1234)
stabsel(x = x, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1) 
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             2         3 
# 
# Selection probabilities:
#     age elbowbreadth  kneebreadth      anthro4     anthro3b     anthro3c     anthro3a    waistcirc      hipcirc 
#    0.00         0.00         0.00         0.04         0.05         0.10         0.36         0.96         0.97 
# 
# ---
# Cutoff: 0.75; q: 2; PFER (*):  0.454 
#    (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment)

and with the subset:

set.seed(1234)
stabsel(x = xuse, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1) 
# Stability Selection with unimodality assumption
# 
# Selected variables:
#     waistcirc   hipcirc 
#             1         2 
# 
# Selection probabilities:
#     kneebreadth    anthro3b    anthro3c    anthro3a   waistcirc     hipcirc 
#            0.00        0.08        0.10        0.37        0.96        0.97 
# 
# ---
# Cutoff: 0.75; q: 2; PFER (*):  0.68 
#     (*) or expected number of low selection probability variables
# PFER (specified upper bound):  1 
# PFER corresponds to signif. level 0.113 (without multiplicity adjustment)

As you can see, in this case, the same variables very stably selected. However, overall, the selection frequencies differ and in other cases different variables might end up in you final subset. Furthermore, the three stability selection parameters q (the average number of selected variables), cutoff (the selection frequency above which variables are termed stable) and PFER (the per family error rate) depend on each other but also on the number of candidate variables p. In the above examples you can see that the realized PFER is 0.454 in the case with all variables and 0.68 in the case with the subset. If the number of variables differs stronger, also the parameters might differ stronger.

All in all, please have a look at the README and relevant literature:

  • For theoretical background: Meinshausen, N. and Bühlmann, P. (2010), Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72: 417–473. doi:10.1111/j.1467-9868.2010.00740.x
  • For stability selection with boosting (and some background on the package): Benjamin Hofner, Luigi Boccuto and Markus Goeker (2015). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16:144. doi:10.1186/s12859-015-0575-3

The latter publication gives you also some ideas about how to choose your stability selection parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants