-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I use stabsel on non-zero matrix after lasso? #21
Comments
Dear @raechin, some notes regarding stability selection:
set.seed(1234)
stabsel(x = x, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 2 3
#
# Selection probabilities:
# age elbowbreadth kneebreadth anthro4 anthro3b anthro3c anthro3a waistcirc hipcirc
# 0.00 0.00 0.00 0.04 0.05 0.10 0.36 0.96 0.97
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.454
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.0504 (without multiplicity adjustment) and with the subset: set.seed(1234)
stabsel(x = xuse, y = y, fitfun = glmnet.lasso, cutoff = 0.75, PFER = 1)
# Stability Selection with unimodality assumption
#
# Selected variables:
# waistcirc hipcirc
# 1 2
#
# Selection probabilities:
# kneebreadth anthro3b anthro3c anthro3a waistcirc hipcirc
# 0.00 0.08 0.10 0.37 0.96 0.97
#
# ---
# Cutoff: 0.75; q: 2; PFER (*): 0.68
# (*) or expected number of low selection probability variables
# PFER (specified upper bound): 1
# PFER corresponds to signif. level 0.113 (without multiplicity adjustment) As you can see, in this case, the same variables very stably selected. However, overall, the selection frequencies differ and in other cases different variables might end up in you final subset. Furthermore, the three stability selection parameters q (the average number of selected variables), cutoff (the selection frequency above which variables are termed stable) and PFER (the per family error rate) depend on each other but also on the number of candidate variables p. In the above examples you can see that the realized PFER is 0.454 in the case with all variables and 0.68 in the case with the subset. If the number of variables differs stronger, also the parameters might differ stronger. All in all, please have a look at the README and relevant literature:
The latter publication gives you also some ideas about how to choose your stability selection parameters. |
Dear Hofner,
I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).
Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.
Here is my code:
output:
Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?
Thank you!
The text was updated successfully, but these errors were encountered: