Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphical models query #12

Closed
richardbeare opened this issue Sep 6, 2016 · 7 comments
Closed

Graphical models query #12

richardbeare opened this issue Sep 6, 2016 · 7 comments

Comments

@richardbeare
Copy link
Contributor

@richardbeare richardbeare commented Sep 6, 2016

Hi,
Not sure if this is the forum you'd like to use for queries - let me know if it isn't.

I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.

I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.

Thanks

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Sep 6, 2016

In principle, this is the correct place for your querry. However, I cannot really advice on the use of stability selection with graphical models as I am no expert in the latter. However, there exists some literature regarding this combination:

  • Already the original paper considers graphical models.
  • Another paper on graphical models with stability selection is given here

Searching the web will surely provide more examples of various flavors of graphical modelling with stability selection.

Regarding your general questions regarding stability selection:

As shown in our article, the choice of q should be such that it is large enough to capture all anticipated variables but (usually much) smaller than the number of available predictors. Meinshausen and Bühlmann propose in one place to choose q = sqrt(0.8 * p) or sqrt(0.8 * alpha * p), where alpha is for example 0.05 (i.e., the significance level). Yet, these choices are not applicable in all cases. I'd suggest to play arround and have a look at the selection frequencies as well as keep an eye on the PFER.

Regarding your final question:
How is your data structured, i.e. what do you mean with grouped data? Do you have multiple measurements on the same subject? In that case, you should perhaps consider resampling individuals instead of resampling observations. Yet, I haven't seen any stability selection case with grouped data where the grouping was taken into account.

Please note that resampling of size n/2 is important for the derivation of the bound for the PFER. Thus, I am not fully aware of the impact on theoretical properties!

@richardbeare
Copy link
Contributor Author

@richardbeare richardbeare commented Sep 6, 2016

Thanks for your comments - I realise this is a potentially tricky question.
Are you aware of R packages for stability selection with graphical models -
I didn't see any during a quick can search?

On Tue, Sep 6, 2016 at 9:31 PM, Benjamin Hofner notifications@github.com
wrote:

In principle, this is the correct place for your querry. However, I cannot
really advice on the use of stability selection with graphical models as I
am no expert in the latter. However, there exists some literature regarding
this combination:

Searching the web will surely provide more examples of various flavors of
graphical modelling with stability selection.

Regarding your general questions regarding stability selection:

As shown in our article
http://www.biomedcentral.com/content/pdf/s12859-015-0575-3.pdf, the
choice of q should be such that it is large enough to capture all
anticipated variables but (usually much) smaller than the number of
available predictors. Meinshausen and Bühlmann
http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x/abstract
propose in one place to choose q = sqrt(0.8 * p) or sqrt(0.8 * alpha * p),
where alpha is for example 0.05 (i.e., the significance level). Yet, these
choices are not applicable in all cases. I'd suggest to play arround and
have a look at the selection frequencies as well as keep an eye on the PFER.

Regarding your final question:
How is your data structured, i.e. what do you mean with grouped data? Do
you have multiple measurements on the same subject? In that case, you
should perhaps consider resampling individuals instead of resampling
observations. Yet, I haven't seen any stability selection case with grouped
data where the grouping was taken into account.

Please note that resampling of size n/2 is important for the derivation of
the bound for the PFER. Thus, I am not fully aware of the impact on
theoretical properties!


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#12 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvooe2ytZ-nKMYFPaQOUtyp4ynYU8hlks5qnU74gaJpZM4J1quf
.

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Sep 7, 2016

Sorry, I don't know such a package. (I also don't know any other package which implements the Shah/Samworth bounds which are usually preferable).

However, I would love to add the relevant functions to stabs. What I would need is a function that takes arguments x, y and q (further arguments can be passed along via ...) and returns the selected variables and potentially the selection path. See the README for details. In that case, you could use the complete infrastructure of stabs (i.e., resampling, error control, parameter computation, ...). This would be my preferred way.

If we would need to use resampling of individuals rather than cases, we could consider to implement such a resampling functionality as well. However, you can always do this by hand if you use stabsel and provide user specified folds.

Another way would be to do the resampling (with samples of size floor(n/2)) on your own and compute the PFER for a given cutoff (aka threshold) and q (or analogously and one of the three parameters given the other two) via the function

stabsel_parameters(p, cutoff, q,  PFER, B, assumption, ...)

If the first way is doable, you could either provide a patch (i.e., the relevant code) or pointers to the relevant packages and functions. I would then assist you writing the required function(s) and manual(s).

@richardbeare
Copy link
Contributor Author

@richardbeare richardbeare commented Sep 7, 2016

Thanks,
I can see an easy starting point for graphical models without groups -
maybe the group approach will become clear later. However, how should q be
interpreted for the graphical case? Is it the number of non-zero entries in
the inverse, or some function of that? I can certainly make a start on a
prototype.

On Wed, Sep 7, 2016 at 5:01 PM, Benjamin Hofner notifications@github.com
wrote:

Sorry, I don't know such a package. (I also don't know any other package
which implements the Shah/Samworth bounds which are usually preferable).

However, I would love to add the relevant functions to stabs. What I would
need is a function that takes arguments x, y and q (further arguments can
be passed along via ...) and returns the selected variables and
potentially the selection path. See the README
https://github.com/hofnerb/stabs/blob/master/README.md for details. In
that case, you could use the complete infrastructure of stabs (i.e.,
resampling, error control, parameter computation, ...). This would be my
preferred way.

If we would need to use resampling of individuals rather than cases, we
could consider to implement such a resampling functionality as well.
However, you can always do this by hand if you use stabsel and provide
user specified folds.

Another way would be to do the resampling (with samples of size floor(n/2))
on your own and compute the PFER for a given cutoff (aka threshold) and q
(or analogously and one of the three parameters given the other two) via
the function

stabsel_parameters(p, cutoff, q, PFER, B, assumption, ...)

If the first way is doable, you could either provide a patch (i.e., the
relevant code) or pointers to the relevant packages and functions. I would
then assist you writing the required function(s) and manual(s).


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#12 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvooXqGOu1GkEwtNEL6GzbZhZfq3URHks5qnmFSgaJpZM4J1quf
.

@hofnerb
Copy link
Owner

@hofnerb hofnerb commented Sep 7, 2016

@richardbeare
Copy link
Contributor Author

@richardbeare richardbeare commented Sep 8, 2016

Was just starting to look into coding this and discovered the "pulsar"
package:

https://cran.r-project.org/package=pulsar

Which looks like it might do a lot of the work. Having a read now to see
how it works.

On Wed, Sep 7, 2016 at 10:22 PM, Benjamin Hofner notifications@github.com
wrote:

Correct. See Meinshausen and Bühlmann
http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x/abstract

[image: graphical_model]
https://cloud.githubusercontent.com/assets/8823088/18311767/698aa6c0-7506-11e6-98e8-48059ef172b8.PNG


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#12 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvooUP7f0MQn9fdQxcneJHbbxLzsdczks5qnqyigaJpZM4J1quf
.

@richardbeare
Copy link
Contributor Author

@richardbeare richardbeare commented Sep 14, 2016

Hi,
I've had a look at the other packages I mentioned - pulsar and huge. They
seem to focus on the "stars" methodology, which is strictly about selection
of regularization level. While they use resampling to test stability at a
given regularization level, they don't combine the results of different
resamplings in the stability selection style.

I've written a couple of stubs for testing graphical methods - see what you
think. I'm curious about how the number of subsamples and the cutoff should
change in this scenario. Note that these stubs require a slightly modified
version of stabs: devtools::install_github("richardbeare/stabs",
ref="GraphTrialsA")

getLamPath <- function (max, min, len, log = FALSE)
{
  if (max < min)
    stop("Did you flip min and max?")
  if (log) {
    min <- log(min)
    max <- log(max)
  }
  lams <- seq(max, min, length.out = len)
  if (log)
    exp(lams)
  else lams
}

set.seed(10010)
p <- 40 ; n <- 1000
dat  <- huge::huge.generator(n, p, "hub", verbose=FALSE, v=.1, u=.5)

stabs.quic <- function(x, y, q, ...)
{
  ## sort out a lambda path
  if (!requireNamespace("QUIC")) {
    stop("Package ", sQuote("QUIC"), " is required but not available")
  }
  empirical.cov <- cov(x)
  max.cov <- max(abs(empirical.cov[upper.tri(empirical.cov)]))
  lams <- getLamPath(max.cov, max.cov*0.05, len=40)
  est <- QUIC::QUIC(empirical.cov, rho=1, path=lams,msg=0)
  ut <- upper.tri(empirical.cov)
  qvals <- sapply(1:length(lams), function(idx){
    m <- est$X[,,idx]
    sum(m[ut] != 0)
  })

  ## Not sure if it is better to have more or less than q
  lamidx <- which.max(qvals >= q)
  ## Need to return the entire upper triangle - think about how to save
  ## ram later
  M <- est$X[,,lamidx][ut]
  selected <- (M != 0)
  s <- sapply(1:lamidx, function(idx){
    m <- est$X[,,idx][ut] != 0
    return(m)
  })
  colnames(s) <- as.character(1:ncol(s))
  return(list(selected=selected, path=s))
}

sq <- stabsel(x=dat$data, y=dat$data, fitfun=stabs.quic, cutoff=0.75,
PFER=1)
@hofnerb hofnerb closed this Sep 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.