Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cox regression cross-validation seems broken #10

Closed
schuemie opened this issue Dec 13, 2014 · 8 comments
Closed

Cox regression cross-validation seems broken #10

schuemie opened this issue Dec 13, 2014 · 8 comments
Assignees

Comments

@schuemie
Copy link
Member

We see some weird behavior in the cross-validation, but its hard to reproduce. At higher variance levels, we often see 'ill configured for prior' but not always. Also, in the below example, it seems a bit unlikely that the log likelihood really stays almost constant from .000001 to 10. Also, at the 'optimal' variance (e.g 0.000215443) all betas are typically zero which hardly seems optimal.

data <- simulateData(nstrata=1,nrows=1000,ncovars=200,model="survival")
cyclopsData <- convertToCyclopsData(data$outcomes,data$covariates,modelType = "cox")

prior <- createPrior("laplace", useCrossValidation = TRUE)
control <- createControl(noiseLevel = "quiet",lowerLimit = 0.000001,upperLimit = 10)
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)
@msuchard msuchard self-assigned this Dec 15, 2014
@msuchard
Copy link
Member

Certainly easy to believe; I have not personally checked CV under the Cox model (which handles observation weights slightly differently than the other models). Will look into this.

@msuchard
Copy link
Member

msuchard commented Jan 6, 2015

Try this latest commit to master 112353a and let me know how things go. The CCD update steps and log likelihoods were being calculated wrong when data were being held out and there existed survival ties.

@schuemie
Copy link
Member Author

schuemie commented Jan 8, 2015

Almost! But there's still some funky stuff. When I do:

library(Cyclops)
set.seed(100)
data <- simulateData(nstrata=1,nrows=1000,ncovars=200,model="survival")
cyclopsData <- convertToCyclopsData(data$outcomes,data$covariates,modelType = "cox")

prior <- createPrior("laplace", useCrossValidation = TRUE)
control <- createControl(noiseLevel = "quiet",lowerLimit = 0.000001,upperLimit = 10,seed = 100)
fit <- fitCyclopsModel(cyclopsData,prior=prior,control=control)

The results only show the likelihood in one fold:

Performing 10-fold cross-validation [seed = 100]
Running at Laplace(1414.21) Grid-point #1 at 1e-006 Fold #1 Rep #1 pred log like = -4841.02
Running at Laplace(1414.21) Grid-point #1 at 1e-006 Fold #2 Rep #1 pred log like = 0
Running at Laplace(1414.21) Grid-point #1 at 1e-006 Fold #3 Rep #1 pred log like = 0
Running at Laplace(1414.21) Grid-point #1 at 1e-006 Fold #4 Rep #1 pred log like = 0
Running at Laplace(1414.21) Grid-point #1 at 1e-006 Fold #5 Rep #1 pred log like = 0
...

@msuchard
Copy link
Member

msuchard commented Jan 9, 2015

I believe I have pin-pointed the problem. The Cox model uses the pid row to identify different strata and not subjects. Code in AbstractSelector.cpp assumes that pid identities exchangeable sampling units. This assumption is true for all other models except the conditional logistic regression. We should discuss what we mean by exchangeable sampling unit for this model. Basically, do we assume:

  • Strata are small and we have a whole bunch of them, such that each strata are independently sampled, or
  • Strata are large and we have few of them, such that we want to independently sample rows within strata.

In either case, I am working on a fix for the Cox model.

@msuchard
Copy link
Member

msuchard commented Jan 9, 2015

@schuemie, try again with commit 1129a39. Of course, I have probably broken something else now.

@schuemie
Copy link
Member Author

This certainly works, but I do now see a different issue: at least in this example, the likelihood has a local optimum far from the global one, and the auto-search lands us in the wrong spot. Any thoughts?

library(Cyclops)
set.seed(1)
data <- simulateCyclopsData(nstrata = 1, nrows = 1000, ncovars = 200, model = "survival")
cyclopsData <- convertToCyclopsData(data$outcomes, data$covariates, modelType = "cox")
prior <- createPrior("laplace", useCrossValidation = TRUE)

# Using grid search finds a small optimal variance
control <- createControl(noiseLevel = "quiet",lowerLimit = 0.000001,upperLimit = 10, seed = 1)
fit <- fitCyclopsModel(cyclopsData, prior = prior, control = control, forceNewObject = TRUE)

# The auto-search starts at large values, and settles in a local optimum
control <- createControl(noiseLevel = "quiet",seed = 1, cvType = "auto")
fit <- fitCyclopsModel(cyclopsData, prior = prior, control = control, forceNewObject = TRUE)

# Forcing a different starting variance helps find the global(?) optimum
control <- createControl(noiseLevel = "quiet",seed = 1, cvType = "auto", startingVariance = 0.000001)
fit <- fitCyclopsModel(cyclopsData, prior = prior, control = control, forceNewObject = TRUE)

@msuchard
Copy link
Member

I noticed similar behavior with the previous seed (100) as well. The default variance estimate from Genkins et al. seems very far off the mark. Prior expert knowledge (i.e., me running cross-validation in MSCCS in the OMOP experiment) suggests variances < 1. Why not just start your searches with

startingVariance = 0.1

until we develop a better heuristic?

@schuemie
Copy link
Member Author

Will do. I'm closing this issue now. Thanks so much Marc, this is just in time for our next study!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants