Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

simulate new random effects/conditional modes conditional on observed data #388

Open
alireza202 opened this Issue Jul 7, 2016 · 14 comments

Comments

Projects
None yet
3 participants

alireza202 commented Jul 7, 2016 edited

I'm doing a multilevel negative binomial regression:

fit_model <- glmer.nb(formula = y ~ x + (x | item), data = training)

and do the prediction for new data using:

testing_results <- predict(object = fit_model, newdata = testing, type = 'response')

Since I need the prediction interval, I use bootMer:

boot1 <- lme4::bootMer(fit_model, 
                       function(x) predict(x, newdata = testing, re.form = NULL, type = 'response'), 
                       nsim = 200, 
                       use.u = FALSE, 
                       type = 'parametric', 
                       parallel = 'multicore', 
                       ncpus = 4)

data.frame(fit = apply(boot1$t, 2, function(x) mean(x)),
           lower = apply(boot1$t, 2, function(x) as.numeric(quantile(x, probs=.05, na.rm=TRUE))),
           upper = apply(boot1$t, 2, function(x) as.numeric(quantile(x, probs=.95, na.rm=TRUE)))) -> PI.boot

Now when I plot the mean that I get from bootMer and the one from predict, they wildly differ. Here is an example:

screen shot 2016-07-07 at 11 55 21 am

where red is the data, green is the prediction from predict function, blue is the prediction from bootMer, with the corresponding confidence interval. Left of the dotted line was the training, and right side is the testing. Am I doing something wrong here? I tried doing the same thing with poisson regression, and got the same result.

Owner

bbolker commented Jul 7, 2016

Nothing obvious springs to mind (although I read this fairly quickly). Can we have a reproducible example please?

When I change use.u = TRUE, it actually matches the predict result, but the prediction interval is too tight.

screen shot 2016-07-07 at 1 48 14 pm

I'm not sure I understand how use.u works here. The above is the result for a single item, by the way.

I can't share the data as it is. If need be, I can anonymize it and share it.

alireza202 commented Jul 7, 2016 edited

In the bootMer help it says:

If use.u is FALSE and type is "parametric", each simulation generates new values of both
the “spherical” random effects u and the i.i.d. errors eps, using rnorm() with parameters 
corresponding to the fitted model x.

Since it uses rnorm, does this mean it can't be used for negative binomial or poisson models?

Owner

bbolker commented Jul 7, 2016

We really need a reproducible example. An anonymized version would be fine; a simplified version (i.e. construct a small, artificial data set that reproduces the general problem) would be even better.

Contributor

pitakakariki commented Jul 7, 2016

Just a hunch: what does plot(ranef(fit_model)) look like?

Here is how it looks like:

screen shot 2016-07-07 at 3 15 57 pm

X axis is x and Y axis is intercept.

Ben, I'm preparing a data set to share.

Here is the RData file with a smaller set of the training and testing data, along with the R code:

bootMer_issue.zip

I got a few warnings while running glmer.nb and bootMer commands, that I didn't get with the full data set. It might be that there are fewer items in the sample.

Also it would be great if you can point me towards any article or writing on the math behind bootMer.

Owner

bbolker commented Jul 8, 2016 edited

I'm afraid there isn't much more available about the math behind bootMer - what's in the documentation is (in the grand tradition of R documentation :-( ) correct and more or less complete, but very opaque if you don't know what you're reading.

There are a number of things going on here. The basic problem is that item-specific confidence intervals are quite tricky. I don't think any of the issues are specific to your data set or to negative binomial models (sorry for making you make up a reproducible example that I turned out not to need ... I have set up the example below with the built-in sleepstudy data, which is smaller and a linear mixed model, hence much faster to run the bootstrap simulations)

Confidence and prediction intervals for unobserved levels

(I think after all that this may not have been your particular question, but it was useful to think about ...) For unobserved levels, you need to make a population-level prediction (i.e. you don't have any information about the new, previously unobserved group), and there are three levels of uncertainty to consider: (1) parameter uncertainty (fixed- and random-effect, i.e. "beta" and "theta"); (2) variation of groups around population mean values; (3) conditional variation, i.e. Gaussian/Poisson/binomial distributions around predicted group-level values.

Therefore, when you simulate you have to use use.u=FALSE to resample new values of the conditional modes (="random effects") (or, preferably, re.form=~0). However, if you want this uncertainty in the bootstrap output, you have to use simulate rather than predict, because predict tries to make deterministic predictions - it doesn't have the capability (and arguably shouldn't) to make "random" predictions from the new levels, this is what simulate() is for.

The following code illustrates three different levels of variation.

library(reshape2)
library(lme4)
library(ggplot2); theme_set(theme_bw())library(llibrary(reshape2)
library(plyr)

## set up test/train (only 1 subject in test ...)
nSubj <- length(levels(sleepstudy$Subject))
train <- subset(sleepstudy,Subject %in% levels(Subject)[1:(nSubj-1)])
test <- subset(sleepstudy,Subject==levels(Subject)[nSubj])

## fit model
fm1 <- lmer(Reaction ~ Days + (Days | Subject), train)

## standard population-level prediction
pframe <- data.frame(test,pred=predict(fm1,newdata=test,re.form=NA))
pframe.m <- melt(pframe,id.vars=c("Days","Subject"))

sfun1 <- function(x) {
    simulate(x,newdata=test,re.form=~0,
             allow.new.levels=TRUE)[[1]]
}
sfun2 <- function(x) {
    simulate(x,newdata=test,re.form=~0,
             allow.new.levels=TRUE,cond.sim=FALSE)[[1]]
}

## param, RE, and conditional
b1 <- bootMer(fm1,FUN=sfun1,nsim=100,seed=101)
## param and RE (no conditional)
b2 <- bootMer(fm1,FUN=sfun2,nsim=100,seed=101)
## param only
b3 <- bootMer(fm1,FUN=function(x) predict(x,newdata=test,re.form=~0),
              ## re.form=~0 is equivalent to use.u=FALSE
              nsim=100,seed=101)
#### Confidence and prediction intervals for *unobserved* levels
bootsum <- function(x,ext="_1") {
   d <- data.frame(aaply(x$t,2,
           function(x) c(mean(x),quantile(x,c(0.025,0.975)))))
   d <- setNames(d,paste0(c("bpred","lwr","upr"),ext))
   return(d)
}
dd <- data.frame(test,bootsum(b1),
                 bootsum(b2,"_2"),
                 bootsum(b3,"_3"))

ggplot(pframe.m,aes(Days,value,colour=variable))+geom_point()+
    geom_line()+
    geom_ribbon(data=dd,colour=NA,
                aes(y=bpred_1,ymin=lwrcode_1,ymax=upr_1),alpha=0.3)+
    geom_ribbon(data=dd,colour=NA,
                fill="blue",
                aes(y=bpred_2,ymin=lwr_2,ymax=upr_2),alpha=0.3)+
    geom_ribbon(data=dd,colour=NA,
                fill="red",
                aes(y=bpred_3,ymin=lwr_3,ymax=upr_3),alpha=0.3)

The points are the population-level prediction; the narrowest CI is parametric uncertainty only; the middle one is parametric + random-effects variation; the outer is parametric + random-effects + conditional variation.

bootpred1

Confidence and prediction intervals for observed levels

This one is, believe it or not, even slightly trickier.

Here we have parametric uncertainty and conditional variation as before, but the random-effect (group-level predictor) variation is more subtle. We do have some information about the uncertainty of the conditional modes (see ?ranef.merMod), but using it is a little bit tricky because we need to decide what to do about the possible covariation with the fixed-effect parameters, which is complicated (this has been discussed a few times on r-sig-mixed-models), mainly because it's both theoretically and practically difficult (in the lme4 framework) to quantify the covariance between the fixed-effect parameters and the conditional mode values.un

  • if we use predict(.,re.form=NULL) as the function for bootMer we will get the parameter uncertainty only;
  • if we use simulate(.,re.form=NULL) (equivalent to use.u=TRUE) we will get the parameter uncertainty and conditional variance only.

At present there is no easy way to incorporate the uncertainty in the conditional mode, but I may try to do something with it soon ...

Thanks for the comprehensive answer Ben. I will closely follow future developments then.

alireza202 referenced this issue in paul-buerkner/brms Jul 11, 2016

Closed

Understanding predict function #82

Owner

bbolker commented Aug 26, 2016

So, just to clarify this for the record: this boils down to a wishlist item to be able to simulate new values of the conditional modes, conditioned on the observations (rather than the extremes of setting the conditional modes/random effects to zero or simulating them unconditionally)?

My goal is to get full prediction interval/capture all uncertainties for GLMM models. I'm not sure what elements it needs.

Owner

bbolker commented Aug 31, 2016

That may not be a sufficiently precise statement. For a start, I'm assuming that you want to compute the uncertainty for particular subjects, conditioning on the observed values?

That would be accurate.

bbolker changed the title from bootMer results very different from predict result to simulate new random effects/conditional modes conditional on observed data Sep 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment