Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open thread for mgcv smooth issues #928

Open
bbolker opened this issue Jul 19, 2023 · 25 comments
Open

open thread for mgcv smooth issues #928

bbolker opened this issue Jul 19, 2023 · 25 comments

Comments

@bbolker
Copy link
Contributor

bbolker commented Jul 19, 2023

See notes. Currently known outstanding issues:

  • setting good default starting values for smooth terms
  • better printing of output (hide fixed terms? print smooth terms in a way more consistent with mgcv?)
  • more general downstream methods (plot smooth terms? work with gratia methods?)
  • relax "no random effects in dispersion formulas" constraint so that users can add smooths to the dispersion model?
  • open the can of worms about term-by-term choice of population-level prediction for REs (see Implement re.form option in predict() also for simulate() and possibly also for residuals() #888)
@seananderson
Copy link
Contributor

Good to see this moving forward! I had long hoped to get around to this myself.

A few things:

There are a large number of options available within s() that, in my experience, users familiar with mgcv immediately try to use. Only some are compatible with smooth2random(). I've added checks to stop if those are used as they come up.

Some examples of settings that will fail with unhelpful messages:

library(glmmTMB)
data("sleepstudy", package = "lme4")

m3 <- glmmTMB(Reaction ~ s(Days, bs = "re"), data = sleepstudy, REML = TRUE)
#> Error in dimnames(x) <- dn: length of 'dimnames' [2] not equal to array extent

m4 <- glmmTMB(Reaction ~ s(Days, fx = TRUE), data = sleepstudy, REML = TRUE)
#> Error in t.default(s$re$rand$Xr): argument is not a matrix

Sometimes users setting m in s() results in a mismatch with the outputs of mgcv, although in trying to create examples here I realized that is only the case with some datasets and I think glmmTMB matches gamm4, so maybe it's fine.

I noticed this strange behaviour around the REML argument, which is important to note when testing:

library(glmmTMB)
library(gamm4)
set.seed(0)
dat <- mgcv::gamSim(1, n = 400, scale = 2)
#> Gu & Wahba 4 term additive model
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
dat$y <- as.numeric(dat$y)

## defaults should be REML = FALSE
## doesn't match!?
br <- gamm4(y ~ s(x0), data = dat, random = ~ (1 | fac))
br1 <- glmmTMB(y ~ s(x0) + (1 | fac), data = dat)
logLik(br$mer) # looks like REML = TRUE
#> 'log Lik.' -1127.131 (df=5)
logLik(br1)
#> 'log Lik.' -1126.471 (df=5)

## but this matches:
br <- gamm4(y ~ s(x0), data = dat, random = ~ (1 | fac), REML = FALSE)
br1 <- glmmTMB(y ~ s(x0) + (1 | fac), data = dat, REML = FALSE)
logLik(br$mer)
#> 'log Lik.' -1126.471 (df=5)
logLik(br1)
#> 'log Lik.' -1126.471 (df=5)

## and so does this:
br <- gamm4(y ~ s(x0), data = dat, random = ~ (1 | fac), REML = TRUE)
br1 <- glmmTMB(y ~ s(x0) + (1 | fac), data = dat, REML = TRUE)
logLik(br$mer)
#> 'log Lik.' -1127.131 (df=5)
logLik(br1)
#> 'log Lik.' -1127.131 (df=5)

Some packages seem to have chosen diagonal.penalty = FALSE and some diagonal.penalty = TRUE in mgcv::smoothCon(). In sdmTMB I had followed the lead of brms, which uses diagonal.penalty = TRUE, but gamm4 and now glmmTMB are using diagonal.penalty = FALSE. I will probably switch to diagonal.penalty = FALSE to match gamm4 and glmmTMB. The result with ML is the same but with different linear fixed effects (the scale of the matrix and coefficients change) but the result with REML can be slightly different. I only mention this in case a user ever wonders why the fixed effect component is dramatically different in glmmTMB vs. brms (often an order of magnitude).

I have had challenges getting prediction on new data with t2() to match mgcv so I'll be curious if that can be worked out here.

Adding by terms isn't too bad, and I might be able to tackle it, but you'll probably get to it first.

In general, I haven't seen problems fitting with ML.

I'm happy to help with testing and working out a good plotting interface—possibly interfacing with gratia. Rolling your own ends up being a pain given the wide array of smoother configurations.

@bbolker
Copy link
Contributor Author

bbolker commented Oct 6, 2023

Naive attempts to fit a 2D thin plate spline didn't go too well ...

library(glmmTMB)
d <- data.frame(z = as.vector(volcano),
                x = as.vector(row(volcano)),
                y = as.vector(col(volcano)))
set.seed(1)
d$z <- d$z + rnorm(length(volcano), sd=15)
d <- d[sample(nrow(d), 100), ]

d$pos <- numFactor(d$x, d$y)
d$group <- factor(rep(1, nrow(d)))

f <- glmmTMB(z ~ 1 + exp(pos + 0 | group), data=d,
             control = glmmTMBControl(parallel = 10))

## does something but not the right thing?
f2 <- glmmTMB(z ~ 1 + s(x, y), data=d,
             control = glmmTMBControl(parallel = 10))

## error
f3 <- glmmTMB(z ~ 1 + s(x, y, bs = "tp"), data=d,
             control = glmmTMBControl(parallel = 10))

## additive model
f4 <- glmmTMB(z ~ 1 + s(x) + s(y), data=d,
             control = glmmTMBControl(parallel = 10))

@authagag
Copy link

authagag commented Oct 10, 2023

NA handling is not working correctly it seems to be fitting the smooth before handling NA's

glmm_all_sm <- glmmTMB(data=util, formula= UTIL_trans ~ factor(POSTPFSFL) +s(BAGE) + ( 1|ID ),  family=ordbeta, na.action = na.omit )
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
  NA/NaN/Inf in foreign function call (arg 1)

vs

glmm_all_sm <- glmmTMB(data=na.omit(util[,c("UTIL_trans","POSTPFSFL","BAGE","ID")]), formula= UTIL_trans ~ factor(POSTPFSFL) +s(BAGE) + ( 1|ID ), 
+                     family=ordbeta, na.action = na.omit )
> 
> glmm_all_sm
Formula:          UTIL_trans ~ factor(POSTPFSFL) + s(BAGE) + (1 | ID)
Data: na.omit(util[, c("UTIL_trans", "POSTPFSFL", "BAGE", "ID")])
      AIC       BIC    logLik  df.resid 
-2445.535 -2400.812  1230.768      1971 
Random-effects (co)variances:

Conditional model:
 Groups Name        Std.Dev.  Corr          
 dummy  dummy1      2.461e-07 0.00 (homdiag)
 ID     (Intercept) 1.027e+00               

Number of obs: 1979 / Conditional model: ID, 246; dummy, 1

Dispersion parameter for ordbeta family (): 8.55 

lower cutoff estimate: 2.46e-10, 0.98 

Fixed Effects:

Conditional model:
          (Intercept)  factor(POSTPFSFL)Post               s(BAGE)1  
               1.9054                -0.2563                -0.1304  

@authagag
Copy link

authagag commented Oct 10, 2023

Transformations not being handled in smooths

glmm_all_sm <- glmmTMB(data=na.omit(util[,c("UTIL_trans","POSTPFSFL","BAGE","ID")]), formula= UTIL_trans ~ factor(POSTPFSFL) +s(log(BAGE),k=4) + ( 1|ID ), family=ordbeta, na.action = na.omit )

Error in names(dat) <- object$term : 
  'names' attribute [1] must be the same length as the vector [0]

@bbolker
Copy link
Contributor Author

bbolker commented Oct 10, 2023

thanks @authagag. Now that this is exposed in a CRAN version I can see that I (or someone???) is going to have some work to do getting this from "experimental" to "robust and ready for a wide range of use cases" ...

@authagag
Copy link

authagag commented Oct 10, 2023

I was writing my own version for gamlss so i could use smooths with ordered beta regression for a paper I am working on. This came along and is saving me a lot of work. I have a vested interest :)

@bbolker
Copy link
Contributor Author

bbolker commented Oct 10, 2023

Hmm. The NA-handling is easy, the transformation-within-smooth case will take some fixes of the formula processing machinery ... (see https://github.com/glmmTMB/glmmTMB/tree/smooth_fixes):

glmmTMB:::findbars_x(Reaction ~ s(log(Days+1)), default.special = NULL, target = "s")
[[1]]
s(log(`+`))

[[2]]
s(log(Days))

[[3]]
s(log(1))

@bbolker
Copy link
Contributor Author

bbolker commented Oct 11, 2023

These all work now, with the smooth_fixes branch referred to above:

library(glmmTMB)
data(sleepstudy, package = "lme4")
ss <- sleepstudy
ss[3, "Days"] <- NA
ss$d <- ss$Days+1
m1 <- glmmTMB(Reaction ~ s(Days), sleepstudy)
m2 <- update(m1, data = ss)
m3 <- update(m1, . ~ s(log(Days+1)))
m4 <- update(m1, . ~ s(log(d)), data = ss) ## OK
m5 <- update(m1, . ~ s(log(d), k = 4), data = ss)

@bbolker
Copy link
Contributor Author

bbolker commented Oct 15, 2023

There are some more complex examples that still don't work, which I need to fix here. Here keep_args[3] is 1, which means that only the first argument of s(...) is kept. Is it safe to assume that all unnamed arguments should be kept (e.g. in s(x, y, bs = "tp", k = 4)) and that this can be signalled by passing NA (or 0 or ...) ?

@authagag
Copy link

authagag commented Oct 20, 2023

Hi, I don't think that fully addresses all the amazing (or stupid) transformations the could be expressed in s(), that only seems to look for "+" in specials but one could imagine something like s(cos((y^2*3)/2), ...) in the formula. Would that be picked up? I am still trying to figure out how to download the different branches so can't test it.

I remember in the survival package there is an untangle specials function that handles a lot of formula tom-foolery pretty robustly. I know that things like this are an absolute bear for functions like predict.

@authagag
Copy link

what needs to be done to get the output to work with gratia, emmeans, and Marginaleffects packages?
Maybe i can contribute something?

@bbolker
Copy link
Contributor Author

bbolker commented Oct 20, 2023

I think

remotes::install_github("glmmTMB/glmmTMB/glmmTMB@smooth_fixes")

should work for installing the current smooth-fix-development branch.

Your weird example suggested above does work with the current branch.

For predict, I think you're referring to the issues with data-dependent bases, but I think we have that covered?

I know nothing about all the downstream packages -- it would be great to at least start testing them (I can imagine that gratia is the biggest challenge. Maybe ask Gavin Simpson for starting points?)

@bbolker
Copy link
Contributor Author

bbolker commented Oct 20, 2023

FWIW, 2D splines now appear to work (at least at a very quick glance): https://github.com/glmmTMB/glmmTMB/blob/smooth_fixes/misc/spline_volcano.R

I don't have a reproducible version for the example in this comment, but this works now:

library(glmmTMB) ## with smooth_fixes branch
library(mgcv)
data("sleepstudy", package = "lme4")
m1 <- glmmTMB(Reaction ~ s(log(Days+1),k=4), data = sleepstudy, REML = TRUE)
m2 <- gam(Reaction ~ s(log(Days+1),k=4), data = sleepstudy, method = "REML")

(need log(Days+1) here because Days starts from 0)

I'm not sure this is actually fully working:

  • if we specify start = list(theta=-3) we can get a pos-def Hessian. Results still differ slightly (all.equal(predict(m1), unname(c(predict(m2)))) says results differ by 3%; difference of 0.6 in log-likelihoods (m2 is better)
  • FWIW I think it was the k=4 rather than the transformation that was the problem ...

@gavinsimpson
Copy link

gavinsimpson commented Oct 25, 2023

Handling "glmmTMB" models in {gratia} should be OK, but will involve some tedious writing of new methods and turning existing functions into S3 methods. I've set up an issue gavinsimpson/gratia#240 in {gratia}'s GitHub repo to track what's needed and progress, but I have only just started to look into what would be actually needed to make this all work.

The main thing is that you are returning the "mgcv.smooth" objects output by mgcv::smoothCon() so much of the work is already in {gratia}. Some things that we might consider stretch goals will need some figuring out (like parametric_effects() and data_slice()), but they aren't needed immediately.

@mebrooks
Copy link
Contributor

mebrooks commented Jan 5, 2024

I agree with the comment above that k=4 was a problem, but I also can't get the example working.

> data("sleepstudy", package = "lme4")
> m1 <- glmmTMB(Reaction ~ s(log(Days+1),k=4), data = sleepstudy, REML = TRUE)
Error in names(dat) <- object$term : 
  'names' attribute [1] must be the same length as the vector [0]

I also found these other things that break.

> tmbgamm <- glmmTMB(count~s(DOY, k=3), data=Salamanders)
> VarCorr(tmbgamm)

Conditional model:
Error in r[2, 1] : subscript out of bounds

Since specifying k seemed to be the problem, I tried omitting it, but then had an issue with the unique covariate combinations being fewer than the maximum degrees of freedom.

> #subset of data with fewer DOY levels
> Ssub <- subset(Salamanders, DOY %in% unique(Salamanders$DOY)[1:5])
> glmmTMB(count~s(DOY), data=Ssub)
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
  A term has fewer unique covariate combinations than specified maximum degrees of freedom

@bbolker
Copy link
Contributor Author

bbolker commented Jan 6, 2024

Are you using the smooth_fixes branch ?

m1 <- glmmTMB(Reaction ~ s(log(Days+1),k=4), data = sleepstudy, REML = TRUE, start = list(theta=-3))

works for me with the smooth_fixes branch (as commented above, we need to tweak the starting value to get a pos-def/non-singular result)

However, I can replicate your next (VarCorr) error. (For some reason print.VarCorr.glmmTMB is looking for the correlation even though the variance structure is homdiag ...)

update: fixed the VarCorr printing problem. I'm going to set up a PR for what's been done so far - no reason it shouldn't be in the master branch ...

@bbolker bbolker mentioned this issue Jan 6, 2024
@mebrooks
Copy link
Contributor

mebrooks commented Jan 8, 2024

Yeah @bbolker, sorry about that. I didn't see that it was on a separate branch. Thanks for the PR. I just merged it.

@mebrooks
Copy link
Contributor

mebrooks commented Feb 20, 2024

I'm helping a student with their data and based on analyses with splines, we could use the

  • relax "no random effects in dispersion formulas" constraint so that users can add smooths to the dispersion model?

item from the wishlist above. So I'm thinking about trying to do it. I don't see any changes in smooth_fixes that work towards this, but maybe I'm missing them. My plan was to search through the code for cond, zi, and disp to see where we allow random effects in the first two, but not the latter. Am I missing anything obvous?

EDIT: following up in #997

@bbolker
Copy link
Contributor Author

bbolker commented Feb 20, 2024

I don't think you're missing anything obvious, but I think this could be a medium- to large-size job because there could be a lot of places to fix. In the short term I would be tempted to fall back to non-penalized splines (splines::ns, splines::bs etc.) for the dispersion component ... however, this would be a good thing to do more generally, to enable more general location-scale modeling.

@AllInCade
Copy link

AllInCade commented Mar 7, 2024

Predictions for bivariate smooths don't seem to work. The model object outputs 2 columns to represent the 2D smooth, so there is a mismatch in data frame dimensions between the data frame to which evaluate the smooth on (test data) and the model objects output. It works fine for s(U) + s(V) and s(U)*s(V) etc. Using t2() does not work (I assume the "t2" function hasn't been "copied" into glmmTMB yet). I've tried manual approaches with predictMat etc, but I'm out of my depth with how mgcv handles this.

> summary(testmod)
 Family: gaussian  ( identity )
Formula:          spd ~ s(time) + s(U, V)
Data: train_data

     AIC      BIC   logLik deviance df.resid 
-17179.7 -17130.9   8596.9 -17193.7     7867 

Random effects:

Conditional model:
 Groups   Name   Variance  Std.Dev. Corr          
 dummy    dummy1 8.835e-05 0.009399 0.00 (homdiag)
 dummy.1  dummy1 5.317e+01 7.292069 0.00 (homdiag)
 Residual        6.325e-03 0.079530               
Number of obs: 7870, groups:  dummy, 1

Dispersion estimate for gaussian family (sigma^2): 0.00633 

Conditional model:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.1540193  0.0008965    5749   <2e-16 ***
s(time)1    -0.0014914  0.0047952       0    0.756    
s(U,V)1      0.1492392  0.0059513      25   <2e-16 ***
s(U,V)2     -0.2936664  0.0052754     -56   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1
> predict(testmod, test_data, type="response")
Error in names(dat) <- object$term : 
  'names' attribute [2] must be the same length as the vector [1]`

@bbolker
Copy link
Contributor Author

bbolker commented Mar 7, 2024

Thanks for the report @AllInCade. Could we have a reproducible example (we can make one up ourselves, but it will go faster if you can construct one ...) please ?

@AllInCade
Copy link

AllInCade commented Mar 8, 2024

I can upload the data set I've been using (it's a small subset of a huge dataset), which I have cleaned up etc. I've tried all kinds of things and tricks that have solved prediction errors before, but can't get this to work (haven't tried very hard manually, but the automatic way seems to not work). I may have dreamed this, but I could have sworn that I made predictions and plotted etc with 2D smooths with glmmTMB a few weeks ago. Certainly don't have anything saved that works...

wind_data.csv


wind_data <- wind_data_full %>%
  filter(datetime_UTC >= ymd_hms('2007-09-10 00:00:00') & datetime_UTC <= ymd_hms('2007-10-15 23:59:59'))
  
wind_data <- wind_data %>%
  arrange(datetime_UTC) %>% # Make sure the data is sorted by datetime_UTC
  mutate(time = as.numeric(difftime(datetime_UTC, min(datetime_UTC), units = "hours")) + 1)

wind_data$datetime_UTC <- NULL # date type variables sometimes cause problems with predict()

set.seed(123) 
wind_data <- wind_data[order(wind_data$time), ]
total_rows <- nrow(wind_data)
train_size <- floor(0.6 * total_rows)
valid_size <- floor(0.2 * total_rows)
test_size <- total_rows - train_size - valid_size 

train_data <- wind_data[1:train_size, ]
valid_data <- wind_data[(train_size + 1):(train_size + valid_size), ]
test_data <- wind_data[(train_size + valid_size + 1):total_rows, ]

test_mod <- glmmTMB(spd ~ s(time) + s(U,V), data = train_data) 
predictions <- predict(test_mod, valid_data, type="response")

I may add that a colleague of mine and I have been using and testing smooths with glmmTMB extensively, and have found quite a few things which need work, and we have some solutions for several of these things. We've also been implementing ridge regularized GAM(M)s in glmmTMB using the smoothCon and smooth2random machinery, with our own automatic smoothness selection procedure (GCV based). We've found that these models tend to outperform the equivalent mgcv::gam models in terms of predictive accuracy (as measured by RMSE) and not suffering from undersmoothing with MLE (rather than REML). So perhaps it could be an idea to try and implement a control option for the regularization method (ISSD vs Ridge) in glmmTMB, specially for when undersmoothing (overfittng) is a concern, and predictive accuracy and computational time are the primary goals (i.e forecasting etc).

Additionally we've seen that attempts have made to compare glmmTMB models with smooth predictors to mgcv::gam models, which are not exactly equivalent frameworks, since the smooths from gam (directly from smooth.construct / smoothCon) are naturally parameterized, and not re-parameterized for mixed model framework estimation (via smooth2random). mgcv::gamm() and gamm4() models are identically equal to glmmTMB models of equvialent model formulas (i.e given the family parameterizations are equal etc). The basis function matrices are exactly identical, and so are all model outputs between all the mixed model frameworks.

We'll release the full paper on glmmTMB spline regression in a couple months, so stay tuned. We'll have everything (code, data etc) uploaded to github as well.

bbolker added a commit that referenced this issue Mar 20, 2024
@bbolker
Copy link
Contributor Author

bbolker commented Mar 20, 2024

I can't replicate this with the current devel (on-its-way-to-CRAN) version, currently version 1.1.9. What version are you using?

@AllInCade
Copy link

AllInCade commented Mar 21, 2024

I can't replicate this with the current devel (on-its-way-to-CRAN) version, currently version 1.1.9. What version are you using?

That's interesting. Like I said I am almost certain that I've had it work in some other files / models. I was/am using version 1.1.8.

Edit: It seems to work in 1.1.9, so that's good :)

@gavinsimpson
Copy link

@bbolker One addition that would be useful, would be to store in each smooth which "coefficients" belong to that smooth. This info is typically stored in the components $first.para and $last.para of each smooth, but unless glmmTMB has the standard coef() function to extract all the model coefficients, this first/last approach is probably a non-starter without further effort because of how the parameters are split off into fixef() and ranef() terms in a GLMM.

Following a quick look at fixef() and ranef() outputs for a glmmTMB model with multiple smooths, it wasn't immediately clear to me how to know which coefs belong to which smooths, e.g.

> ranef(m_glmmTMB2)                                                                                             
$dummy
      dummy1      dummy2     dummy3    dummy4     dummy5   dummy6     dummy7   dummy8    dummy1   dummy2
1 0.02046113 -0.08115577 -0.5104386 -0.385678 -0.1893127 1.407065 -0.2008775 1.480433 0.1304175 0.285876
     dummy3     dummy4    dummy5    dummy6     dummy7    dummy8    dummy1   dummy2    dummy3    dummy4    dummy5
1 0.4071222 -0.1196957 0.4838889 -1.190392 -0.1950292 -1.356725 -15.68131 15.02009 -21.55647 -11.08096 -40.61811
    dummy6    dummy7   dummy8
1 13.18978 -19.92904 9.275013

for this model:

library("gratia")
library("glmmTMB")
df <- data_sim("eg1", seed = 2)
m_glmmTMB2 <- glmmTMB(y ~ s(x0) + s(x1) + s(x2),
    data = df, REML = TRUE)

I presume the duplicated names dummyN actually encode the 8 wiggly bits of the basis for each smooth (when k = 10, default, we have 9 basis function after sum-to-zero-constraint applied, and 1 of those is in the penalty null space so it ends up in the fixed effects, leaving 8 to go into the random effects model matrix).

Do these elements in $dummy need to be called this or could you use the information in $modelInfo$reTrms$cond$smooth_info[[N]]$sm$label to label the ranefs in a way that makes it easier to identify what's what?

(and thinking ahead, for when smooths are allowed in other linear predictors, how are those different linear predictors identified when extracting model "coefficients"?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants