Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions for truncated_nbinom2 models #350

Closed
tom-peatman opened this issue Jun 14, 2024 · 10 comments
Closed

Predictions for truncated_nbinom2 models #350

tom-peatman opened this issue Jun 14, 2024 · 10 comments

Comments

@tom-peatman
Copy link

tom-peatman commented Jun 14, 2024

I've been fitting delta_truncated_nbinom2 models with sdmTMB to some rare-event captures data collected by observers.

It appears that predict.sdmTMB does not return correct predictions for the positive component of the model (at the response scale), or I'm not interpreting the results appropriately.

I'm running sdmTMB v0.5.0.9008 on R v 4.4.0 - but have reviewed recent commits and didn't see recent activity that would address this. Apologies if it has been addressed in a more recent version.

As an example, below I simulate some data from a negative binomial distributions with treatment-specific means and a shared overdispersion term, and remove any 0s:

library(dplyr)
library(tidyr)
library(sdmTMB)

n_draws <- 150
mod_pars <- data.frame(treatment = letters[1:2], mu = c(0.1, 2), phi = 0.1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
    mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

It appears that the link space predictions from predict.sdmTMB may correspond to the parameter mu for a truncated negative binomial distribution where the mean of the distribution is equal to mu * (1 - (1 + alpha * mu) ^ (-1 / alpha) ^ (-1):

prd_data <- data.frame(treatment = letters[1:2])

## predictions in link space
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp

## predictions in response space appear incorrect
## e.g. mean response can't be < 1 for a truncated negative binomial
pred_lp %>% mutate(., response = exp(est))

## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)

## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]

## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)

## these appear to correspond to training data means
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))

I can manually adjust the predictions for the positives component for my delta_truncated_nbinom2 models (as above) to get response scale predictions, but wanted to check that this adjustment is actually correct.

Many thanks in advance, any advice would be much appreciated! And thanks again for the sdmTMB package - I've been really impressed with how flexible and easy to use it is.

@tom-peatman
Copy link
Author

(edited the original post to include sdmTMB and R versions)

@ericward-noaa
Copy link
Collaborator

I don't think there's much wrong here? -- the truncated negative binomial code used in sdmTMB is taken from glmmTMB (see here: https://github.com/pbs-assess/sdmTMB/blob/c70fa5c3528cfabc6c8136d21b22b430f149a368/src/utils.h#L337C8-L337C25). The simulation below seems to recover the parameters ok -- I think the major issue is that one of your initial means was very small (0.1) and that was causing things to blow up / not converge for other some other seeds I tried.

library(dplyr)
library(tidyr)
library(sdmTMB)
  
set.seed(1)
n_draws <- 150
mod_pars <- data.frame(treatment = letters[1:2], mu = c(2, 5), phi = 0.1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

prd_data <- data.frame(treatment = letters[1:2])

## predictions in link space
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#>   treatment       est
#> 1         a 0.6644778
#> 2         b 1.5506004

## predictions in response space appear incorrect
## e.g. mean response can't be < 1 for a truncated negative binomial
pred_lp %>% mutate(., response = exp(est))
#>   treatment       est response
#> 1         a 0.6644778 1.943475
#> 2         b 1.5506004 4.714300

## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)

## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]

## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 2.336000 4.812081

## these appear to correspond to training data means
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 2 × 2
#>   treatment  mean
#>   <chr>     <dbl>
#> 1 a          2.34
#> 2 b          4.81

Created on 2024-06-14 with reprex v2.1.0

@tom-peatman
Copy link
Author

tom-peatman commented Jun 16, 2024

Thanks @ericward-noaa for looking into this and your quick response. And apologies if my pretty contrived example was not particularly clear (and for not having checked whether the model fitting had successfully converged...).

And thanks also for the link to the rtruncated_nbinom function. I tried to expose it to R so I could use it to simulate the training dataset directly rather than truncating random draws from a negative binomial distribution - unfortunately my C++ skills weren't up to the task.

The parameter estimates in your example do appear to be more consistent with the means of the negative binomial distribution(s) used to simulate the training data, as well as the actual means of the training data after removal of the 0s. But, the differences between the negative binomial and truncated negative binomial distributions become more pronounced with lower means (all else equal), as the difference in means are larger due to more 0s.

Hopefully the example below more clearly demonstrates why I think the predictions from predict calls are incorrect for truncated_nbinom2 models (and positive components of delta_truncated_nbinom2 models):

library(dplyr)
library(tidyr)
library(sdmTMB)

set.seed(1)

n_draws <- 250
mod_pars <- data.frame(treatment = letters[1:4], mu = c(0.5, 0.75, 1, 2.5), phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
## predictions in link space
prd_data <- data.frame(treatment = letters[1:4])
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#>   treatment          est
#> 1         a -0.561669862
#> 2         b -0.197211601
#> 3         c -0.001578785
#> 4         d  0.960275459
## predictions in response space:
pred_lp %>% mutate(., response = exp(est))
#>   treatment          est  response
#> 1         a -0.561669862 0.5702560
#> 2         b -0.197211601 0.8210169
#> 3         c -0.001578785 0.9984225
#> 4         d  0.960275459 2.6124160
## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
#> # A tibble: 4 × 2
#>   treatment mean_response
#>   <chr>             <dbl>
#> 1 a                  1.48
#> 2 b                  1.70
#> 3 c                  1.83
#> 4 d                  3.17
## treatment specific mean responses in the training dataset:
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 4 × 2
#>   treatment  mean
#>   <chr>     <dbl>
#> 1 a          1.56
#> 2 b          1.81
#> 3 c          1.98
#> 4 d          3.59

My understanding is that predict should return the predicted mean repsonse. In which case the predictions are biased, and also inconsistent with both the assumptions of the model (a zero truncated negative binomial can't have a mean response < 1) and simulated data from the fitted model. The predictions are also suspiciously similar to the means of the (untruncated) negative binomial distributions that were used to generate the underlying data (i.e. mod_pars$mu).

Using the formula for the mean of a zero truncated negative binomial from https://grodri.github.io/glms/notes/countmoments, I am able to recover predicted means that are consistent with both the training dataset, and simulated responses from the fitted model:

## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)

## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]

## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 1.561798 1.809524 1.984962 3.585366

These in combination suggest to me that the predict calls are not returning the predicted mean response from the truncated negative binomial distribution - perhaps instead a parameter that in part defines the mean? Again, I tried unsuccessfully to dig in to the source code, but didn't make much progress.

Thanks again for your assistance, and apologies if this is still unclear. I'm aware this isn't the gold standard in a minimal reproducible example!

Created on 2024-06-16 with reprex v2.1.0

@ericward-noaa
Copy link
Collaborator

ericward-noaa commented Jun 17, 2024

Thanks @tom-peatman for the clear example. I fleshed this out a bit more, adding more levels (I was interested if there was a gradient in any bias, etc). I don't really see one here -- the estimated means seem consistent with the original means (before data are removed) -- and simulated data are consistent with empirical means, after zeros are removed:

library(dplyr)
library(tidyr)
library(sdmTMB)
library(ggplot2)

set.seed(1)

n_draws <- 250 
true_mu <- seq(0.2, 3, length.out = 20)
mod_pars <- data.frame(treatment = letters[1:length(true_mu)], mu = true_mu, phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)

## simulate data from negative binomial distribution
sim_data <- sim_data %>%
  mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))

## remove 0s
sim_data <- sim_data %>% filter(., response > 0)

## fit model to simulated data
fit <- sdmTMB(
  response ~ treatment,
  data = sim_data, spatial = FALSE,
  family = truncated_nbinom2()
)

sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf

## predictions in link / response space
prd_data <- data.frame(treatment = letters[1:length(true_mu)])
pred_lp <- predict(fit, newdata = prd_data, type = "link", se_fit = TRUE) %>% 
  mutate(response = exp(est), lo = exp(est - 2*est_se), hi = exp(est + 2*est_se))
#> Prediction can be slow when `se_fit = TRUE` and random fields are included
#> (i.e., `re_form = NA`). Consider using the `nsim` argument to take draws from
#> the joint precision matrix and summarizing the standard devation of those
#> draws.
pred_lp$true_mu <- true_mu

# There doesn't appear to be a consistent bias here
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

## treatment specific mean responses in the training dataset:
sim_mean <- sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
pred_lp$sim_mean <- sim_mean$mean

# When we look at the means in the data with zeros removed (blue dots), these values don't 
# include 0s, so are higher
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  geom_point(aes(treatment, sim_mean), col="blue") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_resp <- sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
pred_lp$sim_resp <- sim_resp$mean_response

# We can also compare the simualted values (orange dots) to the filtered original data, and these are 
# quite similar (neither include 0)
ggplot(pred_lp, aes(treatment, response)) + 
  geom_linerange(aes(ymin = lo, ymax = hi)) + 
  geom_point() + 
  geom_point(aes(treatment, true_mu), col="red") + 
  geom_point(aes(treatment, sim_mean), col="blue") + 
  geom_point(aes(treatment, sim_resp), col="orange") + 
  xlab("Treatment") + ylab("Estimated response") + 
  coord_flip() + theme_bw()

Created on 2024-06-17 with reprex v2.1.0

@tom-peatman
Copy link
Author

Thanks @ericward-noaa for this, very helpful.

It looks like I've fundamentally misunderstood what is returned by predict calls for truncated_nbinom2 models, and the positive component for delta_truncated_nbinom2 models. My expectation was that predict should return the predicted mean response from the zero-truncated distribution (at either the link or response scale). However it seems that predict actually returns the linear predictor, which in the case of the truncated_nbinom2 distribution is the mean of the untruncated distribution.

I suppose that truncated_nbinom2 models can be used in different situations, and depending on the context users might have different expectations on what predictions should be returned by predict.

In a situation where 0s occur but the 0s are not observable, and modelled with a truncated_nbinom2 model, then predictions from the untruncated distribution may be appropriate, i.e. the predictions reflect the full range of possible responses (including 0s).

However, in the case of delta_truncated_nbinom2 models, I'd argue that predictions for the positives component should come from the zero truncated distribution. The positives component models numbers when present, and so 0s are not a possible outcome.

For what it's worth, I think it would be worth considering adding something to the predict documentation that explains what is returned for delta_truncated_nbinom2 models, and/or a warning message when predicting for delta_truncated_nbinom2 models, to avoid any misunderstandings by users?

@tom-peatman
Copy link
Author

Thinking more about this overnight, there's also potential cases where a delta model is being fit, but with the delta and positives components fitted separately so that different formulas can be specified for each component. Here, with a truncated_nbinom2 model for the positives, then I think predictions from the truncated distribution are appropriate?

I'm not sure how helpful any of this is - the dialogue has been very informative for me at least so thanks again @ericward-noaa for your input!

@ericward-noaa
Copy link
Collaborator

ericward-noaa commented Jun 18, 2024

Agree -- this is a great discussion, and I appreciate the thoughts. I'll work on folding some of your comments into the documentation to make it more clear what is being returned with the predict() function. I think in implementing this, the goal was try to be as consistent as possible with implementations in other packages like glmmTMB, but I think your points on the delta models are good and potentially confusing to users.

On the glmmTMB side, I should have added this in my last code snippet, but we can add on the glmmTMB equivalent with their truncated_nbinom2() family and show that the predictions are perfectly correlated:

fit_glmmTMB <- glmmTMB(
  response ~ treatment,
  data = sim_data, 
  family = truncated_nbinom2()
)

pred_glmmTMB <- predict(fit_glmmTMB, newdata = prd_data, type = "link")

cor(pred_lp$est, pred_glmmTMB)

@seananderson
Copy link
Member

I also didn't realize until this thread that the linear predictors on the truncated count distributions were predicting the expectation for the untruncated distributions. The help file in glmmTMB doesn't note this, but does link to the aster package vignette where the algorithms are described.

This part in that vignette is relevant:
image

Indeed, although the link-level predictions are coming back correctly, I believe we have the combined expectation from the delta_truncated_nbinom2() wrong.

I had missed this part:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L1046-L1047
where log_nzprob is calculated here:
https://github.com/glmmTMB/glmmTMB/blob/4ab751238c2494d1a0f8959551fb4486bc3ff909/glmmTMB/src/glmmTMB.cpp#L207-L213

That converts the mean of the untruncated distribution to the mean of the the truncated distribution before combining that linear predictor with the probability of a non zero.

I'll get this fixed.

Here were my experiments:

library(sdmTMB)

set.seed(1)
d0 <- data.frame(y = rnbinom(n = 4e4, mu = 3, size = 5))

mean(d0$y)
#> [1] 2.997775
fit <- sdmTMB(
  y ~ 1,
  data = d0, 
  spatial = FALSE,
  family = delta_truncated_nbinom2()
)
fit$sd_report
#> sdreport(.) result
#>        Estimate  Std. Error
#> b_j    2.260764 0.017098746
#> b_j2   1.094016 0.004466339
#> ln_phi 1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p <- predict(fit)
est <- plogis(p$est1) * exp(p$est2)
mean(d0$y)
#> [1] 2.997775
mean(est)
#> [1] 2.704267
# which is what is happening here:
p <- predict(fit, type = "response")
mean(p$est)
#> [1] 2.704267
library(glmmTMB)
#> 
#> Attaching package: 'glmmTMB'
#> The following objects are masked from 'package:sdmTMB':
#> 
#>     lognormal, nbinom1, nbinom2, truncated_nbinom1, truncated_nbinom2,
#>     tweedie
fit2 <- glmmTMB::glmmTMB(
  y ~ 1,
  ziformula = ~ 1,
  data = d0, 
  family = glmmTMB::truncated_nbinom2()
)
# matches:
fit2$sdr
#> sdreport(.) result
#>         Estimate  Std. Error
#> beta    1.094016 0.004466339
#> betazi -2.260764 0.017098746
#> betad   1.572268 0.026848602
#> Maximum gradient component: 0.001687826
p2 <- predict(fit2)

# matches:
mean(exp(p2))
#> [1] 2.986243
p2 <- predict(fit2, type = "zlink")
mean(p2)
#> [1] -2.260764
plogis(mean(p2))
#> [1] 0.094425
# the mean of the *truncated* distribution:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
s <- as.list(fit2$sdr, "Estimate")
plogis(s$betazi)
#> [1] 0.094425
s1 <- s$beta
etad <- s$betad

# convert to truncated mean:
logspace_add <- DPQ::logspace.add
logspace_sub <- DPQ::logspace.sub
(s2 <- logspace_add(0, s1 - etad))
#> [1] 0.4823433
(ans <- logspace_sub(0, -exp(etad) * s2))
#> [1] -0.1030396
xx <- exp(s1) / exp(ans)
xx
#> [1] 3.310355
# which matches:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355
# combined with glmmTMB:
p2 <- predict(fit2, type = "response")
mean(p2)
#> [1] 2.997775
# matches (with sdmTMB):
prob_1 <- plogis(tidy(fit, model = 1)$estimate)
xx * prob_1
#> [1] 2.997775

Created on 2024-06-18 with reprex v2.1.0

seananderson added a commit that referenced this issue Jun 18, 2024
@seananderson
Copy link
Member

OK, this is fixed in the main branch. See these new unit tests: https://github.com/pbs-assess/sdmTMB/blob/main/tests/testthat/test-truncated-dists.R

The link-scale predictions remain unchanged. However, the response-scale predictions and any internal calculations that rely on response-scale predictions (including, importantly, the delta families) now use the truncated mean.

Thanks for pointing this out, @tom-peatman!

@tom-peatman
Copy link
Author

Thanks @seananderson for fixing this so quickly, much appreciated! And thanks again to you and @ericward-noaa for being so responsive to my questions on this, I'm very grateful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants