-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions for truncated_nbinom2
models
#350
Comments
(edited the original post to include sdmTMB and R versions) |
I don't think there's much wrong here? -- the truncated negative binomial code used in sdmTMB is taken from glmmTMB (see here: https://github.com/pbs-assess/sdmTMB/blob/c70fa5c3528cfabc6c8136d21b22b430f149a368/src/utils.h#L337C8-L337C25). The simulation below seems to recover the parameters ok -- I think the major issue is that one of your initial means was very small (0.1) and that was causing things to blow up / not converge for other some other seeds I tried. library(dplyr)
library(tidyr)
library(sdmTMB)
set.seed(1)
n_draws <- 150
mod_pars <- data.frame(treatment = letters[1:2], mu = c(2, 5), phi = 0.1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
prd_data <- data.frame(treatment = letters[1:2])
## predictions in link space
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#> treatment est
#> 1 a 0.6644778
#> 2 b 1.5506004
## predictions in response space appear incorrect
## e.g. mean response can't be < 1 for a truncated negative binomial
pred_lp %>% mutate(., response = exp(est))
#> treatment est response
#> 1 a 0.6644778 1.943475
#> 2 b 1.5506004 4.714300
## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)
## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]
## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 2.336000 4.812081
## these appear to correspond to training data means
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 2 × 2
#> treatment mean
#> <chr> <dbl>
#> 1 a 2.34
#> 2 b 4.81 Created on 2024-06-14 with reprex v2.1.0 |
Thanks @ericward-noaa for looking into this and your quick response. And apologies if my pretty contrived example was not particularly clear (and for not having checked whether the model fitting had successfully converged...). And thanks also for the link to the rtruncated_nbinom function. I tried to expose it to R so I could use it to simulate the training dataset directly rather than truncating random draws from a negative binomial distribution - unfortunately my C++ skills weren't up to the task. The parameter estimates in your example do appear to be more consistent with the means of the negative binomial distribution(s) used to simulate the training data, as well as the actual means of the training data after removal of the 0s. But, the differences between the negative binomial and truncated negative binomial distributions become more pronounced with lower means (all else equal), as the difference in means are larger due to more 0s. Hopefully the example below more clearly demonstrates why I think the predictions from library(dplyr)
library(tidyr)
library(sdmTMB)
set.seed(1)
n_draws <- 250
mod_pars <- data.frame(treatment = letters[1:4], mu = c(0.5, 0.75, 1, 2.5), phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf ## predictions in link space
prd_data <- data.frame(treatment = letters[1:4])
pred_lp <- predict(fit, newdata = prd_data, type = "link")
pred_lp
#> treatment est
#> 1 a -0.561669862
#> 2 b -0.197211601
#> 3 c -0.001578785
#> 4 d 0.960275459 ## predictions in response space:
pred_lp %>% mutate(., response = exp(est))
#> treatment est response
#> 1 a -0.561669862 0.5702560
#> 2 b -0.197211601 0.8210169
#> 3 c -0.001578785 0.9984225
#> 4 d 0.960275459 2.6124160 ## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
#> # A tibble: 4 × 2
#> treatment mean_response
#> <chr> <dbl>
#> 1 a 1.48
#> 2 b 1.70
#> 3 c 1.83
#> 4 d 3.17 ## treatment specific mean responses in the training dataset:
sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
#> # A tibble: 4 × 2
#> treatment mean
#> <chr> <dbl>
#> 1 a 1.56
#> 2 b 1.81
#> 3 c 1.98
#> 4 d 3.59 My understanding is that Using the formula for the mean of a zero truncated negative binomial from https://grodri.github.io/glms/notes/countmoments, I am able to recover predicted means that are consistent with both the training dataset, and simulated responses from the fitted model: ## calculate mean of the truncated negative binomial distribution
## assuming mean = mu * (1 - (1 + alpha * mu) ^ (-1 / alpha)) ^ (-1)
## get dispersion term
phi <- tidy(fit, "ran_pars")
alpha <- 1 / phi$estimate[1]
## predicted means in response space:
exp(pred_lp$est) * (1 - (1 + alpha * exp(pred_lp$est)) ^ (-1 / alpha)) ^ (-1)
#> [1] 1.561798 1.809524 1.984962 3.585366 These in combination suggest to me that the predict calls are not returning the predicted mean response from the truncated negative binomial distribution - perhaps instead a parameter that in part defines the mean? Again, I tried unsuccessfully to dig in to the source code, but didn't make much progress. Thanks again for your assistance, and apologies if this is still unclear. I'm aware this isn't the gold standard in a minimal reproducible example! Created on 2024-06-16 with reprex v2.1.0 |
Thanks @tom-peatman for the clear example. I fleshed this out a bit more, adding more levels (I was interested if there was a gradient in any bias, etc). I don't really see one here -- the estimated means seem consistent with the original means (before data are removed) -- and simulated data are consistent with empirical means, after zeros are removed: library(dplyr)
library(tidyr)
library(sdmTMB)
library(ggplot2)
set.seed(1)
n_draws <- 250
true_mu <- seq(0.2, 3, length.out = 20)
mod_pars <- data.frame(treatment = letters[1:length(true_mu)], mu = true_mu, phi = 1)
sim_data <- expand_grid(draw_id = 1:n_draws, mod_pars)
## simulate data from negative binomial distribution
sim_data <- sim_data %>%
mutate(., response = rnbinom(n = nrow(sim_data), mu = mu, size = phi^(-1)))
## remove 0s
sim_data <- sim_data %>% filter(., response > 0)
## fit model to simulated data
fit <- sdmTMB(
response ~ treatment,
data = sim_data, spatial = FALSE,
family = truncated_nbinom2()
)
sanity(fit)
#> ✔ Non-linear minimizer suggests successful convergence
#> ✔ Hessian matrix is positive definite
#> ✔ No extreme or very small eigenvalues detected
#> ✔ No gradients with respect to fixed effects are >= 0.001
#> ✔ No fixed-effect standard errors are NA
#> ✔ No standard errors look unreasonably large
#> ✔ No sigma parameters are < 0.01
#> ✔ No sigma parameters are > 100
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
#> Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
#> -Inf
## predictions in link / response space
prd_data <- data.frame(treatment = letters[1:length(true_mu)])
pred_lp <- predict(fit, newdata = prd_data, type = "link", se_fit = TRUE) %>%
mutate(response = exp(est), lo = exp(est - 2*est_se), hi = exp(est + 2*est_se))
#> Prediction can be slow when `se_fit = TRUE` and random fields are included
#> (i.e., `re_form = NA`). Consider using the `nsim` argument to take draws from
#> the joint precision matrix and summarizing the standard devation of those
#> draws.
pred_lp$true_mu <- true_mu
# There doesn't appear to be a consistent bias here
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw() ## treatment specific mean responses in the training dataset:
sim_mean <- sim_data %>% group_by(., treatment) %>% summarise(., mean = mean(response))
pred_lp$sim_mean <- sim_mean$mean
# When we look at the means in the data with zeros removed (blue dots), these values don't
# include 0s, so are higher
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
geom_point(aes(treatment, sim_mean), col="blue") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw() ## mean of simulated responses
sim_responses <- simulate(fit)
sim_responses <- data.frame(treatment = sim_data$treatment, sim_response = as.numeric(sim_responses))
sim_resp <- sim_responses %>% group_by(treatment) %>% summarise(mean_response = mean(sim_response))
pred_lp$sim_resp <- sim_resp$mean_response
# We can also compare the simualted values (orange dots) to the filtered original data, and these are
# quite similar (neither include 0)
ggplot(pred_lp, aes(treatment, response)) +
geom_linerange(aes(ymin = lo, ymax = hi)) +
geom_point() +
geom_point(aes(treatment, true_mu), col="red") +
geom_point(aes(treatment, sim_mean), col="blue") +
geom_point(aes(treatment, sim_resp), col="orange") +
xlab("Treatment") + ylab("Estimated response") +
coord_flip() + theme_bw() Created on 2024-06-17 with reprex v2.1.0 |
Thanks @ericward-noaa for this, very helpful. It looks like I've fundamentally misunderstood what is returned by I suppose that In a situation where 0s occur but the 0s are not observable, and modelled with a However, in the case of For what it's worth, I think it would be worth considering adding something to the |
Thinking more about this overnight, there's also potential cases where a delta model is being fit, but with the delta and positives components fitted separately so that different formulas can be specified for each component. Here, with a I'm not sure how helpful any of this is - the dialogue has been very informative for me at least so thanks again @ericward-noaa for your input! |
Agree -- this is a great discussion, and I appreciate the thoughts. I'll work on folding some of your comments into the documentation to make it more clear what is being returned with the predict() function. I think in implementing this, the goal was try to be as consistent as possible with implementations in other packages like glmmTMB, but I think your points on the delta models are good and potentially confusing to users. On the glmmTMB side, I should have added this in my last code snippet, but we can add on the glmmTMB equivalent with their
|
I also didn't realize until this thread that the linear predictors on the truncated count distributions were predicting the expectation for the untruncated distributions. The help file in glmmTMB doesn't note this, but does link to the aster package vignette where the algorithms are described. This part in that vignette is relevant: Indeed, although the link-level predictions are coming back correctly, I believe we have the combined expectation from the I had missed this part: That converts the mean of the untruncated distribution to the mean of the the truncated distribution before combining that linear predictor with the probability of a non zero. I'll get this fixed. Here were my experiments: library(sdmTMB)
set.seed(1)
d0 <- data.frame(y = rnbinom(n = 4e4, mu = 3, size = 5))
mean(d0$y)
#> [1] 2.997775 fit <- sdmTMB(
y ~ 1,
data = d0,
spatial = FALSE,
family = delta_truncated_nbinom2()
)
fit$sd_report
#> sdreport(.) result
#> Estimate Std. Error
#> b_j 2.260764 0.017098746
#> b_j2 1.094016 0.004466339
#> ln_phi 1.572268 0.026848602
#> Maximum gradient component: 0.001687826 p <- predict(fit)
est <- plogis(p$est1) * exp(p$est2)
mean(d0$y)
#> [1] 2.997775 mean(est)
#> [1] 2.704267 # which is what is happening here:
p <- predict(fit, type = "response")
mean(p$est)
#> [1] 2.704267 library(glmmTMB)
#>
#> Attaching package: 'glmmTMB'
#> The following objects are masked from 'package:sdmTMB':
#>
#> lognormal, nbinom1, nbinom2, truncated_nbinom1, truncated_nbinom2,
#> tweedie fit2 <- glmmTMB::glmmTMB(
y ~ 1,
ziformula = ~ 1,
data = d0,
family = glmmTMB::truncated_nbinom2()
)
# matches:
fit2$sdr
#> sdreport(.) result
#> Estimate Std. Error
#> beta 1.094016 0.004466339
#> betazi -2.260764 0.017098746
#> betad 1.572268 0.026848602
#> Maximum gradient component: 0.001687826 p2 <- predict(fit2)
# matches:
mean(exp(p2))
#> [1] 2.986243 p2 <- predict(fit2, type = "zlink")
mean(p2)
#> [1] -2.260764 plogis(mean(p2))
#> [1] 0.094425 # the mean of the *truncated* distribution:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355 s <- as.list(fit2$sdr, "Estimate")
plogis(s$betazi)
#> [1] 0.094425 s1 <- s$beta
etad <- s$betad
# convert to truncated mean:
logspace_add <- DPQ::logspace.add
logspace_sub <- DPQ::logspace.sub
(s2 <- logspace_add(0, s1 - etad))
#> [1] 0.4823433 (ans <- logspace_sub(0, -exp(etad) * s2))
#> [1] -0.1030396 xx <- exp(s1) / exp(ans)
xx
#> [1] 3.310355 # which matches:
p2 <- predict(fit2, type = "conditional")
mean(p2)
#> [1] 3.310355 # combined with glmmTMB:
p2 <- predict(fit2, type = "response")
mean(p2)
#> [1] 2.997775 # matches (with sdmTMB):
prob_1 <- plogis(tidy(fit, model = 1)$estimate)
xx * prob_1
#> [1] 2.997775 Created on 2024-06-18 with reprex v2.1.0 |
OK, this is fixed in the main branch. See these new unit tests: https://github.com/pbs-assess/sdmTMB/blob/main/tests/testthat/test-truncated-dists.R The link-scale predictions remain unchanged. However, the response-scale predictions and any internal calculations that rely on response-scale predictions (including, importantly, the delta families) now use the truncated mean. Thanks for pointing this out, @tom-peatman! |
Thanks @seananderson for fixing this so quickly, much appreciated! And thanks again to you and @ericward-noaa for being so responsive to my questions on this, I'm very grateful. |
I've been fitting
delta_truncated_nbinom2
models withsdmTMB
to some rare-event captures data collected by observers.It appears that
predict.sdmTMB
does not return correct predictions for the positive component of the model (at the response scale), or I'm not interpreting the results appropriately.I'm running sdmTMB v0.5.0.9008 on R v 4.4.0 - but have reviewed recent commits and didn't see recent activity that would address this. Apologies if it has been addressed in a more recent version.
As an example, below I simulate some data from a negative binomial distributions with treatment-specific means and a shared overdispersion term, and remove any 0s:
It appears that the link space predictions from
predict.sdmTMB
may correspond to the parametermu
for a truncated negative binomial distribution where the mean of the distribution is equal tomu * (1 - (1 + alpha * mu) ^ (-1 / alpha) ^ (-1)
:I can manually adjust the predictions for the positives component for my
delta_truncated_nbinom2
models (as above) to get response scale predictions, but wanted to check that this adjustment is actually correct.Many thanks in advance, any advice would be much appreciated! And thanks again for the
sdmTMB
package - I've been really impressed with how flexible and easy to use it is.The text was updated successfully, but these errors were encountered: