/
beta-binomial-deviance-residuals.Rmd
216 lines (165 loc) · 11 KB
/
beta-binomial-deviance-residuals.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "Beta-binomial Deviance Residuals"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Beta-binomial Deviance Residuals}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(extras)
library(tidyr)
library(ggplot2)
library(viridis)
library(scales)
```
### Beta-binomial distribution
The beta-binomial distribution is useful when we wish to incorporate additional variation into the probability parameter of the binomial distribution, $p$.
It accomplishes this by drawing $p$ from a beta distribution.
The parameters of the beta-binomial are the number of trials, $n$, and the shape parameters of the beta distribution, $\alpha$ and $\beta$.
The likelihood of the beta-binomial can be described as follows:
$$P(x | n, \alpha, \beta) = \frac{\Gamma(n + 1)}{\Gamma(x + 1)\Gamma(n - x + 1)}
\frac{\Gamma(x + \alpha) \Gamma(n - x + \beta)}{\Gamma(n + \alpha + \beta)}
\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)},$$
where $x$ is a natural number $≤ n$, $\alpha$ and $\beta$ are $>0$, and where $\Gamma()$ represents the gamma function.
The log-likelihood is:
$$\log(P(x | n, \alpha, \beta)) = \log(\Gamma(n + 1)) - \log(\Gamma(x + 1)) - \log(\Gamma(n - x + 1)) +
\log(\Gamma(x + \alpha)) + \\ \log(\Gamma(n - x + \beta)) - \log(\Gamma(n + \alpha + \beta)) +
\log(\Gamma(\alpha + \beta)) - \log(\Gamma(\alpha)) - \log(\Gamma(\beta)).$$
The expected value of the beta-binomial distribution is $n \cdot \frac{\alpha}{\alpha + \beta}$.
This is the same as the expected value of the binomial distribution ($n \cdot p$), with the $p$ replaced by the expected value of the beta distribution, $\frac{\alpha}{\alpha + \beta}$.
A parameterization frequently used for the beta-binomial distribution uses this expected probability ($p$) as a parameter, with a dispersion parameter ($\theta$) that specifies the variance in the probability.
Our particular parameterization of $\theta$ is slightly unconventional, but has useful properties when modelling.
When $\theta = 0$, the beta-binomial reverts to the binomial distribution.
When $\theta = 1$ and $p = 0.5$, the parameters of the beta distribution become $\alpha = 1$ and $\beta = 1$, which correspond to a uniform distribution for the beta-binomial probability parameter.
The relationships between $p$ and $\theta$ and $\alpha$ and $\beta$ are defined as follows:
$$p = \frac{\alpha}{\alpha + \beta}$$
$$\theta = \frac{2}{\alpha + \beta}$$
$$\alpha = 2\cdot\frac{p}{\theta}$$
$$\beta = 2\cdot\frac{1 - p}{\theta}$$
### Deviance:
Recall that the likelihood of a model is the probability of the data set given the model ($P(\text{data}|\text{model})$).
The *deviance* of a model, $D$, is defined by:
$$D(\text{model},\text{data}) = 2(\log(P(\text{data}|\text{saturated model})) - \log(P(\text{data}|\text{fitted model}))),$$
where the saturated model is the model that perfectly fits the data.
The saturated model usually uses the value of $x$ in place of the mean parameter, thereby fitting through the data points.
For example, calculating the saturated log-likelihood of the normal distribution involves using the $x$ value(s) in place of the mean parameter $\mu$ in the log-likelihood.
Recall from above that the expected value of the binomial distribution is $n \cdot p$.
Since $n$ is fixed in the binomial distribution, $p = \frac{x}{n}$ is used to calculate the saturated log-likelihood.
In the case of the beta-binomial distribution, there is some confusion about what should be used as the saturated model.
One suggested approach is to use the saturated log-likelihood of the binomial distribution, as described above, effectively setting the $\theta$ dispersion term to 0.
A second suggested approach is to use the same method as described for the binomial distribution (i.e., replacing $p$ with $\frac{x}{n}$), and holding the $\theta$ value constant.
Both of these approaches have issues.
The problem with the first approach is that likelihood profile of the beta-binomial becomes increasingly dispersed relative to the binomial as $\theta$ increases (Fig. 1).
Consequently, increasing $\theta$ increases the deviance even if the expected value of the fitted model is the same as the saturated binomial model.
```{r, echo = FALSE, fig.width = 7, fig.height = 4, fig.cap = "Fig. 1: Beta-binomial likelihood profile for $x = 1$ and $n = 5$, for different values of $\\theta$. $\\theta = 0$ corresponds to the binomial case."}
n_samp <- 1000
x <- 1
size <- 5
prob <- 0.3
theta <- c(0, 0.1, 0.5, 1)
prob_seq <- seq(0, 1, length.out = n_samp)
lik <- data.frame(prob_seq = prob_seq,
lik_0.0 = exp(log_lik_beta_binom(x, size, prob_seq, theta[1])),
lik_0.1 = exp(log_lik_beta_binom(x, size, prob_seq, theta[2])),
lik_0.5 = exp(log_lik_beta_binom(x, size, prob_seq, theta[3])),
lik_1.0 = exp(log_lik_beta_binom(x, size, prob_seq, theta[4])))
tb <- as_tibble(lik)
tb <- pivot_longer(tb, cols = c(lik_0.0, lik_0.1, lik_0.5, lik_1.0),
names_to = "theta", names_prefix = "lik_",
values_to = "lik")
ggplot(tb, aes(x = prob_seq, y = lik, colour = theta)) +
geom_line() +
scale_colour_viridis(discrete = TRUE) +
xlab("p") +
ylab("likelihood")
```
The problem with the second approach is that for some combinations of $p$ and $\theta$, the resultant deviance values are negative.
This is an issue for two reasons.
Firstly, it is nonsensical because by definition the saturated model should be the best possible fit to the data.
Secondly, it is computationally problematic, because calculating deviance residuals, as we will do in the next section, involves taking the square root of the deviance.
Here is an example of a case where the deviance is negative.
Using $x = 1$, $n = 5$, $p = 0.3$, and $\theta = 0.5$, we calculate the saturated log-likelihood:
$$\alpha_{sat} = \frac{2 (\frac{x}{n})}{\theta} = \frac{2(0.3)}{0.5} = 0.8$$
$$\beta_{sat} = \frac{2 (1 - \frac{x}{n})}{\theta} = \frac{2 (1 - 0.3)}{0.5} = 3.2$$
$$ \log(P(x|n, p_{sat}, \theta_{sat}) = -1.355106,$$
the fitted log-likelihood:
$$\alpha_{fit} = \frac{2 (p)}{\theta} = \frac{2(0.3)}{0.5} = 1.2$$
$$\beta_{fit} = \frac{2 (1 - p)}{\theta} = \frac{2 (1 - 0.3}{0.5} = 2.8$$
$$\log(P(x|n, p_{fit}, \theta_{fit})) = -1.32999,$$
and finally, the deviance:
\begin{aligned}
D(n, p, \theta, x) &= 2(\log(P(x|n, p_{sat}, \theta_{sat})) - \log(P(x|n, p_{fit}, \theta_{fit}))) \\
&= 2(-1.355106 - -1.32999) \\
&= -0.050232.
\end{aligned}
In summary, the deviance is negative because the likelihood at the expected value, $p = \frac{x}{n}$ is less than the likelihood of the observed value (Fig. 2).
The solution to the problem appears to be to choose the value of $p$ that maximizes the likelihood.
However, the beta-binomial distribution has closed-form solutions only for certain values of $\alpha$ and $\beta$.
Thus, we must generally search for the value of $p$ that maximizes the likelihood for each data point in order to calculate the saturated log-likelihood.
We will refer to the optimized $p$ value for the $i^{th}$ data point as $p_i^*$, and generally as $p^*$.
We use $p^*$ to calculate the saturated log-likelihood that produces deviances that are (a) strictly positive, and (b) relative to the value of $\theta$.
The value of $p_i^*$ for the example case with $x = 1$, $n = 5$, and $\theta = 0.5$ is shown in Figure 2.
```{r, echo = FALSE, fig.width = 7, fig.height = 4, fig.cap = "Fig. 2: Beta-binomial likelihood profile for $x = 1$, $n = 5$, and $\\theta = 0.5$. The dashed vertical line shows the likelihood at the expected value of the beta-binomial distribution ($n \\cdot p$) where $p = \\frac{x}{n} = \\frac{1}{5} = 0.2$. As you can see, this is not the $p$ for which the likelihood is maximized. The solid vertical line shows the likelihood at its maximum point. In this case, $p^* = 0.26$."}
n_samp <- 1000
x <- 1
size <- 5
prob <- 0.3
theta <- 0.5
opt_beta_binom <- function(prob, x, size, theta) {
-log_lik_beta_binom(x = x, size = size, prob = prob, theta = theta)
}
opt_p <- stats::optimize(opt_beta_binom, interval = c(0, 1), x = x,
size = size, theta = theta)$minimum
prob_seq <- seq(0, 1, length.out = n_samp)
lik <- data.frame(prob_seq = prob_seq,
lik = exp(log_lik_beta_binom(x, size, prob_seq, theta)))
p_df <- data.frame(val = c(x / size, opt_p),
probability = c("Expected value (x / n) = 0.20", "Optimized value (p*) = 0.26"))
ggplot(lik, aes(x = prob_seq, y = lik)) +
geom_line(colour = viridis(1, begin = 0.66)) +
geom_vline(data = p_df, mapping = aes(linetype = probability, xintercept = val)) +
xlab("p") +
ylab("likelihood")
```
### Deviance residuals
The unit deviance, $d_i$ refers to the deviance for the $i^{th}$ data point, $x_i$:
$$d_i = 2(\log(P(x_i|n, p_i^*, \theta)) - \log(P(x_i|n, p, \theta))).$$
The *deviance residual* is the signed squared root of the unit deviance,
$$r_i = \text{sign}(x_i - (n p_i^*)) \sqrt{d_i}.$$
The following plot shows histograms of deviance residuals for 10,000 randomly generated data points with $n = 50$ and various values of $p$ and $\theta$.
```{r, echo = FALSE, fig.height = 8, fig.width = 7, fig.cap = "Fig. 4: Histograms of deviance residuals for 10,000 beta-binomial data points simulated with $n = 50$, $p = \\{0.3, 0.5, 0.9 \\}$, and $\\theta = \\{0.0, 0.1, 0.5, 1.0 \\}$. The percentages within each panel (i.e., each combination of $p$ and $\\theta$) sum to 100%."}
n_samp <- 10000
size <- 50
prob <- c(0.3, 0.5, 0.9)
theta <- c(0, 0.1, 0.5, 1)
res_mat <- matrix(NA, nrow = n_samp * length(theta), ncol = length(prob))
set.seed(101)
for (i in 1:length(prob)) {
res <- NULL
for (j in 1:length(theta)) {
x <- ran_beta_binom(n_samp, size, prob[i], theta[j])
res <- c(res, res_beta_binom(x, size, prob[i], theta[j]))
}
res_mat[, i] <- res
}
df <- as.data.frame(res_mat)
colnames(df) <- c("prob_0.3", "prob_0.5", "prob_0.9")
df$theta <- rep(theta, each = n_samp)
df <- pivot_longer(df, cols = c(prob_0.3, prob_0.5, prob_0.9), names_to = "prob",
names_prefix = "prob_", values_to = "res")
ggplot(df, aes(x = res)) +
geom_histogram(aes(y = after_stat(density) * 0.1), binwidth = 0.1) +
scale_y_continuous(labels = percent) +
facet_grid(~prob ~theta, labeller = label_bquote(cols = theta: .(theta), rows = p: .(prob))) +
xlab("Deviance residuals") +
ylab("Percent (%)")
```
Recall that $\theta = 0$ refers to the binomial case.
When $p = 0.5$ and $\theta > 0$, deviance residuals close to 0 become less common; as $\theta$ increases, the distribution of deviance residuals becomes less continuous (Fig. 4).
When $p$ deviates from $0.5$ and $\theta > 0$, the distribution of the deviance residuals becomes increasingly skewed and discontinuous (Fig. 4).
Due to the skewed and discontinuous nature of the distribution of the deviance residuals, the goodness of fit of a beta-binomial model cannot be assessed from the deviance residuals alone but must be evaluated using posterior predictive checking.