-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path2019-saddlepoint-approximation.Rmd
More file actions
769 lines (587 loc) · 37 KB
/
2019-saddlepoint-approximation.Rmd
File metadata and controls
769 lines (587 loc) · 37 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
---
title: "Approximate Densities for Sums of Variables: Negative Binomials and Saddlepoint"
date: 2019-06-20
tags: ["R","Stan","Modelling"]
output:
blogdown::html_page:
toc: true
toc_depth: 1
---
```{r setup, message=FALSE, warning=FALSE, echo = FALSE}
library(rstan)
library(knitr)
library(here)
library(tidyverse)
library(cowplot)
library(rstanmodeldev) #My custom package for SBC and related stuff, can be found at https://github.com/martinmodrak/rstanmodeldev
knitr::opts_chunk$set(echo = FALSE, fig.height = 3)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
fits_cache_dir <- here("cache")
if(!dir.exists(fits_cache_dir)) {
dir.create(fits_cache_dir)
}
results_cache_dir <- here("content","post","_saddlepoint_cache")
if(!dir.exists(results_cache_dir)) {
dir.create(results_cache_dir)
}
sbc_N_steps <- 500
#sbc_cores <- parallel::detectCores()
sbc_cores <- parallel::detectCores() - 1
```
I recently needed to find the distribution of sum of non-identical but independent negative binomial (NB) random variables. Although for some special cases the [sum is itself NB](https://stats.stackexchange.com/questions/45318/conditional-on-the-total-what-is-the-distribution-of-negative-binomials/354433#354433), analytical solution is not feasible in the general case. However it turns out there is a very handy tool called "Saddlepoint approximation" that is useful whenever you need densities of sums of arbitrary random variables. In this post I use the sum of NBs as a case study on how to derive your own approximations for basically any sum of independent random variables, show some tricks needed to get the approximation working in Stan and evaluate it against simpler approximations. To give credit where credit is due, I was introduced to the saddlepoint method via [Cross Validated answer on sum of Gamma variables](https://stats.stackexchange.com/questions/72479/generic-sum-of-gamma-random-variables/137318#137318).
Spoiler: it turns out the saddlepoint approximation is not that great for actual inference (at least for the cases I tested), but it is still a cool piece of math and I spent too much researching it to not show you this whole post.
# The Approximation - Big Picture
The saddlepoint approximation uses the [cumulant-generating function](https://en.wikipedia.org/wiki/Cumulant) (CGF) of a distribution to compute an approximate density at a given point. The neat part about CGFs is that the CGF of the sum of several variables is the sum of the individual CGFs! And CGFs are easy to come by, because the CGF is just the log of the moment-generating function and Wikipedia helpfully lists moment-generating functions for almost all distributions. Figuring out the CGF of almost any sum variable (including variables from different families) is thus relatively straightforward. The CGF can also easily be derived for [general linear combinations of random variables](http://mathworld.wolfram.com/Cumulant-GeneratingFunction.html).
The actual method for approximating density $f$ at point $x$, given the cumulant-generating function $K$, and its first and second derivatives ($K^\prime,K^{\prime\prime}$) is as follows:
1) find the saddlepoint $s_x$ by solving:
$$
K^\prime(s_x) = x
$$
Generally, there is no closed-form solution for $s_x$, but since $K(x)$ is always convex, $K^\prime$ is always increasing, making it a nice target for numerical solutions. Still, since a different solution is needed for each $x$, finding $s_x$ tends to be a computational bottleneck.
2) Once we have $s_x$, we can approximate
$$
f(x) \simeq \frac1{\sqrt{2\pi K''(s_x)}} \exp(K(s_x) - x s_x)
$$
The nice thing about the saddlepoint approximation is that it can easily produce approximations for both discrete and continous densities, and doesn't constrain the approximation to be normal (unlike Laplace approximation). One thing to note is that the saddlepoint approximation in the form above does not necessarily integrate to 1, so a renormalization might be needed if you are interested in the actual density. But to use in Stan, unnormalized density is all that's needed.
# Saddlepoint for Sum of NBs
The moment-generating function of NB distribution parametrized by number of failures $r$ and probability of success $p$ is:
$$
M(t) = \left( \frac{1 - p}{1 - p e^t} \right)^r
$$
So, taking the log and summing over $n$ independent NB variables, the cumulant of sum of NB is:
$$
K(t) = \sum_{i=1}^{n} r_i \left[ \log(1-p_i) - \log(1 - p_i e^t) \right]
$$
We now transform to the more useful parametrization of NB via mean $\mu$ and precision $\phi$ (i.e. $Var(X) = \mu + \frac{\mu^2}{\phi}$), where we have:
$$
r_i = \phi_i \\
p_i = \frac{\mu_i}{\phi_i + \mu_i} \\
K(t) = \sum_{i=1}^{n} \phi_i \left[ \log \frac{\phi_i}{\phi_i + \mu_i} - \log \left(1 - \frac{\mu_i e^t}{\phi_i + \mu_i} \right) \right] = \\
=\sum_{i=1}^{n} \phi_i \left[ \log(\phi_i) - \log(\phi_i + \mu_i ( 1 - e^t)) \right]
$$
Note that $K(t)$ does exist only when $\forall i:\phi_i + \mu_i ( 1 - e^t) > 0$ this constrains $t$ such that:
$$
\begin{align}
\tag{*}
\forall i : t &< log \left(\frac{\phi_i}{\mu_i} + 1 \right)
\end{align}
$$
The first and second derivatives of $K$ are:
$$
K^\prime (t) = \sum_{i=1}^{n} \frac{\phi_i \mu_i e^t}{\phi_i + \mu_i (1 - e^t)} \\
K^{\prime\prime} (t) = \sum_{i=1}^{n} \frac{\phi_i \mu_i (\phi_i + \mu_i) e^t}{(\phi_i + \mu_i (1 - e ^t))^2} \\
$$
It turns out that the saddlepoint $s_x$ is not defined when $x = 0$, since the numerator of $K^\prime(t)$ is positive for all $t$ and the denominator has to be positive for $K$ to exist. But for this special case, the density can be easily computed, as $f(0) = \prod_i P(X_i =0) = \prod_i NB(0 | \mu_i,\phi_i)$. The non-existence of the saddlepoint solution for boundaries of the domain is actually a recurring theme, as the existence of the solution is guaranteed only for the inner points, so it is useful to check for this when developing your approximations.
# Implementing the Approximation in Stan
This has all been a nice math excercise, but how can we translate that into a piece of code we could use? The only problematic part is solving for $s_x$, once we have it, the rest is a simple math that Stan will digest easily. Luckily, Stan has the built-in [`algebra_solver`](https://mc-stan.org/docs/2_19/functions-reference/functions-algebraic-solver.html) that can solve equations AND provide derivatives of the solution wrt. parameters. There is only a minor problem - we have an upper bound on $s_x$ from the equation $(*)$ and `algebra_solver` turns out not to work when there are boundaries (even when initialized within the boundaries). Instead we use the same method Stan uses for bounds on parameters and solve for unbounded $y_x$ where:
$$
s_x = \min_i{log \left(\frac{\phi_i}{\mu_i} + 1 \right)} -e^{y_x}
$$
So let us get our hands dirty and show some code, starting with how to write the saddlepoint equation in a way that the `algebra_solver` can handle. Since $K^\prime$ is always positive, we transform the equation to log scale - partly because we might have some big $\sum\mu_i$ out there and partly because it seems nice - I didn't test the non-log version. So the equation we are actually solving for $s_x$ is:
$$
\log \sum_{i=1}^{n} \exp \left( \log\phi_i + \log \mu_i + s_x - \log(\phi_i + \mu_i - \mu_i \exp(s_x) \right) - x = 0
$$
Translated into Stan we get:
```{r, comment=NA}
functions_lines <- readLines(here("static/post/2019-saddlepoint-approximation/sum_nb_functions.stan"))
cat(paste0(functions_lines[2:24], collapse = "\n"))
```
Above, `y` are the unconstrained unknowns, which we transform via `s_transform` to the constrained space. Further we extract $\mu_i$ and $\phi_i$ from `theta` which can be parameters while `x_i` contains the observed sums (data). Since we have no real number data, `x_r` is ignored. The `algebra_solver` will try to find `y` such that `value` is 0 which is exactly when `s` is the solution to the saddlepoint equation.
We use the `nb_sum_log_Kd_eq` function to compute the actual saddlepoint density:
```{r, comment=NA}
cat(paste0(functions_lines[26:67], collapse = "\n"))
```
The above shows how the `algebra_solver` is called - we combine $\mu_i$ and $\phi_i$ as params, pass a guess (0 works great, so we don't need to worry about details). The only weird part is `dummy_x_r` - I want it to be just an empty array, but it has to be of type `real` and has to be data. And I didn't find a way to make the compiler understand that unless I pass `dummy_x_r` from outside as in
```
transformed data {
real dummy_x_r[0];
}
...
model {
sums ~ neg_binomial_sum_lpmf(mus, phis, dummy_x_r);
}
```
# A Simple Baseline
To assess, how useful the saddlepoint approximation is in practice, we'll compare it to a straightforward application of [Method of moments](https://en.wikipedia.org/wiki/Method_of_moments_(statistics)). This is just a fancy name for choosing a distribution family and choosing it's parameters so that mean, variance (and possibly higher moments) match those of the desired distribution. In case of NBs, when $Y_i \sim NB(\mu_i, \phi_i)$ then
$$
E \left(\sum Y_i \right) = \sum \mu_i \\
Var \left(\sum Y_i \right) = \sum \left( \mu_i + \frac{\mu_i^2}{\phi_i} \right)
$$
Simply because both mean and variance are linear operators. Maybe sum of NBs isn't that different from a NB distribution, so let's approximate
$$
\sum Y_i \approx NB(\bar\mu, \bar\phi)
$$
Solving for $\bar\mu$ and $\bar\phi$ by matching the mean and variance of the approximate distribution gives:
$$
\bar \mu = \sum \mu_i \\
\bar \phi = \frac{ \left(\sum \mu_i \right)^2 }{\sum \frac{\mu_i^2}{\phi_i}}
$$
This can be implemented very directly in Stan as:
```{r, comment=NA}
cat(paste0(functions_lines[69:74], collapse = "\n"))
```
# Eyeballing Masses
As a first look, we will see how well do both approximations match the empirical mass function - we simulate a lot of sums of NBs, bin the range of observed values and compute empirical mass as the proportion of the samples that fits in each bin. For the approximations, we sum the mass for all values belonging to the bins.
```{r plot_approx_fun, message=FALSE, results="hide", cache = TRUE}
expose_stan_functions(here("static","post","2019-saddlepoint-approximation","sum_nb_functions.stan"), show_compiler_warnings = FALSE)
## from http://tr.im/hH5A
logsumexp <- function (x) {
y = max(x)
y + log(sum(exp(x - y)))
}
plot_approximation_densities <- function(means, phis, log_scale) {
n_means = length(means)
N_sums <- 1e5
sums = numeric(N_sums)
for(n in 1:N_sums) {
sums[n] = sum(rnbinom(n_means, mu = means, size = phis))
}
breaks = seq(min(sums),max(sums) + 1, length.out = 30) %>% round()
#breaks = seq(min(sums) * 0.95,max(sums)* 1.01, length.out = 30) %>% round()
#breaks = quantile(sums, probs = seq(0, 1, length.out = 30))
#breaks[length(breaks)] <- breaks[length(breaks)] + 1 #The intervals are left closed, so including the max value here
#Prepare data for output
moments = numeric(length(breaks) - 1)
saddlepoint = numeric(length(breaks) - 1)
empirical = numeric(length(breaks) - 1)
for(b in 1:(length(breaks) - 1)) {
#Compute the mass for each breaks[b] to breaks[b + 1] regions
if(log_scale) {
empirical[b] = log(mean(sums >= breaks[b] & sums < breaks[b + 1]))
} else {
empirical[b] = mean(sums >= breaks[b] & sums < breaks[b + 1])
}
break_length <- breaks[b + 1] - breaks[b]
moments_mass <- numeric(break_length)
saddlepoint_mass <- numeric(break_length)
for(i in 1:break_length) {
y <- breaks[b] + i - 1
moments_mass[i] <- neg_binomial_sum_moments_lpmf(y, means, phis)
saddlepoint_mass[i] <- neg_binomial_sum_saddlepoint_lpmf(y, means, phis, dummy_x_r = array(0,0))
}
if(any(is.na(moments_mass))) {
stop("NA - moments")
}
if(any(is.na(saddlepoint_mass))) {
stop("NA - saddlepoint")
}
if(log_scale) {
moments[b] = logsumexp(moments_mass)
saddlepoint[b] = logsumexp(saddlepoint_mass)
} else {
moments[b] = sum(exp(moments_mass))
saddlepoint[b] = sum(exp(saddlepoint_mass))
}
}
if(log_scale) {
scale_name <- "log binned mass"
} else {
scale_name <- "binned mass"
}
data.frame(x = breaks[-length(breaks)], empirical, moments, saddlepoint) %>%
gather("type","value", -x) %>%
mutate("is_main" = if_else(type == "empirical", "yes","no")) %>%
ggplot(aes(x = x, y = value, color = type)) + geom_line(alpha = 0.5, size = 2) +
scale_y_continuous(scale_name) +
scale_color_brewer(type = "qual", palette = 2)
}
means_ex1 <- c(800, 1600)
phis_ex1 <- c(10, 1)
means_ex2 <- c(50, 100, 1300, 2000)
phis_ex2 <- rep(10, length(means_ex2))
```
The saddlepoint approximation improves notably over moments when the Fano factors of the summed variables are vastly different and we do not sum a large number of values, below we show mass and log mass for the case when $\mu = \{`r paste0(means_ex1, collapse = ", ")` \}$ and $\phi = \{`r paste0(phis_ex1, collapse = ", ")` \}$:
```{r plot_approx_different, cache = TRUE, dependson="plot_approx_fun"}
plot_approximation_densities(means_ex1, phis_ex1, log_scale = FALSE)
plot_approximation_densities(means_ex1, phis_ex1, log_scale = TRUE)
```
It is visible that the saddlepoint mass tracks the empirical mass very tightly both in the bulk and in the tail (visible better on the log mass) - note that the tail of the empirical log mass is jittery due to low number of samples in the tail.
On the other hand, when we sum a lot of variables which are not very different and/or when $\phi_i$ are large, the sum becomes normal-ish and both approximation work well - let us for example look at the case when $\mu = \{`r paste0(means_ex2, collapse = ", ")` \}$ and $\phi = \{`r paste0(phis_ex2, collapse = ", ")` \}$:
```{r plot_approx_similar, cache = TRUE, dependson="plot_approx_fun"}
plot_approximation_densities(means_ex2, phis_ex2, log_scale = FALSE)
plot_approximation_densities(means_ex2, phis_ex2, log_scale = TRUE)
```
This gives us some intuition where to expect differences.
# Evaluating Performance
```{r, message = FALSE, results = "hide", cache = TRUE}
model_file <- here("static","post","2019-saddlepoint-approximation","test_sum_nb.stan")
model_sum_nb <- stan_model(model_file, isystem = here("static","post","2019-saddlepoint-approximation"))
```
We evaluate the model using [Simulation-based Calibration](https://arxiv.org/abs/1804.06788) (SBC). The main idea is that when I generate data exactly the way the model assumes, then for any $c$ the $c\%$ posterior interval should contain the true value an unobserved parameter in exactly $c\%$ of the cases. In other words the quantile in which the true value is found in the posterior distribution should be uniformly distributed. There are some caveats to this, read the paper for details.
I am using my own implementation of SBC which is in my not very well documented, likely-never-on-CRAN package [`rstanmodeldev`](https://github.com/martinmodrak/rstanmodeldev). We run `r sbc_N_steps` simulations for each of the test cases. If you want to see under the hood, the code for this post is available [at the GitHub repo of this blog](https://github.com/martinmodrak/blog/blob/master/content/post/2019-saddlepoint-approximation.Rmd).
The first test case I will use is that I observe the sum of $G+1$ variables where I know $\mu_i$ and $\phi_i$ for $i \in {1 .. G}$ while $\mu_{G+1}$ and $\phi_{G+1}$ is unknown and has to be infered from $N$ observations of the sum.
In all cases, both observed and unobserved $\phi_i$ are drawn following the [advice of Dan simpson](https://statmodeling.stat.columbia.edu/2018/04/03/justify-my-love/), i.e.:
$$
\phi_{raw} \sim HalfN(0, 1) \\
\phi = \frac{1}{\sqrt{\phi_{raw}}}
$$
This is how the model looks-like in Stan ( [`test_sum_nb.stan`](/post/2019-saddlepoint-approximation/test_sum_nb.stan) ):
```{r, comment=NA}
cat(paste0(readLines(model_file), collapse = "\n"))
```
Most notably, the way the sum of NBs is implemented is given as data. The [`sum_nb_functions.stan`](/post/2019-saddlepoint-approximation/sum_nb_functions.stan) include contains the functions shown above.
And this is an R method to generate simulated data - this is a function that given parameters of the observed data gives a function that on each call generates both `true` and `observed` data in a format that matches the Stan model:
```{r, echo = TRUE}
generator <- function(G, N, method = "saddlepoint", observed_mean_mus, observed_sd_mus, mu_prior_mean, mu_prior_sd) {
if(method == "saddlepoint") {
method_id = 0
} else if (method == "moments") {
method_id = 1
} else {
stop("Invalid method")
}
function() {
all_mus <- rlnorm(G + 1, observed_mean_mus, observed_sd_mus)
all_mus[G + 1] <- rlnorm(1, mu_prior_mean, mu_prior_sd)
all_phis <- 1 / sqrt(abs(rnorm(G + 1)))
sums <- array(-1, N)
for(n in 1:N) {
sums[n] <- sum(rnbinom(G + 1, mu = all_mus, size = all_phis))
}
list(
observed = list(
N = N,
sums = sums,
G = G,
method = method_id,
mus = array(all_mus[1:G], G),
phis = array(all_phis[1:G], G),
mu_prior_mean = mu_prior_mean,
mu_prior_sd = mu_prior_sd
),
true = list(
extra_mu = all_mus[G+1],
extra_phi = all_phis[G+1]
)
)
}
}
```
## Sum of Two NBs
```{r}
small_N <- 10
mu_prior_mean_small <- 5
mu_prior_sd_small <- 3
observed_mean_mus_small <- 5
observed_sd_mus_small <- 3
seed_small <- 68752245 #I use the same seed so I evaluate on exactly the same data
```
Here we test a sum of two NBs - the means of both observed and unobserved NB are chosen randomly from LogNormal(`r mu_prior_mean_small`,`r mu_prior_sd_small`) We observe `r small_N` sums in each run.
### Saddlepoint
First, let's look at diagnostics for the saddlepoint approximation:
```{r}
results_file_saddlepoint_small <- paste0(results_cache_dir, "/saddlepoint_small.rds")
if(!file.exists(results_file_saddlepoint_small) ||
length(unique(readRDS(results_file_saddlepoint_small)$params$run)) < sbc_N_steps) {
cache_dir_saddlepoint_small <- paste0(fits_cache_dir,"/saddlepoint_small")
if(!dir.exists(cache_dir_saddlepoint_small)) {
dir.create(cache_dir_saddlepoint_small)
}
set.seed(seed_small)
sbc_res_saddlepoint_small <- sbc(model_sum_nb, generator(G = 1, N = small_N, "saddlepoint",
observed_mean_mus = observed_mean_mus_small,
observed_sd_mus = observed_sd_mus_small,
mu_prior_mean = mu_prior_mean_small,
mu_prior_sd = mu_prior_sd_small),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
cache_dir = cache_dir_saddlepoint_small,
cores = sbc_cores,
cluster_options = list(outfile = here("cache","cluster_out.txt")))
saveRDS(sbc_res_saddlepoint_small, results_file_saddlepoint_small)
} else {
sbc_res_saddlepoint_small <- readRDS(results_file_saddlepoint_small)
}
summarise_sbc_diagnostics(sbc_res_saddlepoint_small) %>% kable()
```
All the columns except for `median_total_time` represent proportion of fits that have a problem with divergences/treedepth/lowe n_eff etc. We see that some small number of runs ended with divergencies. This is not great, but we will ingore it for now. The `n_eff` and `Rhat` diagnostics are okay. We also note that the model is quite slow - `r round(summarise_sbc_diagnostics(sbc_res_saddlepoint_small)$median_total_time)` seconds for just 10 observations is high.
Let's look at the SBC histogram at two resolutions:
```{r}
plot_sbc_histogram(sbc_res_saddlepoint_small$params)
plot_sbc_histogram(sbc_res_saddlepoint_small$params, binwidth = 2)
```
Here we would like to see a uniform distribution. The gray area is a rough 99% confidence interval, so very few bars should actually be outside this. While the histogram for $\mu_{G+1}$ looks OK, the consistent trend and several outliers for $\phi_{G+1}$ indicates that the approximation has some problems and consistently underestimates the true value.
Finally we can look at a scatter plot of true value vs. posterior median:
```{r}
plot_sbc_scatter(sbc_res_saddlepoint_small$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
The blue line indicates perfect match (true value = posterior median)
As in the above plot, we see that $\mu_{G+1}$ is inferred quite precisely, especially for larger true values, while the results for $\phi_{G+1}$ are diffuse, often dominated by the priors (the prior density peaks at around 1.7) and have a slight tendency to be below the perfect prediction line. We also see that low true values of $\mu_{G+1}$ tend to be overestimated - this is not unexpected as when the observed $\mu$ is large and unobserved small it is hard to infer it's exact value and the posterior is largely influenced by prior (which has large mean).
### Moments
We can now do the same for the method of moments approximation, starting with the diagnostics:
```{r}
results_file_moments_small <- paste0(results_cache_dir, "/moments_small.rds")
if(!file.exists(results_file_moments_small) ||
length(unique(readRDS(results_file_moments_small)$params$run)) < sbc_N_steps) {
cache_dir_moments_small <- paste0(fits_cache_dir,"/moments_small")
if(!dir.exists(cache_dir_moments_small)) {
dir.create(cache_dir_moments_small)
}
set.seed(seed_small)
sbc_res_moments_small <- sbc(model_sum_nb, generator(G = 1, N = small_N, "moments",
observed_mean_mus = observed_mean_mus_small,
observed_sd_mus = observed_sd_mus_small,
mu_prior_mean = mu_prior_mean_small,
mu_prior_sd = mu_prior_sd_small),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
init_r = 0.5, #Sometimes, to small inits caused the chain to be stuck
cache_dir = cache_dir_moments_small,
cores = sbc_cores,
cluster_options = list(outfile = here("cache","cluster_out.txt")))
saveRDS(sbc_res_moments_small, results_file_moments_small)
} else {
sbc_res_moments_small <- readRDS(results_file_moments_small)
}
summarise_sbc_diagnostics(sbc_res_moments_small) %>% kable()
```
We see some small number of divergences and low n_eff and high Rhat (which go usually hand in hand). This is comparable to the saddlepoint case.
The histogram:
```{r}
plot_sbc_histogram(sbc_res_moments_small$params)
plot_sbc_histogram(sbc_res_moments_small$params, binwidth = 2)
```
The histrograms look very slightly worse than the saddlepoint approximation - although there is no consistent trend, more bars are outside the confidence interval or close to the border, indicating some issues, although I honestly don't really understand what is going on.
And the scatterplot, which looks quite similar to the saddlepoint version:
```{r}
plot_sbc_scatter(sbc_res_moments_small$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
## Sum of 21 NBs
```{r}
N_large <- 20
seed_large <- 32157588
mu_prior_mean_large <- 5
mu_prior_sd_large <- 3
observed_mean_mus_large <- 2
observed_mean_sd_large <- 1
```
Further, we can check the case where there are 20 known variables with low means and one NB is unknown with a large mean - we want the unobserved mean to have notable influence on the total outcome, hence we choose it to be larger. In particular, the observed means are drawn from LogNormal(`r observed_mean_mus_large`,`r observed_mean_sd_large`) and the mean to be inferred is drawn from LogNormal(`r mu_prior_mean_large`,`r mu_prior_sd_large`)
### Saddlepoint
Looking at the statisics, we see only very few divergences, but quite large median time:
```{r}
results_file_saddlepoint_large <- paste0(results_cache_dir, "/saddlepoint_large.rds")
if(!file.exists(results_file_saddlepoint_large) ||
length(unique(readRDS(results_file_saddlepoint_large)$params$run)) < sbc_N_steps) {
cache_dir_saddlepoint_large <- paste0(fits_cache_dir,"/saddlepoint_large")
if(!dir.exists(cache_dir_saddlepoint_large)) {
dir.create(cache_dir_saddlepoint_large)
}
set.seed(seed_large)
sbc_res_saddlepoint_large <- sbc(model_sum_nb, generator(G = 20, N = N_large, "saddlepoint",
observed_mean_mus = observed_mean_mus_large,
observed_sd_mus = observed_mean_sd_large,
mu_prior_mean = mu_prior_mean_large,
mu_prior_sd = mu_prior_sd_large),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
init_r = 0.5, #Sometimes, to small inits caused the chain to be stuck
cores = sbc_cores,
cache_dir = cache_dir_saddlepoint_large,
cluster_options = list(outfile = here("cache","cluster_out.txt")))
saveRDS(sbc_res_saddlepoint_large, results_file_saddlepoint_large)
} else {
sbc_res_saddlepoint_large <- readRDS(results_file_saddlepoint_large)
}
summarise_sbc_diagnostics(sbc_res_saddlepoint_large) %>% kable()
```
The histogram:
```{r}
plot_sbc_histogram(sbc_res_saddlepoint_large$params)
plot_sbc_histogram(sbc_res_saddlepoint_large$params, binwidth = 2)
```
We see that especially for $\phi_{G+1}$ the results are discouraging with the true value frequently being in the low quantiles of the posterior.
```{r}
plot_sbc_scatter(sbc_res_saddlepoint_large$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
The scatterplot is than quite similar to the previous cases.
### Moments
The statistics for moments show short running time but a larger amount of convergence issues:
```{r}
results_file_moments_large <- paste0(results_cache_dir, "/moments_large.rds")
if(!file.exists(results_file_moments_large) ||
length(unique(readRDS(results_file_moments_large)$params$run)) < sbc_N_steps) {
cache_dir_moments_large <- paste0(fits_cache_dir,"/moments_large")
if(!dir.exists(cache_dir_moments_large)) {
dir.create(cache_dir_moments_large)
}
set.seed(seed_large)
sbc_res_moments_large <- sbc(model_sum_nb, generator(G = 20, N = N_large, "moments",
observed_mean_mus = observed_mean_mus_large,
observed_sd_mus = observed_mean_sd_large,
mu_prior_mean = mu_prior_mean_large,
mu_prior_sd = mu_prior_sd_large),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
cache_dir = cache_dir_moments_large,
cores = sbc_cores,
cluster_options = list(outfile = here("cache","cluster_out.txt")))
saveRDS(sbc_res_moments_large, results_file_moments_large)
} else {
sbc_res_moments_large <- readRDS(results_file_moments_large)
}
summarise_sbc_diagnostics(sbc_res_moments_large) %>% kable()
```
The histograms:
```{r}
plot_sbc_histogram(sbc_res_moments_large$params)
plot_sbc_histogram(sbc_res_moments_large$params, binwidth = 2)
```
The histograms hint at consistent underestimation of $\mu_{G+1}$ and overestimation of $\phi_{G+1}$, problematic especially for $\phi_{G+1}$.
```{r}
plot_sbc_scatter(sbc_res_moments_large$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
Once again the scatter is similar, the only interesting feature are the few outliers for $\mu_{G+1}$ where the true value is large but the posterior median is very small. Those likely correspond to the divergent runs, but they cannot account for the full skew of the SBC histograms - this is more likely caused by the string of underestimated points just below the blue line on the top right.
## Sum Defined by Series
The first model is not very useful when `G` is large, because the posterior gets dominated by the prior. To better test what happens with large `G`, we instead us a single parameter to define all $\mu_i$ and $\phi_i$ as a geometric series, i.e. $\mu_i = \mu_{base} k^{(i - 1)}$ where $k$ is known while $\mu_{base}$ is the unknown parameter, similarly for $\phi_i$. The Stan code is:
```{r, comment=NA}
series_model_file <- here("static","post","2019-saddlepoint-approximation","test_sum_nb_series.stan")
cat(paste0(readLines(series_model_file), collapse = "\n"))
```
```{r, message = FALSE, results = "hide", cache = TRUE}
series_model_sum_nb <- stan_model(series_model_file, isystem = here("static","post","2019-saddlepoint-approximation"))
```
The R code for simulation is then:
```{r, echo = TRUE}
generator_series <- function(G, N, method = "saddlepoint", mu_prior_mean, mu_prior_sd, mu_series_coeff, phi_series_coeff) {
if(method == "saddlepoint") {
method_id = 0
} else if (method == "moments") {
method_id = 1
} else {
stop("Invalid method")
}
function() {
mu <- rlnorm(1, mu_prior_mean, mu_prior_sd)
phi <- 1 / sqrt(abs(rnorm(1)))
all_mus <- mu * mu_series_coeff ^ (0:(G - 1))
all_phis <- phi * phi_series_coeff ^ (0:(G - 1))
sums <- array(-1, N)
for(n in 1:N) {
sums[n] <- sum(rnbinom(G, mu = all_mus, size = all_phis))
}
list(
observed = list(
N = N,
sums = sums,
G = G,
method = method_id,
mu_prior_mean = mu_prior_mean,
mu_prior_sd = mu_prior_sd,
mu_series_coeff = mu_series_coeff,
phi_series_coeff = phi_series_coeff
),
true = list(
mu = mu,
phi = phi
)
)
}
}
```
```{r}
# Test single run, only for development
# data <- generator_series(G = 30, N = 10, method = "saddlepoint",
# mu_prior_mean = 10, mu_prior_sd = 4,
# mu_series_coeff = 0.75, phi_series_coeff = 0.9
# )
# fit <- sampling(series_model_sum_nb, data = data$observed)
# evaluate_all_params(rstan::extract(fit), data$true)
```
```{r}
G_series <- 30
N_series <- 10
seed_series <- 8720045
mu_prior_mean_series <- 8
mu_prior_sd_series <- 3
mu_series_coeff <- 0.75
phi_series_coeff <- 0.9
```
In the following we draw $\mu_{base}$ from LogNormal(`r mu_prior_mean_series`, `r mu_prior_sd_series`) and use `mu_series_coeff = ` `r mu_series_coeff` and `phi_series_coeff` = `r phi_series_coeff`.
### Saddlepoint
Once again let's look at the diagnostics for saddlepoint approximation:
```{r}
results_file_saddlepoint_series <- paste0(results_cache_dir, "/saddlepoint_series.rds")
if(!file.exists(results_file_saddlepoint_series) ||
length(unique(readRDS(results_file_saddlepoint_series)$params$run)) < sbc_N_steps) {
cache_dir_saddlepoint_series <- paste0(fits_cache_dir,"/saddlepoint_series")
if(!dir.exists(cache_dir_saddlepoint_series)) {
dir.create(cache_dir_saddlepoint_series)
}
set.seed(seed_series)
sbc_res_saddlepoint_series <- sbc(series_model_sum_nb,
generator_series(G = G_series, N = N_series, method = "saddlepoint",
mu_prior_mean = mu_prior_mean_series,
mu_prior_sd = mu_prior_sd_series,
mu_series_coeff = mu_series_coeff,
phi_series_coeff = phi_series_coeff
),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
cores = sbc_cores,
cache_dir = cache_dir_saddlepoint_series,
cluster_options = list(outfile = here("cache","cluster_out.txt"))
)
saveRDS(sbc_res_saddlepoint_series, results_file_saddlepoint_series)
} else {
sbc_res_saddlepoint_series <- readRDS(results_file_saddlepoint_series)
}
summarise_sbc_diagnostics(sbc_res_saddlepoint_series) %>% kable()
```
We see quite a few divergences, but I didn't investigate them in detail. The SBC histograms follow:
```{r}
plot_sbc_histogram(sbc_res_saddlepoint_series$params)
plot_sbc_histogram(sbc_res_saddlepoint_series$params, binwidth = 2)
```
The histrograms hint at some problems for $\mu$.
```{r}
plot_sbc_scatter(sbc_res_saddlepoint_series$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
The scatterplot shows that the estimation is quite reasonable for both $\mu$ and $\phi$ - definitely better than the previous model, as we got rid of the cases where the data do not identify the true values well.
### Moments
The diagnostics and plots for method of moments is:
```{r}
results_file_moments_series <- paste0(results_cache_dir, "/moments_series.rds")
if(!file.exists(results_file_moments_series) ||
length(unique(readRDS(results_file_moments_series)$params$run)) < sbc_N_steps) {
cache_dir_moments_series <- paste0(fits_cache_dir,"/moments_series")
if(!dir.exists(cache_dir_moments_series)) {
dir.create(cache_dir_moments_series)
}
set.seed(seed_series)
sbc_res_moments_series <- sbc(series_model_sum_nb,
generator_series(G = G_series, N = N_series, method = "moments",
mu_prior_mean = mu_prior_mean_series,
mu_prior_sd = mu_prior_sd_series,
mu_series_coeff = mu_series_coeff,
phi_series_coeff = phi_series_coeff
),
N_steps = sbc_N_steps,
control = list(adapt_delta = 0.95),
cache_dir = cache_dir_moments_series,
cores = sbc_cores,
cluster_options = list(outfile = here("cache","cluster_out.txt"))
)
saveRDS(sbc_res_moments_series, results_file_moments_series)
} else {
sbc_res_moments_series <- readRDS(results_file_moments_series)
}
summarise_sbc_diagnostics(sbc_res_moments_series) %>% kable()
```
We see a bunch of problems, comparable to the saddlepoint version. Let's look at the SBC histograms:
```{r}
plot_sbc_histogram(sbc_res_moments_series$params)
plot_sbc_histogram(sbc_res_moments_series$params, binwidth = 2)
```
And those are surprisingly nice, showing no clear trend or outliers!
```{r}
plot_sbc_scatter(sbc_res_moments_series$params, x_axis_trans = "log10", y_axis_trans = "log10")
```
The scatterplot is very similar to the saddlepoint case.
# Summing up
We see that in the regimes we tested, the saddlepoint approximation for sum of negative binomials provides somewhat better inferences for small number of variables at the cost of much increased computation times. For sums of large number of variables, it may even be worse than the moments method. So it is probably not very practical unless you have few variables you need that extra bit of precision. But it is a neat mathematical trick and of interest on its own. It is also possible that for some low-mean regimes the difference is bigger.
# Saddlepoint Approximations for Other Families
If you want to use saddlepoint approximation for other than NB variables, but don't want to do the math on your own, there are some worked out on the Internet:
- Sum of Gamma variables: [Answer on Cross Validated](https://stats.stackexchange.com/questions/72479/generic-sum-of-gamma-random-variables/137318#137318)
- Sum of binomials: [Liu & Quertermous: Approximating the Sum of Independent Non-Identical Binomial Random Variables](https://arxiv.org/abs/1712.01410)
Thanks for reading!