# The bars of Gangelt: Bayesian analysis of the Infection Fatality Rate in Gangelt, and a simple extrapolation to infection-counts in Germany

(Miscellaneous text and maths notes, initiated by Jakob. Feel free to copy paste for the notebook we use, some of it we might want to exclude or put in an appendix or supplement. )

Stuff we mgiht want to read:
https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurvey/england10may2020

## Introduction
- to be written - Philipp first draft?

## A Bayesian model of infections and fatalities in Gangelt



[might be a bit long and dry]

We concentrate on estimation of the infection fatality rate (IFR) $\lambda$ in Gangelt, and in a second step we extrapolate to infection-counts in Germany.

For the infection rate $\theta$ in Gangelt, we simply follow the results by Streeck et al 2020, i.e. we do not provide a separate Bayesian analysis for it (see e.g. https://statmodeling.stat.columbia.edu/2020/05/01/simple-bayesian-analysis-inference-of-coronavirus-infection-rate-from-the-stanford-study-in-santa-clara-county/ for a Bayesian analysis of an infection rate of another study). In Streeck et al, the infection rate $\theta$ is estimated as 0.1553 with confidence intervals (0.1231, 0.1896). To preserve this uncertainty estimate, we assume the infection rate to follow a Beta distribution with parameters $\alpha_\theta$ and $\beta_\theta$ which are chosen such that this prior has the same mean and variance, i.e.

$$\theta \sim \mbox{Beta}(\alpha_\theta = ?, \beta_\theta = ?),$$ using the fact that 

$ \mbox{E}(\theta) = \frac{\alpha_\theta}{\alpha_\theta+\beta_\theta}$ and $\mbox{Var}(\theta) = \frac{\alpha_\theta \beta_\theta}{(\alpha_\theta + \beta_\theta)^2(\alpha_\theta + \beta_\theta+1)}$.

In [14]:
# we could include a little plot here if we want?

We assume that infections in Gangelt occur independently from each other at an infection rate $\theta$, i.e. that the total infection count $I_G$ (for fixed $\theta$) is given by a binomial distribution,

$$ I_G | \theta \sim \mbox{Binomial}(n_G,\theta)$$,

where $N_G$ is the total number of inhabitants in Gangelt, $n_G=?$. Averaged over the uncertainty in the infection rate, this results in the infection count in Gangelt given by a beta-binomial-distribution, 

$$ I_G \sim \mbox{BetaBinomial}(n_G, \alpha_\theta,\beta_\theta  )$$.

We assume that, amongst infected individuals, fatalities occur independently at infection fatality rate $\lambda$, i.e. 

$$ F_G | I_G \sim \mbox{Binomial}(I_G, \lambda)$$, 

which (marginalized over $I_G$) yields a likelihood $P(F_G=f_G| \lambda$), with observed fatality-count $f_G= 7$.  

[Note Jakob: I am not sure what the functional form of the likelihood is, to be honest-- I dont think that a 'thinned' beta-binomial distribution is still beta-binomial, but honestly I dont quite know. Based on Wiuf and Stumpf 2006, it does not look like the beta-binomial distribution is closed under binomial subsampling-- it is not listed but I also did not work through their conditions. Need to veryfiy or word carefully. Maybe someone can look into this? Matthias? See below]

Finally, we need a prior distribution over $\lambda$. We choose a Beta-distribution with parameters $\alpha_\lambda$ and $\beta_\lambda$, 

$$ \lambda \sim \mbox{Beta}(\alpha_\lambda,\beta_\lambda)$$.

We choose these parameters such that the prior has a mean of $??$, and a variance of $??$, resulting in $\alpha_\lambda =??$ and $\beta_\lambda = ??$. We believe that this provides a reasonable description assuming prior studies in ? and ?, and with a prior variance chosen large enough to not overly constrain the resulting inference. Below, we provide an empirical exploration of how strongly the inference results will be affected if one had chosen different priors, and, using this notebook, it is furthermore possible to further explore different choices of the prior. 

Lots of results to be described...

## Two approximate, simpler likelihoods

[I really like these, but we might want to discuss the actual derivations in an appendix to keep the flow. This section is unfinished, so skip it for now.]

The model describe above provides a simple, yet arguably reasonable generative model of how observed infections and fatalities are related to the underlying infection- and fatality rates. However, Bayesian inference over $\lambda_G$ is complicated by the fact that we do not have a closed-form likelihood, but have to numerically average over $I_G$$. We here provide two approximate models which lead to slightly simpler calculations-- as we will show [hopefully...], the resulting inferences are basically unchanged: 

### Approximating the fatality count by a beta-binomial model

The fatality-count is given by a beta-binomial count under binomial subsampling-- this will, in general, not result in another beta-binomial model (see Wiuf and Stumpf 2006), but it is reasonable to approximate it by a beta-binomial distribution with the same mean and variance. To derive the mean and variance of $F_G$, note that it can be modelled as a random sum of Bernoulli random variables, i.e. 
$ F_G = \sum_{i=1}^{I_G} B_i$, where $B_i \sim \mbox{Bernoulli}(\lambda)$. Thus, 

$$E(F_G) = E(I_G)*\lambda $$ 
and 
$$\mbox{Var}(F_G)= \mbox{E}(\mbox{Var}(F_G| I_G)) + \mbox{Var}(\mbox{E}(F_G| I_G)) = \lambda(1-\lambda)\mbox{E}(\theta) + \lambda^2 \mbox{Var}(I_G) $$, 

where $\mbox{E}(I_G)$ and $\mbox{Var}(I_G)$ are given by the mean and variance of a beta binomial distribution. Thus, having derived the mean and variance of $F_G$, one can approximate the fatality-count distribution by a beta binomial with matching moments. 

[Somoene would need to check the equations, probably at least one error in it,  and also implement the moment matching, but then we have a simply likelihood which I would guess is very close to the true one. There is even a chance it is identical, in which case of course we need to reword the above...]


### Approximating the infection count by a negative-binomial model

The second approximation [which I am yet to work out]  exploits  the fact that the infection count is expected to be substantially lower than the total population size-- (I think) in that limit one can approximate the beta-binomial distribution by a negative binomial one, and as the negative binomial distribution is closed under binomial subsampling, the 

[Well, if we do either or both of these approximations, someone should just check empirically both that these models are indeed close approximations, and that it does not look like they are identical... if that is the case then I am probably wrong and the beta-binomial family is closed under binomial subsampling]. Note that the advantage of having an analytical likelihood is that we can very easily use numerical integration to calculate the posterior, otherwise we have to sum over all possible values of $I_G$, which is also fine but a bit annoying. Also, I just enjoyed working through this, so Matthias please do me the favour of checking whether it makes a difference ;-)

## Discussion

We emphasize that the analysis above rests on several distributional assumptions. These include: 

* The estimation of the infection rate, and its uncertainty by Streeck et al is correct, and a Beta distribution is an appropriate distribution to capture this uncertainty

* Infections in Gangelt are (conditional on the infection rate) independent from each other. 

* Fatalities in Gangelt are (conditional on the fatality rate) independent from each other.

In addition, for any extrapolation to infection-counts in Germany, one would additionally have to assume:

* Both the recorded fatalities in Gangelt ($F_G= ?$) and in Germany ($F_D= ?$) are correct. 

* The infection-fatality rate in Gangelt and in Germany are the same, $\lambda = \lambda_G = \lambda_D$.

Violation of any of these assumptions, or any combination of them, could have a substantial impact on the results.  There are reasonable grounds to challenge several of them, e.g. the assumption that fatalities are independent, or that the infection-rate between Gangelt and Germany are the same (as also acknowledged by the authors of Streeck et al 2020). 

We do not have take a position on whether these assumptions are appropriate. However, we would expect that violations of these assumptions will generally lead to an increase in the resulting uncertainty. Thus, we would expect that a more conservative analysis (which e.g. models fluctuations in infection rates or statistical dependence in fatalities, e.g. by marginalizing over the associated uncertainty) would likely lead to bigger error bars. 

In addition, we emphasize that the results of any Bayesian analysis depend on the prior distributions assumed. In our case, this is most prominently case for the prior on the fatality rate. In particular, we showed that our inference on the posterior distribution over $\lambda$--- i.e. the error-bars on the IFR--- is manifestly dependent on the prior. One could interpret this dependence on the prior as a weakness of the Bayesian analysis-- however, the advantage of the Bayesian analysis is that the dependence on prior assumptions is made explicit, and can be explored quantitatively. In addition, the dependence on the prior highlights the fact that the data from Gangelt, in intself, only weakly constraints the IFR. 

Following the above analysis, one can conclude that the study by Streeck et al only provides weak constraints both on the infection fatality rate in Gangelt, as well as on the extrapolated total infection count in Germany. This is consistent with the naive intuition that any estimation of a fatality rate from a local would depend on a small number of events (here: 7 fatalities), and therefore be of limited reliability. 

It has been stated that the goal of the study was to provide "constraints for models" (JHM: best to find a reference where they did say it, e.g. in the press release). As pointed out above, given limited data, no such study would be able to fully constrain models or their parameters-- this is particularly evident given the very example provided by the example of how these numbers could be used in a "theoretical model" to extrapolate to an infection count in Germany. As shown above, the results of any such extrapolation would be strongly affected by uncertainty in the underlying parameters. This highlights a more general point-- given that it is (rarely) possible to fully constrain any parameter of a model or simulation, any simulation-results should ideally be interpreted by considering an ensemble of models which are generated from different parameter-sets, each of which is consistent with the observed data (technically: parameters which are sampled from the posterior distribution). [I was tempted to put a plug here for simulation-based inferene but not sure this is the right place...]

Thus, the Gangelt-study, in itself, only provides a weak constraint on the IFR-- however, even such weak constraints can be highly valuable if they are combined with multiple studies (as are e.g. are currently underway, links)-- by pooling the results of multiple such studies, one can both aim to more strongly constrain IFRs, as well as to estimate regional variations in IFRs. Such meta-analyses benefit strongly from openly shared data statistical and analysis-protocols. 