notes/testing_distrib.Rmd

---
title: "models for testing"
---

```{r pkgs,message=FALSE,echo=FALSE}
library(tidyverse)
library(ggplot2); theme_set(theme_bw())
library(viridis)
library(bbmle)
source("covid19testing_funs.R")
```

## Introduction

How should we think about the connection between numbers of (1) infected people in the population (2) tests done (3) cases reported?

Most generally/mechanistically, on any day we have some number of people who are on day $\tau$ of their infectious period. On a given day, an infectious ($I$) person may be subclinical (asymptomatic or so mildly symptomatic that they don't report symptoms); mild; severe (hospitalized); or very severe (ICU/ventilator). The progression among these stages will be different for different people (although in general they will progress toward greater severity). The more severe the symptoms, the more likely someone is to be tested.

Testing will also depend on how they got infected, e.g. imported cases and close contacts are more likely to be tested. (An asymptomatic, community-infected person will basically never be tested.)

In a complete agent-based model (or a model based on a complex compartmental structure), we could keep track of people moving through all of these categories and assign them probabilities of being tested. In a sufficiently realistic model we could even take into account detailed testing criteria and phenomena such as limited testing resources (so that, e.g., low-risk, mild cases would never be tested).

There are two extreme scenarios that are easy to reason about.

* if testing is random, the proportion of positive tests should be the same as the prevalence (or incidence - I'm not being precise about whether we are measuring a flow or a stock, but I don't think it makes a difference), regardless of either. Increasing the number of tests when the prevalence is constant leads to a proportional increase in the number of cases. The proportion of positive tests is $i$ and the number of cases (i.e., positive tests) is $iT$. (I'm using $i$, $t$ for proportion of the population infected or tested and $I$, $T$ as the number infected or tested: $I=iN$, $T=tN$.) If the prevalence is increasing exponentially with rate $r_1$ and testing is increasing at rate $r_2$, the number of cases increases at rate $r_1 + r_2$ (and doubling time $\log(2)/(r1+r2)$).
* if testing is perfectly focused on infected people, then tests are 100% positive until we run out of infected people (i.e. proportion of pop tested > prevalence). The proportion of positive tests is $\max(1,i/t)$ and the number of cases is $\max(T,I)$.

Now let's consider that in general we test people in (approximate) order of their probability of infection (ignoring tests skipped because of constraints). We can think about a distribution in the population of probability of infection (higher for known contacts of infected people, travellers from high-risk areas, etc.), and a distribution in the probability of testing (which ideally lines up with the risk). We could think about having two distributions, but these can't really be separated (FIXME: explain better!).

## Machinery

Let's use a Beta distribution with mean equal to prevalence $i$ as the distribution of 'probability infected'. (An alternative that might be more analytically tractable is the [Kumaraswamy distribution](https://en.wikipedia.org/wiki/Kumaraswamy_distribution).)  As the dispersion $\phi=1/(a+b)$ goes to infinity we end up with point masses $1-i$ on 0 and $i$ on 1; as it goes to 0 we end up with a point mass on $i$. If we always test 'from the top down' (i.e. calculate the mean value of the top fraction $T/N$ of the population distribution), these two extreme cases correspond to perfect testing (cases=$\min(T,I)$) and random testing (cases=$\textrm{Binomial}(i,T)$).

If $B$ is the Beta distribution with mean $i$ and dispersion $\gamma$, $\Phi_B$ is the CDF, and $Q$ is the inverse CDF (i.e., the quantile function) then our expected proportion positive from testing a fraction $T$ is $\left(\int_{Q(1-T/N)}^1 B(y,\phi) y \, dy\right)/(1-\Phi_B(Q(1-T/N)))$, i.e. "the mean of the upper tail of the infection-probability distribution".

We can get a little bit farther analytically.
Given that $B(x,a,b) \propto x^{a-1} (1-x)^{b-1}$, the mean is $a/(a+b)=i \to a \phi =i \to a = i/\phi, b=(a+b)-a = 1/\phi - i/\phi = (1-i)/\phi$. We can show that $B(x,a,b) \cdot x = a/(a+b) B(x,a+1,b)$, so we should be able to do the integral directly by computing the appropriate CDF (or complementary CDF) of the Beta distribution.

$B(x,a,b) =\frac{1}{\mathcal{B}(a,b)} x^{a-1} (1-x)^{b-1}$, where $\mathcal{B}(a,b)$ is the beta function, which satisfies 
$$\mathcal{B}(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\,.$$
It follows that 
$$
\begin{aligned}
B(x,a,b)x = \frac{1}{\mathcal{B}(a,b)} x^{a} (1-x)^{b-1} &= \frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)}\times \frac{1}{\mathcal{B}(a+1,b)} x^{a} (1-x)^{b-1}\\
&=\frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)} B(x,a+1,b)\,,
\end{aligned}
$$
which, using the identity $\Gamma(z+1) = z\Gamma(z)$, simplifies to
$$
\begin{aligned}
B(x,a,b)x
&=\frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+b+1)}\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} B(x,a+1,b)\\
&= \frac{a\Gamma(a)\Gamma(b)}{(a+b)\Gamma(a+b)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} B(x,a+1,b) = \frac{a}{a+b}B(x,a+1,b)\,.
\end{aligned}
$$
Since we set the mean $a/(a+b)$ of the distribution $B(x,a,b)$ to $i$, we have $B(x,a,b)x = iB(x,a+1,b)$.

In order to make the dispersion parameter a little more interpretable we will transform the parameter $\gamma$ from $(0,\infty)$  to a value $\phi=-\log(1-\gamma)$ (i.e. $\gamma=1-\exp(-\phi)$), so that 0 corresponds to random testing and 1 corresponds to perfectly focused testing.

If we wanted to get more realistic/detailed we could relax the assumption of a Beta distribution and instead characterize the distribution of probability infected as a mixture of types (general population, symptomatic, travel history, etc.), with the constraint of a mean of $i$ \ldots

## Numerical examples

```{r plots0}
par(las=1,bty="l")
curve(prop_pos_test(i=0.01,t=0.001,x),from=0.001,to=0.999,
      xlab=expression("testing focus"~(phi)),
      ylab="proportion of positive tests",
      main=c("1% prevalence, 0.1% testing"),
      ylim=c(0,1))
abline(h=0.01,lty=2)
```

What happens as we test more people than are actually infected in the population?
```{r plots1}
par(las=1,bty="l")
x <- seq(0.001, 0.4, length.out =  101)
y <- prop_pos_test(t=x,i=0.01,phi=0.5)
## something funky about curve(),
## throws Error in from <= 0 || to <= 0 : 'length = 2' in coercion to 'logical(1)'
## hard to debug because of curve + mapply()/Vectorize() fanciness ...
plot(x, y, type = "l",
     log="x",xlab="proportion tested",
     ylab="prop positive tests",
     main=expression(list("prevalence=1%",phi==0.5)))
abline(v=0.01,lty=2)
```

```{r plots2}
dd <- (expand.grid(time=1:25,phi=c(0.001,0.2,0.5,0.8,0.999))
  %>% as_tibble()
  %>% mutate(inc=0.001*exp(log(2)/3*time),
             tests=1e-4*exp(log(2)/4*time),
             pos_prop=prop_pos_test(inc,tests,phi),
             cases=pos_prop*tests)
  ## issues with pivot_longer? use gather() instead
  ## "Error: `spec` must have `.name` and `.value` columns"
  %>% gather("var","value",inc,tests,pos_prop,cases)
)
print(ggplot(dd,aes(time,value,col=phi,group=phi))
  + geom_line()
  + facet_wrap(~var,scale="free")
  + scale_y_log10()
  + scale_colour_viridis_c()
)
```

**FIXME:** direct labels? drop incidence and testing, include as reference-slope lines?

## Estimation

When is $\phi$ identifiable?

- can we make a combined model with hospitalizations/deaths?
- if we assume that prevalence is increasing exponentially (and we know the rates of testing) can we estimate $\phi$?
- what other comparisons are possible?
- there is *some* information in the model because change in $i$ over time is constrained by the dynamical parameters ...
- can we set a sensible prior on $\phi$ and integrate over the possibilities?

## To do

### Simulations

- The first thing to do is the minimal identifiability test: if we simulate data directly using `prop_pos_test`, then add some noise (e.g. binomial testing outcomes), and then we try to take the observed testing data (total tests and positive tests) and get an MLE of both phi and the parameters of the exponential growth of prevalence (initial value and growth rate), can we do it?  If $t$ is the fraction of the population tested and $N$ is the population size and $f(\tau)=\textrm{prop\_pos}(i(\tau),t(\tau),\phi)$ is the fraction positive, then we want to simulate $c(\tau) \sim \textrm{Binomial}(f,tN)$, the number of confirmed positive cases at time $\tau$. Not worrying about sensitivity/specificity yet (tests are perfect).

MLE: want to estimate $i_0$, $r$ (growth rate of prevalence), and $\phi$. We know $t(\tau)$ and $c(\tau)$ (no time lags yet!)
   
```{r testsim}
## t (test vector) and c (confirmation vector) are defined; tau is a time vector
set.seed(101)
## scalar parameters
true_pars <- c(I_0=100, r=log(2)/3, phi=0.5,
               T_0=100, r_t=log(2)/4, N=1e6)
true_pars["i_0"] <- true_pars["I_0"]/true_pars["N"]
## vectors of observed stuff
dd <- data.frame(tau=1:20)
dd$T <- round(true_pars["T_0"]*exp(true_pars["r_t"]*dd$tau))
dd$t <- dd$T/true_pars["N"]
## simulating confirmation vector
dd$c <- with(c(as.list(true_pars), dd),
             rbinom(length(tau),
                    size=T,  ## number of tests at time tau
                    prob=prop_pos_test(i_0*exp(r*tau), t, phi)
                    )
             )
mle_out<- mle2(c~dbinom(prob=prop_pos_test(plogis(logit_i_0)*exp(r*tau),
                                           t, exp(log_phi),
                                 debug=FALSE, phiscale = "unconstrained"),
              size=t*N),
              start=list(logit_i_0=qlogis(true_pars["i_0"]),
                         r=true_pars["r"], log_phi=log(true_pars["phi"]) ),
              data= list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"]),
              control=list(maxit=1000)
)
mle_est0 <- coef(mle_out)
```


```{r pp,cache=TRUE}
pp <- profile(mle_out)
```

```{r}
dd <- as.data.frame(pp)
ggplot(dd,aes(focal,abs(z))) + geom_point() + geom_line() +
  facet_wrap(~param,scale="free_x")+
  scale_y_continuous(limits=c(NA,5))
```

```{r}
invlink <- function(x) c(plogis(x[1]),x[2],exp(x[3]))
tt <- (broom::tidy(mle_out)
  %>% mutate(lwr=estimate-2*std.error,
             upr=estimate+2*std.error)
  %>% select(term,estimate,lwr,upr)
  %>% mutate_if(is.numeric,invlink)
  %>% mutate(term=c("i_0","r","phi"))
  %>% mutate(true=true_pars[c("i_0","r","phi")])
)
```

- The next simulation is for understanding better what we're really modeling here. The simulation should be a more mechanistic description of the infection and testing process: not sure how this actually worked. There must be some infection process (e.g. an SIR model ...) and some process by which people get assigned a probability of being chosen for testing, which we can match up with the mathematical description above. The testing process is also mechanistic; after testing people get removed from the testable pool.

How do we model a testing policy?
How do we make the testing policy match up with our distribution? 

Give every individual two (correlated) numbers which correspond to infection risk and testing risk? Increase testing risk at onset of symptoms?

Test asymptomatic (susceptible) people at one rate and symptomatic (infectious) people at another rate?

A fraction of the incidence moves into a "testing pool"; people in the testing pool are tested at a constant rate. Every person gets a number sampled from our beta distribution when they become infected, which governs their chance of being selected


### Kumaraswamy distribution

Is the inverse CDF for the Kumaraswamy distribution tractable?  Seems that way:

$$
\begin{split}
q & =1-(1-x^a)^b \\
(1-q)^{1/b} & = 1-x^a \\
x = (1- (1-q)^{1/b})^{1/a}
\end{split}
$$

(this could be important if we want to embed this machinery in Stan/JAGS/etc. where `pbeta()`/`qbeta()` may not exist and/or be sufficiently robust ...)

Unfortunately the tradeoff is that the expression for the mean is complicated:

$$
\frac{b\Gamma\left(1+ \frac{1}{a}\right) \Gamma(b)}{\Gamma\left(1+\frac{1}{a}+b\right)}
$$

so reparameterizing in terms of mean/dispersion might be hard (and the integral trick with the Beta also might not work \ldots)

### brain dump/misc

How does this relate to other ways that people are thinking about testing vs incidence? e.g. graphs in https://docs.google.com/spreadsheets/d/1ecQ0t1Sn2maR2b9sUacA3RSIUTHFlT-V1XiNnOFiBpw/edit#gid=2019730603