# <center>Day-1 Practical Session, 25 May 2021</center>
## <center>Part 1: Epidemic Prevalence Estimation</center>
#### <center> *Li-Chun Zhang*<sup>1,2,3</sup> and *Melike Oguz-Alper*<sup>2</sup> </center>
  
##### <center> <sup>1</sup>*University of Southampton (L.Zhang@soton.ac.uk)*, <sup>2</sup>*Statistics Norway*, <sup>3</sup>*University of Oslo* </center>
***

<a id="size-biased"></a>
### Illustration I: Size-biased sampling and adaptive network tracing

In this illustration, the efficiency of the Horvitz-Thompson estimator (HTE) of epidemic prevalence under Adaptive Cluster Sampling (ACS) by adaptive network tracing will be compared with the HTE based on the initial sample $s_0$, selected with either Simple Random Sampling ($\eta=1$) or *size-biased* sampling with $\eta\neq 1$, where $\eta$ is the *odds-ratio* between case and non-case groups, which is defined by $$\eta=\frac{Pr(i\in s_0|y_i=1)}{Pr(i\in s_0|y_i=0)}\cdot$$

***
We will use the R-function <font color=green>**mainEpi**</font> described below.
***
#### Description of the population and sampling strategies
* Population graph: $G=(U,A)$, $(ij)\in A$ if $i\in U$ and $j\in U$ are in-contact
* Population size is denoted by $N$
* The population consists of cases and non-cases: $y_i=1$ if case, and $y_i=0$ otherwise
* A *case network* in $G$ contains all case-nodes connected to each other
* The set of case networks in $G$ denoted by $\Omega$
* The total number of cases in the population is denoted by $Y=\sum_{i\in U}y_i$ within the function below
* All the cases in the population divided into equal-size networks, $K$
* Initial sample $s_0$ of size $n_0$ is selected with either SRS or Poisson sampling with probabilities $\pi_i\propto \eta$ 
* For ACS, all case-networks that initially selected cases in $s_0$ belong to are taken into the sample
* Sampled case networks denoted by $\Omega_s$

***

#### Description of R-function <font color=green>**mainEpi**<font> 
##### 1. Function parameters
* **N**: the population size; deafult value $100\,000$
* **theta**: prevalence in the population, i.e. the proportion of the cases in the population, $\theta=Y/N$; default value $0.01$
* **f**: sampling fraction, $f=n_0/N$; default value $0.01$
* **M**: the number of case-networks in the population; default value $10$
* **lift**: the odds-ratio between cases and non-cases, denoted by $\eta$; default value $1$
* **B**: number of replications; default value $100$

##### 2. Main steps of the function
* Equal case-network size, **K**, is calculated based on **N** and **M**, and the total number of cases is calculated by **Y**=**MK**
* Index for case-networks, **kidx**, is created for all cases and set to $0$ for all non-cases, such that the first **Y** units in the population take a network id number while the last $N-Y$ units take value $0$, that is, $\underbrace{1,1,\dots,1}_{K \text{ times}},\underbrace{2,2,\dots,2}_{K \text{ times}},\dots,\underbrace{M,M,\dots,M}_{K \text{ times}},\underbrace{0,0,\dots,0}_{(N-Y) \text{ times}}$
* Inclusion probabilities, $\pi_i=\mathrm{Pr}(i\in s_0)$, denoted by **p** inside the function, are calculated as proportional to **lift** if $y_i=1$, and $1$ otherwise
* Inclusion probabilites of case-networks are calculated by 
    * $\pi_{(\kappa)}=1-(1-p)^K$, denoted by **pr.k** inside the function, under Poisson sampling of $s_0$,
    * $\pi_{(\kappa)}=1-\binom{N-K}{n_0}/\binom{N}{n_0}$, denoted by **p.k** inside the function, under SRS of $s_0$
* An $Y\times 2$ array is created, called **nidx** inside the function, the first and the second columns of which are replaced with the case-network sizes and their inclusion probabilities, respectively 
* The HTE of the prevalence the population:

    $\hat{\theta}_{HT} = \frac{1}{N}\sum_{\kappa\in \Omega_s}  \frac{y_{\kappa}}{\pi_{(\kappa)}}$, where $y_{\kappa}$ the number of cases in network $\kappa$

* Sampling variances of the HTE based on $s_0$ are calculated under SRS and Poisson sampling: **v.srs** and **v.pois**
$$V_{srs}(\hat{\theta}_{HT;s_0})=\big(1-\frac{n_0}{N}\big)\frac{\theta(1-\theta)}{n_0},\quad V_{pois}(\hat{\theta}_{HT;s_0})=\frac{1}{N^2}\sum_{i \in s_0}\big(\frac{1}{\pi_i}-1\big)y_i$$
* Sampling variance of the HTE under ACS is calculated by
$$V_{acs}(\hat{\theta}_{HT})=\frac{1}{N^2}\left\{\sum_{\kappa\in\Omega}\big(\frac{1}{\pi_{(\kappa)}}-1\big)K^2+\sum_{\kappa\in\Omega}\sum_{\ell\neq \kappa\in\Omega}\big(\frac{\pi_{(\kappa\ell)}}{\pi_{(\kappa)}\pi_{(\ell)}}-1\big)K^2\right\}\cdot$$

Here, we have $\pi_{(\kappa\ell)}=\pi_{(\kappa)}+\pi_{(\ell)}-(1-\bar\pi_{(\kappa\cup\ell)})$, with $\bar\pi_{(\kappa\cup\ell)}=(1-p)^{2K}$ if poisson sampling.
* Simulation study
    * **B** random samples selected with *Sequential Poisson Sampling* (SPS) from the population
    * For each replication, the population prevalence is estimated based on $s_0$ and the sample of case-networks observed by adaptive tracing 
    
##### 2. Main outputs of the function
* Relative efficiency of the HTE under ACS against the HTE based on the initial sample $s_0$ selected with either SRS or poisson sampling 
    
***

In [46]:
mainEpi <- function(N=10^5,theta=0.01,f=0.01,M=10,lift=1,B=100)
{
  K = trunc(N*theta/M + 0.5); Y = M*K
  kidx = rep(0,N); kidx[1:Y] = c(t(array(1:M,c(M,K))))
  y = 1*(kidx>0)
  n0 = trunc(N*f+0.5)
  cat("(M, K, Y, n0) =",c(length(unique(kidx))-1,K,sum(y),n0),"\n")
  
  p = rep(n0/N,N); p[1:Y] = lift*p[1:Y]; p = n0*p/sum(p)
  pr.k = p.k = 1-exp(c(1:K)*log(1-max(p))); cat('Summary of pr.k under pois:','\n');print(summary(pr.k))
    if (lift==1) {
    for (m in 1:K) { p.k[m] = 1-exp(sum(log(N-m+1-c(1:n0)) - log(N+1-c(1:n0)))) }
   cat('Summary of p.k if SRS:','\n');
        print(summary(p.k))}
  nidx = array(0,c(Y,2))
  for (k in 1:M) { idk = kidx[1:Y]==k; nidx[idk,1] = sum(idk); nidx[idk,2] = pr.k[sum(idk)] }
  
  v.srs = (1-n0/N)*theta*(1-theta)/n0
  v.pois = sum((1/p-1)*y)/N^2
  cat("SE(initial) by SRS, Pois =",sqrt(c(v.srs,v.pois)),"\n")
  tmp = 1 + exp(2*K*log(1-max(p))) - 2*exp(K*log(1-max(p)))
  v.acs = c(M*(1/pr.k[K]-1), M*(M-1)*(tmp/pr.k[K]^2 -1))*K^2/N^2
  cat("SE(ACS) =",sqrt(sum(v.acs)),"\t leading term:",sqrt(v.acs[1]),"\n")
  cat("RE(analytic) =",sum(v.acs)/v.pois,"\n")
  
  mat = array(0,c(B,5))
  for (i in 1:B) {
# s0 by SPS (sequential Poisson sampling), s0y = effective sample of cases 
    u = runif(N,0,1)/p
    s0 = sort((c(1:N)[order(u)])[1:n0])
    s0y = s0[s0<=Y]
    mat[i,3] = length(s0y)
    mat[i,5] = n0-mat[i,3]
# initial sample estimator      
    mat[i,1] = sum(1/p[s0y])/N
# HT estimator, need only the first unit in s0 for each case-network
    idk = s0y[!duplicated(kidx[s0y])]
    mat[i,2] = sum(nidx[idk,1]/nidx[idk,2])/N
    mat[i,4] = length(idk)
    mat[i,5] = mat[i,5] + sum(nidx[idk,1])
  }
  
  emat = rbind(colMeans(mat), sqrt(diag(var(mat))))
  colnames(emat) = c("Est.s0","Est.ACS","No cases s0",'No sample case-ntw','No sample units')
  rownames(emat) = c("MC-Mean","MC-SD")
  print(emat) 
  cat("RE(simulation) =",var(mat[,2])/c(var(mat[,1]),v.pois),"\n")
}