# <center>Day-1 Practical Session, 25 May 2021</center>
## <center>Part 1: Epidemic Prevalence Estimation</center>
#### <center> *Li-Chun Zhang*<sup>1,2,3</sup> and *Melike Oguz-Alper*<sup>2</sup> </center>
  
##### <center> <sup>1</sup>*University of Southampton (L.Zhang@soton.ac.uk)*, <sup>2</sup>*Statistics Norway*, <sup>3</sup>*University of Oslo* </center>
***

<a id="size-biased"></a>
### Illustration I: Size-biased sampling and adaptive network tracing

In this illustration, the efficiency of the Horvitz-Thompson estimator (HTE) of epidemic prevalence under Adaptive Cluster Sampling (ACS) by adaptive network tracing will be compared with the HTE based on the initial sample $s_0$, selected with either Simple Random Sampling ($\eta=1$) or *size-biased* sampling with $\eta\neq 1$, where $\eta$ is the *odds-ratio* between case and non-case groups, which is defined by $$\eta=\frac{Pr(i\in s_0|y_i=1)}{Pr(i\in s_0|y_i=0)}\cdot$$

***
We will use the R-function <font color=green>**mainEpi**</font> described below.
***
#### Description of the population and sampling strategies
* Population graph: $G=(U,A)$, $(ij)\in A$ if $i\in U$ and $j\in U$ are in-contact
* Population size is denoted by $N$
* The population consists of cases and non-cases: $y_i=1$ if case, and $y_i=0$ otherwise
* A *case network* in $G$ contains all case-nodes connected to each other
* The set of case networks in $G$ denoted by $\Omega$
* The total number of cases in the population is denoted by $Y=\sum_{i\in U}y_i$ within the function below
* All the cases in the population divided into equal-size networks, $K$
* Initial sample $s_0$ of size $n_0$ is selected with either SRS or Poisson sampling with probabilities $\pi_i\propto \eta$ 
* For ACS, all case-networks that initially selected cases in $s_0$ belong to are taken into the sample
* Sampled case networks denoted by $\Omega_s$

***

#### Description of R-function <font color=green>**mainEpi**<font> 
##### 1. Function parameters
* **N**: the population size; deafult value $100\,000$
* **theta**: prevalence in the population, i.e. the proportion of the cases in the population, $\theta=Y/N$; default value $0.01$
* **f**: sampling fraction, $f=n_0/N$; default value $0.01$
* **M**: the number of case-networks in the population; default value $10$
* **lift**: the odds-ratio between cases and non-cases, denoted by $\eta$; default value $1$
* **B**: number of replications; default value $100$

##### 2. Main steps of the function
* Equal case-network size, **K**, is calculated based on **N** and **M**, and the total number of cases is calculated by **Y**=**MK**
* Index for case-networks, **kidx**, is created for all cases and set to $0$ for all non-cases, such that the first **Y** units in the population take a network id number while the last $N-Y$ units take value $0$, that is, $\underbrace{1,1,\dots,1}_{K \text{ times}},\underbrace{2,2,\dots,2}_{K \text{ times}},\dots,\underbrace{M,M,\dots,M}_{K \text{ times}},\underbrace{0,0,\dots,0}_{(N-Y) \text{ times}}$
* Inclusion probabilities, $\pi_i=\mathrm{Pr}(i\in s_0)$, denoted by **p** inside the function, are calculated as proportional to **lift** if $y_i=1$, and $1$ otherwise
* Inclusion probabilites of case-networks are calculated by 
    * $\pi_{(\kappa)}=1-(1-p)^K$, denoted by **pr.k** inside the function, under Poisson sampling of $s_0$,
    * $\pi_{(\kappa)}=1-\binom{N-K}{n_0}/\binom{N}{n_0}$, denoted by **p.k** inside the function, under SRS of $s_0$
* An $Y\times 2$ array is created, called **nidx** inside the function, the first and the second columns of which are replaced with the case-network sizes and their inclusion probabilities, respectively 
* The HTE of the prevalence the population:

    $\hat{\theta}_{HT} = \frac{1}{N}\sum_{\kappa\in \Omega_s}  \frac{y_{\kappa}}{\pi_{(\kappa)}}$, where $y_{\kappa}$ the number of cases in network $\kappa$

* Sampling variances of the HTE based on $s_0$ are calculated under SRS and Poisson sampling: **v.srs** and **v.pois**
$$V_{srs}(\hat{\theta}_{HT;s_0})=\big(1-\frac{n_0}{N}\big)\frac{\theta(1-\theta)}{n_0},\quad V_{pois}(\hat{\theta}_{HT;s_0})=\frac{1}{N^2}\sum_{i \in U}\big(\frac{1}{\pi_i}-1\big)y_i$$
* Sampling variance of the HTE under ACS is calculated by
$$V_{acs}(\hat{\theta}_{HT})=\frac{1}{N^2}\left\{\sum_{\kappa\in\Omega}\big(\frac{1}{\pi_{(\kappa)}}-1\big)K^2+\sum_{\kappa\in\Omega}\sum_{\ell\neq \kappa\in\Omega}\big(\frac{\pi_{(\kappa\ell)}}{\pi_{(\kappa)}\pi_{(\ell)}}-1\big)K^2\right\}\cdot$$

Here, we have $\pi_{(\kappa\ell)}=\pi_{(\kappa)}+\pi_{(\ell)}-(1-\bar\pi_{(\kappa\cup\ell)})$, with $\bar\pi_{(\kappa\cup\ell)}=(1-p)^{2K}$ if poisson sampling.
* Simulation study
    * **B** random samples selected with *Sequential Poisson Sampling* (SPS) from the population
    * For each replication, the population prevalence is estimated based on $s_0$ and the sample of case-networks observed by adaptive tracing 
    
##### 2. Main outputs of the function
* Relative efficiency of the HTE under ACS against the HTE based on the initial sample $s_0$ selected with either SRS or poisson sampling 
    
***

In [37]:
mainEpi <- function(N=10^5,theta=0.01,f=0.01,M=10,lift=1,B=100)
{
  K = trunc(N*theta/M + 0.5); Y = M*K
  kidx = rep(0,N); kidx[1:Y] = c(t(array(1:M,c(M,K))))
  y = 1*(kidx>0)
  n0 = trunc(N*f+0.5)
  cat("(M, K, Y, n0) =",c(length(unique(kidx))-1,K,sum(y),n0),"\n")
  
  p = rep(n0/N,N); p[1:Y] = lift*p[1:Y]; p = n0*p/sum(p)
  pr.k = p.k = 1-exp(c(1:K)*log(1-max(p))); cat('Summary of pr.k under pois:','\n');print(summary(pr.k))
    if (lift==1) {
    for (m in 1:K) { p.k[m] = 1-exp(sum(log(N-m+1-c(1:n0)) - log(N+1-c(1:n0)))) }
   cat('Summary of p.k if SRS:','\n');
        print(summary(p.k))}
  nidx = array(0,c(Y,2))
  for (k in 1:M) { idk = kidx[1:Y]==k; nidx[idk,1] = sum(idk); nidx[idk,2] = pr.k[sum(idk)] }
  
  v.srs = (1-n0/N)*theta*(1-theta)/n0
  v.pois = sum((1/p-1)*y)/N^2
  cat("SE(initial) by SRS, Pois =",sqrt(c(v.srs,v.pois)),"\n")
  tmp = 1 + exp(2*K*log(1-max(p))) - 2*exp(K*log(1-max(p)))
  v.acs = c(M*(1/pr.k[K]-1), M*(M-1)*(tmp/pr.k[K]^2 -1))*K^2/N^2
  cat("SE(ACS) =",sqrt(sum(v.acs)),"\t leading term:",sqrt(v.acs[1]),"\n")
  cat("RE(analytic) =",sum(v.acs)/v.pois,"\n")
  
  mat = array(0,c(B,5))
  for (i in 1:B) {
# s0 by SPS (sequential Poisson sampling), s0y = effective sample of cases 
    u = runif(N,0,1)/p
    s0 = sort((c(1:N)[order(u)])[1:n0])
    s0y = s0[s0<=Y]
    mat[i,3] = length(s0y)
    mat[i,5] = n0-mat[i,3]
# initial sample estimator      
    mat[i,1] = sum(1/p[s0y])/N
# HT estimator, need only the first unit in s0 for each case-network
    idk = s0y[!duplicated(kidx[s0y])]
    mat[i,2] = sum(nidx[idk,1]/nidx[idk,2])/N
    mat[i,4] = length(idk)
    mat[i,5] = mat[i,5] + sum(nidx[idk,1])
  }
  
  emat = rbind(colMeans(mat), sqrt(diag(var(mat))))
  colnames(emat) = c("Est.s0","Est.ACS","No cases s0",'No sample case-ntw','No sample units')
  rownames(emat) = c("MC-Mean","MC-SD")
  print(emat) 
  cat("RE(simulation) =",var(mat[,2])/c(var(mat[,1]),v.pois),"\n")
}

In [38]:
mainEpi(N=10^5,theta=0.01,f=0.01,M=10,lift=1,B=100) # With default parameters, or just run mainEpi() 

(M, K, Y, n0) = 10 100 1000 1000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0100  0.2280  0.3980  0.3724  0.5306  0.6340 
Summary of p.k if SRS: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0100  0.2280  0.3981  0.3725  0.5307  0.6342 
SE(initial) by SRS, Pois = 0.003130655 0.003146427 
SE(ACS) = 0.002402847 	 leading term: 0.002402847 
RE(analytic) = 0.5831995 
             Est.s0     Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.009760000 0.009842773    9.760000            6.24000       1614.2400
MC-SD   0.002730745 0.002092615    2.730745            1.32665        130.7703
RE(simulation) = 0.5872406 0.4423268 


In [39]:
sqrt((1-1600/100000)*0.01*0.99/1600) # SE under SRS for n0=E(|s|)=1600

In [40]:
mainEpi(lift=2) # Size-biased sampling

(M, K, Y, n0) = 10 100 1000 1000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0198  0.4025  0.6358  0.5720  0.7780  0.8647 
SE(initial) by SRS, Pois = 0.003130655 0.00222486 
SE(ACS) = 0.001251022 	 leading term: 0.001251022 
RE(analytic) = 0.3161729 
             Est.s0     Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.009862650 0.010050033   19.530000           8.690000      1849.47000
MC-SD   0.002185925 0.001181869    4.328564           1.021931        99.94286
RE(simulation) = 0.2923266 0.2821848 


In [41]:
sqrt((1-1800/100000)*0.01*0.99/1800) # SE under SRS for n0=E(|s|)=1800

In [42]:
eta <- 2
n <- 1800
sqrt(((100000+n*(eta-1))/(n*eta)-1)*1000/100000/100000) # SE under poisson for n0=E(|s|)=1800; pr.i becomes n0*lift/(N+Y*(lift-1)) for cases, and n0/(N+Y*(lift-1)) for non-cases

In [43]:
mainEpi(lift=15)

(M, K, Y, n0) = 10 100 1000 1000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1316  0.9735  0.9992  0.9340  1.0000  1.0000 
SE(initial) by SRS, Pois = 0.003130655 0.0008124038 
SE(ACS) = 2.73223e-06 	 leading term: 2.73223e-06 
RE(analytic) = 1.131073e-05 
             Est.s0    Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.009947640 0.01000001   130.89000                 10      1869.11000
MC-SD   0.000806691 0.00000000    10.61436                  0        10.61436
RE(simulation) = 0 0 


In [47]:
mainEpi(M=500) # Small networks

(M, K, Y, n0) = 500 2 1000 1000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01000 0.01247 0.01495 0.01495 0.01742 0.01990 
Summary of p.k if SRS: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01000 0.01248 0.01495 0.01495 0.01743 0.01990 
SE(initial) by SRS, Pois = 0.003130655 0.003146427 
SE(ACS) = 0.003138511 	 leading term: 0.003138511 
RE(analytic) = 0.9949749 
             Est.s0     Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.009890000 0.009869347    9.890000           9.820000     1009.750000
MC-SD   0.003413861 0.003408595    3.413861           3.391552        3.388558
RE(simulation) = 0.9969176 1.173588 


In [46]:
mainEpi(lift=2,M=500) # Small networks, size-biased sampling

(M, K, Y, n0) = 500 2 1000 1000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01980 0.02465 0.02951 0.02951 0.03436 0.03921 
SE(initial) by SRS, Pois = 0.003130655 0.00222486 
SE(ACS) = 0.002213707 	 leading term: 0.002213707 
RE(analytic) = 0.99 
             Est.s0     Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.009589950 0.009599141   18.990000          18.820000      1018.65000
MC-SD   0.002209415 0.002227838    4.375079           4.367881         4.39783
RE(simulation) = 1.016746 1.002679 


In [48]:
mainEpi(lift=2,M=500,f=0.05) # Small networks, size-biased sampling, larger n0

(M, K, Y, n0) = 500 2 1000 5000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09901 0.12131 0.14361 0.14361 0.16592 0.18822 
SE(initial) by SRS, Pois = 0.001371496 0.0009539392 
SE(ACS) = 0.0009287649 	 leading term: 0.0009287649 
RE(analytic) = 0.9479167 
              Est.s0      Est.ACS No cases s0 No sample case-ntw
MC-Mean 0.0098899200 0.0098747805   97.920000           92.93000
MC-SD   0.0008667783 0.0008099955    8.581964            7.62274
        No sample units
MC-Mean     5087.940000
MC-SD          7.294415
RE(simulation) = 0.8732712 0.720981 


In [49]:
mainEpi(lift=1,M=500,f=0.05) # Small networks, SRS, larger n0

(M, K, Y, n0) = 500 2 1000 5000 
Summary of pr.k under pois: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05000 0.06187 0.07375 0.07375 0.08563 0.09750 
Summary of p.k if SRS: 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05000 0.06188 0.07375 0.07375 0.08563 0.09750 
SE(initial) by SRS, Pois = 0.001371496 0.001378405 
SE(ACS) = 0.001360618 	 leading term: 0.001360618 
RE(analytic) = 0.974359 
             Est.s0     Est.ACS No cases s0 No sample case-ntw No sample units
MC-Mean 0.010094000 0.010100513   50.470000          49.240000     5048.010000
MC-SD   0.001466241 0.001430149    7.331205           6.971979        6.791996
RE(simulation) = 0.9513759 1.076488 
