# Prevalence estimation

Consider a population of interest and a known condition, such as, for example,
a disease or a binary behavior. It is important to understand the proportion
of individuals in this population exposed at time $t$, called *prevalence*. 
Suppose that a diagnostic test is done to measure the presence or the
absence of this condition in the individuals. Mathematically, let $\theta \in
(0,1)$ be the prevalence (parameter of interest) of the condition and $Y_i$ be
an indicator function of the presence of the condition in the $i$th.
individual. 

Assuming for simplicity that all tests are performed at time $t$, and the
sample is $\{y_1, ..., y_n\}$, the maximum likelihood estimator is the
apparent prevalence: 

```{math}
:label: eq:naive-estimator
\hat{\theta} = \frac{1}{n}\sum_{i=1}^n y_i
```

However, this estimator has two problems in this context: it assumes a perfect
diagnostic test, which is often incorrect, and the samples in RDS are not
independent by definition (network structure). 

The first problem in {eq}`eq:naive-estimator` was tackled several times in
the literature, such as {cite:t}`mcinturff2004modelling`. The second problem
was a study object in {cite:t}`heckathorn1997,heckathorn2002` where the
estimator was proposed based largely on Markov chain theory and social network theory.
{cite:t}`volz2008probability` improved it with the RDS II estimator considering
the network degree

\begin{equation}
    \hat{\theta}^{RDS II} = \frac{\sum_{i=1}^n y_i \delta_i^{-1}}{\sum_{i=1}^n \delta_i^{-1}},
\end{equation}

such that $\delta_i$ is the i$^{th}$ individual's degree. However, this is an
area of research in progress. 

Let $I$ be a index set and $Y_i$ be the indicator function of the $i^{th}$ individual's exposure to the disease, and $T_i$
indicating whether the test of the $i^{th}$ individual is positive at time
$t$. Suppose that $\{Y_i\}_{i \in I}$ and $\{T_i\}_{i \in I}$ are two independent and identically distributed
random variables with $\Pr(X = 1) = \theta$ and $\Pr(T = 1) = p$. We say that
$\theta$ is the prevalence and $p$ is the apparent prevalence in the
population. 

If the test is perfect, then for every $i$, $T_i = Y_i$, and
$\theta = p$ (with probability one when they are random variables).
Unfortunately, this is not true in the real world, what makes important to
regard the evaluation of the diagnosticworkin, and the following definitions are
used:

**Specificity:** Probability of a negative test correctly identified. In mathematical terms,
  conditioned on $Y = 0$, the *specificity* $\gamma_e$ is the probability
  of $T = 0$: 
  
\begin{equation}
\gamma_e = \Pr(T = 0|Y = 0). 
\end{equation} 

**Sensitivity:** Probability of a positive test correctly identified. In mathematical terms,
  conditioned on $Y = 1$, the *sensitivity* $\gamma_s$ is the probability
  of $T = 1$: 
  
\begin{equation}
\gamma_s = \Pr(T = 1|Y = 1). 
\end{equation} 

The relation between prevalence and apparent prevalence is given by the
following relation:

\begin{equation}
    p = \gamma_s\theta + (1-\gamma_e)(1-\theta).
\end{equation}

The intuition behind this equation is pretty simple: the proportion
of positive test counts the correct identified exposed individuals and the
incorrect identified not exposed. Observe that if $\gamma_s = \gamma_e = 1$,
we have the trivial case $p = \theta$. Moreover, if $\gamma_s = \gamma_e = 0.5$, we have that $p = 0.5$ and there is no information about $\theta$. 