In [1]:
from __future__ import division, print_function
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import math
import numpy as np
import numpy.random
import scipy as sp
import scipy.stats

from ballot_comparison import findNmin_ballot_comparison_rates
from hypergeometric import trihypergeometric_optim

# Example of a hybrid audit in a large election

There are two strata. One contains every CVR county and the other contains every no-CVR county.
There were 2 million ballots cast in the election, 1.9 million in the CVR stratum and 100,000 in the no-CVR stratum.

In the CVR stratum, the diluted margin was $20\%$: there were 1,102,000 votes reported for A, 703,000 votes reported for candidate B, and 76,000 invalid ballots.
In the no-CVR stratum, the diluted margin was $-10\%$: there were 42,500 votes reported for A, 52,500 votes for B, and 5,000 invalid ballots.
A won overall, with 1,144,500 votes to B's 755,500, but not in the CVR stratum.

The reported vote margin between A and B is 389,000 votes, a "diluted margin" of $389000/2000000 = 19.45\%$.

For any $\lambda$, the reported outcome of the election is correct if the overstatement of the margin in the CVR stratum is less than $389000\lambda$ votes and if the overstatement of the margin in the no-CVR stratum is less than $389000(1-\lambda)$ votes. 
For this example, we set $\lambda = 0.9$, roughly reflecting the relative sizes of the two strata.

We want to limit the risk of certifying an incorrect outcome to at most $\alpha=5\%$. 
We allocate risk unequally between the two strata.
The total risk is at most $1 - (1-\alpha_1)(1-\alpha_2)$.
One choice is to set $\alpha_1 = 1\%$ and to solve for the value of $\alpha_2$
which makes $1 - (1-\alpha_1)(1-\alpha_2) = \alpha$.
In this case, $\alpha_1=1\%$ and $\alpha_2 = 4.04\%$ achieve the desired risk limit.

We test the following pair of null hypotheses, using independent samples from the two strata:

* the overstatment in the CVR stratum is less than $389000\lambda$. We test at significance level
(risk limit) $\alpha_1$ using a ballot-level comparison audit

* the overstatment in the no-CVR stratum is less than $389000(1-\lambda)$. We test this at significance level (risk limit) $\alpha_2$ using a ballot-polling audit

If either null is not rejected, we hand count the corresponding stratum completely, adjust the null
in the other stratum to reflect the known tally in the other stratum, and then determine whether there needs to be
more auditing in the stratum that was not fully hand counted.



In [2]:
lambda1 = 0.9
lambda2 = 1-lambda1
alpha1 = 0.01
alpha2 = 0.0404
margin = 389000
N1 = 1900000
N2 = 100000

# CVR stratum

We compute the sample size needed to confirm the election outcome, for a number of assumed rates of error in the population of ballots.

We take the chosen $\lambda$ from above and plug it in as the parameter `null_lambda` in the function below.

We set $\gamma = 1.03905$ as in "A Gentle Introduction to Risk-limiting Audits."

In [3]:
# Assuming that the audit will find no errors
findNmin_ballot_comparison_rates(alpha=alpha1, gamma=1.03905, 
                                r1=0, s1=0, r2=0, s2=0,
                                reported_margin=margin, N=N1, null_lambda=lambda1)

40.0

In [4]:
# Assuming that the audit will find 1-vote overstatements at rate 0.1%
findNmin_ballot_comparison_rates(alpha=alpha1, gamma=1.03905, 
                                r1=0.001, s1=0, r2=0, s2=0,
                                reported_margin=margin, N=N1, null_lambda=lambda1)

40.0

# No-CVR stratum

Below, we compute the sample size $n$ needed to confirm the election outcome.

Define
$$
    c = \text{reported margin in the stratum } - \lambda_2 \text{overall reported margin}.
$$

The reported margin in the stratum could be large or small, but it is known. 
Below, we will vary it just to see the effect.

$c$ defines the null hypothesis. We test the null that the actual margin in the stratum is less than or equal to $c$: $A_{w, 2} - A_{\ell, 2} \leq c$. Here, $A_{w, 2}$ is an unknown nuisance parameter.

In practice, we will maximize the $p$-value over all possible pairs $(A_{w,2}, A_{\ell, 2})$ in the null.

In [5]:
# Assuming that the stratum reported margin is accurate

# We don't know N_w, N_\ell so maximize the p-value over all possibilities.

np.random.seed(292018)
pop = np.array([0]*52500 + [1]*42500 + [np.nan]*5000)
c = (42500-52500) - lambda2*margin
print("c= ", c)
for n in range(20, 10000, 20):
    sample = np.random.choice(pop, n)
    pval = trihypergeometric_optim(sample, popsize=N2, null_margin=c)
    print("n=", n, ", pvalue=", pval)
    if pval < alpha2:
        break

c=  -48899.99999999999
n= 20 , pvalue= 0.00136312335797
