In [1]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()

# The minimum sample size in the binomial parameter

*The binomial parameter is $p = \frac{k}{n}$, where $k$ is the number of successes and $n$ the number of trials. How good is this parameter (which is computed on a *sample* of size $n$) as an estimate of the real *population* parameter? Translated: how big does $n$ have to be for $p$ to be reliable?*

## Intro: the De Moivre-Laplace theorem and how we use it

**Statement**: the binomial distribution is approximated by a gaussian distribution when $n \to \infty$, this gaussian having mean $np$ and standard deviation $\sqrt{np(1-p)}$. 

**Proof**: go see [[Wikipedia]](#wiki). 

For our purposes here, we're focussed on the binomial parameter, not the count of successes. By the linearity of the mean and standard deviation, we infer that when $n$ is large enough the distribution of the binomial parameter becomes a gaussian with mean $\mu = p$ and standard deviation $\sigma = \sqrt{\frac{p(1-p)}{n}}$. That's pretty much what we will use to answer our question.

## Choosing the margin of error and confidence level

The "how big the sample size has to be" depends entirely on our choice of the error margin we want to allow around $p$ and the confidence level we want to have, that is, how sure we want to be about it. There is no number without an error.

First of all, let us choose the desired margin of error we want. Say we want $p$ to fluctuate within a range of $0.01$, it'll then mean that we can write

$$
p \pm 0.01
$$

Second of all, we need to decide the confidence level we want to apply. We use the statement from the De Moivre-Laplace theorem above and the [$z$-score](../concepts/z-score.ipynb) to affirm that the error margin $e$ is 

$$
e = z_{\alpha/2} \sqrt{\frac{p(1-p)}{n}}
$$

### Proving what we said here^

The main source I used for this is [[the page from PennState about this]](#penn). 

From the notions of [confidence levels and intervals](../concepts/confidence-level-interval.ipynb), we can write that, setting our confidence level as $1-\alpha$ and being $x$ our random variable of interest, namely the binomial parameter distributed with a gaussian of mean $p$ and standard deviation $\sqrt{\frac{p(1-p)}{n}}$, 

$$
P \left(-z_{\alpha/2} \leq \frac{x - p}{\sqrt{\frac{p(1-p)}{n}}} \leq z_{\alpha/2}\right) = 1 - \alpha \ ,
$$

which turns into 

$$
p - z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}} \leq x \leq p + z_{\alpha/2} \sqrt{\frac{p(1-p)}{n}} \ ,
$$

so that stated above is the margin or error.

## Calculating the minimum $n$

So from what we said above, setting both the margin of erros and the confidence level desired, we can easily compute the minumum required $n$ as 

$$
n_{\text{min}} = \frac{z_{\alpha/2}^2 p(1-p)}{e^2} \ .
$$

In a stricter, more conservative case, we'd use the maximum of $p(1-p)$ (which is $\frac{1}{2}$) instead, so that 

$$
n_{\text{min}} = \frac{z_{\alpha/2}^2 \frac{1}{2} \frac{1}{2}}{e^2} \ .
$$

## References

1. <a name="penn"></a> [This great page from PennState on the topic](https://onlinecourses.science.psu.edu/stat500/node/30)
2. <a name="wiki"></a> [WIkipedia on the theorem](https://en.wikipedia.org/wiki/De_Moivre–Laplace_theorem)