# Optimal Sample Size

In [1]:
import scipy.stats
import numpy as np
np.random.seed(2222)

First generate a p value, but do not look at the value.

In [2]:
p = np.random.rand(1)

In [3]:
total_population = [1] * int(1000000 * p) + [0] * int(1000000 * (1-p))

In [4]:
np.random.shuffle(total_population)

So we have a population here with a measurement which is represented by a 1 or 0. Lets take a look at a snippet. 

In [5]:
total_population[:10]

[0, 1, 1, 0, 0, 1, 1, 0, 1, 1]

It is important to choose your confidence level now and the margin of error.

Here I will choose a confidence level of 95%, now to calculate the z-score

In [6]:
z = scipy.stats.norm.ppf(1- 0.05/2)
print("z value for 95% confidence level", z)

z value for 95% confidence level 1.959963984540054


Now we need to set a margin of error, I will set a margin of error of 0.01 which means we are correct within 1%.

In [7]:
moe = 0.01

Lets set $\sigma = \frac{1}{2}$ which is the highest variance for a binomial distribution.

In [8]:
sigma = 0.5

We can now calculate the required value of n for our sample size.

In [9]:
n = z**2/(2*moe)**2
print(n)

9603.647051735314


Finally lets do 10 tests, if correct we should (it's still random so not guaranteed) have atleast 9/10 of these tests with $\hat{p}$ within 1% of $p$

In [10]:
for i in range(10):
    sample = np.random.choice(total_population, int(n+1))
    p_hat = sum(sample) / len(sample)
    correct = abs(p_hat - p) < 0.01
    print("sample",i,":", correct, "p hat value = ", p_hat)

sample 0 : [ True] p hat value =  0.7282382340691379
sample 1 : [ True] p hat value =  0.7242815493544357
sample 2 : [ True] p hat value =  0.7303206997084548
sample 3 : [ True] p hat value =  0.7235526863806747
sample 4 : [ True] p hat value =  0.7275093710953769
sample 5 : [ True] p hat value =  0.7256351520199916
sample 6 : [ True] p hat value =  0.7346938775510204
sample 7 : [ True] p hat value =  0.720637234485631
sample 8 : [ True] p hat value =  0.7325072886297376
sample 9 : [ True] p hat value =  0.7337567680133278


In [11]:
print("True p value =", p[0])

True p value = 0.7272001796427789
