In [36]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.max_columns', 500)
import seaborn as sns
sns.set_style("dark")
plt.rcParams['figure.figsize'] = 8, 6
from tqdm import tqdm, tqdm_notebook
from scipy import stats

Standard error:

$$\text{SE}_{\overline{x}} = \frac{s}{\sqrt{n}}$$

where $s$ - standard deviation

if $x \sim \text{Bernoilli}\left(\theta\right)$:

$$\text{SE}_{\overline{x}} = \sqrt{\frac{\theta\left(1 - \theta\right)}{n}}$$

we can think that $\text{SE} \sim \mathcal{N}\left(0, 1\right)$


For mean difference with equal or unequal sample size and unequal variances:

$$t = \frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

then $t \sim \text{Student}\left(\ldots\right)$

$$d = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\left(\frac{s_1^2}{n_1}\right)^2 \frac{1}{n_1 - 1} + \left(\frac{s_2^2}{n_2}\right)^2 \frac{1}{n_2 - 1}}$$

In [35]:
level = 0.975
z = stats.norm.ppf(level)
print level, z

0.975 1.95996398454


In [42]:
p = 0.0011

n = z*z/(p*(1 - p))
print p, n

p = 0.001

n = z*z/(p*(1 - p))
print p, n

0.0011 3496.08098062
0.001 3845.30412482


In [None]:
stats.t.ppf(0.975, df=20000)

In [136]:
level = 0.975
p_c = 0.001
q = 0.1
z = stats.norm.ppf(level)
print (1 + q)*p_c

n = z*z*(2 + q)*(p_c - p_c*p_c)/(q*q*p_c*p_c)
print n

0.0011
805899.645993


In [91]:
p = 0.001
q = 0.0011

level = 0.975
z = stats.norm.ppf(level)
# z = stats.t.ppf(level, df=2)
print level
print z

n = ((z/(p - q))**2)*(p*(1 - p) + q*(1 - q))

print n

0.975
1.95996398454
805857.389946


In [135]:
p = 0.001
q = 0.0011
# p = 0.5
# q = 0.55

a = 0.01
b = 0.01
ta = stats.norm.ppf(1 - a)
tb = stats.norm.ppf(b)
print a, ta
print b, tb

n = (tb*np.sqrt(q*(1 - q)) - ta*np.sqrt(p*(1 - p)))/(q - p)
n = n*n
print n

c = p + ta*np.sqrt(p*(1 - p)/n)
print c

0.01 2.32634787404
0.01 -2.32634787404
2269318.38438
0.00104881009877


So, in summary, if the pollster collects data on n = 1001 voters, and rejects his null hypothesis H0: p = 0.50 if the proportion of sampled voters who favor the political candidate is greater than 0.5367, he will have a 1% chance of committing a Type I error and a 20% chance of committing a Type II error if the population proportion p were actually 0.55.

- The desired α level, that is, your willingness to commit a Type I error.
- The desired power or, equivalently, the desired β level, that is, your willingness to commit a Type II error.
- A meaningful difference from the value of the parameter that is specified in the null hypothesis.
- The standard deviation of the sample statistic or, at least, an estimate of the standard deviation (the "standard error") of the sample statistic.

$$\begin{array}{rcl}
c &=& p + t_\alpha \sqrt{\frac{p(1 - p)}{n}} \\
c &=& q + t_\beta \sqrt{\frac{q(1 - q)}{n}} \\
n &=& \left(\frac{t_\beta \sqrt{q(1 - q)} - t_\alpha \sqrt{p(1 - p)}}{p - q}\right)^2
\end{array}$$
- $p$ is for control
- $q$ is for treatment