# Hypothesis testing : A/B Test


### Example 
(adapted from ``Data Science from Scratch'' by Joel Grus. I have also added the math)

An advertiser has developed an energy drink targeted at data scientists. 

Advertisement A: "taste great!"
Advertisiment B: "less bias!"

Suppose "taste great!" gets 200 clicks out of 1,000 views and "less bias!" gets 180 clicks out of 1,000 views. 

----
Think about the ad as a Bernoulli trial 

\begin{equation*}
p_A=P(\mbox{ Someone click on ad } A)
\end{equation*}

Let $N_A$ the numer of views for advertisement A and $n_A$ the number of clicks. 
$n_A= X_1 + X_2 + \cdots + X_{N_A}$ follows a Binomial distribution $Bin(N_A, p_A).$  (the $X_i's$ are iid)

If $N_A$ is large enough, by The Central Limit Theorem

\begin{equation*}
\dfrac{n_A}{N_A} \sim N(\mu_A, \sigma_A^2)
\end{equation*}


with $\mu_A=p_A$ and $\sigma_A=\sqrt{\frac{p_A(1-p_A)}{N_A}}$

Similarly, 

\begin{equation*}
\dfrac{n_B}{N_B} \sim N\left(p_B, \sqrt{\frac{p_B(1-p_B)}{N_B}} \right)
\end{equation*}

**Assumption:** $\frac{n_A}{N_A}$ and   $\frac{n_B}{N_B}$ are independent with normal distribution. 

Then, 

\begin{equation*}
\frac{n_A}{N_A} - \frac{n_B}{N_B}  \sim N \left(p_A-p_B, \frac{p_A(1-p_A)}{N_A} + \frac{p_B(1-p_B)}{N_B} \right)
\end{equation*}


**Note:** $\sigma_A^2$ and $\sigma_B^2$ are not given. Formally we would have to work with $S_A$ and $S_B$ (the sample variances), so  $\frac{n_A}{N_A}$ and  $\frac{n_B}{N_B}$ would follow a Student t distribution. However, since $N_A$ and $N_B$ are large enough, the distribution of $\frac{n_A}{N_A}$ and  $\frac{n_B}{N_B}$ can be approximated by normals. 
$\sigma_A^2 \approx S_A^2$ and $\sigma_B^2 \approx S_B^2$ for $N_A$ and $N_B$ large enough. 



\begin{equation*}
Z= \dfrac{\frac{n_A}{N_A} - \frac{n_B}{N_B} -(p_A- p_B)}{\sqrt{\frac{p_A(1-p_A)}{N_A} + \frac{p_B(1-p_B)}{N_B}}} \sim N(0,1)
\end{equation*}






### The elements of a statistical test


Null Hypothesis:  
\begin{equation*}
 H_0=p_A-p_B=0
\end{equation*}

	    
Alternative Hypothesis:
\begin{equation*}
H_a=p_A-p_B \neq 0
\end{equation*}

Test statistic:
\begin{equation*}
\frac{n_A}{N_A} - \frac{n_B}{N_B}
\end{equation*}
 



Use $\alpha=0.05$ significance level (0.05 decision rule)

\begin{equation*}
P(\mbox{ reject } H_0 | H_0 \mbox{ is true })= P\left(\frac{n_A}{N_A} - \frac{n_B}{N_B} \neq 0 | p_A - p_B=0 \right)= P(Z \leq -z_{\alpha/2} | p_A=p_B) + P(Z \geq z_{\alpha/2} | p_A=p_B)
\end{equation*}

Recall that by symmetry $P(Z \leq -z_{\alpha/2})=0.025$ and $P(Z \geq z_{\alpha/2})=0.025.$
From the table, 

\begin{equation*}
z_{\alpha/2}= 1.96 
\end{equation*}

(since $P(Z \geq 1.96)=0.025$)



Now we compute the **observed statistics**:

\begin{equation*}
\dfrac{\bar{y}_A - \bar{y}_B -(p_A- p_B)}{\sqrt{\frac{p_A(1-p_A)}{N_A} + \frac{p_B(1-p_B)}{N_B}}}
\end{equation*}

If $H_0$ is true:

\begin{equation*}
\dfrac{\bar{y}_A - \bar{y}_B -(p_A- p_B)}{\sqrt{\frac{p_A(1-p_A)}{N_A} + \frac{p_B(1-p_B)}{N_B}}}=
\dfrac{\bar{y}_A - \bar{y}_B}{\sqrt{\frac{p_A(1-p_A)}{N_A} + \frac{p_B(1-p_B)}{N_B}}}= \dfrac{\frac{200}{1000} - \frac{180}{1000}}{0.0175385}=1.14
\end{equation*}


1.14 is not in the rejection region, so we DO NOT reject $H_0$ 

Aside calulation for the denominator:

$\dfrac{p_A(1-p_A)}{N_A} + \dfrac{p_B(1-p_B)}{N_B}= \dfrac{\frac{200}{1000}(1-\frac{200}{1000})}{1000} + \dfrac{\frac{180}{1000}(1-\frac{180}{1000})}{1000}= 0.0003076$

**P-value**

\begin{equation*}
\mbox{ P- value }= P(Z \geq 1.14) + P(Z \leq -1.14)= 2  \times 0.1271 = 0.2542
\end{equation*}
(far enough from 0.05)

Large enough that you cannot conclude there is much of a difference. 

**Exercise**

What happens if "less bias!" only gets 150 clicks out of 1,000 views. 





In [1]:
import math
from scipy.stats import norm   # Imports the normal distribution

In [2]:
def estimated_parameter(N,n):
    ybar=n/N
    sigma=math.sqrt(ybar*(1-ybar)/N)
    return ybar, sigma

In [3]:
estimated_parameter(1000,200)

(0.2, 0.01264911064067352)

In [4]:
def a_b_test_statistic(N_A,n_A,N_B,n_B):
    ybar_A, sA2= estimated_parameter(N_A,n_A)
    ybar_B, sB2= estimated_parameter(N_B,n_B)
    return (ybar_A-ybar_B)/math.sqrt(sA2**2 + sB2**2)

In [5]:
z=a_b_test_statistic(1000,200,1000,180)
print(z)

1.1403464899034472


In [6]:
norm.ppf(0.025, loc=0, scale=1)     # z-score


-1.9599639845400545

In [23]:
norm?


In [7]:
norm.cdf(-1.14, loc=0, scale=1)

0.1271431505627983

In [8]:
pvalue=2*norm.cdf(-1.14, loc=0, scale=1)
pvalue

0.2542863011255966