# Hypothesis Testing and Inference
We often want to test whether a certain hypothesis is likely to be true. Hypotheses are typically assertions like "this coin is fair' or "data scienctist love python" that can be translated into statistics about data. As we estabtilished before we can approximate a binomial distribution with a normal apporaximation such that

$$ \mu =  np \quad
\sigma = \sqrt{np(1-p)}
$$


In [26]:
from typing import Tuple
import math

def normal_approximatiom_to_binomial(n: int, p: float) -> Tuple[float, float]:
    """ Returns the mean and standard deviation corresponding to a Binomial(n,p) """
    mu = n*p
    sigma = math.sqrt(n*p*(1-p))

    return mu, sigma

normal_approximatiom_to_binomial(1000,0.2)


(200.0, 12.649110640673518)

When a random variable follows a normal distribution, we can use normal_cdf to figure out the probability that its realised value lies within or outside a particular interval. As default normal_cdf() returns the area under the curve from a point and below.

In [2]:
from helpers import normal_cdf

normal_probability_below = normal_cdf
print(f"The Probabtility below Z=1: {normal_probability_below(1)}")

#The probability that is is above the threshold
def normal_probability_above(lo: float, mu: float=0, sigma: float=1) -> float:
    """ Returns Probability = P(Z <= z) """
    return 1 - normal_cdf(lo,mu,sigma)

print(f"The Probabtility above Z=1: {normal_probability_above(1)}")

def normal_probability_between(lo: float, hi: float, mu: float=0, sigma: float=1) -> float:
    return normal_cdf(hi, mu,sigma) - normal_cdf(lo, mu, sigma)

print(f"The Probabtility inbetween z=[-1,1]: {normal_probability_between(-1,1)}")

def normal_probability_outside(lo: float, hi: float, mu: float=0, sigma: float=1) -> float:
    return 1-normal_probability_between(lo,hi,mu,sigma)

normal_probability_outside(-1,1)
print(f"The Probabtility outside z=[-1,1]: {normal_probability_outside(-1,1)}")

The Probabtility below Z=1: 0.8413447460685429
The Probabtility above Z=1: 0.15865525393145707
The Probabtility inbetween z=[-1,1]: 0.6826894921370859
The Probabtility outside z=[-1,1]: 0.31731050786291415


We can do the reverse again. Say we are given a certain level of likelihood and we want to know what region thaat pertains to. For example, if we want to find the interval centered at the mean and containing 60% probability, then we find the cutoffs where the upper and lower tails each contain 20% of the probability (given symmmetry).


Show the three intervals with a chart showing the area and bounds we want to compute.




In [31]:
from typing import Tuple
from helpers import inverse_normal_cdf

def normal_upper_bound(probability: float, mu: float=0, sigma: float=1) -> float:
    """Returns the z for which P(Z <= z) = probabilty """
    return inverse_normal_cdf(probability,mu, sigma)

probability= 0.8413447460685429
print(f"The z, for which P(Z <= z) = {probability} is z = {normal_upper_bound(probability)}") #we should get 1 looking at the results from above where The Probabtility below Z=1: 0.8413447460685429

def normal_lower_bound(probability: float, mu: float=0, sigma: float=1) -> float:
    """ Returns the z for which P(Z >= z) = probability """
    return inverse_normal_cdf(1-probability, mu, sigma)

print(f"The z, for which P(Z >= z) = {probability} is z = {normal_lower_bound(probability)}") #we should get -1 looking at the results from above where The Probabtility below Z=1: 0.8413447460685429


def normal_two_sided_bounds(probability: float, mu: float=0, sigma:float=1) -> Tuple[float, float]:
    """ Returns the symmetric (about the mean) bounds that contain the specified prob """
    tail_probability = (1-probability)/2
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)

    return lower_bound, upper_bound

print(f"The z, for which P(z_1 <= Z <= z_2) = {probability} is z = {normal_two_sided_bounds(probability)}") 


The z, for which P(Z <= z) = 0.8413447460685429 is z = 0.9999847412109375
The z, for which P(Z >= z) = 0.8413447460685429 is z = -0.9999847412109375
The z, for which P(z_1 <= Z <= z_2) = 0.8413447460685429 is z = (-1.4096832275390625, 1.4096832275390625)


## Example of Hypothesis test: is a coin fair?
Say that we will flip a coin $n=1000$ times. If our hypothesis of fairness is true (that the coin is fair and therefore gives heads 50% of the time and tails 50% of the time), X shoulder be distributed approximately normally with mean $\mu = 500 (=np=1000*0.5)$ and standard deviation, $\sigma = 15.8 (=\sqrt{np(1-p)}=\sqrt{250})$

In [25]:
num_of_flips = 1000
probability = 0.5
mu_0, sigma_0 = normal_approximatiom_to_binomial(num_of_flips, probability)
print(f"mean = {mu_0} , sigma = {sigma_0}")


mean = 500.0 , sigma = 15.811388300841896


But the question remains, how willing are we to make a type 1 error ("false positive"), in which we reject the hypothesis even though it is true. For example, lets say we run the simulation 1000 times and we get a head 350 times, how can we know if we should take this to mean the coin is unfair?

To do this, we must make a decision about significance. Lets define $H_0$ as the null hypothesis, that represents some default position and some alternative hypothesis $H_1$. We typically set the significance to 5% or 1% and say that we will reject the null hypothesis $H_0$ if it falls in outside a set of bounds. If we choose 5%, this is to say that our confidence interval is $95% (=100%-5%)$. Therefore, if the experiment produces values withinn this range we accept $H_0$, otherwise reject.

Assume probability $p$ really equals 0.5, there is a 5% chance we observe an X that lies outside this interval.

In [32]:
#given significance at 5%, we can compute the bounds that give us 95% probability - which is the confidence interval.
prob=0.95
lower, upper = normal_two_sided_bounds(prob, mu_0, sigma_0)
print(f"{prob*100}% Confidence interval ({lower}, {upper})")

95.0% Confidence interval (469.011020350622, 530.988979649378)


Imagine we get 530 heads after running the experiment. We can compute the p-values, which is the probability - assuming $H_0$ is true, that we would see a value at least as extreme as the one we actually observed. 

## 1. What a p-value is really doing

- Imagine the **null hypothesis** $(H_0)$ is true.  
- You collect data and calculate a test statistic (like the sample mean or number of successes).  
- The **p-value** is the probability of seeing a statistic *at least as extreme* as yours, **if $(H_0)$ were actually true**.  

So it‚Äôs not ‚Äúthe probability that the null is true.‚Äù  
It‚Äôs ‚Äúhow surprising my data would be under the null.‚Äù

## 2. Why p-value > significance means we don't reject \(H_0\)

- The **significance level** $(\alpha)$, often 0.05) is the cutoff for how much false-positive risk we‚Äôre willing to tolerate.  
- If **p-value < $(\alpha)$**:  
  - The data is so extreme that it would rarely occur if $(H_0)$ were true.  
  - ‚áí We reject $(H_0)$.  
- If **p-value > $(\alpha)$**:  
  - The data is not unusually extreme under the null.  
  - ‚áí We don‚Äôt have enough evidence to reject $(H_0)$.  

Important: ‚ÄúDon‚Äôt reject‚Äù does **not** mean ‚Äúprove the null is true.‚Äù  
It just means the evidence is insufficient to overturn it.


## 3. Why multiply by 2 in a two-sided test

- In a **one-sided test**, ‚Äúextreme‚Äù means large deviations in one direction only (e.g. ‚Äútoo many heads‚Äù).  
- In a **two-sided test**, ‚Äúextreme‚Äù means large deviations in *either* direction (too many **or** too few).  
- Therefore, the two-tailed p-value is **twice the one-tailed probability**:  
  $$
  p\_\text{two-sided} = 2 \times P(\text{statistic at least as extreme in one tail})
  $$


We use the Continuity correction of $z \pm 0.5$ 

In [41]:
def two_sided_p_value(x: float, mu: float=0, sigma: float=1) -> float:
    """ How likely are we to see a value at least as extreme as x """
    if x >= mu:
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        return 2 * normal_probability_below(x, mu, sigma)

two_sided_p_value(529.5, mu_0, sigma_0)


import random

extreme_value_count = 0
num_of_experiments = 1000
for _ in range(num_of_experiments):
    num_heads = sum(1 if random.random() < 0.5 else 0 for _ in range (1000)) 
    if num_heads >=530 or num_heads <= 470:
        extreme_value_count +=1 

print(f"How many times did we get an extreme value: {extreme_value_count}, therefore p-value= { extreme_value_count/num_of_experiments}")


How many times did we get an extreme value: 68, therefore p-value= 0.068


When we get a $p-value$ > 5% we the result we've gotten isn't unusally extreme under the null. Therefore we do not have enough evidence to ject $H_0$.

To do good science, you should determine your hypotheses before looking at the data, you should clean your data without the hypothese in mind, and you should keep in mind that p-values are not substitutes for common sense.

## Example: Running an A/B Test:
Assume we are in charge of website experience optimization. One of our advertisers has developed a new energy drink targeted at data scientists, and thbe VP of Ads wants my help choosing between Adverisemnet A ("tastes great!" and ad B ("less bias!").

We decide to run an experiement, randomly showing visitors one of the ads and tracking how many people click on each one. Lets say that 990 out of 1,000 A-viewers click their ad, while only 10 out of 1,000 B-viewers click their ad, you can confidently say A is the better ad.

However, what if the difference isn't so large. Here we use $\textit{statistical inference}$

Lets say that $N_A$ people see ad A, $n_A$ of them click on it. We can think of each ad view as a Bernoulli trial (1 for clicked, 0 for not clicked) with $p_A$ probability that someone clicks. We want to observe the click rate $$\hat{p}_A = n_A/N_A$$. 

For Ad A:
- Let $X_{A1},...,X_{AN_A}$ is a Bernoulli(p_A)
- The count of the clicks is $S_A = \sum X_{Ai}$ is Binomial(N_A, p_A)

From Central Limit Theory, if $N_A$ is large then the we know that $\hat{p}_A$ is a normal random varialbe mean $p_A$ and standard deviation $\sigma = \sqrt{p_A(1-p_A)/N_A}$. So we know that:

$$E[S_A] = N_Ap_A \quad
VAR[S_A] = N_Ap_A(1-p_A)
$$

Therefore to use these above values we can compute the mean and variance for the normal random variable, click rate:
$$ 
E[\hat{p_A}] = E[S_A/N_A] = \frac{1}{N_A} E[S_A] = \frac{1}{N_A} N_A p_A = p_A
$$

$$ 
VAR[\hat{p_A}] = \frac{1}{N_A^2}VAR[S_A] = \frac{1}{N_A^2}N_Ap_A(1-p_A) = \frac{1}{N_A}p_A(1-p_A)
$$

We want to test the null hypothesis:
$$H_0: p_A = p_B $$
in otherwords, saying that the click-through rate for each ad is the same. The probability is identical. The alternative hypothesis, 
$$H_1: p_A \neq p_B $$

So the paramter of interest is the difference between the two observed sample click-through rates, $\hat{p}_A$, $\hat{p}_B$
$$
\hat{p}_B - \hat{p}_A
$$
and this is taken to be normally distributed given that indiviually they are independent and normally distbributed with mean:

$$
E[\hat{p}_B - \hat{p}_A] = \frac{1}{N_B} E[S_{B}] - \frac{1}{N_A} E[S_{A}] = p_B - p_A
$$

$$
VAR[\hat{p}_B - \hat{p}_A] = \sigma_B^2 + \sigma_A^2,
$$
using the identity $VAR(X-Y) = VAR(X) + VAR(-Y) + 2COV(X,-Y)$, and $VAR(-Y) = VAR(Y)$ and $COV(X,-Y)=0$

### Constructing the Z-statistic

This is essentially centering and scaling the normal random variable. The general recipe is:

$$ ùëç = \frac{observed ‚àí expected}{ standard¬†deviation under¬†ùêª_0}$$


- Here, ‚Äúobserved‚Äù = $\hat{p}_B - \hat{p}_A$
- ‚ÄúExpected under $ùêª_0 = 0$ (since if $p_A=p_B=p$), difference is 0) so the mean is zero. Under the null hypothesis we assume both groups share the same true rate, and since we don't know $p$ - the probability - we use the pooled estimate, $\hat{p}$:
$$ \hat{p} = \frac{N_A+N_B}{n_A+n_B} $$
- ‚ÄúStandard deviation under $H_0= \sqrt{\hat{p}(1-\hat{p})(\frac{1}{N_B}+\frac{1}{N_A}})$
So:
$$
Z = \frac{\hat{p}_B - \hat{p}_A}{\sqrt{\hat{p} (1-\hat{p})(\frac{1}{N_B}+\frac{1}{N_A}})}
$$

Then:
1. Convert Z to p-value using standard normal.

2. Compare to $\alpha$ (0.05).

3. Report both p-value and effect size (difference in proportions).

p-value = probability of data ‚Äúas or more extreme‚Äù under $H_0$

- If $p < \alpha$ ‚Üí reject $H_0$

- If $p > \alpha$ ‚Üí insufficient evidence, do not reject.

One-tailed = directional; two-tailed = non-directional.

What the p-value actually measures

- It‚Äôs not ‚Äúthe probability of being 1.14 SDs away‚Äù exactly.

- It‚Äôs ‚Äúthe probability of seeing a difference at least this extreme (‚â§ -0.02 or ‚â• +0.02) if the true difference were zero.‚Äù

That‚Äôs why we double the tail.

### When to use one-tailed vs two-tailed

**One-tailed**: use when your hypothesis is directional.
Example: ‚ÄúWe believe Ad B has a higher click-through than Ad A.‚Äù
Then you only care if 

$$p_B > p_A$$

- The rejection region is one side of the normal curve.

**Two-tailed**: use when your hypothesis is non-directional.
Example: ‚ÄúWe believe Ad B is different from Ad A.‚Äù
Then you must consider both 

$$p_B > p_A \quad and \quad p_B < p_A $$

- So you split Œ± into two tails, and double the tail probability in the p-value.

Rule of thumb: unless you had a very strong reason before the experiment to only care about one direction, you do a two-tailed test.

### Confidence Intervals

A 95% confidence interval for the difference is:

$$
(\hat p_B-\hat p_A) \pm 1.96 \times SE.
$$

If 0 lies outside this interval, it corresponds to rejecting \(H_0\) at the 5% level.


In [50]:
import math
#in code this looks like
def estimate_parameters(N: int, n:int) -> Tuple[float,float]:
    """ Returns the mean and standard deviation """
    p = n/N #the click through rate
    sigma = math.sqrt(p*(1-p)/N)
    return p, sigma

estimate_parameters(1000,200)

def a_b_test_statistic(N_A: int, n_A: int, N_B: int, n_B: int) -> float:
    p_A, sigma_A = estimate_parameters(N_A, n_A)
    p_B, sigma_B = estimate_parameters(N_B, n_B)
    return (p_B - p_A)/ math.sqrt(sigma_A**2 + sigma_B **2)

z = a_b_test_statistic(1000,200,1000,180)

two_sided_p_value(z)

z_1 = a_b_test_statistic(1000,200,1000,150)
two_sided_p_value(z_1)

0.003189699706216853

# What does this all mean?

## üìä How to frame it for a business audience

### Example 1 (200 vs 180 clicks):

$\hat{P}_A = 20%$, $\hat{P}_B = 18%$.

Difference = 2 percentage points.

$Z = -1.14$, $p-value = 0.25$ ‚Üí not statistically significant at 5%.

Interpretation: With this sample size, the observed 2% gap could easily be noise. We‚Äôd need more data before concluding Ad A really outperforms Ad B.

### Example 2 (200 vs 150 clicks):

$\hat{P}_A = 20%$, $\hat{P}_B$ = 15%.

Difference = 5 percentage points.

$Z = -2.94$, $p-value = 0.003$ ‚Üí significant at 5%.

Interpretation: The evidence strongly suggests Ad A is better. At scale, that 5% lift in CTR could mean thousands of extra conversions or significant added revenue.