# Hypothesis Testing

**Guide**

- Confidence Intervals (CIs) ‚Äî what they really mean and how they‚Äôre computed.
- Hypothesis Testing Framework ‚Äî the logic of testing claims.
- Errors, Power, and p-values ‚Äî how to interpret statistical decisions.
- Real-world examples ‚Äî A/B testing and ML model comparison.
- Two-sample & paired t-tests ‚Äî hands-on calculations and model performance comparison.

üß© Analogy: ‚ÄúThe Pizza Slice Game‚Äù

- Imagine you own a pizza shop.
- Each pizza (population) has 8 slices (data points).
- You keep baking pizzas all day ‚Äî some have slightly bigger slices, some smaller.

Now, if you randomly take one slice from each pizza, you‚Äôll get different slice weights every time (this is sampling).

If you repeat this many times and compute the average slice weight for each pizza, those averages form a sampling distribution.

- The mean of this distribution ‚âà population mean.
- The spread (how much those sample means vary) is measured by Standard Error (SE).

As you sample larger pizzas (n ‚Üë), your SE ‚Üì because averages become more stable.

üßÆ Tiny Example

Let‚Äôs say:
- Population mean = 50g
- Population SD = 10g

If we take samples of size n=25:

$$SE = \frac{SD}{\sqrt{n}} = \frac{10}{\sqrt{25}}=2$$

If we increase sample size to 100:

$$SE = \frac{10}{\sqrt{100}}=1$$

‚Üí Our sample mean estimates are twice as precise.

üéØ Central Limit Theorem (CLT)

No matter how weird or skewed the population is ‚Äî
if you take enough samples, the distribution of sample means tends toward a Normal Distribution.

That‚Äôs why we can use z-scores and t-scores to build confidence intervals and run hypothesis tests ‚Äî even when the raw data isn‚Äôt normal.

‚úÖ Quick Check for You:
Can you restate why SE gets smaller as sample size increases ‚Äî and what that means for our confidence about the population mean?

You nailed two key ideas:

1. Shape: With more samples, the sampling distribution looks more Normal (thanks to the Central Limit Theorem).

2. Precision: As sample size n increases, the Standard Error (SE) shrinks, meaning our sample mean ‚Äúwiggles‚Äù less around the true population mean.

In short: larger samples ‚Üí narrower uncertainty ‚Üí better estimate.

Now let‚Äôs turn that idea into something concrete:

üéØ Confidence Intervals (CIs)

They use this sampling variability (SE) to describe how uncertain we are about the population mean.

A 95% Confidence Interval for the mean is:

$$\bar{x} \pm z \times SE$$

where z=1.96 for 95% confidence (under the normal curve).

üçï Example (continuing the pizza story)

Suppose:
- Sample mean slice weight $\bar{x}=50g$
- SD = 10g
- n=25
    ‚Üí SE=2

Then 95% CI:

$$50 \pm 1.96 \times 2 = 50 \pm 3.92$$

So: (46.08g, 53.92g)

Interpretation (the correct one):

> If we repeated this sampling process many times, about 95% of the confidence intervals we compute would contain the true population mean.

It‚Äôs not ‚Äúthere‚Äôs a 95% chance the mean is between 46.08 and 53.92‚Äù ‚Äî because the population mean is fixed; it‚Äôs our interval that varies from sample to sample.

Exactly right ‚úÖ

- When n increases, the standard error (SE) shrinks (since $SE = \frac{SD}{\sqrt{n}}$).
- Smaller SE ‚Üí smaller ‚Äú¬± margin‚Äù ‚Üí narrower confidence interval.

That‚Äôs the mathematical expression of what you just said: more data = more precision.

We‚Äôll look at three types of confidence intervals that appear both in statistics and machine learning:
- CI for a mean ‚Äî what we just did.
- CI for a proportion ‚Äî e.g., model accuracy or success rate.
- CI for difference of means ‚Äî e.g., comparing Model A vs. Model B.

üß© CI for a Proportion (e.g., model accuracy)

Suppose a model correctly classifies 840 out of 1000 test samples.
Then:

$$\hat{p} = \frac{840}{1000} =0.84$$

The standard error for a proportion is:

$$SE = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} = \sqrt{\frac{0.84(0.16)}{1000}} \approx 0.0116$$

A 95% CI:

$$0.84 \pm 1.96 \times 0.0116 = (0.817, 0.863)$$

Interpretation:
> We‚Äôre 95% confident that the true model accuracy (if tested on all future data) lies between 81.7% and 86.3%.

üí° What is a proportion?

A proportion is just a fraction that tells us the part of the whole that meets some condition.

Example:
If 840 out of 1000 predictions are correct,
$$proportion = \frac{840}{1000} = 0.84$$

That means 84% of items had the outcome we‚Äôre tracking ‚Äî here, correct predictions.

In general:

$$proportion = \frac{\text{number of successes}}{\text{total number of trials}}$$

üß© What is pÃÇ (‚Äúp-hat‚Äù)?

The hat symbol ( ÃÇ ) means ‚Äúestimated from sample.‚Äù

So pÃÇ (read ‚Äúp-hat‚Äù) is our sample proportion ‚Äî our best estimate of the true population probability p.

Example:
- p = true model accuracy on all possible data (unknown)
- $\hat{p} = 0.84 =$ observed accuracy on our test sample (known)

So, $\hat{p}$ is like saying:
> ‚ÄúBased on what we‚Äôve seen, we estimate the true probability of success is 0.84.‚Äù

‚öñÔ∏è Relationship between $\hat{p}$ and probability

They are related but not identical:

| Symbol      | Meaning                           | Known/Unknown                |
| :---------- | :-------------------------------- | :--------------------------- |
| $p$       | True probability in population    | Unknown (we try to infer it) |
| $\hat{p}$ | Observed proportion in our sample | Known (we compute it)        |


You can think of $\hat{p}$ as the empirical probability ‚Äî what actually happened in our sample ‚Äî while $p$ is the theoretical (true) probability we‚Äôre trying to learn about.

üß† Analogy

If you flip a coin 100 times and get 53 heads:
- $\hat{p} = 0.53$ ‚Äî what you observed
- $p=0.5?$ ‚Äî the true (ideal) probability, which we‚Äôre testing or estimating

‚úÖ Quick check:
Suppose your model correctly classified 90 out of 120 samples.
Can you tell me what $\hat{p}$ (the sample proportion) is in this case?
Then we‚Äôll use that number to compute its standard error next.

In [5]:
import numpy as np

In [12]:
def proportion_se(success, sample):
    proportion =  success/sample
    se = np.sqrt(proportion*(1-proportion)/sample) 
    return proportion, se

In [14]:
success = 90
sample_size = 120
proportion, pse = proportion_se(success, sample_size)

# print result
print(f"Proportion: {proportion:.4f}")
print(f"Proportion Standard Error: {pse:.4f}")


Proportion: 0.7500
Proportion Standard Error: 0.0395


So now you‚Äôve got the two key pieces for a proportion CI:

- $\hat{p} = 0.75$
- $SE=0.039$

This means:

> If you repeatedly sampled 120 items many times, your sample accuracies would typically vary about ¬±0.039 around the true accuracy.

üéØ Quick reinforcement

Let‚Äôs turn this into a 95% Confidence Interval.
We use:

$$\hat{p} \pm 1.96 \times SE$$

In [15]:
margin_error = 1.96 * pse
print(margin_error)

0.0774758026741253


In [18]:
# calculate confidence interval
confidence_low = proportion - margin_error
confidence_high = proportion + margin_error

print(f"We are 95% confident that the success is between {confidence_low * 100:.4f}% to {confidence_high * 100:.4f} %")

We are 95% confident that the success is between 67.2524% to 82.7476 %


Calculation (final step)

You found the margin of error:

$$ME = 1.96 \times SE = 1.96 \times 0.039 \approx 0.07747$$

So the 95% CI for the proportion is:

$$\hat{p} \pm ME = 0.75 \pm 0.07747$$
$$CI \approx (0.6725, 0.8275)$$

(rounded: 67.3% to 82.8%)

Quick checks

- Normal-approximation valid? $n\hat{p} = 120 \times 0.75 = 90$ and $n(1 - \hat{p})=30$ - both grater than 5, so the normal approximation and this formula are fine.
- formula used $SE = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$ and 95% z value $\approx 1.96$ (standard texts: Agresti & Coull; Casella & Berger).

Interpretation (the correct way)
> If we repeated this sampling process many times and computed a 95% CI each time, about 95% of those intervals would contain the true population proportion p.
It does not mean ‚Äúthere‚Äôs a 95% chance the true proportion lies in this interval‚Äù for this single computed interval ‚Äî the true proportion is fixed; the interval is random.

Short analogy

Think of aiming at a target repeatedly with a slightly shaky hand. Each interval you build is a net you throw; 95% of nets you throw over many experiments will catch the bullseye. This net (the interval) either caught it this time or it didn‚Äôt ‚Äî we just know the method catches it 95% of the time in the long run.