# Chapter 6: Hypothesis Testing Fundamentals

**Core Goal:** Make principled decisions about parameter values using data and controlled error rates.

**Motivation:** Often we need to decide whether data support a specific claim about a parameter. Is a new drug better than placebo? Has the mean changed from its historical value? Is one method superior to another? Hypothesis testing provides a formal framework for making such decisions while controlling the probability of incorrect conclusions. Unlike confidence intervals which estimate parameters, hypothesis tests assess specific claims and provide yes/no decisions with known error rates.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 6.1 Null and Alternative Hypotheses

**Null Hypothesis (H₀):** The claim we test, typically representing "no effect" or "no difference."

**Alternative Hypothesis (H₁ or Hₐ):** What we conclude if we reject the null hypothesis.

**Types of alternative hypotheses:**
- **Two-sided:** $H_1: \theta \neq \theta_0$ (parameter differs from null value)
- **Right-sided:** $H_1: \theta > \theta_0$ (parameter greater than null value)
- **Left-sided:** $H_1: \theta < \theta_0$ (parameter less than null value)

**Motivation:** Hypothesis tests start by assuming the null hypothesis is true, then ask whether observed data are sufficiently unlikely under this assumption to reject it. The burden of proof is on demonstrating departure from H₀. This asymmetric structure protects against claiming effects that don't exist.

### Example: Testing Mean

**Research question:** Has the population mean changed from its historical value μ₀ = 100?

**Hypotheses:**
- H₀: μ = 100 (no change)
- H₁: μ ≠ 100 (has changed)

**Data collection:** Take random sample and compute sample mean.

## 6.2 Test Statistics and Rejection Regions

**Test Statistic:** A function of the data used to make the decision.

**For testing μ = μ₀ with known σ:** $Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}$

**For testing μ = μ₀ with unknown σ:** $T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}}$

**Rejection Region:** Values of test statistic that lead to rejecting H₀.

**Motivation:** Test statistics standardize the difference between data and null hypothesis. Under H₀, they follow known distributions (standard normal, t-distribution), allowing us to determine how extreme the observed value is.

In [None]:
np.random.seed(42)
mu_0 = 100; true_mu = 105; sigma = 15

In [None]:
data = stats.norm(true_mu, sigma).rvs(50)
xbar = np.mean(data); n = len(data)

In [None]:
# Z = (X̄ - μ₀)/(σ/√n): Test statistic under H₀ (known variance)
z_stat = (xbar - mu_0) / (sigma / np.sqrt(n))

In [None]:
print(f"Sample mean: X̄ = {xbar:.2f}")
print(f"Test statistic: Z = {z_stat:.3f}")

## 6.3 Type I and Type II Errors

**Type I Error (α):** Reject H₀ when H₀ is true (false positive)

**Type II Error (β):** Fail to reject H₀ when H₀ is false (false negative)

**Significance Level (α):** Maximum allowable Type I error probability, typically α = 0.05

**Power (1-β):** Probability of correctly rejecting H₀ when it is false

| **Reality** | **Reject H₀** | **Fail to Reject H₀** |
|------------|-------------|---------------------|
| **H₀ True** | Type I Error (α) | Correct |
| **H₀ False** | Correct (Power = 1-β) | Type II Error (β) |

**Motivation:** All hypothesis tests involve risk of error. We cannot eliminate both types simultaneously. By convention, we control Type I error at α (usually 0.05) and try to maximize power. The asymmetry reflects that false positives (claiming effects that don't exist) are often more costly than false negatives.

## 6.4 P-values

**P-value:** Probability of observing a test statistic at least as extreme as the one obtained, assuming H₀ is true.

**For two-sided test:** $\text{p-value} = P(|Z| \geq |z_{obs}| \mid H_0)$

**Decision rule:** Reject H₀ if p-value < α

**Interpretation:**
- Small p-value (< α): Data are unlikely under H₀ → reject H₀
- Large p-value (≥ α): Data are consistent with H₀ → fail to reject H₀

**Motivation:** P-values quantify how surprising the data are under the null hypothesis. They provide a continuous measure of evidence against H₀, though the reject/fail-to-reject decision is binary. Smaller p-values indicate stronger evidence against H₀.

In [None]:
# p = P(|Z| ≥ |z_obs| | H₀): Two-sided p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

In [None]:
print(f"p-value = {p_value:.4f}")
print(f"Decision at α=0.05: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'}")

### Visualizing P-value

**P-value is the area in tails beyond observed test statistic.**

In [None]:
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x)

In [None]:
plt.plot(x, y, 'b-', linewidth=2, label='Standard Normal')
plt.axvline(z_stat, color='r', linestyle='--', label=f'Observed Z = {z_stat:.2f}')
plt.axvline(-z_stat, color='r', linestyle='--')

In [None]:
# Shade p-value region
x_right = x[x >= abs(z_stat)]
x_left = x[x <= -abs(z_stat)]
plt.fill_between(x_right, stats.norm.pdf(x_right), alpha=0.3, color='red', label='p-value region')
plt.fill_between(x_left, stats.norm.pdf(x_left), alpha=0.3, color='red')

In [None]:
plt.xlabel('Z'); plt.ylabel('Density')
plt.title(f'P-value = {p_value:.4f} (Shaded Area)'); plt.legend()

## 6.5 One-Sample t-Test

**Test:** H₀: μ = μ₀ versus H₁: μ ≠ μ₀

**When σ unknown:** Use t-test with test statistic $T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \sim t_{n-1}$ under H₀

**P-value:** $P(|T| \geq |t_{obs}| \mid H_0)$ using $t_{n-1}$ distribution

**Motivation:** When population variance is unknown (the typical case), we use sample standard deviation and t-distribution. This is the most common hypothesis test in practice.

In [None]:
# Estimate variance from data
s = np.std(data, ddof=1)

In [None]:
# T = (X̄ - μ₀)/(S/√n): t-statistic with unknown variance
t_stat = (xbar - mu_0) / (s / np.sqrt(n))

In [None]:
# p-value from t-distribution with n-1 degrees of freedom
p_value_t = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-1))

In [None]:
print(f"t-statistic: T = {t_stat:.3f}")
print(f"p-value = {p_value_t:.4f}")
print(f"Decision at α=0.05: {'Reject H₀' if p_value_t < 0.05 else 'Fail to reject H₀'}")

### Using scipy.stats for t-Test

In [None]:
# stats.ttest_1samp: Built-in one-sample t-test
t_result = stats.ttest_1samp(data, mu_0)

In [None]:
print(f"Scipy t-test: t = {t_result.statistic:.3f}, p = {t_result.pvalue:.4f}")
print("Matches manual calculation")

## 6.6 Power of a Test

**Power:** Probability of rejecting H₀ when it is false: $\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_1 \text{ true})$

**Factors affecting power:**
1. **Effect size:** Larger difference from H₀ → higher power
2. **Sample size:** Larger n → higher power
3. **Significance level:** Larger α → higher power (but more Type I errors)
4. **Variance:** Smaller σ² → higher power

**Motivation:** Power quantifies our ability to detect effects when they exist. Low-power studies waste resources by being unlikely to detect true effects. Power analysis helps plan adequate sample sizes before data collection.

### Power Calculation Example

**Question:** What is power to detect μ = 105 when testing H₀: μ = 100 with n=50, σ=15, α=0.05?

In [None]:
# Rejection region: |Z| > z_{α/2}
alpha = 0.05; z_crit = stats.norm.ppf(1 - alpha/2)

In [None]:
# Under H₁: Z ~ N(δ√n/σ, 1) where δ = μ₁ - μ₀
delta = true_mu - mu_0; non_centrality = delta * np.sqrt(n) / sigma

In [None]:
# Power = P(|Z| > z_{α/2} | H₁): Probability of rejection under alternative
power = 1 - stats.norm.cdf(z_crit - non_centrality) + stats.norm.cdf(-z_crit - non_centrality)

In [None]:
print(f"Power to detect μ = {true_mu}: {power:.3f}")
print(f"Probability of Type II error (β): {1-power:.3f}")

### Power as Function of Sample Size

In [None]:
sample_sizes = np.arange(10, 200, 5)
powers = []

In [None]:
for ns in sample_sizes:
    nc = delta * np.sqrt(ns) / sigma
    pow = 1 - stats.norm.cdf(z_crit - nc) + stats.norm.cdf(-z_crit - nc)
    powers.append(pow)

In [None]:
plt.plot(sample_sizes, powers, linewidth=2)
plt.axhline(0.8, color='r', linestyle='--', label='80% power (conventional)')

In [None]:
plt.xlabel('Sample Size n'); plt.ylabel('Power')
plt.title(f'Power to Detect μ = {true_mu} (α = {alpha})'); plt.legend()

## 6.7 Relationship Between Tests and Confidence Intervals

**Duality:** Hypothesis tests and confidence intervals are complementary.

**Connection:** Fail to reject H₀: μ = μ₀ at level α if and only if μ₀ is in the (1-α) confidence interval.

**Example:**
- 95% Confidence Interval: [98, 112]
- Test H₀: μ = 105 at α=0.05 → Fail to reject (105 is in interval)
- Test H₀: μ = 95 at α=0.05 → Reject (95 is not in interval)

**Motivation:** This duality shows that tests and confidence intervals provide equivalent information. Confidence intervals are often more informative because they show all plausible values, not just yes/no for a single value.

In [None]:
# 95% confidence interval
ci = stats.t.interval(0.95, df=n-1, loc=xbar, scale=s/np.sqrt(n))

In [None]:
print(f"95% Confidence Interval: [{ci[0]:.2f}, {ci[1]:.2f}]")
print(f"Test H₀: μ = {mu_0} → {'Reject' if mu_0 < ci[0] or mu_0 > ci[1] else 'Fail to reject'}")
print(f"μ₀ = {mu_0} is {'NOT in' if mu_0 < ci[0] or mu_0 > ci[1] else 'in'} confidence interval")

## 6.8 One-Sided Tests

**Right-sided test:** H₀: μ ≤ μ₀ versus H₁: μ > μ₀

**Rejection region:** Z > z_α (or T > t_{α,n-1})

**P-value:** P(Z ≥ z_obs | H₀)

**Left-sided test:** H₀: μ ≥ μ₀ versus H₁: μ < μ₀

**Rejection region:** Z < -z_α (or T < -t_{α,n-1})

**P-value:** P(Z ≤ z_obs | H₀)

**Motivation:** One-sided tests are appropriate when deviation in only one direction is of interest. They have higher power than two-sided tests for detecting effects in the specified direction, but cannot detect effects in the opposite direction.

In [None]:
# Right-sided test: H₀: μ ≤ 100 versus H₁: μ > 100
p_value_right = 1 - stats.t.cdf(t_stat, df=n-1)

In [None]:
print(f"Right-sided p-value: {p_value_right:.4f}")
print(f"Two-sided p-value: {p_value_t:.4f}")
print("One-sided p-value is half of two-sided (when test statistic positive)")

## 6.9 Common Misinterpretations of P-values

**P-value is NOT:**
1. ❌ Probability that H₀ is true
2. ❌ Probability that results occurred by chance
3. ❌ Importance or size of effect
4. ❌ Probability of making an error

**P-value IS:**
✅ Probability of observing data at least as extreme as what we got, **assuming H₀ is true**

**Correct statement:** "If H₀ were true and we repeated this experiment many times, we would observe a test statistic this extreme or more in 1.2% of experiments."

**Motivation:** P-values are commonly misunderstood, leading to incorrect conclusions. They tell us about compatibility of data with H₀, not about the truth of H₀.

## 6.10 Practical Versus Statistical Significance

**Statistical Significance:** p-value < α (typically 0.05)

**Practical Significance:** Effect size large enough to matter in practice

**Possible scenarios:**
1. Statistically significant AND practically significant → Important finding
2. Statistically significant but NOT practically significant → Trivial effect detected with large n
3. NOT statistically significant but practically significant → Important effect missed due to small n (low power)
4. NOT statistically significant and NOT practically significant → No evidence of meaningful effect

**Motivation:** With large sample sizes, even tiny, meaningless effects can be statistically significant. Always consider effect size alongside p-value. Statistical significance is necessary but not sufficient for practical importance.

## Summary: Hypothesis Testing Framework

1. **State hypotheses:** H₀ (null) and H₁ (alternative)
2. **Choose significance level:** α (typically 0.05)
3. **Compute test statistic:** Standardized measure of departure from H₀
4. **Calculate p-value:** Probability of data at least this extreme under H₀
5. **Make decision:** Reject H₀ if p-value < α; otherwise fail to reject
6. **Interpret:** Consider statistical significance, effect size, and practical importance

**Remember:** Failing to reject H₀ does not prove H₀ is true—it means insufficient evidence against it.

## Key Takeaways

- **Hypothesis tests assess specific claims:** Unlike confidence intervals which estimate parameters, tests answer yes/no questions about parameter values.

- **Null hypothesis gets benefit of doubt:** We assume H₀ is true and require strong evidence (small p-value) to reject it. Asymmetry favors avoiding false positives.

- **Two types of errors are unavoidable:** Type I (false positive) and Type II (false negative). We control Type I error at α and try to maximize power (minimize Type II error).

- **P-values measure compatibility with H₀:** Small p-value means data are unlikely under H₀, providing evidence against it. Large p-value means data are consistent with H₀.

- **Statistical significance ≠ practical importance:** With large n, tiny meaningless effects can be statistically significant. Always consider effect size.

- **Power depends on effect size and sample size:** Larger effects and larger samples yield higher power to detect departures from H₀.

- **Tests and confidence intervals are dual:** Failing to reject H₀: μ = μ₀ is equivalent to μ₀ being in the confidence interval.

- **Failure to reject ≠ proof of H₀:** Absence of evidence is not evidence of absence. May simply lack power to detect effect.