# Chapter 8: Non-parametric and Robust Inference

**Core Goal:** Develop inference methods that do not require strong distributional assumptions and that remain valid when assumptions are violated.

**Motivation:** Classical parametric methods (t-tests, Analysis of Variance, linear regression) assume specific distributions (usually normal) and are sensitive to violations of these assumptions. Real data often contain outliers, have heavy tails, or arise from unknown distributions. Non-parametric methods make minimal assumptions about the underlying distribution, providing valid inference across a wide range of scenarios. Robust methods maintain good performance even when assumptions are violated. This chapter develops distribution-free and outlier-resistant inference procedures that work reliably in practice.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 8.1 Parametric versus Non-parametric Methods

**Parametric Method:** Assumes data follow a specific distribution family characterized by a finite number of parameters.

**Non-parametric Method:** Makes minimal or no assumptions about the underlying distribution. Also called distribution-free methods.

**Motivation:** Parametric methods are powerful when assumptions hold but can fail catastrophically when violated. Non-parametric methods sacrifice some efficiency under ideal conditions for validity across a broader range of scenarios. They are particularly valuable when: (1) sample sizes are too small to assess distributional assumptions, (2) data contain outliers or have heavy tails, (3) the underlying distribution is unknown or complex, or (4) we want inference that is robust to assumption violations.

In [None]:
# Example: Compare t-test (parametric) with Wilcoxon test (non-parametric)
np.random.seed(42)

In [None]:
# Normal data: Both methods work
normal_data = stats.norm(5, 2).rvs(20)

In [None]:
# H₀: median = 4 (or mean = 4 for t-test)
t_result = stats.ttest_1samp(normal_data, 4)
wilcoxon_result = stats.wilcoxon(normal_data - 4)

In [None]:
print(f"t-test p-value: {t_result.pvalue:.4f}")
print(f"Wilcoxon p-value: {wilcoxon_result.pvalue:.4f}")

**Key Tradeoff:** Non-parametric methods have slightly lower power (efficiency) than parametric methods when parametric assumptions hold, but remain valid when assumptions fail.

## 8.2 Sign Test

**Sign Test:** Tests whether the median of a distribution equals a specified value by counting how many observations exceed that value.

**Test Statistic:** $S = \#\{X_i > m_0\}$ where $m_0$ is the hypothesized median.

**Null Distribution:** Under $H_0: m = m_0$, $S \sim \text{Binomial}(n, 0.5)$

**Motivation:** The sign test is the simplest non-parametric test for location. It only uses the direction (sign) of deviations from the hypothesized median, not their magnitude. This makes it extremely robust to outliers and applicable to any continuous distribution. It requires no distributional assumptions beyond continuity. While it discards magnitude information (making it less powerful than alternatives), its simplicity and complete robustness make it valuable for quick checks and when outliers are present.

In [None]:
# Test H₀: median = 50 versus H₁: median ≠ 50
data = np.array([48, 52, 55, 49, 53, 51, 47, 54, 50, 56, 45, 58])

In [None]:
# S = number of observations > 50: Count positive signs
hypothesized_median = 50
S = np.sum(data > hypothesized_median)
n = len(data)
print(f"Test statistic S = {S} out of n = {n}")

In [None]:
# Under H₀, S ~ Binomial(n, 0.5): Calculate two-sided p-value
p_value = 2 * min(stats.binom.cdf(S, n, 0.5), 1 - stats.binom.cdf(S-1, n, 0.5))
print(f"Sign test p-value: {p_value:.4f}")

**Interpretation:** p-value > 0.05 suggests insufficient evidence to reject $H_0$. The sign test uses only the count of observations above versus below the median, not their actual values.

## 8.3 Wilcoxon Signed-Rank Test

**Wilcoxon Signed-Rank Test:** Tests whether the median (or center of symmetry) equals a specified value using both signs and ranks of deviations.

**Test Statistic:** $W^+ = \sum_{i=1}^n R_i^+ \cdot \mathbb{1}(X_i > m_0)$ where $R_i^+$ is the rank of $|X_i - m_0|$

**Assumptions:** Continuous distribution that is symmetric about its median.

**Motivation:** The Wilcoxon signed-rank test improves upon the sign test by incorporating magnitude information through ranks. It uses both the sign (positive or negative) and the rank (relative size) of deviations from the hypothesized median. This makes it more powerful than the sign test while maintaining robustness to outliers. Large deviations receive more weight than small deviations, but through ranks rather than raw values, providing outlier resistance. Under normality, it achieves about 95% of the efficiency of the t-test, making it an excellent default choice when distributional assumptions are uncertain.

In [None]:
# Rank |Xᵢ - m₀|: Use ranks of absolute deviations from hypothesized median
differences = data - hypothesized_median
ranks = stats.rankdata(np.abs(differences))

In [None]:
# W⁺ = sum of ranks for positive differences
W_plus = np.sum(ranks[differences > 0])
print(f"W⁺ = {W_plus}")

In [None]:
# Wilcoxon test using scipy (handles ties and exact/asymptotic p-values)
wilcoxon_result = stats.wilcoxon(differences)
print(f"Wilcoxon test p-value: {wilcoxon_result.pvalue:.4f}")

**Comparison with t-test:**

In [None]:
t_result = stats.ttest_1samp(data, hypothesized_median)
print(f"t-test p-value: {t_result.pvalue:.4f} | Wilcoxon p-value: {wilcoxon_result.pvalue:.4f}")

**Efficiency:** For normal data, Wilcoxon signed-rank test has asymptotic relative efficiency of 0.955 compared to t-test (requires only 5% more data for same power).

## 8.4 Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

**Mann-Whitney U Test:** Tests whether two independent samples come from distributions with the same median (or whether one distribution is stochastically larger).

**Test Statistic:** $U = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \mathbb{1}(X_i > Y_j)$ counts how many times observations from sample 1 exceed observations from sample 2.

**Equivalent Form:** $U = n_1n_2 + \frac{n_1(n_1+1)}{2} - R_1$ where $R_1$ is sum of ranks from sample 1 in combined ranking.

**Motivation:** The Mann-Whitney test is the non-parametric analog of the two-sample t-test. It tests whether one distribution tends to produce larger values than another, without assuming normality. The test ranks all observations from both samples together, then checks if one group's ranks are systematically higher. This makes it robust to outliers and applicable to any continuous distributions with the same shape. It is particularly useful when comparing groups with skewed distributions, ordinal data, or when outliers are present.

In [None]:
# Compare two independent samples: Test if medians differ
group1 = stats.norm(50, 10).rvs(15)
group2 = stats.norm(55, 10).rvs(15)

In [None]:
# Mann-Whitney U test (non-parametric)
mw_result = stats.mannwhitneyu(group1, group2, alternative='two-sided')
print(f"Mann-Whitney U = {mw_result.statistic:.0f}, p-value = {mw_result.pvalue:.4f}")

In [None]:
# Compare with two-sample t-test (parametric)
t_result = stats.ttest_ind(group1, group2)
print(f"Two-sample t-test: t = {t_result.statistic:.3f}, p-value = {t_result.pvalue:.4f}")

### Demonstration with Outliers

In [None]:
# Add extreme outlier to group2: Test robustness to outliers
group2_contaminated = np.append(group2, 200)
group1_extended = np.append(group1, stats.norm(50, 10).rvs(1))

In [None]:
mw_outlier = stats.mannwhitneyu(group1_extended, group2_contaminated, alternative='two-sided')
t_outlier = stats.ttest_ind(group1_extended, group2_contaminated)

In [None]:
print(f"With outlier - Mann-Whitney p-value: {mw_outlier.pvalue:.4f}")
print(f"With outlier - t-test p-value: {t_outlier.pvalue:.4f}")
print("Mann-Whitney remains stable; t-test becomes less significant")

**Result:** Mann-Whitney test is robust to the extreme outlier, while t-test is affected. Ranks transform the outlier to just the highest rank, limiting its influence.

## 8.5 Kruskal-Wallis Test

**Kruskal-Wallis Test:** Non-parametric test for comparing medians across three or more independent groups.

**Test Statistic:** $H = \frac{12}{n(n+1)}\sum_{i=1}^k \frac{R_i^2}{n_i} - 3(n+1)$ where $R_i$ is sum of ranks for group $i$.

**Null Distribution:** Under $H_0$, $H$ approximately follows $\chi^2_{k-1}$ for large samples.

**Motivation:** The Kruskal-Wallis test extends the Mann-Whitney test to multiple groups, serving as the non-parametric analog of one-way Analysis of Variance. It tests whether at least one group's distribution differs from the others by comparing the average ranks across groups. Like all rank-based tests, it is robust to outliers and does not require normality. It is particularly useful for comparing multiple groups with skewed distributions or when sample sizes are too small to verify normality assumptions.

In [None]:
# Compare three independent groups: Test if distributions differ
groupA = stats.norm(50, 10).rvs(12)
groupB = stats.norm(55, 10).rvs(12)
groupC = stats.norm(52, 10).rvs(12)

In [None]:
# Kruskal-Wallis H-test (non-parametric Analysis of Variance)
kw_result = stats.kruskal(groupA, groupB, groupC)
print(f"Kruskal-Wallis H = {kw_result.statistic:.3f}, p-value = {kw_result.pvalue:.4f}")

In [None]:
# Compare with one-way Analysis of Variance (parametric)
f_statistic, anova_pvalue = stats.f_oneway(groupA, groupB, groupC)
print(f"One-way Analysis of Variance: F = {f_statistic:.3f}, p-value = {anova_pvalue:.4f}")

**Post-hoc Comparisons:** If Kruskal-Wallis rejects, perform pairwise Mann-Whitney tests with Bonferroni correction to identify which groups differ.

In [None]:
# Pairwise comparisons with Bonferroni correction: α / number of comparisons
alpha_bonferroni = 0.05 / 3
print(f"\nPairwise comparisons (Bonferroni-corrected α = {alpha_bonferroni:.4f}):")

In [None]:
pairs = [('A', 'B', groupA, groupB), ('A', 'C', groupA, groupC), ('B', 'C', groupB, groupC)]
for name1, name2, g1, g2 in pairs:
    result = stats.mannwhitneyu(g1, g2, alternative='two-sided')
    print(f"{name1} vs {name2}: p = {result.pvalue:.4f}")

## 8.6 Rank Correlation

**Rank Correlation:** Measures association between two variables using ranks rather than raw values.

**Motivation:** Pearson correlation assumes bivariate normality and measures linear association. Rank correlations (Spearman and Kendall) are more general: they detect monotonic relationships (not just linear), are robust to outliers, and make no distributional assumptions. They are particularly useful for ordinal data, non-linear monotonic relationships, and when outliers are present.

### Spearman's Rank Correlation

**Spearman's ρ:** Pearson correlation computed on ranks: $\rho_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$ where $d_i$ is difference between ranks.

**Interpretation:** $\rho_s = 1$: perfect monotone increasing, $\rho_s = -1$: perfect monotone decreasing, $\rho_s = 0$: no monotone relationship.

**Motivation:** Spearman correlation measures how well the relationship between variables can be described by a monotonic function. It converts values to ranks, then computes ordinary correlation on these ranks. This makes it insensitive to the specific numerical relationship and robust to outliers in the tails.

In [None]:
# Non-linear but monotonic relationship: y = x²
x = np.linspace(0, 10, 30)
y = x**2 + np.random.normal(0, 10, 30)

In [None]:
# Pearson correlation (measures linear relationship)
pearson_r, pearson_p = stats.pearsonr(x, y)
print(f"Pearson r = {pearson_r:.3f}, p-value = {pearson_p:.4f}")

In [None]:
# Spearman correlation (measures monotonic relationship)
spearman_rho, spearman_p = stats.spearmanr(x, y)
print(f"Spearman ρ = {spearman_rho:.3f}, p-value = {spearman_p:.4f}")

In [None]:
plt.scatter(x, y)
plt.xlabel('x'); plt.ylabel('y'); plt.title('Non-linear Monotonic Relationship')
plt.text(1, 80, f'Pearson r = {pearson_r:.2f}\nSpearman ρ = {spearman_rho:.2f}', fontsize=12)

**Result:** Spearman correlation (0.99) better captures the strong monotonic relationship than Pearson correlation, which only detects linear association.

### Kendall's Tau

**Kendall's τ:** $\tau = \frac{\text{concordant pairs} - \text{discordant pairs}}{\binom{n}{2}}$

**Concordant Pair:** $(X_i, Y_i)$ and $(X_j, Y_j)$ where $(X_i - X_j)(Y_i - Y_j) > 0$

**Motivation:** Kendall's tau measures the probability that two randomly chosen pairs agree in their ordering minus the probability they disagree. It is more robust than Spearman for small samples and has a more intuitive interpretation based on pairwise comparisons. It is particularly useful when data contain ties.

In [None]:
# Kendall tau (based on concordant/discordant pairs)
kendall_tau, kendall_p = stats.kendalltau(x, y)
print(f"Kendall τ = {kendall_tau:.3f}, p-value = {kendall_p:.4f}")

**Comparison:** Kendall's τ is typically smaller in magnitude than Spearman's ρ but has better small-sample properties and more direct interpretation.

## 8.7 Robustness Concepts

**Robust Estimator:** An estimator whose performance degrades gracefully under violations of assumptions or in the presence of outliers.

**Motivation:** Classical estimators (sample mean, least squares) are optimal under ideal conditions but can fail catastrophically with small amounts of contamination. Robust statistics provides estimators that maintain good performance across a broader range of conditions, accepting slight efficiency loss under ideal conditions for protection against worst-case scenarios. This reflects a practical reality: we rarely know the true distribution, outliers occur frequently, and catastrophic failure is often more costly than slight efficiency loss.

### Breakdown Point

**Breakdown Point:** The smallest fraction of contamination that can cause an estimator to take arbitrarily large values.

**Sample Mean:** Breakdown point = 1/n (a single extreme outlier can make mean arbitrarily large)

**Sample Median:** Breakdown point = 0.5 (need to contaminate >50% of data to break median)

**Motivation:** Breakdown point quantifies the worst-case robustness of an estimator. A breakdown point of 0.5 is the maximum possible—it means the estimator remains bounded even when up to half the data are arbitrarily contaminated. The sample mean has the lowest possible breakdown point (1/n), making it extremely fragile. The median achieves the maximum breakdown point, providing excellent worst-case protection.

In [None]:
# Demonstrate breakdown points: Clean data
clean_data = stats.norm(50, 10).rvs(100)
print(f"Clean data: Mean = {np.mean(clean_data):.2f}, Median = {np.median(clean_data):.2f}")

In [None]:
# Add 5% contamination (5 extreme outliers): Test breakdown resistance
contaminated_data = np.append(clean_data[:95], [1000, 1000, 1000, 1000, 1000])
print(f"5% outliers: Mean = {np.mean(contaminated_data):.2f}, Median = {np.median(contaminated_data):.2f}")

In [None]:
# Add 50% contamination: Mean breaks down, median remains stable
heavily_contaminated = np.append(clean_data[:50], np.repeat(1000, 50))
print(f"50% outliers: Mean = {np.mean(heavily_contaminated):.2f}, Median = {np.median(heavily_contaminated):.2f}")

**Observation:** Mean is severely affected by even 5% contamination, while median remains stable until contamination exceeds 50%, demonstrating its 0.5 breakdown point.

### Trimmed Mean

**Trimmed Mean:** Mean computed after removing a specified percentage of extreme values from each tail.

**α-Trimmed Mean:** Remove α proportion from each tail, then compute mean of remaining data.

**Motivation:** The trimmed mean balances efficiency and robustness. It is more efficient than the median under normality (using more of the data) but more robust than the mean (protecting against extreme outliers by trimming). A 10% or 20% trimmed mean is often a good practical compromise. The breakdown point of an α-trimmed mean is α, so 20% trimming gives breakdown point 0.2.

In [None]:
from scipy.stats import trim_mean
# 10% trimmed mean: Remove 10% from each tail
trimmed_10 = trim_mean(contaminated_data, 0.1)
print(f"10% trimmed mean: {trimmed_10:.2f}")

In [None]:
# 20% trimmed mean: Remove 20% from each tail
trimmed_20 = trim_mean(contaminated_data, 0.2)
print(f"20% trimmed mean: {trimmed_20:.2f}")

In [None]:
print(f"\nComparison with 5% contamination:")
print(f"Mean: {np.mean(contaminated_data):.2f} | Median: {np.median(contaminated_data):.2f}")
print(f"10% Trimmed: {trimmed_10:.2f} | 20% Trimmed: {trimmed_20:.2f}")

**Result:** Trimmed means provide intermediate robustness between mean (fragile) and median (very robust), making them practical choices for real data.

### Median Absolute Deviation

**Median Absolute Deviation:** Robust scale estimator: $\text{MAD} = \text{median}(|X_i - \text{median}(X)|)$

**Normalized Median Absolute Deviation:** $\text{MAD}_n = 1.4826 \cdot \text{MAD}$ (consistent estimator of standard deviation under normality)

**Motivation:** Just as the median robustly estimates location, the Median Absolute Deviation robustly estimates scale (spread). Standard deviation is as fragile as the mean—a single outlier can make it arbitrarily large. Median Absolute Deviation has breakdown point 0.5, making it highly resistant to outliers. The constant 1.4826 makes Median Absolute Deviation approximately equal standard deviation for normal data, allowing direct comparison.

In [None]:
# MAD = median(|Xᵢ - median(X)|): Robust scale estimator
mad = np.median(np.abs(contaminated_data - np.median(contaminated_data)))
mad_normalized = 1.4826 * mad

In [None]:
print(f"Standard Deviation: {np.std(contaminated_data, ddof=1):.2f} (affected by outliers)")
print(f"Median Absolute Deviation: {mad_normalized:.2f} (robust to outliers)")

**Application:** Use Median Absolute Deviation instead of standard deviation for outlier detection: $|X_i - \text{median}| > k \cdot \text{MAD}_n$ (typically $k=3$)

## 8.8 Bootstrap Methods

**Bootstrap:** A resampling method for estimating the sampling distribution of a statistic by repeatedly resampling from the observed data with replacement.

**Procedure:**
1. Draw sample $X^*$ of size $n$ with replacement from observed data $X$
2. Compute statistic $\hat{\theta}^*$ from $X^*$
3. Repeat steps 1-2 many times (typically 1000-10000)
4. Use distribution of $\hat{\theta}^*$ to approximate sampling distribution of $\hat{\theta}$

**Motivation:** The bootstrap provides a general-purpose method for quantifying uncertainty when theoretical formulas are unavailable or intractable. It makes minimal distributional assumptions—essentially treating the sample as a surrogate population. While not a substitute for exact theory when available, it is invaluable for complex estimators (trimmed mean, median, robust regression coefficients) where distributional theory is difficult. The bootstrap is also naturally non-parametric and distribution-free.

In [None]:
# Bootstrap Standard Error for trimmed mean
original_data = stats.norm(50, 10).rvs(30)
original_trimmed_mean = trim_mean(original_data, 0.1)

In [None]:
# Bootstrap resampling: Generate 5000 bootstrap samples
n_bootstrap = 5000
bootstrap_estimates = [trim_mean(np.random.choice(original_data, size=len(original_data), replace=True), 0.1) 
                       for _ in range(n_bootstrap)]

In [None]:
# Bootstrap Standard Error = standard deviation of bootstrap distribution
bootstrap_se = np.std(bootstrap_estimates)
print(f"Bootstrap Standard Error of 10% trimmed mean: {bootstrap_se:.2f}")

### Bootstrap Confidence Interval

**Percentile Method:** Use quantiles of bootstrap distribution as confidence interval.

**95% Confidence Interval:** $(\hat{\theta}^*_{0.025}, \hat{\theta}^*_{0.975})$ where $\hat{\theta}^*_\alpha$ is $\alpha$ quantile of bootstrap distribution.

**Motivation:** The percentile method provides a simple, assumption-free confidence interval by using the empirical quantiles of the bootstrap distribution. It automatically adapts to the shape of the sampling distribution, handling skewness and other non-standard features.

In [None]:
# 95% percentile bootstrap Confidence Interval
bootstrap_ci = np.percentile(bootstrap_estimates, [2.5, 97.5])
print(f"95% Bootstrap Confidence Interval: [{bootstrap_ci[0]:.2f}, {bootstrap_ci[1]:.2f}]")

In [None]:
plt.hist(bootstrap_estimates, bins=50, density=True, alpha=0.7, edgecolor='black')
plt.axvline(original_trimmed_mean, color='r', linewidth=2, label='Original estimate')
plt.axvline(bootstrap_ci[0], color='g', linestyle='--', label='95% Confidence Interval')
plt.axvline(bootstrap_ci[1], color='g', linestyle='--')
plt.xlabel('Trimmed Mean'); plt.ylabel('Density'); plt.title('Bootstrap Distribution')
plt.legend()

## 8.9 Choosing Between Parametric and Non-parametric Methods

**Decision Framework:**

**Use Parametric Methods When:**
- Distributional assumptions are reasonable and verified
- Sample size is large enough to check assumptions
- Data are clean with no outliers
- Maximum efficiency is important
- Theoretical properties are needed (exact p-values, optimality)

**Use Non-parametric Methods When:**
- Distributional assumptions are uncertain or violated
- Sample size is too small to verify assumptions
- Data contain outliers or have heavy tails
- Robustness is more important than efficiency
- Data are ordinal or ranks are more meaningful than values

**Practical Compromise:**
- Report both parametric and non-parametric results
- If they agree, conclusions are robust
- If they disagree, investigate why (outliers? non-normality?)
- Use robust methods (trimmed mean, bootstrap) as middle ground

## Summary: Non-parametric and Robust Inference Framework

**Non-parametric tests provide distribution-free inference:**
- **Sign test:** Uses only signs of deviations (most robust, least powerful)
- **Wilcoxon signed-rank:** Uses signs and ranks (robust and reasonably powerful)
- **Mann-Whitney:** Two-sample comparison using ranks (robust alternative to t-test)
- **Kruskal-Wallis:** Multiple-group comparison using ranks (robust alternative to Analysis of Variance)
- **Rank correlations:** Spearman and Kendall detect monotonic relationships (robust to outliers and non-linearity)

**Robust estimators resist outliers:**
- **Median:** Breakdown point 0.5 (maximum possible)
- **Trimmed mean:** Balances efficiency and robustness
- **Median Absolute Deviation:** Robust scale estimator

**Bootstrap provides general-purpose uncertainty quantification:**
- Works for any estimator without requiring theoretical formulas
- Naturally non-parametric and distribution-free
- Provides Standard Errors and confidence intervals through resampling

**Key tradeoff:** Efficiency (parametric) versus robustness (non-parametric). Practical strategy: verify assumptions when possible, use robust methods when uncertain.

## Key Takeaways

- **Non-parametric methods sacrifice efficiency for validity:** They are slightly less powerful under ideal conditions but remain valid when assumptions fail, making them safer choices when distributional assumptions are uncertain.

- **Ranks provide natural robustness:** Converting values to ranks automatically limits the influence of outliers, as extreme values become just the highest or lowest ranks.

- **Breakdown point quantifies worst-case robustness:** Sample mean has breakdown point 1/n (fragile), sample median has breakdown point 0.5 (maximally robust). This explains why median is preferred for data with outliers.

- **Trimmed mean offers practical compromise:** It is more efficient than median under normality but more robust than mean against outliers. A 10-20% trim is often a good default choice.

- **Rank correlations detect monotonic relationships:** Spearman and Kendall correlations measure monotonic association (not just linear), making them more general than Pearson correlation and robust to outliers.

- **Bootstrap provides distribution-free uncertainty quantification:** When theoretical formulas are unavailable or intractable, bootstrap resampling provides empirical Standard Errors and confidence intervals with minimal assumptions.

- **Practical strategy: Report both parametric and non-parametric results:** If they agree, conclusions are robust. If they disagree, investigate the cause (outliers, non-normality) and use this information to choose the appropriate method.