# Chapter 4: Method of Moments and Other Estimation Methods

**Core Goal:** Estimate parameters by matching sample moments to population moments.

**Motivation:** Maximum Likelihood Estimation is powerful but sometimes requires complex optimization and strong distributional assumptions. Method of Moments provides a simpler alternative: equate sample moments (mean, variance, etc.) to their population counterparts and solve for parameters. This approach often yields closed-form estimators that are easy to compute and reasonably effective. While generally less efficient than Maximum Likelihood Estimation, Method of Moments serves as a quick alternative and provides starting values for numerical Maximum Likelihood optimization.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 4.1 Moments of a Distribution

**k-th Moment:** $\mu_k = E[X^k] = \int x^k f(x)dx$ for continuous random variable

**k-th Central Moment:** $\mu_k' = E[(X - \mu)^k]$ where $\mu = E[X]$

**Common moments:**
- **First moment:** $\mu_1 = E[X]$ (mean)
- **Second moment:** $\mu_2 = E[X^2]$
- **Second central moment:** $\mu_2' = E[(X-\mu)^2] = \sigma^2$ (variance)

**Motivation:** Moments summarize key distributional characteristics. The first few moments often uniquely determine a distribution's parameters, providing a pathway to estimation.

In [None]:
# Normal distribution N(μ=5, σ²=4)
dist = stats.norm(5, 2)

In [None]:
# E[X] = μ: First moment equals mean
first_moment = dist.mean()

In [None]:
# E[X²] = μ² + σ²: Second moment for normal distribution
second_moment = dist.mean()**2 + dist.var()

In [None]:
print(f"First moment E[X] = {first_moment}")
print(f"Second moment E[X²] = {second_moment}")

## 4.2 Sample Moments

**k-th Sample Moment:** $m_k = \frac{1}{n}\sum_{i=1}^n X_i^k$

**Properties:**
- $m_1 = \bar{X}$ (sample mean)
- $m_2 = \frac{1}{n}\sum_{i=1}^n X_i^2$
- By Law of Large Numbers: $m_k \xrightarrow{P} \mu_k$ as $n \to \infty$

**Motivation:** Sample moments consistently estimate population moments. This convergence provides the foundation for Method of Moments estimation.

In [None]:
np.random.seed(42)
data = stats.norm(5, 2).rvs(100)

In [None]:
# m₁ = (1/n)ΣXᵢ: First sample moment (sample mean)
m1 = np.mean(data)

In [None]:
# m₂ = (1/n)ΣXᵢ²: Second sample moment
m2 = np.mean(data**2)

In [None]:
print(f"First sample moment m₁ = {m1:.3f} (true E[X] = 5)")
print(f"Second sample moment m₂ = {m2:.3f} (true E[X²] = 29)")

## 4.3 Method of Moments Estimation

**Method of Moments Principle:** Equate sample moments to population moments and solve for parameters.

**For k parameters:** Set $m_j = \mu_j(\theta_1, ..., \theta_k)$ for $j = 1, ..., k$ and solve the system of k equations.

**Motivation:** If population moments are functions of parameters, matching sample moments to population moments provides k equations in k unknowns. Solving this system yields Method of Moments estimators. This approach is intuitive, often yields closed-form solutions, and requires minimal distributional assumptions.

### Example: Normal Distribution

**Population moments:** $E[X] = \mu$, $E[X^2] = \mu^2 + \sigma^2$

**Method of Moments equations:**
1. $m_1 = \mu$
2. $m_2 = \mu^2 + \sigma^2$

**Solutions:**
- $\hat{\mu}_{MM} = m_1 = \bar{X}$
- $\hat{\sigma}^2_{MM} = m_2 - m_1^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$

In [None]:
# μ̂ₘₘ = m₁: Method of Moments estimator for mean
mu_mm = m1

In [None]:
# σ̂²ₘₘ = m₂ - m₁²: Method of Moments estimator for variance
sigma_sq_mm = m2 - m1**2

In [None]:
print(f"Method of Moments estimates: μ̂ = {mu_mm:.3f}, σ̂² = {sigma_sq_mm:.3f}")
print(f"True parameters: μ = 5, σ² = 4")

**Note:** For normal distribution, Method of Moments estimators match Maximum Likelihood Estimators.

### Example: Exponential Distribution

**Distribution:** $X \sim \text{Exp}(\lambda)$ with density $f(x) = \lambda e^{-\lambda x}$

**Population moment:** $E[X] = 1/\lambda$

**Method of Moments equation:** $m_1 = 1/\lambda$

**Method of Moments estimator:** $\hat{\lambda}_{MM} = 1/m_1 = 1/\bar{X}$

**Note:** Matches Maximum Likelihood Estimator.

In [None]:
true_lambda = 2
exp_data = stats.expon(scale=1/true_lambda).rvs(100)

In [None]:
# λ̂ₘₘ = 1/X̄: Method of Moments estimator for exponential rate
lambda_mm = 1 / np.mean(exp_data)

In [None]:
print(f"Method of Moments estimate: λ̂ₘₘ = {lambda_mm:.3f}")
print(f"True λ = {true_lambda}")

### Example: Gamma Distribution

**Distribution:** $X \sim \text{Gamma}(\alpha, \beta)$

**Population moments:** $E[X] = \alpha/\beta$, $\text{Var}(X) = \alpha/\beta^2$

**Method of Moments equations:**
1. $m_1 = \alpha/\beta$
2. $m_2 - m_1^2 = \alpha/\beta^2$

**Method of Moments estimators:**
- $\hat{\alpha}_{MM} = m_1^2 / (m_2 - m_1^2) = \bar{X}^2 / s^2$
- $\hat{\beta}_{MM} = m_1 / (m_2 - m_1^2) = \bar{X} / s^2$

In [None]:
true_alpha, true_beta = 3, 2
gamma_data = stats.gamma(true_alpha, scale=1/true_beta).rvs(200)

In [None]:
mean_gamma = np.mean(gamma_data)
var_gamma = np.var(gamma_data, ddof=1)

In [None]:
# α̂ₘₘ = X̄²/s²: Method of Moments estimator for shape parameter
alpha_mm = mean_gamma**2 / var_gamma

In [None]:
# β̂ₘₘ = X̄/s²: Method of Moments estimator for rate parameter
beta_mm = mean_gamma / var_gamma

In [None]:
print(f"Method of Moments estimates: α̂ₘₘ = {alpha_mm:.3f}, β̂ₘₘ = {beta_mm:.3f}")
print(f"True parameters: α = {true_alpha}, β = {true_beta}")

## 4.4 Properties of Method of Moments Estimators

**Consistency:** Method of Moments estimators are consistent under mild regularity conditions.

**Proof sketch:** Since $m_k \xrightarrow{P} \mu_k$ and parameter functions are continuous, $\hat{\theta}_{MM} \xrightarrow{P} \theta$.

**Asymptotic Normality:** Method of Moments estimators are asymptotically normal.

**Efficiency:** Method of Moments estimators are generally NOT asymptotically efficient (do not achieve Cramér-Rao Lower Bound).

**Motivation:** Method of Moments estimators have good large-sample properties but are typically less efficient than Maximum Likelihood Estimators. They provide quick, reasonable estimates when Maximum Likelihood Estimation is difficult.

### Demonstrating Consistency

**As sample size increases, Method of Moments estimator converges to true parameter.**

In [None]:
sample_sizes = [20, 50, 100, 300, 1000]
np.random.seed(123)

In [None]:
# Generate Method of Moments estimates for increasing sample sizes
mm_estimates = [1/np.mean(stats.expon(scale=0.5).rvs(n)) for n in sample_sizes]

In [None]:
plt.plot(sample_sizes, mm_estimates, 'o-', markersize=8)
plt.axhline(true_lambda, color='r', linestyle='--', label='True λ')

In [None]:
plt.xlabel('Sample Size n'); plt.ylabel('λ̂ₘₘ')
plt.title('Consistency of Method of Moments Estimator'); plt.legend()

## 4.5 Comparing Maximum Likelihood Estimation and Method of Moments

**Maximum Likelihood Estimation advantages:**
1. Asymptotically efficient (minimum variance)
2. Invariance property
3. Theoretically principled

**Method of Moments advantages:**
1. Often closed-form solutions
2. Simpler to compute
3. Requires weaker distributional assumptions
4. Good starting values for Maximum Likelihood numerical optimization

**Motivation:** Maximum Likelihood Estimation is theoretically superior but computationally more demanding. Method of Moments provides quick, reasonable estimates and serves as a practical complement to Maximum Likelihood Estimation.

### Efficiency Comparison: Normal Variance

**For normal distribution with known mean:**
- Maximum Likelihood Estimator: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum(X_i - \mu)^2$
- Method of Moments: Same as Maximum Likelihood Estimator

**For normal distribution with unknown mean:**
- Maximum Likelihood Estimator: $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum(X_i - \bar{X})^2$ (biased)
- Method of Moments: Same as Maximum Likelihood Estimator

**In this case, both methods agree.**

## 4.6 Bayesian Estimation: A Different Paradigm

**Bayesian Approach:** Treat parameter $\theta$ as random variable with prior distribution $p(\theta)$.

**Bayes' Theorem:** $p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)} \propto p(x | \theta) p(\theta)$

**Posterior Distribution:** $p(\theta | x)$ combines prior beliefs with data likelihood.

**Point Estimates:**
- **Posterior mean:** $\hat{\theta}_{Bayes} = E[\theta | x]$
- **Maximum a posteriori (MAP):** $\hat{\theta}_{MAP} = \arg\max_\theta p(\theta | x)$

**Motivation:** Bayesian estimation incorporates prior information about parameters and provides full posterior distributions rather than just point estimates. This approach is philosophically different from frequentist methods (Maximum Likelihood Estimation, Method of Moments) but often yields similar results with uninformative priors.

### Example: Bayesian Estimation of Normal Mean

**Model:** $X_i \sim N(\mu, \sigma^2)$ with $\sigma^2$ known

**Prior:** $\mu \sim N(\mu_0, \tau^2)$

**Posterior:** $\mu | x \sim N\left(\frac{\tau^2 n \bar{x} + \sigma^2 \mu_0}{\tau^2 n + \sigma^2}, \frac{\tau^2 \sigma^2}{\tau^2 n + \sigma^2}\right)$

**Posterior mean:** $\hat{\mu}_{Bayes} = \frac{\tau^2 n}{\tau^2 n + \sigma^2}\bar{x} + \frac{\sigma^2}{\tau^2 n + \sigma^2}\mu_0$

**Interpretation:** Weighted average of data (sample mean) and prior (prior mean).

In [None]:
# Data from N(μ=5, σ²=4)
data_bayes = stats.norm(5, 2).rvs(20)

In [None]:
mu_0, tau_sq, sigma_sq = 0, 100, 4  # Prior: N(0, 100), known variance
n = len(data_bayes); xbar = np.mean(data_bayes)

In [None]:
# μ̂ᵦₐᵧₑₛ = wX̄ + (1-w)μ₀: Weighted average of data and prior
weight_data = (tau_sq * n) / (tau_sq * n + sigma_sq)

In [None]:
mu_bayes = weight_data * xbar + (1 - weight_data) * mu_0
print(f"Bayesian estimate (posterior mean): {mu_bayes:.3f}")

In [None]:
print(f"Maximum Likelihood Estimator (sample mean): {xbar:.3f}")
print(f"Prior mean: {mu_0}")

**Note:** With weak prior (large $\tau^2$), Bayesian estimate approaches Maximum Likelihood Estimator.

## 4.7 Robust Estimation

**Robust Estimator:** Estimator whose performance does not degrade severely under violations of assumptions or presence of outliers.

**Breakdown Point:** Fraction of outliers an estimator can tolerate before giving arbitrarily bad results.

**Examples:**
- **Sample mean:** Breakdown point = 0% (single outlier can ruin it)
- **Sample median:** Breakdown point = 50% (can tolerate up to half outliers)
- **Trimmed mean:** Breakdown point = trim fraction

**Motivation:** Real data often contain outliers or violate model assumptions. Robust estimators sacrifice some efficiency under ideal conditions for stability under violations. This tradeoff is often worthwhile in practice.

### Demonstrating Lack of Robustness: Sample Mean

**Single outlier dramatically affects sample mean but not median.**

In [None]:
clean_data = stats.norm(50, 5).rvs(20)
contaminated_data = np.append(clean_data, 500)  # Add extreme outlier

In [None]:
# X̄: Sample mean (not robust to outliers)
mean_clean = np.mean(clean_data); mean_contaminated = np.mean(contaminated_data)

In [None]:
# Median: Robust to outliers (50% breakdown point)
median_clean = np.median(clean_data); median_contaminated = np.median(contaminated_data)

In [None]:
print(f"Clean data - Mean: {mean_clean:.2f}, Median: {median_clean:.2f}")
print(f"With outlier - Mean: {mean_contaminated:.2f}, Median: {median_contaminated:.2f}")

**Result:** Single outlier changed mean by ~20 units but median by <1 unit.

### Trimmed Mean: Compromise Between Efficiency and Robustness

**Trimmed Mean:** Remove fraction of smallest and largest observations, then average.

**α-trimmed mean:** Remove α fraction from each tail, average the middle $(1-2\alpha)$ fraction.

**Properties:**
- More robust than mean (positive breakdown point)
- More efficient than median (uses more data)
- Compromise between robustness and efficiency

In [None]:
from scipy import stats as sp_stats
# 10% trimmed mean: Remove 10% from each tail
trimmed_mean = sp_stats.trim_mean(contaminated_data, 0.1)

In [None]:
print(f"Mean: {mean_contaminated:.2f}")
print(f"10% Trimmed mean: {trimmed_mean:.2f}")
print(f"Median: {median_contaminated:.2f}")

## 4.8 M-Estimators

**M-Estimator:** Generalization of Maximum Likelihood Estimation where we maximize $\sum \rho(X_i, \theta)$ for some function $\rho$.

**Maximum Likelihood Estimation:** Special case with $\rho(x, \theta) = \log f(x; \theta)$

**Huber M-Estimator:** Combines least squares (quadratic) for small residuals with absolute deviation (linear) for large residuals.

**Motivation:** M-estimators provide a framework for robust estimation. By choosing $\rho$ to downweight outliers, we can achieve robustness while maintaining reasonable efficiency. Huber's method is a popular compromise.

## 4.9 Bootstrap for Estimator Comparison

**Bootstrap:** Resample data with replacement to estimate sampling distribution of any statistic.

**Use for comparison:** Estimate variance of different estimators empirically and compare.

**Motivation:** When theoretical variance formulas are unavailable or complex, bootstrap provides empirical comparison of estimator precision.

In [None]:
# Bootstrap to compare mean versus median variance
B = 2000; np.random.seed(42)

In [None]:
data_for_bootstrap = stats.norm(50, 10).rvs(50)
n_boot = len(data_for_bootstrap)

In [None]:
# Generate bootstrap samples and compute statistics
boot_means = [np.mean(np.random.choice(data_for_bootstrap, n_boot, replace=True)) for _ in range(B)]

In [None]:
boot_medians = [np.median(np.random.choice(data_for_bootstrap, n_boot, replace=True)) for _ in range(B)]

In [None]:
print(f"Bootstrap SE(mean): {np.std(boot_means):.3f}")
print(f"Bootstrap SE(median): {np.std(boot_medians):.3f}")
print(f"Efficiency ratio: {np.var(boot_medians)/np.var(boot_means):.2f}")

**Result:** Median has ~1.57 times variance of mean for normal data, confirming theory.

## 4.10 Choosing Among Estimation Methods

**Maximum Likelihood Estimation when:**
- Model is well-specified and assumptions hold
- Efficiency is critical
- Large sample size available
- Computational resources available for numerical optimization

**Method of Moments when:**
- Need quick estimates
- Starting values for Maximum Likelihood numerical optimization
- Model specification uncertain
- Closed-form solution desired

**Robust methods when:**
- Outliers present or suspected
- Model assumptions questionable
- Cost of outlier influence high
- Willing to sacrifice efficiency for stability

**Bayesian methods when:**
- Prior information available
- Full posterior distribution desired (not just point estimate)
- Hierarchical modeling needed
- Small sample with informative prior

## Summary: The Estimation Toolbox

1. **Maximum Likelihood Estimation:** Maximize likelihood—asymptotically efficient, principled, requires correct model
2. **Method of Moments:** Match sample and population moments—simple, quick, less efficient
3. **Bayesian:** Combine prior and likelihood—full posterior, incorporates prior information
4. **Robust:** Downweight outliers—stable under violations, lower efficiency under ideal conditions

**No universal winner—choose based on context, assumptions, and priorities.**

## Key Takeaways

- **Method of Moments is simple and intuitive:** Equate sample moments to population moments and solve. Often yields closed-form estimators that are easy to compute.

- **Maximum Likelihood Estimation is asymptotically superior:** Achieves minimum variance asymptotically, while Method of Moments generally does not. But Method of Moments is computationally simpler.

- **Both methods are consistent and asymptotically normal:** Good large-sample properties for both, but Maximum Likelihood Estimation has smaller asymptotic variance.

- **Method of Moments provides Maximum Likelihood Estimation starting values:** When Maximum Likelihood Estimation requires numerical optimization, Method of Moments estimates give good initial guesses.

- **Robustness matters in practice:** Sample mean is optimal under normality but catastrophic with outliers. Median and trimmed mean sacrifice efficiency for robustness.

- **Bayesian estimation incorporates prior information:** When prior knowledge exists, Bayesian methods combine it with data. With uninformative priors, results approach Maximum Likelihood Estimation.

- **No single method is always best:** Maximum Likelihood Estimation for efficiency, Method of Moments for simplicity, robust methods for contaminated data, Bayesian for incorporating priors. Context determines choice.

- **Bootstrap enables empirical comparison:** When theoretical properties are hard to derive, bootstrap provides empirical variance estimates for comparing methods.