# Chapter 3: Maximum Likelihood Estimation

**Core Goal:** Find parameter values that make observed data most probable.

**Motivation:** Given sample data and a parametric model, which parameter value best explains what we observed? Maximum Likelihood Estimation answers this by choosing the parameter that maximizes the probability (likelihood) of the observed data. This principle is intuitive, theoretically justified, and produces estimators with excellent properties. Maximum Likelihood Estimators are consistent, asymptotically normal, and asymptotically efficient, making them a cornerstone of statistical inference.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 3.1 Likelihood Function

**Likelihood Function:** $L(\theta; x) = f(x; \theta)$ viewed as a function of parameter $\theta$ for fixed data $x$.

**For independent observations:** $L(\theta; x_1, ..., x_n) = \prod_{i=1}^n f(x_i; \theta)$

**Motivation:** The likelihood function represents how probable the observed data are as a function of different parameter values. While probability fixes the parameter and varies the data, likelihood fixes the data and varies the parameter. This perspective shift is crucial: we ask which parameter values make our observed data more or less likely. Higher likelihood means the parameter value provides a better explanation for the data we actually observed.

In [None]:
np.random.seed(42)
true_mu = 5; data = stats.norm(true_mu, 1).rvs(20)

In [None]:
# L(μ) = ∏f(xᵢ; μ): Product of densities at observed data points
mu_values = np.linspace(0, 10, 100)

In [None]:
# Compute likelihood for each candidate parameter value
likelihoods = [np.prod(stats.norm(mu, 1).pdf(data)) for mu in mu_values]

In [None]:
plt.plot(mu_values, likelihoods, linewidth=2)
plt.axvline(true_mu, color='r', linestyle='--', label=f'True μ = {true_mu}')

In [None]:
plt.xlabel('Parameter μ'); plt.ylabel('Likelihood L(μ)')
plt.title('Likelihood Function for Normal Mean'); plt.legend()

**Key Insight:** Likelihood peaks near the true parameter value, showing which $\mu$ makes observed data most probable.

## 3.2 Log-Likelihood Function

**Log-Likelihood:** $\ell(\theta; x) = \log L(\theta; x) = \sum_{i=1}^n \log f(x_i; \theta)$

**Motivation:** Products are computationally unstable and analytically unwieldy. Taking logarithms converts products to sums, which are easier to compute and differentiate. Since logarithm is monotonically increasing, maximizing log-likelihood is equivalent to maximizing likelihood. The log transformation prevents numerical underflow (likelihoods can be extremely small) and simplifies derivative calculations needed for optimization.

In [None]:
# ℓ(μ) = Σlog f(xᵢ; μ): Sum of log-densities for numerical stability
log_likelihoods = [np.sum(stats.norm(mu, 1).logpdf(data)) for mu in mu_values]

In [None]:
plt.plot(mu_values, log_likelihoods, linewidth=2)
plt.axvline(true_mu, color='r', linestyle='--', label=f'True μ = {true_mu}')

In [None]:
plt.xlabel('Parameter μ'); plt.ylabel('Log-Likelihood ℓ(μ)')
plt.title('Log-Likelihood Function (Same Maximum as Likelihood)'); plt.legend()

**Computational Advantage:** Log-likelihood avoids underflow and is much more stable for numerical optimization.

## 3.3 Maximum Likelihood Estimator

**Maximum Likelihood Estimator:** $\hat{\theta}_{MLE} = \arg\max_\theta L(\theta; x) = \arg\max_\theta \ell(\theta; x)$

**Motivation:** The Maximum Likelihood Estimator chooses the parameter value that makes the observed data most probable under the assumed model. This is an intuitive principle: among all possible parameter values, select the one under which what we actually observed would be most likely to occur. Maximum Likelihood Estimation provides a unified framework that works for virtually any parametric model, and the resulting estimators have strong theoretical properties.

In [None]:
# μ̂ₘₗₑ = arg max L(μ): Parameter value maximizing likelihood
mu_hat_mle = mu_values[np.argmax(log_likelihoods)]

In [None]:
print(f"Maximum Likelihood Estimator: μ̂ₘₗₑ = {mu_hat_mle:.3f}")
print(f"Sample mean: X̄ = {np.mean(data):.3f}")

**Result:** For normal distribution, Maximum Likelihood Estimator equals sample mean.

## 3.4 Finding Maximum Likelihood Estimator Analytically

**Analytic Method:** Solve $\frac{\partial \ell(\theta)}{\partial \theta} = 0$ (score equation)

**Motivation:** When the log-likelihood is differentiable and concave, we can find the Maximum Likelihood Estimator by setting the derivative to zero. This is often simpler than numerical optimization and provides closed-form solutions that reveal the estimator's structure. The derivative of log-likelihood is called the score function, and it plays a central role in likelihood theory.

### Example: Normal Mean with Known Variance

**Model:** $X_1, ..., X_n \sim N(\mu, \sigma^2)$ with $\sigma^2$ known

**Log-likelihood:** $\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2$

**Score function:** $\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu)$

**Setting to zero:** $\sum_{i=1}^n(x_i - \mu) = 0 \implies \hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{X}$

In [None]:
# μ̂ₘₗₑ = X̄: Sample mean is Maximum Likelihood Estimator for normal mean
mu_hat_analytic = np.mean(data)

In [None]:
print(f"Analytic Maximum Likelihood Estimator: {mu_hat_analytic:.3f}")
print("Matches numerical optimization result")

### Example: Normal Variance with Known Mean

**Model:** $X_1, ..., X_n \sim N(\mu, \sigma^2)$ with $\mu$ known

**Log-likelihood:** $\ell(\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2$

**Score:** $\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i - \mu)^2$

**Maximum Likelihood Estimator:** $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2$

In [None]:
# σ̂²ₘₗₑ = (1/n)Σ(xᵢ - μ)²: Average squared deviation from known mean
sigma_sq_hat = np.mean((data - true_mu)**2)

In [None]:
print(f"Maximum Likelihood Estimator for σ²: {sigma_sq_hat:.3f}")
print(f"True σ² = 1.0")

### Both Parameters Unknown

**Joint Maximum Likelihood Estimators:**
- $\hat{\mu}_{MLE} = \bar{X}$
- $\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2$

**Note:** $\hat{\sigma}^2_{MLE}$ divides by $n$, making it biased. The unbiased estimator divides by $n-1$.

In [None]:
# σ̂²ₘₗₑ = (1/n)Σ(xᵢ - X̄)²: Biased Maximum Likelihood Estimator (divides by n)
sigma_sq_mle_biased = np.mean((data - np.mean(data))**2)

In [None]:
# s² = (1/(n-1))Σ(xᵢ - X̄)²: Unbiased estimator (divides by n-1)
sigma_sq_unbiased = np.var(data, ddof=1)

In [None]:
print(f"Maximum Likelihood Estimator (biased): {sigma_sq_mle_biased:.3f}")
print(f"Unbiased estimator: {sigma_sq_unbiased:.3f}")

## 3.5 Maximum Likelihood Estimation for Bernoulli Distribution

**Model:** $X_1, ..., X_n \sim \text{Bernoulli}(p)$

**Probability Mass Function:** $P(X = x) = p^x(1-p)^{1-x}$ for $x \in \{0, 1\}$

**Likelihood:** $L(p) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} = p^{\sum x_i}(1-p)^{n - \sum x_i}$

**Log-likelihood:** $\ell(p) = \left(\sum x_i\right) \log p + \left(n - \sum x_i\right) \log(1-p)$

**Maximum Likelihood Estimator:** $\hat{p}_{MLE} = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}$

**Motivation:** For binary data, Maximum Likelihood Estimation yields the sample proportion of successes, which is the natural estimator.

In [None]:
true_p = 0.3
bernoulli_data = stats.bernoulli(true_p).rvs(100)

In [None]:
# p̂ₘₗₑ = X̄: Sample proportion is Maximum Likelihood Estimator for Bernoulli parameter
p_hat_mle = np.mean(bernoulli_data)

In [None]:
print(f"True p = {true_p}")
print(f"Maximum Likelihood Estimator: p̂ₘₗₑ = {p_hat_mle:.3f}")

## 3.6 Maximum Likelihood Estimation for Exponential Distribution

**Model:** $X_1, ..., X_n \sim \text{Exp}(\lambda)$

**Probability Density Function:** $f(x; \lambda) = \lambda e^{-\lambda x}$ for $x > 0$

**Log-likelihood:** $\ell(\lambda) = n\log\lambda - \lambda\sum_{i=1}^n x_i$

**Score:** $\frac{\partial \ell}{\partial \lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i$

**Maximum Likelihood Estimator:** $\hat{\lambda}_{MLE} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{X}}$

**Motivation:** For exponential data (waiting times, lifetimes), Maximum Likelihood Estimation gives the reciprocal of the sample mean.

In [None]:
true_lambda = 2
exp_data = stats.expon(scale=1/true_lambda).rvs(50)

In [None]:
# λ̂ₘₗₑ = 1/X̄: Reciprocal of sample mean for exponential rate
lambda_hat_mle = 1 / np.mean(exp_data)

In [None]:
print(f"True λ = {true_lambda}")
print(f"Maximum Likelihood Estimator: λ̂ₘₗₑ = {lambda_hat_mle:.3f}")

## 3.7 Numerical Maximum Likelihood Estimation

**Motivation:** Many models lack closed-form Maximum Likelihood Estimators. Numerical optimization finds the maximum by iterative algorithms. This approach works for arbitrarily complex models, though it requires careful implementation to avoid local maxima and convergence issues.

In [None]:
from scipy.optimize import minimize
np.random.seed(123)

In [None]:
# Generate data from normal distribution
data_normal = stats.norm(10, 2).rvs(100)

In [None]:
# -ℓ(θ): Negative log-likelihood (minimize instead of maximize)
def neg_log_likelihood(params):
    return -np.sum(stats.norm(params[0], params[1]).logpdf(data_normal))

In [None]:
# Numerical optimization to find Maximum Likelihood Estimators
result = minimize(neg_log_likelihood, x0=[0, 1], method='L-BFGS-B', bounds=[(None, None), (0.001, None)])

In [None]:
print(f"Numerical Maximum Likelihood Estimators: μ̂ = {result.x[0]:.3f}, σ̂ = {result.x[1]:.3f}")
print(f"Analytic: μ̂ = {np.mean(data_normal):.3f}, σ̂ = {np.std(data_normal, ddof=0):.3f}")

## 3.8 Invariance Property of Maximum Likelihood Estimator

**Invariance Property:** If $\hat{\theta}_{MLE}$ is the Maximum Likelihood Estimator of $\theta$, then $g(\hat{\theta}_{MLE})$ is the Maximum Likelihood Estimator of $g(\theta)$ for any function $g$.

**Motivation:** This remarkable property simplifies estimation of transformed parameters. If we want to estimate a function of a parameter, we simply apply that function to the Maximum Likelihood Estimator of the original parameter. No separate optimization is needed. This property is unique to Maximum Likelihood Estimation and does not hold for other estimation methods like Method of Moments.

In [None]:
# For normal data: μ̂ₘₗₑ = X̄
mu_hat = np.mean(data_normal)

In [None]:
# g(μ̂ₘₗₑ): Maximum Likelihood Estimator of g(μ) by invariance property
tau_hat = np.exp(mu_hat)  # Estimate e^μ

In [None]:
print(f"μ̂ₘₗₑ = {mu_hat:.3f}")
print(f"Maximum Likelihood Estimator of e^μ: e^μ̂ = {tau_hat:.3f}")

**Example:** If $\hat{\lambda}_{MLE}$ for exponential, then Maximum Likelihood Estimator of mean $1/\lambda$ is $1/\hat{\lambda}_{MLE}$.

## 3.9 Score Function

**Score Function:** $S(\theta) = \frac{\partial \ell(\theta)}{\partial \theta}$

**Properties:**
1. $E[S(\theta)] = 0$ (expected score is zero at true parameter)
2. At Maximum Likelihood Estimator: $S(\hat{\theta}_{MLE}) = 0$

**Motivation:** The score function measures the slope of log-likelihood. Its expectation being zero means that on average, the log-likelihood has no upward or downward trend at the true parameter. The score plays a central role in maximum likelihood theory and appears in asymptotic distribution results and efficiency calculations.

In [None]:
# S(μ) = ∂ℓ/∂μ: Derivative of log-likelihood (score function)
def score_normal_mean(mu, data, sigma=1):
    return np.sum((data - mu)) / sigma**2

In [None]:
mu_grid = np.linspace(3, 7, 100)
scores = [score_normal_mean(mu, data) for mu in mu_grid]

In [None]:
plt.plot(mu_grid, scores, linewidth=2)
plt.axhline(0, color='black', linestyle='-', linewidth=0.5)

In [None]:
plt.xlabel('Parameter μ'); plt.ylabel('Score S(μ)')
plt.title('Score Function: Zero at Maximum Likelihood Estimator')

**Observation:** Score crosses zero exactly where likelihood is maximized.

## 3.10 Fisher Information

**Fisher Information:** $I(\theta) = E\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] = -E\left[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}\right]$

**For sample of size n:** $I_n(\theta) = n I(\theta)$

**Motivation:** Fisher Information quantifies how much information the data contain about the parameter. Higher information means the log-likelihood is more sharply peaked, allowing more precise estimation. Fisher Information appears in the Cramér-Rao Lower Bound and in the asymptotic variance of Maximum Likelihood Estimators. It connects the curvature of log-likelihood to estimation precision.

### Example: Normal Distribution with Known Variance

**Log-likelihood for one observation:** $\ell(\mu; x) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}$

**First derivative:** $\frac{\partial \ell}{\partial \mu} = \frac{x - \mu}{\sigma^2}$

**Second derivative:** $\frac{\partial^2 \ell}{\partial \mu^2} = -\frac{1}{\sigma^2}$

**Fisher Information:** $I(\mu) = \frac{1}{\sigma^2}$

In [None]:
# I(μ) = 1/σ²: Fisher Information for normal mean
sigma = 2; fisher_info = 1 / sigma**2

In [None]:
print(f"Fisher Information for one observation: I(μ) = {fisher_info:.3f}")
print(f"Fisher Information for n=100: I₁₀₀(μ) = {100 * fisher_info:.1f}")

**Interpretation:** Information increases with sample size and decreases with variance. More observations or less noise mean more information about $\mu$.

## 3.11 Asymptotic Properties of Maximum Likelihood Estimator

**Under regularity conditions, Maximum Likelihood Estimators have three key asymptotic properties:**

1. **Consistency:** $\hat{\theta}_{MLE} \xrightarrow{P} \theta$ as $n \to \infty$
2. **Asymptotic Normality:** $\sqrt{n}(\hat{\theta}_{MLE} - \theta) \xrightarrow{d} N(0, 1/I(\theta))$
3. **Asymptotic Efficiency:** Maximum Likelihood Estimator achieves Cramér-Rao Lower Bound asymptotically

**Motivation:** These properties explain why Maximum Likelihood Estimation is preferred in practice. Consistency guarantees convergence to truth with enough data. Asymptotic normality enables construction of confidence intervals and hypothesis tests using normal theory. Asymptotic efficiency means no other estimator can have lower asymptotic variance. Together, these properties make Maximum Likelihood Estimators the gold standard for large-sample inference.

### Demonstrating Consistency

**Consistency:** As sample size increases, Maximum Likelihood Estimator converges to true parameter value.

In [None]:
true_theta = 5
sample_sizes = [10, 30, 100, 300, 1000, 3000]

In [None]:
# Generate Maximum Likelihood Estimators for increasing sample sizes
mles = [stats.norm(true_theta, 2).rvs(n).mean() for n in sample_sizes]

In [None]:
plt.plot(sample_sizes, mles, 'o-', markersize=8, linewidth=2)
plt.axhline(true_theta, color='r', linestyle='--', linewidth=2, label='True θ')

In [None]:
plt.xlabel('Sample Size n'); plt.ylabel('θ̂ₘₗₑ')
plt.title('Consistency: Maximum Likelihood Estimator Converges to True Value'); plt.legend()

### Demonstrating Asymptotic Normality

**Asymptotic Normality:** Sampling distribution of Maximum Likelihood Estimator becomes approximately normal for large $n$.

In [None]:
# Generate sampling distribution of Maximum Likelihood Estimator
mle_estimates = [stats.norm(true_theta, 2).rvs(100).mean() for _ in range(2000)]

In [None]:
# Var(θ̂ₘₗₑ) ≈ 1/(nI(θ)): Asymptotic variance formula
asymptotic_var = 1 / (100 * fisher_info)

In [None]:
plt.hist(mle_estimates, bins=40, density=True, alpha=0.7, edgecolor='black')
x = np.linspace(4, 6, 100); plt.plot(x, stats.norm(true_theta, np.sqrt(asymptotic_var)).pdf(x), 'r-', linewidth=2)

In [None]:
plt.xlabel('θ̂ₘₗₑ'); plt.ylabel('Density')
plt.title('Asymptotic Normality of Maximum Likelihood Estimator')

In [None]:
print(f"Empirical standard deviation: {np.std(mle_estimates):.3f}")
print(f"Theoretical (1/√(nI(θ))): {np.sqrt(asymptotic_var):.3f}")

## 3.12 Observed Information and Standard Errors

**Observed Information:** $J(\hat{\theta}) = -\frac{\partial^2 \ell(\theta)}{\partial \theta^2}\Big|_{\theta = \hat{\theta}}$

**Estimated Standard Error:** $\widehat{SE}(\hat{\theta}_{MLE}) = \frac{1}{\sqrt{J(\hat{\theta})}}$

**Motivation:** Observed information uses the actual data to estimate Fisher Information. The negative second derivative of log-likelihood at the Maximum Likelihood Estimator quantifies how peaked the likelihood is. A sharper peak means more information and smaller standard error. This provides a practical way to quantify uncertainty in Maximum Likelihood Estimates.

In [None]:
# J(μ̂) = n/σ²: Observed information for normal mean
observed_info = len(data) / sigma**2

In [None]:
# SE(μ̂ₘₗₑ) = 1/√J(μ̂): Standard error from observed information
se_mle = 1 / np.sqrt(observed_info)

In [None]:
print(f"Observed Information: J(μ̂) = {observed_info:.2f}")
print(f"Standard Error of Maximum Likelihood Estimator: {se_mle:.3f}")

## 3.13 Multiparameter Maximum Likelihood Estimation

**Vector Parameter:** $\theta = (\theta_1, ..., \theta_p)$

**Score Vector:** $S(\theta) = \left(\frac{\partial \ell}{\partial \theta_1}, ..., \frac{\partial \ell}{\partial \theta_p}\right)$

**Maximum Likelihood Estimator:** Solve $S(\hat{\theta}) = 0$

**Fisher Information Matrix:** $I(\theta) = E\left[S(\theta)S(\theta)^T\right]$

**Asymptotic Distribution:** $\hat{\theta}_{MLE} \sim N(\theta, I_n^{-1}(\theta))$ for large $n$

**Motivation:** Many realistic models have multiple parameters. The multiparameter Maximum Likelihood framework extends naturally to vector parameters. The Fisher Information becomes a matrix capturing information about each parameter and their relationships.

In [None]:
# Normal(μ, σ²): Two-parameter Maximum Likelihood Estimation
data_multi = stats.norm(10, 3).rvs(200)

In [None]:
# μ̂ₘₗₑ = X̄: Maximum Likelihood Estimator for mean
mu_hat_multi = np.mean(data_multi)

In [None]:
# σ̂²ₘₗₑ = (1/n)Σ(xᵢ - X̄)²: Maximum Likelihood Estimator for variance
sigma_sq_hat_multi = np.mean((data_multi - mu_hat_multi)**2)

In [None]:
print(f"Maximum Likelihood Estimators: μ̂ = {mu_hat_multi:.3f}, σ̂² = {sigma_sq_hat_multi:.3f}")
print(f"True values: μ = 10, σ² = 9")

## 3.14 Advantages and Limitations of Maximum Likelihood Estimation

**Advantages:**
1. **Principled:** Clear interpretation - parameter making data most probable
2. **General:** Applicable to virtually any parametric model
3. **Efficient:** Achieves Cramér-Rao Lower Bound asymptotically
4. **Invariant:** Invariance property simplifies transformed parameters
5. **Normal:** Asymptotically normal, enabling standard inference

**Limitations:**
1. **Model dependence:** Requires correctly specified probability model
2. **Finite-sample bias:** May be biased in small samples
3. **Computational:** May require numerical optimization
4. **Regularity conditions:** Asymptotic theory requires technical assumptions
5. **Not robust:** Sensitive to outliers and model misspecification

**Motivation:** Understanding both strengths and weaknesses guides appropriate use of Maximum Likelihood Estimation. It excels with large samples and correctly specified models but may struggle with small samples or model violations.

## Summary: Maximum Likelihood Estimation Framework

1. **Specify parametric model** $f(x; \theta)$ for the data
2. **Construct likelihood** $L(\theta) = \prod_{i=1}^n f(x_i; \theta)$ or log-likelihood $\ell(\theta) = \sum_{i=1}^n \log f(x_i; \theta)$
3. **Maximize likelihood** by solving $\frac{\partial \ell}{\partial \theta} = 0$ or using numerical optimization
4. **Verify maximum** by checking second derivative is negative (concave)
5. **Quantify uncertainty** using observed information $J(\hat{\theta})$ and standard errors
6. **Use asymptotic theory** for large-sample inference: $\hat{\theta}_{MLE} \sim N(\theta, 1/(nI(\theta)))$

## Key Takeaways

- **Likelihood quantifies plausibility:** Likelihood function shows which parameter values make observed data more or less probable. Maximizing likelihood chooses the most plausible parameter.

- **Log-likelihood simplifies computation:** Taking logarithms converts products to sums, preventing numerical underflow and simplifying derivatives while preserving the maximum.

- **Maximum Likelihood Estimators have excellent properties:** Consistency, asymptotic normality, and asymptotic efficiency make Maximum Likelihood Estimation the preferred method for large samples.

- **Invariance property is powerful:** Estimating functions of parameters requires only applying the function to Maximum Likelihood Estimators, no separate optimization needed.

- **Fisher Information quantifies precision:** Higher information means sharper likelihood peak and more precise estimation. Information increases with sample size and data quality.

- **Observed information provides standard errors:** Negative second derivative of log-likelihood at Maximum Likelihood Estimator gives practical uncertainty quantification.

- **Numerical methods handle complex models:** When no closed-form solution exists, numerical optimization finds Maximum Likelihood Estimators, extending applicability to arbitrary models.

- **Model specification matters:** Maximum Likelihood Estimation performance depends critically on correct model specification. Misspecified models lead to biased and inconsistent estimators.