# Chapter 2: Properties of Estimators

**Core Goal:** Understand what makes an estimator "good" and develop criteria for comparing different estimation procedures.

**Motivation:** Not all estimators are created equal. Given sample data, we could propose many different functions to estimate a parameter. How do we choose among them? This chapter develops a systematic framework for evaluating and comparing estimators based on mathematical criteria: unbiasedness, efficiency, consistency, and sufficiency. Understanding these properties allows us to identify optimal estimators and quantify their performance.

In [None]:
import numpy as np
import scipy.stats as stats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

## 2.1 Estimators: Basic Concepts

**Estimator:** A rule or function that assigns to each possible sample a value of the parameter.

**Estimate:** The specific numerical value obtained by applying the estimator to observed data.

**Motivation:** This distinction is crucial. An estimator is a random variable (it depends on the random sample), while an estimate is a fixed number (it comes from the particular sample we observed). The estimator $\bar{X}$ is a function; the estimate $\bar{x} = 52.3$ is a number. Understanding estimators as random variables allows us to study their probabilistic properties.

In [None]:
# Setup: Population N(μ=50, σ²=100)
np.random.seed(42); true_mu, true_sigma = 50, 10

In [None]:
population = stats.norm(loc=true_mu, scale=true_sigma)
sample = population.rvs(30)

In [None]:
# Estimator X̄ applied to this sample gives an estimate
estimate = np.mean(sample); print(f"Estimate from this sample: {estimate:.2f}")

## 2.2 Unbiasedness

**Definition:** An estimator $\hat{\theta}$ is unbiased for parameter $\theta$ if $E[\hat{\theta}] = \theta$.

**Bias:** The systematic error in an estimator: $\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta$

**Motivation:** Unbiasedness means the estimator is correct "on average" across all possible samples. If we could repeat the sampling process infinitely many times, the average of all estimates would equal the true parameter. This property is desirable because it means the estimator has no systematic tendency to overestimate or underestimate. However, unbiasedness alone does not guarantee a good estimator - we must also consider variability.

In [None]:
# Simulate many estimates to verify unbiasedness of sample mean
estimates_mu = [population.rvs(30).mean() for _ in range(5000)]

In [None]:
bias = np.mean(estimates_mu) - true_mu
print(f"Bias of sample mean X̄: {bias:.4f} (approximately 0, confirming unbiasedness)")

In [None]:
plt.hist(estimates_mu, bins=50, density=True, alpha=0.7, edgecolor='black')
plt.axvline(true_mu, color='r', linestyle='--', linewidth=2, label='True μ'); plt.legend()

### Biased versus Unbiased Variance Estimators

**Biased Estimator:** $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar{X})^2$ divides by $n$

**Unbiased Estimator:** $S^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar{X})^2$ divides by $n-1$

**Motivation:** When we compute deviations from the sample mean rather than the true mean, we underestimate variance on average. The correction factor $(n-1)$ instead of $n$ compensates for this bias. This correction is essential because we "used up" one degree of freedom estimating the mean from the same data. The parameter `ddof=1` (delta degrees of freedom) implements this correction in NumPy.

In [None]:
# Biased variance estimator (divides by n)
biased_variances = [np.var(population.rvs(30), ddof=0) for _ in range(5000)]

In [None]:
# Unbiased variance estimator (divides by n-1)
unbiased_variances = [np.var(population.rvs(30), ddof=1) for _ in range(5000)]

In [None]:
print(f"True σ² = {true_sigma**2}")
print(f"E[biased estimator] = {np.mean(biased_variances):.2f}, E[unbiased estimator] = {np.mean(unbiased_variances):.2f}")

In [None]:
plt.hist(biased_variances, bins=50, alpha=0.5, label='Biased (n)', density=True, edgecolor='black')
plt.hist(unbiased_variances, bins=50, alpha=0.5, label='Unbiased (n-1)', density=True, edgecolor='black'); plt.legend()

**Key Result:** The biased estimator systematically underestimates $\sigma^2$, while the unbiased estimator correctly targets it on average.

## 2.3 Mean Squared Error

**Mean Squared Error:** $\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]$ measures overall estimation accuracy.

**Decomposition:** $\text{MSE}(\hat{\theta}) = \text{Bias}^2(\hat{\theta}) + \text{Var}(\hat{\theta})$

**Motivation:** Mean Squared Error provides a comprehensive measure of estimator quality by combining both bias and variance. It represents the expected squared distance from the true parameter value. The bias-variance decomposition reveals a fundamental tradeoff: sometimes accepting small bias can substantially reduce variance, yielding lower overall Mean Squared Error. This decomposition explains why unbiased estimators are not always optimal.

In [None]:
# Mean Squared Error for sample mean
mse_mean = np.mean((np.array(estimates_mu) - true_mu)**2); print(f"MSE(X̄) = {mse_mean:.2f}")

In [None]:
# Since sample mean is unbiased: MSE equals Variance
variance_mean = np.var(estimates_mu); print(f"Variance(X̄) = {variance_mean:.2f}")

In [None]:
# Theoretical variance of sample mean
theoretical_variance = true_sigma**2 / 30; print(f"Theoretical: σ²/n = {theoretical_variance:.2f}")

### Bias-Variance Tradeoff

**Principle:** Small bias combined with low variance can yield better Mean Squared Error than zero bias with high variance.

**Motivation:** This tradeoff appears throughout statistics. For example, ridge regression and lasso introduce bias but reduce variance, often improving prediction. The optimal estimator minimizes Mean Squared Error, not necessarily bias alone.

In [None]:
# Compare Mean Squared Error: biased versus unbiased variance estimator
mse_biased = np.mean((np.array(biased_variances) - true_sigma**2)**2)

In [None]:
mse_unbiased = np.mean((np.array(unbiased_variances) - true_sigma**2)**2)
print(f"MSE biased: {mse_biased:.2f}, MSE unbiased: {mse_unbiased:.2f}")

**Observation:** The unbiased estimator has higher Mean Squared Error due to increased variance. This illustrates that unbiasedness does not guarantee optimality.

## 2.4 Consistency

**Definition:** A sequence of estimators $\hat{\theta}_n$ is consistent if $\hat{\theta}_n \xrightarrow{P} \theta$ as $n \to \infty$.

**Convergence in Probability:** For any $\epsilon > 0$, $P(|\hat{\theta}_n - \theta| > \epsilon) \to 0$ as $n \to \infty$

**Motivation:** Consistency is a minimal requirement for reasonable estimators: with enough data, the estimator should get arbitrarily close to the true value with high probability. Unlike unbiasedness (a finite-sample property), consistency is an asymptotic property. An estimator can be biased but consistent if the bias vanishes as sample size increases. Consistency guarantees that increasing sample size improves estimation.

In [None]:
# Demonstrate consistency: distribution of X̄ concentrates around μ
sample_sizes = [10, 30, 100, 300, 1000, 3000]

In [None]:
# For each n, generate many sample means
distributions_by_n = {n: [population.rvs(n).mean() for _ in range(1000)] for n in sample_sizes}

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for ax, n in zip(axes.flat, sample_sizes):

In [None]:
    ax.hist(distributions_by_n[n], bins=30, edgecolor='black')
    ax.axvline(true_mu, color='r', linewidth=2); ax.set_title(f'n = {n}')

In [None]:
plt.tight_layout()
# Distribution becomes increasingly concentrated around true parameter value

**Visual Interpretation:** As sample size increases, the distribution becomes narrower and tighter around the true parameter, demonstrating consistency.

## 2.5 Efficiency and Relative Efficiency

**Efficiency:** Among unbiased estimators, the most efficient one has smallest variance.

**Relative Efficiency:** $\text{Eff}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$

**Motivation:** Efficiency measures precision. Given two unbiased estimators, we prefer the one with smaller variance because it produces estimates closer to the truth more consistently. Relative efficiency allows quantitative comparison: if relative efficiency is 1.5, the first estimator is 50% more efficient, meaning the second estimator requires 50% more data to achieve the same precision.

In [None]:
# Compare three estimators for population mean: mean, median, trimmed mean
n_simulations = 5000; n = 50

In [None]:
means = [population.rvs(n).mean() for _ in range(n_simulations)]
medians = [np.median(population.rvs(n)) for _ in range(n_simulations)]

In [None]:
from scipy.stats import trim_mean
trimmed_means = [trim_mean(population.rvs(n), 0.1) for _ in range(n_simulations)]

In [None]:
print(f"Variance(mean): {np.var(means):.2f}")
print(f"Variance(median): {np.var(medians):.2f}, Variance(trimmed): {np.var(trimmed_means):.2f}")

In [None]:
plt.hist(means, bins=50, alpha=0.5, label='Mean', density=True, edgecolor='black')
plt.hist(medians, bins=50, alpha=0.5, label='Median', density=True, edgecolor='black'); plt.legend()

**Result for Normal Data:** Sample mean has smallest variance, making it most efficient for estimating the center of a normal distribution.

In [None]:
relative_efficiency = np.var(medians) / np.var(means)
print(f"Relative efficiency (median vs mean): {relative_efficiency:.2f}")

**Interpretation:** The median needs approximately 57% more observations to achieve the same precision as the mean for normal data.

## 2.6 Cramér-Rao Lower Bound

**Cramér-Rao Lower Bound:** For any unbiased estimator $\hat{\theta}$: $\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$

**Fisher Information:** $I(\theta) = E\left[\left(\frac{\partial \log f(X;\theta)}{\partial \theta}\right)^2\right] = -E\left[\frac{\partial^2 \log f(X;\theta)}{\partial \theta^2}\right]$

**Motivation:** The Cramér-Rao Lower Bound establishes the theoretical minimum variance achievable by any unbiased estimator. Fisher Information quantifies how much information the data contain about the parameter: higher information means lower minimum variance. An estimator that achieves this bound is called efficient and cannot be improved upon (in terms of variance) among unbiased estimators. This bound provides a benchmark for evaluating estimator quality.

In [None]:
# For Normal(μ, σ²) with known σ: Fisher Information I(μ) = 1/σ²
fisher_information = 1 / true_sigma**2; print(f"Fisher Information I(μ) = {fisher_information:.4f}")

In [None]:
# Cramér-Rao Lower Bound for n=30 observations
cramer_rao_bound = 1 / (30 * fisher_information); print(f"Cramér-Rao Lower Bound = {cramer_rao_bound:.2f}")

In [None]:
# Variance of sample mean
variance_sample_mean = true_sigma**2 / 30; print(f"Var(X̄) = {variance_sample_mean:.2f}")

In [None]:
print(f"Sample mean achieves Cramér-Rao Lower Bound: {np.isclose(cramer_rao_bound, variance_sample_mean)}")
print("Therefore, X̄ is an efficient estimator for normal mean")

**Conclusion:** Sample mean achieves the Cramér-Rao Lower Bound, proving it is the most efficient unbiased estimator for the normal mean.

## 2.7 Sufficiency

**Sufficient Statistic:** $T(X)$ is sufficient for $\theta$ if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$.

**Formal Definition:** $P(X | T(X), \theta) = P(X | T(X))$ for all $\theta$

**Motivation:** A sufficient statistic captures all information in the sample relevant to estimating the parameter. Once we know $T(X)$, the rest of the data provides no additional information about $\theta$. Sufficiency is important because: (1) it achieves data reduction without information loss, and (2) by the Rao-Blackwell theorem, we can improve any unbiased estimator by conditioning on a sufficient statistic.

### Factorization Theorem

**Theorem:** $T(X)$ is sufficient for $\theta$ if and only if the joint density factors as: $f(x; \theta) = g(T(x), \theta) \cdot h(x)$

**Motivation:** The factorization theorem provides a practical method for identifying sufficient statistics without computing conditional distributions.

In [None]:
# Example: For Normal(μ, σ²), sample mean X̄ is sufficient for μ
sample1 = np.array([48, 50, 52]); sample2 = np.array([45, 50, 55])

In [None]:
print(f"Sample 1: {sample1}, X̄₁ = {np.mean(sample1):.1f}")
print(f"Sample 2: {sample2}, X̄₂ = {np.mean(sample2):.1f}")

**Key Insight:** Both samples have the same mean (50). They contain identical information about $\mu$ despite having different individual values. The sample mean is sufficient.

## 2.8 Minimum Variance Unbiased Estimator

**Minimum Variance Unbiased Estimator:** An unbiased estimator with smallest variance among all unbiased estimators.

**Rao-Blackwell Theorem:** If $\hat{\theta}$ is unbiased and $T$ is sufficient, then $\tilde{\theta} = E[\hat{\theta}|T]$ has variance less than or equal to $\text{Var}(\hat{\theta})$, with equality only if $\hat{\theta}$ is a function of $T$.

**Motivation:** The Rao-Blackwell theorem shows how to systematically improve any unbiased estimator: condition it on a sufficient statistic. This process maintains unbiasedness while reducing variance. Combined with sufficiency and the Cramér-Rao bound, this theorem provides a constructive method for finding optimal estimators.

In [None]:
# Example: Estimating Bernoulli parameter p
true_p = 0.6; bernoulli_population = stats.bernoulli(true_p)

In [None]:
# Naive unbiased estimator: just use first observation X₁
naive_estimates = [bernoulli_population.rvs(10)[0] for _ in range(5000)]

In [None]:
# Improved estimator using sufficient statistic: sample mean X̄
improved_estimates = [np.mean(bernoulli_population.rvs(10)) for _ in range(5000)]

In [None]:
print(f"Variance(X₁) = {np.var(naive_estimates):.4f}")
print(f"Variance(X̄) = {np.var(improved_estimates):.4f} (substantially lower!)")

**Result:** The sample mean (based on sufficient statistic) has much lower variance than using a single observation, illustrating Rao-Blackwell improvement.

## 2.9 Asymptotic Properties

**Asymptotic Properties:** Behavior of estimators as sample size $n \to \infty$.

**Motivation:** Exact finite-sample properties of estimators are often mathematically intractable. Asymptotic properties provide approximations that become increasingly accurate with larger samples. They are useful because: (1) they are often easier to derive, (2) they provide theoretical justification for procedures used with large samples, and (3) modern datasets are frequently large enough for asymptotics to be accurate.

### Asymptotic Unbiasedness

**Definition:** $\lim_{n \to \infty} E[\hat{\theta}_n] = \theta$

**Motivation:** An estimator may be biased in finite samples but become unbiased asymptotically. This is weaker than unbiasedness but still desirable. Many maximum likelihood estimators have this property.

In [None]:
# Maximum Likelihood Estimator for σ² is biased but asymptotically unbiased
sample_sizes = [10, 30, 100, 500, 2000]

In [None]:
biases = [np.mean([np.var(population.rvs(n), ddof=0) for _ in range(1000)]) - true_sigma**2 
          for n in sample_sizes]

In [None]:
plt.plot(sample_sizes, biases, 'o-', markersize=8)
plt.axhline(0, color='r', linestyle='--', linewidth=2); plt.xlabel('Sample Size n'); plt.ylabel('Bias')

**Result:** Bias approaches zero as sample size increases, demonstrating asymptotic unbiasedness.

### Asymptotic Normality

**Definition:** $\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} N(0, \sigma^2)$

**Motivation:** Many estimators are asymptotically normally distributed regardless of the population distribution. This allows us to construct approximate confidence intervals and hypothesis tests using normal theory, even when exact distributions are unknown or complex. Maximum likelihood estimators generally have this property, making them attractive for inference.

In [None]:
# Sample mean is asymptotically normal even from non-normal population
exponential_population = stats.expon(scale=2)

In [None]:
# Standardized sample means: √n(X̄ - μ) / σ
n = 100; true_mean_exp = 2; true_sd_exp = 2

In [None]:
standardized_means = [np.sqrt(n) * (exponential_population.rvs(n).mean() - true_mean_exp) / true_sd_exp 
                      for _ in range(5000)]

In [None]:
plt.hist(standardized_means, bins=50, density=True, alpha=0.7, edgecolor='black')
x = np.linspace(-4, 4, 100); plt.plot(x, stats.norm.pdf(x), 'r-', linewidth=2, label='N(0,1)')

In [None]:
plt.legend(); plt.title('Asymptotic Normality: Standardized Means from Exponential Distribution')
# Distribution closely matches standard normal despite non-normal population

## 2.10 Comparing Estimators: Practical Framework

**Evaluation Criteria:**
1. **Unbiasedness:** $E[\hat{\theta}] = \theta$
2. **Efficiency:** Minimum variance among unbiased estimators
3. **Consistency:** $\hat{\theta}_n \xrightarrow{P} \theta$ as $n \to \infty$
4. **Mean Squared Error:** $\text{MSE} = \text{Bias}^2 + \text{Variance}$
5. **Sufficiency:** Captures all information about parameter

**Motivation:** No single criterion determines the best estimator. Different criteria may favor different estimators. A practical evaluation considers multiple properties simultaneously, balancing theoretical optimality with robustness and computational feasibility.

In [None]:
def evaluate_estimator(estimates, true_value):
    return {'bias': np.mean(estimates) - true_value, 'variance': np.var(estimates), 
            'mse': np.mean((np.array(estimates) - true_value)**2)}

In [None]:
comparison_results = {'mean': evaluate_estimator(means, true_mu),
                     'median': evaluate_estimator(medians, true_mu),
                     'trimmed_mean': evaluate_estimator(trimmed_means, true_mu)}

In [None]:
import pandas as pd
pd.DataFrame(comparison_results).T

**Interpretation:** For normal data, sample mean dominates with lowest variance and Mean Squared Error, confirming theoretical results.

## 2.11 Robustness

**Robust Estimator:** An estimator whose performance does not degrade substantially under violations of assumptions or presence of outliers.

**Motivation:** Real data often violate theoretical assumptions. Outliers, heavy tails, and asymmetry are common. While sample mean is optimal for normal data, it is highly sensitive to outliers. Robust estimators like median and trimmed mean sacrifice some efficiency under ideal conditions for better performance under violations. The choice between efficiency and robustness depends on how much we trust our assumptions.

In [None]:
def contaminated_sample(n, contamination_proportion=0.1):
    return np.concatenate([stats.norm(50, 10).rvs(int(n*(1-contamination_proportion))), 
                          stats.norm(50, 50).rvs(int(n*contamination_proportion))])

In [None]:
# Compare estimators on contaminated data (90% normal + 10% outliers)
contaminated_means = [np.mean(contaminated_sample(50)) for _ in range(5000)]

In [None]:
contaminated_medians = [np.median(contaminated_sample(50)) for _ in range(5000)]
contaminated_trimmed = [trim_mean(contaminated_sample(50), 0.1) for _ in range(5000)]

In [None]:
print(f"Bias - Mean: {np.mean(contaminated_means)-true_mu:.2f}, Median: {np.mean(contaminated_medians)-true_mu:.2f}")
print(f"MSE - Mean: {np.mean((np.array(contaminated_means)-true_mu)**2):.2f}, Median: {np.mean((np.array(contaminated_medians)-true_mu)**2):.2f}")

In [None]:
plt.hist(contaminated_means, bins=50, alpha=0.5, label='Mean', density=True, edgecolor='black')
plt.hist(contaminated_medians, bins=50, alpha=0.5, label='Median', density=True, edgecolor='black'); plt.legend()

**Result:** With contamination, median has lower Mean Squared Error than mean. Robustness becomes more valuable than efficiency when assumptions are violated.

## 2.12 Bootstrap for Estimator Properties

**Bootstrap:** A computational method for estimating the sampling distribution of a statistic by resampling the observed data with replacement.

**Motivation:** Theoretical formulas for bias, variance, and confidence intervals are often unavailable or intractable for complex estimators. The bootstrap provides a general-purpose method for approximating these quantities using only the observed sample. It treats the sample as a surrogate population and estimates properties empirically through resampling. While not a substitute for exact theory when available, bootstrap is invaluable for complex problems.

In [None]:
original_sample = population.rvs(50)
original_estimate = np.mean(original_sample)

In [None]:
# Bootstrap: resample with replacement, recompute statistic
bootstrap_estimates = [np.mean(np.random.choice(original_sample, size=50, replace=True)) 
                       for _ in range(5000)]

In [None]:
bootstrap_se = np.std(bootstrap_estimates)
print(f"Bootstrap Standard Error: {bootstrap_se:.2f}")

In [None]:
theoretical_se = true_sigma / np.sqrt(50)
print(f"Theoretical Standard Error: {theoretical_se:.2f}")

In [None]:
plt.hist(bootstrap_estimates, bins=50, density=True, edgecolor='black')
plt.title('Bootstrap Distribution of Sample Mean'); plt.xlabel('Bootstrap X̄')

**Application:** Bootstrap estimates standard error without requiring knowledge of the population distribution or complex formulas.

## Summary: Ideal Estimator Properties

**Optimal estimator characteristics:**
- **Unbiased:** $E[\hat{\theta}] = \theta$ (correct on average)
- **Efficient:** Achieves Cramér-Rao Lower Bound (minimum variance)
- **Consistent:** $\hat{\theta}_n \xrightarrow{P} \theta$ (converges to truth)
- **Sufficient:** Based on sufficient statistic (uses all information)
- **Robust:** Performs well under violations of assumptions

**Reality:** No estimator is simultaneously optimal under all criteria for all problems. Practical choice requires balancing theoretical optimality with robustness, computational feasibility, and the specific goals of analysis.

## Key Takeaways

- **Multiple criteria exist for evaluating estimators:** Unbiasedness, efficiency, consistency, Mean Squared Error, and sufficiency each capture different aspects of estimator quality. No single criterion dominates.

- **Unbiasedness does not guarantee optimality:** The bias-variance tradeoff shows that accepting small bias can reduce Mean Squared Error. Unbiasedness is desirable but not always most important.

- **Sample mean is Minimum Variance Unbiased Estimator for normal mean:** It is unbiased, achieves the Cramér-Rao Lower Bound, is based on a sufficient statistic, and is consistent. This makes it theoretically optimal for normal data.

- **Median is more robust but less efficient:** While median has higher variance than mean for normal data, it performs better with outliers or heavy-tailed distributions. The choice depends on confidence in normality assumptions.

- **Cramér-Rao Lower Bound provides theoretical benchmark:** It establishes the best possible variance for unbiased estimators, allowing us to assess whether an estimator can be improved.

- **Asymptotic properties simplify analysis:** Consistency and asymptotic normality are often easier to establish than exact finite-sample properties, and they justify large-sample approximations commonly used in practice.