# Chapter 2: Properties of Estimators - Exercises

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 30px; border-radius: 10px; color: white;'>
<h1 style='margin: 0; font-size: 2.5em;'>PLAI Academy</h1>
<p style='margin: 10px 0 0 0; font-size: 1.2em; opacity: 0.9;'>Statistical Inference • Chapter 2 Exercises</p>
</div>

---

## Exercise Structure

**20 exercises** covering bias, variance, efficiency, consistency, and MSE:

1. **Exercises 1-5**: Core estimator properties
2. **Exercises 6-10**: Advanced topics from statistics textbooks
3. **Exercises 11-15**: Machine learning applications
4. **Exercises 16-20**: Current research (2025+)

---

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import trim_mean
sns.set_theme(style='whitegrid', palette='husl')

---

## Part 1: Core Concepts (Exercises 1-5)

### Exercise 2.1: Verifying Unbiasedness

**Tasks:**
1. Create a population N(μ=50, σ²=100)
2. Draw 10,000 samples of size n=30
3. For each sample, calculate:
   - Sample mean X̄
   - Biased variance estimator: σ̂² = (1/n)Σ(xᵢ-X̄)²
   - Unbiased variance estimator: s² = (1/(n-1))Σ(xᵢ-X̄)²
4. Calculate E[X̄], E[σ̂²], and E[s²]
5. Verify: E[X̄] = μ, E[s²] = σ², E[σ̂²] < σ²
6. Calculate the bias of σ̂²

In [None]:
# Your solution here


### Exercise 2.2: Bias-Variance Tradeoff

**Reference**: [Stanford CS109 - Properties of Estimators](https://web.stanford.edu/class/archive/cs/cs109/cs109.1218/files/student_drive/7.6.pdf)

**Tasks:**
1. Create a population with true parameter θ = 10
2. Compare three estimators:
   - Unbiased: θ̂₁ = X̄ (sample mean)
   - Biased but lower variance: θ̂₂ = 0.9X̄ + 1
   - Highly biased: θ̂₃ = 8 (constant)
3. For each estimator, calculate:
   - Bias
   - Variance
   - MSE = Bias² + Variance
4. Which estimator has lowest MSE?
5. Plot MSE decomposition (bias² vs variance) for each estimator

In [None]:
# Your solution here


### Exercise 2.3: Efficiency Comparison

**Tasks:**
1. For a normal population N(100, 15²), compare three location estimators:
   - Sample mean
   - Sample median
   - 10% trimmed mean
2. Generate 5000 samples of size n=50 for each
3. Calculate variance of each estimator
4. Compute relative efficiency: Eff(median, mean) = Var(median)/Var(mean)
5. Verify the theoretical result: median has ~64% efficiency for normal data
6. Repeat for exponential population - which estimator is most efficient now?

In [None]:
# Your solution here


### Exercise 2.4: Consistency Demonstration

**Tasks:**
1. Consider estimating the population mean from Uniform(0, θ)
2. One estimator: θ̂₁ = 2X̄ (method of moments)
3. Another: θ̂₂ = max(X₁,...,Xₙ) × (n+1)/n (maximum likelihood adjusted)
4. For n = [10, 30, 100, 500, 2000]:
   - Generate 1000 samples of each size
   - Calculate both estimators
   - Plot distributions
5. Show both estimators are consistent by examining how distributions concentrate
6. Which converges faster? Calculate MSE(n) for both

In [None]:
# Your solution here


### Exercise 2.5: Cramér-Rao Lower Bound

**Reference**: [Bias, MSE, Relative Efficiency](https://uregina.ca/~kozdron/Teaching/Regina/252Winter16/Handouts/ch3.pdf)

**Tasks:**
1. For Poisson(λ), the Fisher Information is I(λ) = 1/λ
2. Generate data from Poisson(λ=5) with n=30
3. The sample mean X̄ estimates λ
4. Calculate:
   - Cramér-Rao Lower Bound: CRLB = 1/(n×I(λ))
   - Empirical variance of X̄ from 10,000 samples
   - Theoretical variance: Var(X̄) = λ/n
5. Show that X̄ achieves the CRLB (it's efficient)
6. Try a different estimator (e.g., median) and show it has higher variance

In [None]:
# Your solution here


---

## Part 2: Advanced Problems (Exercises 6-10)

### Exercise 2.6: Rao-Blackwell Improvement

**Tasks:**
1. For Bernoulli(p) data with n observations
2. Start with crude estimator: θ̂₀ = X₁ (just the first observation)
3. The sufficient statistic is T = ΣXᵢ
4. Apply Rao-Blackwell: θ̂* = E[θ̂₀|T] = T/n (the sample proportion)
5. Simulate p=0.6, n=20 with 5000 repetitions
6. Calculate Var(θ̂₀) and Var(θ̂*)
7. Verify: Var(θ̂*) = Var(θ̂₀) × (1/n)
8. Show both are unbiased but θ̂* has much lower variance

In [None]:
# Your solution here


### Exercise 2.7: James-Stein Estimator

The famous James-Stein estimator shows that X̄ is inadmissible for dimensions ≥3.

**Tasks:**
1. Generate p=5 means μ = [1, 2, 3, 4, 5]
2. Observe X ~ N(μ, I) where I is identity matrix
3. Compare estimators:
   - MLE: δ̂_MLE = X
   - James-Stein: δ̂_JS = (1 - (p-2)/(||X||²)) × X
4. Repeat 10,000 times and calculate:
   - Average squared error: ||δ̂ - μ||²
5. Show James-Stein dominates MLE (lower MSE)
6. This violates intuition that X̄ is optimal!

In [None]:
# Your solution here


### Exercise 2.8: Robustness vs Efficiency

**Tasks:**
1. Generate clean data: N(50, 10²), n=40
2. Generate contaminated data: 90% N(50, 10²) + 10% N(50, 50²)
3. Compare estimators: mean, median, 10% trimmed mean, 20% trimmed mean
4. For clean data, calculate:
   - MSE of each estimator
   - Relative efficiency vs mean
5. For contaminated data, recalculate MSE
6. Create a table showing MSE(clean) and MSE(contaminated) for each
7. Which estimator offers best efficiency-robustness balance?

In [None]:
# Your solution here


### Exercise 2.9: Asymptotic Normality

**Tasks:**
1. For exponential data Exp(λ=2), the MLE is λ̂ = 1/X̄
2. For sample sizes n = [10, 30, 100, 500]:
   - Generate 5000 samples
   - Calculate λ̂ for each
   - Standardize: Z = √n(λ̂ - λ) / se(λ̂)
3. Plot standardized distributions
4. Overlay N(0,1) density
5. Use Kolmogorov-Smirnov test to measure convergence to normality
6. At what sample size is normal approximation accurate?

In [None]:
# Your solution here


### Exercise 2.10: Bootstrap Standard Errors

**Reference**: [Estimation Theory](http://www.dliebl.com/RM_ES_Script/estimation-theory.html)

**Tasks:**
1. Generate original sample of n=50 from Gamma(shape=3, scale=2)
2. Calculate these statistics: mean, median, 20% trimmed mean, standard deviation
3. For each statistic:
   - Perform 5000 bootstrap resamples
   - Calculate the statistic on each resample
   - Estimate SE from bootstrap distribution
4. Compare bootstrap SE to theoretical SE (where known)
5. Construct 95% bootstrap confidence intervals using:
   - Percentile method
   - Normal approximation method
6. Which method gives more accurate coverage?

In [None]:
# Your solution here


---

## Part 3: ML Applications (Exercises 11-15)

### Exercise 2.11: Bias-Variance in Ridge Regression

Ridge regression introduces bias to reduce variance.

**Tasks:**
1. Generate data: y = Xβ + ε with p=10 features, n=50, high correlation between features
2. For λ = [0, 0.1, 1, 10, 100]:
   - Generate 200 training sets
   - Fit ridge regression: β̂ = (X'X + λI)⁻¹X'y
   - Predict on fixed test point
3. For each λ, calculate:
   - Bias² of predictions
   - Variance of predictions
   - Total MSE
4. Plot bias², variance, and MSE vs λ
5. Find optimal λ that minimizes MSE
6. Compare to OLS (λ=0)

In [None]:
# Your solution here


### Exercise 2.12: Consistency of SGD Estimators

Stochastic Gradient Descent produces estimators that should be consistent.

**Tasks:**
1. True model: y = 3x + 2 + ε
2. For dataset sizes n = [100, 500, 1000, 5000, 10000]:
   - Generate data
   - Run mini-batch SGD with batch_size=32
   - Record final β̂ estimates
3. Repeat 100 times for each n
4. Plot distribution of β̂ for each sample size
5. Calculate:
   - Bias(β̂) vs n
   - Var(β̂) vs n
   - MSE(β̂) vs n
6. Verify: MSE → 0 as n → ∞ (consistency)
7. How does learning rate affect consistency?

In [None]:
# Your solution here


### Exercise 2.13: Estimator Comparison for Model Selection

**Tasks:**
1. Generate data with 20 features, only 5 truly predictive
2. Compare model selection criteria:
   - AIC = 2k - 2ln(L)
   - BIC = k×ln(n) - 2ln(L)
   - Cross-validation MSE
3. For each criterion:
   - Try all possible feature subsets (or use forward selection)
   - Select best model
4. Repeat 200 times with different data
5. Calculate for each criterion:
   - Probability of selecting correct features
   - Average number of false positives
   - Average test MSE
6. Which criterion is most consistent at finding the true model?

In [None]:
# Your solution here


### Exercise 2.14: Variance Estimation in Deep Learning

**Tasks:**
1. Train a neural network on a regression task
2. Estimate prediction variance using three methods:
   - Monte Carlo dropout (run forward pass 100 times with dropout enabled)
   - Bootstrap (retrain on 50 bootstrapped datasets)
   - Ensemble (train 50 networks with different initializations)
3. For a test set:
   - Get mean prediction and variance from each method
   - Compare variance estimates
4. Evaluate uncertainty quality:
   - Do wider intervals have higher error?
   - Calculate calibration
5. Which method gives most reliable uncertainty estimates?

In [None]:
# Your solution here


### Exercise 2.15: Efficient Estimation in Logistic Regression

**Tasks:**
1. Generate binary classification data with known β
2. Compare three estimators for β:
   - Maximum Likelihood (standard logistic regression)
   - Median-unbiased estimator (using Firth's correction)
   - Bayesian posterior mean with weak prior
3. For n = [50, 100, 500]:
   - Generate 500 datasets
   - Fit all three methods
4. Calculate for each method:
   - Bias
   - Variance
   - MSE
5. Which achieves lowest MSE?
6. How does sample size affect relative performance?

In [None]:
# Your solution here


---

## Part 4: Contemporary Research (2025+)

### Exercise 2.16: Estimator Properties in Double Descent

**Reference**: [Rethinking Bias-Variance for Neural Networks (2025)](https://dl.acm.org/doi/pdf/10.5555/3524938.3525936)

Modern neural networks exhibit "double descent" where increasing parameters past interpolation reduces error.

**Tasks:**
1. Generate data: y = f(x) + ε with n=100 samples
2. Fit polynomial models of degree d = [2, 5, 10, 20, 50, 99, 100, 150, 200]
3. For each degree (repeat 200 times):
   - Train on all n samples
   - Predict on test set
4. Calculate:
   - Bias²(d)
   - Variance(d)
   - Test MSE(d)
5. Plot all three vs model complexity
6. Observe:
   - Classical U-shape (underfitting → overfitting)
   - Second descent after interpolation threshold
7. Does variance continue increasing, or does it decrease in over-parameterized regime?

In [None]:
# Your solution here


### Exercise 2.17: Uncertainty in LLM Reward Models

**Reference**: [Uncertainty Quantification for LLM Reward Learning (arxiv 2025)](https://arxiv.org/abs/2512.03208)

**Tasks:**
1. Simulate reward model: given pairs (response_A, response_B), predict which is better
2. True preferences follow Bradley-Terry model: P(A > B) = exp(r_A)/(exp(r_A)+exp(r_B))
3. Collect n comparisons, fit reward model
4. For the reward estimate r̂:
   - Calculate asymptotic variance using Fisher Information
   - Construct confidence intervals
5. Verify coverage by simulation:
   - Generate 1000 datasets
   - For each, construct 95% CI
   - Check if true r is contained
6. How does uncertainty scale with n?
7. What if there are heterogeneous annotators (some more reliable)?

In [None]:
# Your solution here


### Exercise 2.18: Efficiency of LLM Survey Estimators

**Reference**: [How Many Survey Respondents is an LLM Worth? (arxiv 2025)](https://arxiv.org/abs/2502.17773)

**Tasks:**
1. True population has opinion distribution: [40% A, 35% B, 25% C]
2. LLM simulates responses with slight bias: [42% A, 34% B, 24% C]
3. Compare two estimators:
   - Real survey: p̂_real from n_real human responses
   - LLM-augmented: combine n_real humans + n_LLM simulated responses
4. For fixed budget (n_real = 100):
   - Vary n_LLM = [0, 100, 500, 1000, 5000]
   - Calculate MSE of combined estimator
5. Find optimal weight: p̂ = w×p̂_real + (1-w)×p̂_LLM
6. When does adding LLM data help vs hurt?
7. Propose efficiency adjustment for LLM bias

In [None]:
# Your solution here


### Exercise 2.19: Causal Inference Estimator Comparison

**Reference**: [A/B Testing at Scale (ResearchGate 2025)](https://www.researchgate.net/publication/392470551)

**Tasks:**
1. Simulate A/B test with n=100,000 users
2. Treatment effect: τ = 0.03 (3% lift)
3. Add confounders (user characteristics that affect both treatment and outcome)
4. Compare estimators:
   - Simple difference in means (ignoring confounders)
   - Regression adjustment
   - Inverse propensity weighting
   - Doubly-robust estimator
5. For each, calculate:
   - Bias
   - Variance
   - MSE
   - Coverage of 95% CI
6. Which estimator is most efficient?
7. How do results change with perfect randomization vs observational data?

In [None]:
# Your solution here


### Exercise 2.20: Bootstrap Bias Correction in Model Evaluation

**Reference**: [Bootstrap Bias Corrected Cross Validation (PMC 2025)](https://pmc.ncbi.nlm.nih.gov/articles/PMC7304018/)

**Tasks:**
1. Train a complex model (e.g., random forest) on n=200 samples
2. Evaluate performance using:
   - Training error (optimistic bias)
   - Standard cross-validation
   - Bootstrap .632 estimator: 0.632×E_test + 0.368×E_train
   - Bootstrap .632+ (bias-corrected version)
3. Generate 100 independent test sets to get "true" performance
4. For each evaluation method, calculate:
   - Bias (compared to true performance)
   - Variance across 50 repetitions
   - MSE
5. Which method gives most accurate performance estimate?
6. How does model complexity affect the bias?

In [None]:
# Your solution here


---

## Exercise Completion Checklist

- [ ] Exercises 1-5: Core properties (bias, variance, efficiency)
- [ ] Exercises 6-10: Advanced topics (Rao-Blackwell, James-Stein, robustness)
- [ ] Exercises 11-15: ML applications (Ridge, SGD, neural networks)
- [ ] Exercises 16-20: 2025+ research (double descent, LLMs, causal inference)

---

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;'>
<p style='margin: 0; font-size: 1.1em;'>Exercises curated by <strong>PLAI Academy</strong></p>
<p style='margin: 5px 0 0 0; opacity: 0.8;'>Statistical Inference • 2025</p>
</div>