# Chapter 1: Introduction to Statistical Inference - Exercises

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 30px; border-radius: 10px; color: white;'>
<h1 style='margin: 0; font-size: 2.5em;'>PLAI Academy</h1>
<p style='margin: 10px 0 0 0; font-size: 1.2em; opacity: 0.9;'>Statistical Inference • Chapter 1 Exercises</p>
</div>

---

## Exercise Structure

This notebook contains **20 exercises** organized into four sections:

1. **Exercises 1-5**: Fundamental concepts directly from chapter material
2. **Exercises 6-10**: Advanced problems from classical statistics literature
3. **Exercises 11-15**: Applications in AI/Machine Learning
4. **Exercises 16-20**: Contemporary problems (2025+)

---

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='whitegrid', palette='husl')

---

## Part 1: Fundamental Concepts (Exercises 1-5)

### Exercise 1.1: Population vs Sample

A quality control inspector wants to estimate the average weight of cereal boxes produced in a factory. The factory produces 10,000 boxes per day.

**Tasks:**
1. Define the population and a potential sample
2. Generate a population of 10,000 box weights with mean μ = 500g and standard deviation σ = 15g
3. Draw a random sample of n = 30 boxes
4. Calculate the sample mean and compare it to the population mean
5. Repeat step 3-4 ten times. What do you observe about the sample means?

In [None]:
# Your solution here


### Exercise 1.2: Sampling Distribution Simulation

**Reference**: [Using Simulations to Explore Sampling Distributions (PMC, 2025)](https://pmc.ncbi.nlm.nih.gov/articles/PMC12549070/)

This exercise demonstrates how sampling distributions illustrate the behavior of statistics across repeated samples.

**Tasks:**
1. Create a population that follows N(100, 225) (mean=100, variance=225)
2. Draw 1000 samples of size n=25 from this population
3. Calculate the sample mean for each sample
4. Plot the distribution of these 1000 sample means
5. Compare the mean and standard deviation of the sampling distribution to theoretical values:
   - E[X̄] = μ = 100
   - SE(X̄) = σ/√n = 15/√25 = 3

In [None]:
# Your solution here


### Exercise 1.3: Central Limit Theorem with Non-Normal Data

The Central Limit Theorem states that sample means are approximately normal regardless of the population distribution for large n.

**Tasks:**
1. Create an exponential population with λ = 0.5 (mean = 2, variance = 4)
2. For sample sizes n = [5, 10, 30, 50, 100]:
   - Draw 1000 samples of size n
   - Calculate the mean of each sample
   - Plot the distribution of sample means
3. Observe how the sampling distribution becomes more normal as n increases
4. Calculate skewness for each sampling distribution to quantify normality

In [None]:
# Your solution here


### Exercise 1.4: Confidence Intervals

**Tasks:**
1. Generate a sample of n=50 from N(μ=75, σ=12)
2. Calculate a 95% confidence interval for the population mean using:
   - X̄ ± t_{α/2, n-1} × (s/√n)
3. Repeat this process 100 times, creating 100 different confidence intervals
4. Plot all 100 confidence intervals and color them:
   - Green if they contain the true mean (μ=75)
   - Red if they don't
5. Calculate the coverage rate (proportion that contain μ). Does it match the 95% confidence level?

In [None]:
# Your solution here


### Exercise 1.5: Hypothesis Testing and P-values

A researcher claims that a new teaching method increases test scores. Historical data shows the population mean is μ₀ = 70. After implementing the new method, a sample of 36 students has a mean score of 73.5 with standard deviation s = 8.

**Tasks:**
1. State the null and alternative hypotheses
2. Calculate the test statistic: t = (X̄ - μ₀)/(s/√n)
3. Calculate the p-value for a one-sided test
4. Make a decision at α = 0.05 significance level
5. Simulate the null distribution (assuming H₀ is true) and visualize where your test statistic falls

In [None]:
# Your solution here


---

## Part 2: Advanced Statistical Problems (Exercises 6-10)

### Exercise 1.6: Power Analysis

**Reference**: [Statistical Inference - Casella & Berger](https://www.ctanujit.org/uploads/2/5/3/9/25393293/_solutions_manual_of_casella_berger.pdf)

Statistical power is the probability of correctly rejecting a false null hypothesis.

**Tasks:**
1. Consider testing H₀: μ = 100 vs H₁: μ = 105 with σ = 15, n = 25, α = 0.05
2. Calculate the critical value for rejecting H₀
3. Calculate the probability of Type II error (β) when the true mean is 105
4. Calculate power = 1 - β
5. Create a power curve showing power for true means ranging from 100 to 110
6. How does increasing sample size to n = 50 affect power?

In [None]:
# Your solution here


### Exercise 1.7: Comparing Two Populations

Two manufacturing processes produce light bulbs. Process A has been tested extensively with known mean lifetime μ_A = 1000 hours and σ_A = 100 hours. Process B is new.

**Tasks:**
1. Generate data for Process A (n=40) and Process B (n=40, with true mean=1050, σ=100)
2. Perform a two-sample t-test to determine if Process B has a different mean lifetime
3. Calculate the 95% confidence interval for the difference in means
4. Repeat this experiment 1000 times and calculate:
   - The proportion of times you correctly reject H₀ (empirical power)
   - The distribution of p-values when H₀ is false

In [None]:
# Your solution here


### Exercise 1.8: Chi-Squared Distribution and Sample Variance

For a normal population, the sample variance follows a chi-squared distribution:
(n-1)s²/σ² ~ χ²(n-1)

**Tasks:**
1. Generate 5000 samples of size n=20 from N(50, 100) (variance σ²=100)
2. For each sample, calculate s² and the standardized quantity (n-1)s²/σ²
3. Plot the histogram of these standardized quantities
4. Overlay the theoretical χ²(19) distribution
5. Use this relationship to construct a 95% confidence interval for σ²

In [None]:
# Your solution here


### Exercise 1.9: Multiple Testing Problem

When conducting multiple hypothesis tests, the probability of at least one Type I error increases.

**Tasks:**
1. Simulate 20 independent hypothesis tests where all null hypotheses are true
2. For each test:
   - Generate data from N(0, 1) with n=30
   - Test H₀: μ = 0 at α = 0.05
3. Count how many tests incorrectly reject H₀
4. Repeat this entire process 1000 times
5. What is the probability of at least one false rejection? Compare to theoretical: 1 - (1-0.05)²⁰
6. Apply Bonferroni correction (α/20) and recalculate the family-wise error rate

In [None]:
# Your solution here


### Exercise 1.10: Sample Size Determination

**Tasks:**
1. You want to estimate the population mean with a margin of error E = 2 at 95% confidence
2. From a pilot study, you estimate σ = 12
3. Calculate the required sample size: n = (z_{α/2} × σ / E)²
4. Verify this by simulation:
   - Generate 1000 samples of the calculated size
   - For each, construct a 95% CI
   - Calculate the average width of these CIs
5. What happens to required sample size if you want E = 1? E = 0.5?

In [None]:
# Your solution here


---

## Part 3: AI/Machine Learning Applications (Exercises 11-15)

### Exercise 1.11: Train-Test Split and Sampling Distributions

In machine learning, we split data into training and test sets. Understanding sampling variation is crucial.

**Tasks:**
1. Create a dataset of 1000 samples with a known relationship: y = 2x + 3 + noise
2. Perform 100 different random 80-20 train-test splits
3. For each split, fit a linear model on training data and evaluate on test data
4. Plot the distribution of:
   - Training R² scores
   - Test R² scores
5. Calculate the mean and standard deviation of test performance
6. What does this tell you about the sampling distribution of model performance?

In [None]:
# Your solution here


### Exercise 1.12: Bias-Variance Decomposition in Practice

**Reference**: [Reconciling modern machine learning and bias-variance trade-off (PNAS)](https://www.pnas.org/doi/10.1073/pnas.1903070116)

**Tasks:**
1. Generate data: y = sin(2πx) + ε where ε ~ N(0, 0.1²)
2. For polynomial models of degree d = [1, 2, 5, 10, 20]:
   - Generate 100 training datasets of size n=50
   - Fit the model to each training set
   - Evaluate predictions at a fixed test point x=0.5
3. For each model degree, calculate:
   - Bias² = (E[ŷ] - y_true)²
   - Variance = Var(ŷ)
   - MSE = Bias² + Variance + σ²
4. Plot bias², variance, and total MSE vs model complexity
5. Identify the optimal model complexity

In [None]:
# Your solution here


### Exercise 1.13: Confidence Intervals for Model Performance

Model performance metrics are statistics with sampling distributions.

**Tasks:**
1. Train a classification model on a dataset (use sklearn's make_classification)
2. Evaluate accuracy on a test set of size n=200
3. Construct a 95% confidence interval for the true accuracy using:
   - Normal approximation: p̂ ± z_{α/2} × √(p̂(1-p̂)/n)
   - Wilson score interval (better for proportions)
4. Repeat the experiment 1000 times with different random seeds
5. Verify that approximately 95% of the CIs contain the true population accuracy
6. How does sample size affect CI width?

In [None]:
# Your solution here


### Exercise 1.14: Hypothesis Testing for Model Comparison

**Tasks:**
1. Train two different models (e.g., logistic regression vs random forest) on the same dataset
2. Use 10-fold cross-validation to get 10 accuracy measurements for each model
3. Perform a paired t-test to determine if one model is significantly better:
   - H₀: μ_diff = 0 (no difference in performance)
   - H₁: μ_diff ≠ 0 (one model is better)
4. Calculate the p-value and effect size (Cohen's d)
5. Discuss: Why is a paired test more appropriate than an independent samples test here?

In [None]:
# Your solution here


### Exercise 1.15: P-values and Model Selection

In feature selection, we often test many features for significance, leading to multiple testing issues.

**Tasks:**
1. Generate a dataset with 100 features, where only 5 are truly predictive
2. For each feature, perform a t-test against the target variable
3. Sort features by p-value and select features with p < 0.05
4. How many false positives do you get?
5. Apply False Discovery Rate (FDR) control using Benjamini-Hochberg procedure
6. Compare the selected features before and after FDR correction

In [None]:
# Your solution here


---

## Part 4: Contemporary Problems (2025+)

### Exercise 1.16: Uncertainty Quantification in LLMs

**Reference**: [Uncertainty Quantification for Large Language Models (ACL 2025)](https://aclanthology.org/2025.acl-tutorials.3/)

Modern LLMs need to quantify uncertainty in their predictions.

**Tasks:**
1. Simulate an LLM classification task with 4 classes
2. Generate 1000 predictions where the model outputs probability distributions over classes
3. For each prediction, calculate:
   - Predictive entropy: H = -Σ p_i log(p_i)
   - Maximum probability
4. Establish a confidence threshold based on entropy
5. Create a calibration plot:
   - Bin predictions by confidence level
   - Compare predicted confidence to actual accuracy
6. Discuss: How would you construct confidence intervals for LLM outputs?

In [None]:
# Your solution here


### Exercise 1.17: A/B Testing with Network Effects

**Reference**: [Causal Inference with Dyadic Data (arxiv 2025)](https://arxiv.org/html/2505.20780)

Modern A/B tests must account for interference when users interact.

**Tasks:**
1. Simulate a social network with 1000 users and random connections
2. Implement a treatment where treated users have probability p=0.6 of success vs p=0.5 for control
3. Add network spillover: neighbors of treated users also get a small boost (p+0.05)
4. Randomly assign 500 users to treatment, 500 to control
5. Compare:
   - Naive estimate: difference in means ignoring spillover
   - Adjusted estimate accounting for network exposure
6. Construct confidence intervals for the true treatment effect
7. How does spillover bias the naive estimate?

In [None]:
# Your solution here


### Exercise 1.18: Statistical Inference for Neural Network Predictions

**Reference**: [A Modern Take on Bias-Variance Tradeoff in Neural Networks (2025)](https://arxiv.org/abs/1810.08591)

Neural networks are trained with stochastic optimization, creating uncertainty in predictions.

**Tasks:**
1. Create a simple regression dataset
2. Train a small neural network (2 hidden layers) 50 times with different random initializations
3. For a fixed test input x*, collect all 50 predictions
4. Analyze the distribution of predictions:
   - Calculate mean prediction and standard deviation
   - Construct a 95% prediction interval
5. Compare this to the sampling distribution approach:
   - Generate 50 different training datasets (bootstrap)
   - Train one network on each
   - How does this uncertainty compare to initialization uncertainty?
6. Propose a method to quantify total predictive uncertainty

In [None]:
# Your solution here


### Exercise 1.19: LLM Survey Simulation and Statistical Inference

**Reference**: [How Many Survey Respondents is an LLM Worth? (arxiv 2025)](https://arxiv.org/abs/2502.17773)

Researchers use LLMs to simulate survey responses. We need statistical methods to validate these simulations.

**Tasks:**
1. Simulate a "true" population with binary opinions (60% favor, 40% oppose)
2. Create an "LLM simulator" that generates responses with some bias (65% favor, 35% oppose)
3. Draw samples of size n=[50, 100, 200, 500] from both populations
4. For each sample size:
   - Calculate the proportion favoring
   - Construct 95% confidence intervals
   - Test H₀: p_true = p_LLM
5. At what sample size can you reliably detect the 5% bias?
6. Propose a method to adjust LLM simulations to match true population parameters

In [None]:
# Your solution here


### Exercise 1.20: Confidence Intervals for LLM Reward Models

**Reference**: [Uncertainty Quantification for LLM Reward Learning (arxiv 2025)](https://arxiv.org/abs/2512.03208)

Reward models in RLHF need confidence intervals for their reward estimates.

**Tasks:**
1. Simulate a reward model that scores text completions from 0-10
2. Generate 1000 human feedback examples with true rewards drawn from N(μ=7, σ=2)
3. Train a simple reward predictor (linear model on text features)
4. For new completions, the predictor gives point estimates
5. Implement three approaches to construct 95% confidence intervals:
   - Bootstrap: resample training data, retrain, get prediction distribution
   - Asymptotic: use standard error from model fitting
   - Empirical: use residuals from training data
6. Evaluate coverage: do 95% of intervals contain true rewards?
7. Which method gives the most reliable intervals? The tightest?

In [None]:
# Your solution here


---

## Exercise Completion Checklist

- [ ] Exercises 1-5: Fundamental concepts
- [ ] Exercises 6-10: Advanced statistical problems
- [ ] Exercises 11-15: AI/ML applications
- [ ] Exercises 16-20: Contemporary 2025+ problems

---

<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; text-align: center;'>
<p style='margin: 0; font-size: 1.1em;'>Exercises curated by <strong>PLAI Academy</strong></p>
<p style='margin: 5px 0 0 0; opacity: 0.8;'>Statistical Inference • 2025</p>
</div>