# Sampling Methodology Exercise

**CMVP Capstone — Statistics Foundations**

This notebook replicates the *Statistics Exercise* workbook.
You will:
1. Generate a synthetic population of fixture wattages
2. Draw a random sample and compare to the population
3. Walk through descriptive statistics step by step
4. Calculate required sample sizes for various precision/confidence targets

---

## 1. Import the functions

In [None]:
import sys, os
sys.path.insert(0, os.path.join(os.getcwd(), '..', 'scripts'))

from sampling_exercise import (
    generate_population, draw_sample, calc_stats,
    sample_size_infinite, sample_size_finite, z_score, histogram
)
import math

## 2. Generate a population

Imagine a building with 1,000 light fixtures. The true mean wattage is 100 W with a standard deviation of 25 W.

In [None]:
POP_MEAN = 100
POP_STD = 25
POP_SIZE = 1000
SEED = 42

population = generate_population(POP_MEAN, POP_STD, POP_SIZE, seed=SEED)
pop_stats = calc_stats(population)

print(f"Target:   mean={POP_MEAN}, std_dev={POP_STD}, N={POP_SIZE}")
print(f"Observed: mean={pop_stats['mean']:.2f}, std_dev={pop_stats['std_dev']:.2f}")
print(f"Range:    {pop_stats['min']:.1f} – {pop_stats['max']:.1f}")
print(f"CV:       {pop_stats['cv']:.4f} ({pop_stats['cv']*100:.2f}%)")
print("\nDistribution:")
histogram(population)

## 3. Draw a random sample

In [None]:
SAMPLE_SIZE = 30

sample = draw_sample(population, SAMPLE_SIZE, seed=SEED + 1)
samp_stats = calc_stats(sample)

print(f"Sample (first 10): {', '.join(f'{x:.1f}' for x in sample[:10])}...")
print()
print(f"{'Statistic':<20} {'Population':<15} {'Sample':<15} {'% Diff':<10}")
print("-" * 60)
for label, pk, sk in [
    ('Mean', pop_stats['mean'], samp_stats['mean']),
    ('Std Dev', pop_stats['std_dev'], samp_stats['std_dev']),
    ('CV', pop_stats['cv'], samp_stats['cv']),
]:
    pct = (sk - pk) / pk * 100 if pk != 0 else 0
    print(f"{label:<20} {pk:<15.2f} {sk:<15.2f} {pct:<+10.1f}%")

## 4. Step-by-step descriptive statistics

In [None]:
deviations = [x - samp_stats['mean'] for x in sample]
sq_devs = [d ** 2 for d in deviations]

print(f"{'#':<4} {'Watts':<10} {'Deviation':<12} {'Dev²':<12}")
print("-" * 38)
for i, (w, d, d2) in enumerate(zip(sample, deviations, sq_devs), 1):
    print(f"{i:<4} {w:<10.1f} {d:<12.2f} {d2:<12.2f}")
print("-" * 38)
print(f"{'Sum':<4} {sum(sample):<10.1f} {'':12} {sum(sq_devs):<12.2f}")
print()
print(f"Sample Variance = {sum(sq_devs):.2f} / ({samp_stats['n']} - 1) = {samp_stats['variance']:.2f}")
print(f"Sample Std Dev  = sqrt({samp_stats['variance']:.2f}) = {samp_stats['std_dev']:.2f}")
print(f"CV              = {samp_stats['std_dev']:.2f} / {samp_stats['mean']:.2f} = {samp_stats['cv']:.4f}")

## 5. Sample size calculator

### Formulas

| Step | Formula |
|---|---|
| Infinite population | $n_0 = \left(\frac{Z \cdot CV}{P}\right)^2$ |
| Finite population correction | $n = \frac{n_0 \cdot N}{n_0 + N}$ |

Where Z is the z-score for the desired confidence level, CV is the coefficient of variation, and P is the desired precision.

In [None]:
# Using the sample CV
cv = samp_stats['cv']
confidence = 90
precision = 0.10

z = z_score(confidence)
n0 = sample_size_infinite(z, cv, precision)
n_final = sample_size_finite(n0, POP_SIZE)

print(f"Z({confidence}%) = {z:.3f},  CV = {cv:.4f},  P = {precision:.2f}")
print(f"n₀ = ({z:.3f} × {cv:.4f} / {precision:.2f})² = {n0:.1f}")
print(f"n  = ({n0:.1f} × {POP_SIZE}) / ({n0:.1f} + {POP_SIZE}) = {n_final:.1f}")
print(f"\nRequired sample size: {math.ceil(n_final)}")

### Scenario comparison

In [None]:
print(f"{'Confidence':<12} {'Precision':<12} {'n₀':<10} {'n (N={POP_SIZE})':<10}")
print("-" * 44)
for conf in [80, 90, 95]:
    for prec in [0.05, 0.10, 0.20]:
        z_val = z_score(conf)
        n0_val = sample_size_infinite(z_val, cv, prec)
        n_val = sample_size_finite(n0_val, POP_SIZE)
        print(f"{conf}%{'':<8} ±{prec*100:.0f}%{'':<8} {math.ceil(n0_val):<10} {math.ceil(n_val):<10}")

## 6. Exercises

**Try these:**
1. Change `SAMPLE_SIZE` to 10, then 100. How does the sample CV compare to the population CV?
2. If your project requires ±5% precision at 95% confidence, how many fixtures must you sample?
3. What happens to the finite population correction when the population is very large (N = 100,000)?
4. Run the cell multiple times with different seeds. How much variability do you see in the sample mean?

---
*CMVP Capstone · Counterfactual Designs*