# Level 2 - Week 5 - 01 Controlled Iteration

**Estimated time:** 60-90 minutes

## Learning Objectives

- Change one variable at a time
- Track runs with run_id
- Record metrics per run


## Overview

RAG systems improve through disciplined iteration.

If you change multiple things at once, you can’t explain improvements.

## The underlying theory: metrics are noisy signals

When you run an evaluation on a finite set of questions, a metric is an estimate of true performance.

If a metric is an average of per-item outcomes $x_i$ (e.g. hit=0/1), then:

$$
\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

Small $n$ means more randomness: a single hard item can move the metric.

### Confidence intuition (rule-of-thumb)

For a 0/1 metric (a proportion) like hit rate, a rough standard error is:

$$
\mathrm{SE}(\hat{p}) \approx \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
$$

So if $\hat{p}=0.5$ and $n=20$, $\mathrm{SE}\approx 0.11$.

Practical implication: a small change (e.g. +0.03) may be sampling noise on small eval sets.

## Controlled iteration: isolate cause → effect

Freeze everything except one variable (chunk size/overlap, top_k, embedding model, rerank on/off). Keep the eval set and prompt template fixed.

## Practical run artifacts (minimum)

Per run id, save:

- config used
- metrics summary
- top failures with evidence

This gives reproducibility and rollback.

### Sample code

Config list and run_id pattern.


In [None]:
from datetime import datetime

configs = [
    {'chunk_size': 600, 'overlap': 100, 'top_k': 5},
    {'chunk_size': 800, 'overlap': 100, 'top_k': 5},
]

run_id = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
print('run_id', run_id)
print('configs', configs)


### Student fill-in

Add a metrics placeholder for each config.


In [None]:
results = []
for cfg in configs:
    # TODO: run eval and fill metrics
    results.append({'config': cfg, 'metrics': {}})

print(results)


## Self-check

- Are you changing only one variable?
- Do you keep run artifacts?


## Practical run-id + folder layout

A simple pattern:

- `runs/2026-01-27_1400_chunk800_overlap150/`
  - `config.json`
  - `metrics.json`
  - `failures.json`
  - `samples.md`

This gives you a paper-trail for comparisons and rollbacks.

### Exercise: standard error intuition

Compute the rule-of-thumb standard error for a proportion metric.

Use this to sanity-check whether small metric deltas are plausibly noise.

In [None]:
import math


def se_proportion(p_hat: float, n: int) -> float:
    return math.sqrt((p_hat * (1.0 - p_hat)) / n)


p_hat = 0.5
n = 20
print("SE:", round(se_proportion(p_hat, n), 3))
print("Typical 1-sigma band:", (round(p_hat - se_proportion(p_hat, n), 3), round(p_hat + se_proportion(p_hat, n), 3)))

# Example question:
# If hit@k moves from 0.50 to 0.55 on n=20, is that obviously meaningful?
print("delta=0.05 vs SE=", round(se_proportion(p_hat, n), 3))

In [None]:
from dataclasses import dataclass
from typing import Any


@dataclass(frozen=True)
class RunResult:
    run_id: str
    change: str
    metric_before: float
    metric_after: float
    notes: str


# Example experiment table (fill with your own numbers)
results_table: list[RunResult] = [
    RunResult(run_id="baseline", change="none", metric_before=0.45, metric_after=0.45, notes="baseline"),
    RunResult(run_id="chunk800", change="chunk_size=800", metric_before=0.45, metric_after=0.52, notes="improved recall, citations ok"),
]

for r in results_table:
    print(r)

### Student fill-in

- Choose one variable to change (e.g. `chunk_size`).
- Keep everything else fixed.
- Record:
  - before/after metric values
  - 3–5 concrete failure examples
  - a short note describing the tradeoff (e.g. recall improved but citations got worse).

In [None]:
# TODO: Create your own run table entry.
#
# Example:
# results_table.append(
#     RunResult(run_id="...", change="...", metric_before=..., metric_after=..., notes="...")
# )
#
# Then print the updated table.

## Self-check

- Did you change exactly one variable?
- Is your eval set fixed across the tuning burst?
- Can you reproduce a run from artifacts (config + outputs)?
- Did you record at least one regression and a rollback decision?