# Level 2 - Week 5 - 01 Controlled Iteration

**Estimated time:** 60-90 minutes

## Learning Objectives

- Change one variable at a time
- Track runs with run_id
- Record metrics per run


## Overview

Controlled iteration prevents confusion about why metrics changed.
Keep the eval set fixed while you test one change.

## Practice Steps

- Define a list of configs.
- Record run_id and metrics per config.


### Sample code

Config list and run_id pattern.


In [None]:
from datetime import datetime

configs = [
    {'chunk_size': 600, 'overlap': 100, 'top_k': 5},
    {'chunk_size': 800, 'overlap': 100, 'top_k': 5},
]

run_id = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
print('run_id', run_id)
print('configs', configs)


### Student fill-in

Add a metrics placeholder for each config.


In [None]:
results = []
for cfg in configs:
    # TODO: run eval and fill metrics
    results.append({'config': cfg, 'metrics': {}})

print(results)


## Self-check

- Are you changing only one variable?
- Do you keep run artifacts?


Legacy practice content from practice.ipynb

# Level 2 — Week 5 Practice: RAG Evaluation Basics

**Estimated time:** 60–90 minutes

## Learning Objectives

- Define a minimal evaluation set
- Track simple quality metrics (coverage, refusal, non-empty)
- Log failures for iteration
- Produce a small evaluation summary


Legacy practice content from practice.ipynb

## Overview

This practice builds a tiny evaluation loop. The goal is not perfect metrics,
just a consistent routine for comparing changes.

You will:

1. Create a small evaluation dataset (10–20 items).
2. Implement a loop that scores outputs and logs failures.
3. Summarize results in a dictionary.

## Practice Steps

- Fill in the sample dataset below.
- Implement evaluation checks.
- Print a summary and a few failures.


In [None]:
# Legacy practice content
TASK_5_1_GUIDE = """
Task 5.1: Evaluation loop

Implement evaluation checks and track simple metrics.

Checklist:
- Define 10-20 eval items
- Track answer_nonempty and citation_coverage
- Log failures for review
"""

print(TASK_5_1_GUIDE)


In [None]:
# Legacy practice content
from typing import List, Dict

EvalItem = Dict

items: List[EvalItem] = [
    {"query": "What is the return policy?", "answer": "Returns in 30 days.", "citations": ["doc-1"]},
    {"query": "Do you offer refunds?", "answer": "", "citations": []},
]

def evaluate(items: List[EvalItem]) -> Dict:
    failures = []
    for item in items:
        has_answer = bool(item.get("answer"))
        has_citations = bool(item.get("citations"))
        if not (has_answer and has_citations):
            failures.append(item)
    return {
        "n": len(items),
        "answer_nonempty": sum(bool(i.get("answer")) for i in items),
        "citation_coverage": sum(bool(i.get("citations")) for i in items),
        "failures": failures,
    }

summary = evaluate(items)
print(summary)


Legacy practice content from practice.ipynb

### Task 5.2: Failure review

Print a small subset of failures for quick iteration notes.


In [None]:
# Legacy practice content
for item in summary["failures"][:3]:
    print("failure:", item)


Legacy practice content from practice.ipynb

## Self-check

- Is your dataset small but representative?
- Are failures visible and easy to inspect?
- Can you compare runs consistently?
