# Week 2 — Part 03: Compare Runs + Report Lab

**Estimated time:** 60–90 minutes

---

## Pre-study (Self-learn)

Foundations Course assumes Self-learn is complete. If you need a refresher on evaluation metrics:

- [Foundations Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Evaluation metrics (accuracy/precision/recall/F1)](../self_learn/Chapters/4/02_core_concepts.md)

---

## What success looks like (end of Part 03)

- You compare at least 2 runs with a change in exactly one variable.
- You write a short `report.md` that explains:
  - what changed
  - what happened
  - what you think caused it
  - what you'd try next

### Checkpoint

After running this notebook:
- `output/report.md` exists and compares two runs
- The comparison explains why the change mattered

## Learning Objectives

- Compare ML experiment runs systematically
- Write evidence-based reports
- Identify the key variable that changed between runs

### What this part covers
This notebook teaches you to **compare experiment runs systematically** and write an evidence-based report.

The core discipline: change **one variable at a time**, measure the effect, and write down what you found and why you think it happened.

**Why this matters for LLM work:** The same discipline applies when comparing prompts, models, or temperatures. You need a consistent comparison framework — same inputs, same metrics, same artifact structure — to know whether a change actually helped.

## Overview

Comparing runs requires consistent fields and consistent artifacts.

In this lab you will:

- load or create a small list of runs
- select the best run using a clear rule
- compute a summary
- write report artifacts under `output/compare_runs/`

If you need a refresher on evaluation metrics, use the Self-learn links at the top of the notebook.

### What this cell does
Creates a list of 3 run records (simulating saved experiment results) and writes them to `output/compare_runs/runs.json`.

**Key design:** Each run has a `run_id`, `model`, `accuracy`, `f1`, and `notes`. The `notes` field records what changed — this is how you track "what was different about this run?" without digging through code history.

**Notice:** `run_002` changed only `max_iter` (more iterations), `run_003` switched to a different model (`rf` = Random Forest). A good experiment changes **one thing at a time** so you know what caused the improvement.

In [None]:
import json
from pathlib import Path

runs = [
    {"run_id": "run_001", "model": "logreg", "accuracy": 0.84, "f1": 0.82, "notes": "baseline"},
    {"run_id": "run_002", "model": "logreg", "accuracy": 0.87, "f1": 0.86, "notes": "more iterations"},
    {"run_id": "run_003", "model": "rf", "accuracy": 0.89, "f1": 0.88, "notes": "higher depth"},
]

out_dir = Path("output/compare_runs")
out_dir.mkdir(parents=True, exist_ok=True)
(out_dir / "runs.json").write_text(json.dumps(runs, indent=2), encoding="utf-8")
print("wrote", out_dir / "runs.json")

### What this cell does
Defines `select_best_run_todo()` and `summarize_runs_todo()` — two functions you need to implement.

**`select_best_run_todo()`** should return the run with the highest accuracy (tie-break on F1). This is a deliberate design choice: you pick *one* primary metric to rank by, then use a secondary metric to break ties.

**`summarize_runs_todo()`** should return a summary dict with the best run, average accuracy, and total run count. This summary is what goes into your report.

**Your task:** The current implementations are placeholders. Implement them properly — the solution is in the Appendix if you get stuck.

In [None]:
def select_best_run_todo(runs):
    """TODO: return the best run.

    Criteria (suggested):

    - highest accuracy
    - tie-break: highest f1
    """
    return runs[0]


def summarize_runs_todo(runs):
    """TODO: return a small summary dict used for reporting."""
    best = select_best_run_todo(runs)
    avg_acc = sum(r["accuracy"] for r in runs) / len(runs)
    return {"best": best, "avg_accuracy": round(avg_acc, 3), "n": len(runs)}


summary = summarize_runs_todo(runs)
print(summary)

### What this cell does
Defines `write_report_todo()` — a function that writes a Markdown report summarizing the comparison.

**What a good report includes:**
- Total number of runs (so readers know the scope)
- Average accuracy across all runs (baseline context)
- The best run with its exact `run_id`, model, metrics, and notes

**Why write to a file?** A report in a file is an artifact — it can be committed to git, shared with teammates, and referenced in future retrospectives. A print statement disappears when the notebook closes.

In [None]:
def write_report_todo(path: Path, summary: dict) -> None:
    """TODO: write a markdown report.

    Suggested sections:

    - Total runs
    - Average accuracy
    - Best run (with run_id, model, metrics, notes)
    """
    lines = ["# Run Comparison Report", "", f"Total runs: {summary['n']}"]
    path.write_text("\n".join(lines), encoding="utf-8")


write_report_todo(out_dir / "report.md", summary)
print("wrote", out_dir / "report.md")

## Appendix: Solutions (peek only after trying)

Reference implementations for the TODO functions in this notebook.

In [None]:
def select_best_run_todo(runs):
    return max(runs, key=lambda r: (r["accuracy"], r["f1"]))


def summarize_runs_todo(runs):
    best = select_best_run_todo(runs)
    avg_acc = sum(r["accuracy"] for r in runs) / len(runs)
    avg_f1 = sum(r["f1"] for r in runs) / len(runs)
    return {
        "best": best,
        "avg_accuracy": round(avg_acc, 3),
        "avg_f1": round(avg_f1, 3),
        "n": len(runs),
    }


def write_report_todo(path: Path, summary: dict) -> None:
    lines = ["# Run Comparison Report", ""]
    lines.append(f"Total runs: {summary['n']}")
    lines.append(f"Average accuracy: {summary['avg_accuracy']}")
    if "avg_f1" in summary:
        lines.append(f"Average f1: {summary['avg_f1']}")
    lines.append("")

    best = summary["best"]
    lines.append("## Best run")
    lines.append(f"- run_id: {best['run_id']}")
    lines.append(f"- model: {best['model']}")
    lines.append(f"- accuracy: {best['accuracy']}")
    lines.append(f"- f1: {best['f1']}")
    lines.append(f"- notes: {best['notes']}")

    path.write_text("\n".join(lines), encoding="utf-8")


summary_solution = summarize_runs_todo(runs)
write_report_todo(out_dir / "report_solution.md", summary_solution)
print("wrote", out_dir / "report_solution.md")

## Exercise: Live experiment comparison

Goal:

- Run **two** experiments that differ by exactly one change.
- Write a short `output/compare_runs/report.md` explaining:
  - what changed
  - what happened (metrics)
  - what you think caused it
  - what you'd try next

Checkpoint:

- `output/compare_runs/report.md` exists and mentions both experiments.

In [None]:
from dataclasses import dataclass

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score


@dataclass
class Config:
    seed: int = 42
    test_size: float = 0.2
    max_iter: int = 200


data = load_iris(as_frame=True)
X = data.data
y = data.target


def run_experiment(cfg: Config):
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=cfg.test_size, random_state=cfg.seed, stratify=y
    )
    m = LogisticRegression(max_iter=cfg.max_iter)
    m.fit(X_train, y_train)
    pred = m.predict(X_val)

    return {
        "config": cfg.__dict__.copy(),
        "metrics": {
            "accuracy": float(accuracy_score(y_val, pred)),
            "f1_macro": float(f1_score(y_val, pred, average="macro")),
        },
    }


cfg_a = Config()
cfg_b = Config(seed=42, test_size=0.2, max_iter=cfg_a.max_iter * 2)

run_a = run_experiment(cfg_a)
run_b = run_experiment(cfg_b)

report_md = "\n".join(
    [
        "# Experiment Comparison Report",
        "",
        "## What changed",
        "TODO: Describe the one change you made (max_iter / solver / model type).",
        "",
        "## Results",
        f"- Experiment A config: {run_a['config']}",
        f"- Experiment A metrics: {run_a['metrics']}",
        f"- Experiment B config: {run_b['config']}",
        f"- Experiment B metrics: {run_b['metrics']}",
        "",
        "## Why you think it happened",
        "TODO: Write 2-5 sentences.",
        "",
        "## Next experiment",
        "TODO: What will you try next?",
        "",
    ]
)

(out_dir / "report.md").write_text(report_md, encoding="utf-8")
print("wrote", out_dir / "report.md")
run_a, run_b

### Solution: Live experiment comparison

Reference approach for comparing experiments and writing a short report.

In [None]:
def format_run(run: dict) -> str:
    cfg = run["config"]
    metrics = run["metrics"]
    return "\n".join(
        [
            f"- config: {cfg}",
            f"- accuracy: {metrics['accuracy']}",
            f"- f1_macro: {metrics['f1_macro']}",
        ]
    )


report_solution = "\n".join(
    [
        "# Experiment Comparison Report",
        "",
        "## What changed",
        "In Experiment B, I increased `max_iter` while holding `seed` and `test_size` constant.",
        "",
        "## Results",
        "### Experiment A",
        format_run(run_a),
        "",
        "### Experiment B",
        format_run(run_b),
        "",
        "## Why you think it happened",
        "Logistic regression sometimes needs more optimization steps to converge; increasing `max_iter` can improve metrics if the model was under-trained.",
        "",
        "## Next experiment",
        "Try a different solver (e.g. `lbfgs` vs `liblinear`) or add feature scaling and compare again.",
        "",
    ]
)

(out_dir / "report_solution.md").write_text(report_solution, encoding="utf-8")
print("wrote", out_dir / "report_solution.md")