# Week 5 — Part 03: Benchmarking script (latency + quality artifacts)

**Estimated time:** 90–150 minutes

## Learning Objectives

- Explain why benchmarks must control variables
- Measure latency as a distribution (not a single number)
- Save benchmark artifacts to disk for manual quality review
- Build a small benchmark harness for local models (Ollama)


### What this part covers
This notebook builds a **benchmarking harness** — a script that runs the same prompts across multiple models and records latency + outputs for comparison.

**Why benchmark systematically?**
- Latency is a distribution, not a single number. One fast run doesn't tell you the typical or worst-case performance.
- Quality varies by task. A model that's fast for simple extraction may be slow or inaccurate for complex reasoning.
- Without consistent inputs (same prompts, same settings), you're comparing apples to oranges.

**Benchmark hygiene rules:**
1. Same prompt set for all models
2. Warmup run per model (don't count model-loading time)
3. Save all outputs to disk for manual quality review

## Overview

A benchmark must be consistent:

- same prompt set
- same measurement method
- same saved outputs

We will write a small benchmark harness that:

- loops over prompts
- loops over models
- records latency
- saves outputs to disk for later quality review

---

## What success looks like (end of Part 03)

- You can run a benchmark over **at least 2 models** and **at least 5 prompts**.
- You write JSON artifacts to `output/benchmarks/`.
- You produce a `summary.json` with basic latency distribution statistics.

If you cannot reach Ollama, fix the local runtime first (Week 5 Part 01).

---

## Underlying theory: benchmarking is measurement under controlled conditions

You are trying to estimate:

- **speed** (latency / throughput)
- **quality** (correctness, format adherence, completeness)

The key rule is controlling variables:

- same prompts
- same settings
- same machine state as much as possible

Latency is a distribution, not a single number. Two useful summaries:

- average latency (typical case)
- slowest case / tail latency (worst case)

Practical implication: a model that is “fast on average” but has very slow worst cases may still feel bad in a demo.

### What this cell does
Defines the benchmark infrastructure:

- **`BenchmarkItem`** — one (model, prompt, prompt_id) combination to run
- **`call_ollama_local()`** — makes one HTTP call and returns model, prompt_id, response, and latency
- **`summarize_latencies()`** — computes n, average, P95, and max latency from a list of results

**Why P95 latency?** Average latency tells you the typical case. P95 (the 95th percentile) tells you the worst case most users will experience. A model with avg=1s but P95=10s will feel slow to 1 in 20 users — important to know before choosing it for production.

## Benchmark harness (example)

Benchmark hygiene notes:

- consider a warmup run per model (do not record it) to avoid counting model load time
- keep prompts short enough that you are comparing models, not just tokenization overhead
- avoid changing the prompt set while comparing models (version your prompt list)

Next you’ll implement a simple benchmark harness that saves JSON artifacts to disk.

### What this cell does
Implements `run_benchmark()` — the main harness that loops over all (model, prompt) combinations, runs each one, saves a JSON artifact per result, and writes a `summary.json`.

**Key design decisions:**
- **Warmup per model:** The first call to a model is slow (loading weights into memory). The warmup call is discarded so it doesn't skew your latency measurements.
- **One JSON file per result:** Naming pattern `{model}_{prompt_id}.json` makes it easy to find specific results later. You can inspect any individual output without parsing a large combined file.
- **`summary.json`:** Aggregated statistics (avg, P95, max latency) across all runs. This is what you report in your comparison.

**What to check after running:** Open `output/benchmarks/summary.json` and verify the latency numbers look reasonable for your hardware.

In [None]:
import json
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List


try:
    import requests
except Exception as e:  # pragma: no cover
    requests = None
    _requests_import_error = e


@dataclass(frozen=True)
class BenchmarkItem:
    model: str
    prompt: str
    prompt_id: str


def call_ollama_local(host: str, item: BenchmarkItem, *, timeout_s: float = 120.0) -> Dict[str, Any]:
    if requests is None:
        raise RuntimeError("requests is required: %s" % _requests_import_error)

    url = "%s/api/generate" % host
    payload = {"model": item.model, "prompt": item.prompt, "stream": False}
    t0 = time.time()
    resp = requests.post(url, json=payload, timeout=timeout_s)
    resp.raise_for_status()
    data = resp.json()
    return {
        "model": item.model,
        "prompt_id": item.prompt_id,
        "response": data.get("response", ""),
        "latency_s": time.time() - t0,
    }


def summarize_latencies(results: List[Dict[str, Any]]) -> Dict[str, Any]:
    lats = [float(r["latency_s"]) for r in results]
    if not lats:
        return {"n": 0}
    return {
        "n": len(lats),
        "avg_latency_s": sum(lats) / len(lats),
        "p95_latency_s": sorted(lats)[max(0, int(len(lats) * 0.95) - 1)],
        "max_latency_s": max(lats),
    }


print("benchmark helpers ready")

### What this cell does
Runs a tiny benchmark (1 model × 2 prompts) to validate the full harness end-to-end.

**What to notice:**
- `list_models_via_tags()` auto-discovers available models — no hardcoding needed
- Only 1 model is used for the first run (`models[:1]`) to keep it fast
- If Ollama is not running, the error message tells you exactly what to do

**Your task:** Implement `list_models_via_tags_todo()` — a reusable version of the model discovery function. Once implemented, you can use it in any script to avoid hardcoding model names. The solution is in the Appendix.

In [None]:
from typing import Set


def run_benchmark(items: List[BenchmarkItem], *, host: str, out_dir: Path) -> List[Dict[str, Any]]:
    out_dir.mkdir(parents=True, exist_ok=True)
    results: List[Dict[str, Any]] = []

    seen_models: Set[str] = set()
    for item in items:
        if item.model not in seen_models:
            seen_models.add(item.model)
            _ = call_ollama_local(host, BenchmarkItem(item.model, "Warmup", "warmup"), timeout_s=120.0)

        r = call_ollama_local(host, item, timeout_s=120.0)
        results.append(r)
        out_path = out_dir / ("%s_%s.json" % (item.model, item.prompt_id))
        out_path.write_text(json.dumps(r, indent=2), encoding="utf-8")

    summary = summarize_latencies(results)
    (out_dir / "summary.json").write_text(json.dumps(summary, indent=2), encoding="utf-8")
    return results


# Example usage (requires Ollama running locally):
# host = "http://localhost:11434"
# prompts = [
#     ("p01", "Summarize: Large language models are useful but require careful evaluation."),
#     ("p02", "Extract JSON with keys {name, email} from: 'Name: Sam, Email: sam@example.com'"),
#     ("p03", "Write 3 bullet points about caching."),
# ]
# items = [BenchmarkItem(model=m, prompt=p, prompt_id=pid) for m in ["llama3.1", "qwen2.5"] for pid, p in prompts]
# results = run_benchmark(items, host=host, out_dir=Path("output/benchmarks"))
# print("wrote", len(results), "results")

## Guided walkthrough: run a tiny benchmark once

Goal: verify the full flow **models × prompts → artifacts on disk → summary.json**.

### Checkpoint

- You should see JSON files created under `output/benchmarks/`.
- If Ollama is not running, you should see a clear error and next steps.

Tip: start with 1 model and 2–3 prompts, then scale up.

In [None]:
from typing import List

DEFAULT_HOST = "http://localhost:11434"

# Tiny benchmark inputs (start small, then scale up)
PROMPTS = [
    ("p01", "Summarize: Local inference trades cost for hardware constraints."),
    ("p02", "Write 3 bullet points about benchmarking."),
]


def list_models_via_tags(host: str) -> List[str]:
    # Minimal helper used by the walkthrough.
    if requests is None:
        raise RuntimeError("requests is required: %s" % _requests_import_error)
    resp = requests.get("%s/api/tags" % host, timeout=2.0)
    resp.raise_for_status()
    data = resp.json()
    return [m.get("name") for m in data.get("models", []) if m.get("name")]


try:
    models = list_models_via_tags(DEFAULT_HOST)
    if not models:
        print("No local models found.")
        print("Next steps:")
        print("- Pull a model: ollama pull <model>")
        print("- Confirm: ollama list")
    else:
        # Use 1 model for the first run to validate the harness.
        MODELS = models[:1]
        items = [BenchmarkItem(model=m, prompt=p, prompt_id=pid) for m in MODELS for pid, p in PROMPTS]
        results = run_benchmark(items, host=DEFAULT_HOST, out_dir=Path("output/benchmarks"))
        print("wrote_results=", len(results))
        print("summary_path=", Path("output/benchmarks/summary.json").resolve())
except Exception as e:
    print("benchmark_ok=False")
    print("error=", type(e).__name__, str(e))
    print("Next steps:")
    print("- Ensure Ollama is running: ollama serve")
    print("- Ensure /api/tags is reachable: http://localhost:11434/api/tags")
    print("- Ensure you pulled a model: ollama pull <model>")

## Exercise: make model discovery reusable

Implement the TODO below.

Goal:

- Return a list of model names available locally (via `GET /api/tags`).
- Fail fast with a clear error if Ollama is not reachable.

You’ll use this to avoid hardcoding `MODELS = [...]` in future scripts.

In [None]:
from typing import List


def list_models_via_tags_todo(host: str = "http://localhost:11434") -> List[str]:
    # TODO: implement.
    # Suggested approach:
    # - GET {host}/api/tags with a short timeout
    # - parse JSON: data['models'][*]['name']
    # - return List[str]
    return []


print("Implement list_models_via_tags_todo().")

## How to compare models

Compare:

- **Speed**: average latency + slowest case
- **Quality**: read saved outputs for:
  - correctness
  - adherence to format
  - completeness

Simple “quality heuristics” without heavy math:

- for JSON prompts: count parse failures
- for extraction: check if required keys exist
- for summaries: check length caps and whether key facts are present

---

## References

- Python `time`: https://docs.python.org/3/library/time.html
- Python `timeit`: https://docs.python.org/3/library/timeit.html

## Appendix: Solutions (peek only after trying)

Reference implementation for `list_models_via_tags_todo`.

In [None]:
from typing import List


def list_models_via_tags_todo(host: str = "http://localhost:11434") -> List[str]:
    if requests is None:
        raise RuntimeError("requests is required: %s" % _requests_import_error)
    resp = requests.get("%s/api/tags" % host, timeout=2.0)
    resp.raise_for_status()
    data = resp.json()
    return [m.get("name") for m in data.get("models", []) if m.get("name")]


try:
    print(list_models_via_tags_todo()[:5])
except Exception as e:
    print("error=", type(e).__name__, str(e))