# Week 5 — Part 03: Benchmarking script (latency + quality artifacts)

**Estimated time:** 90–150 minutes

## Learning Objectives

- Explain why benchmarks must control variables
- Measure latency as a distribution (not a single number)
- Save benchmark artifacts to disk for manual quality review
- Build a small benchmark harness for local models (Ollama)


## Overview

A benchmark must be consistent:

- same prompt set
- same measurement method
- same saved outputs

We will write a small benchmark harness that:

- loops over prompts
- loops over models
- records latency
- saves outputs to disk for later quality review

---

## Underlying theory: benchmarking is measurement under controlled conditions

You are trying to estimate:

- **speed** (latency / throughput)
- **quality** (correctness, format adherence, completeness)

The key rule is controlling variables:

- same prompts
- same settings
- same machine state as much as possible

Latency is a distribution, not a single number. Two useful summaries:

- average latency (typical case)
- slowest case / tail latency (worst case)

Practical implication: a model that is “fast on average” but has very slow worst cases may still feel bad in a demo.

## Benchmark harness (example)

Benchmark hygiene notes:

- consider a warmup run per model (do not record it) to avoid counting model load time
- keep prompts short enough that you are comparing models, not just tokenization overhead
- avoid changing the prompt set while comparing models (version your prompt list)

Next you’ll implement a simple benchmark harness that saves JSON artifacts to disk.

In [None]:
from __future__ import annotations

import json
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any


try:
    import requests
except Exception as e:  # pragma: no cover
    requests = None
    _requests_import_error = e


@dataclass(frozen=True)
class BenchmarkItem:
    model: str
    prompt: str
    prompt_id: str


def call_ollama_local(host: str, item: BenchmarkItem, *, timeout_s: float = 120.0) -> dict[str, Any]:
    if requests is None:
        raise RuntimeError(f"requests is required: {_requests_import_error}")

    url = f"{host}/api/generate"
    payload = {"model": item.model, "prompt": item.prompt, "stream": False}
    t0 = time.time()
    resp = requests.post(url, json=payload, timeout=timeout_s)
    resp.raise_for_status()
    data = resp.json()
    return {
        "model": item.model,
        "prompt_id": item.prompt_id,
        "response": data.get("response", ""),
        "latency_s": time.time() - t0,
    }


def summarize_latencies(results: list[dict[str, Any]]) -> dict[str, Any]:
    lats = [float(r["latency_s"]) for r in results]
    if not lats:
        return {"n": 0}
    return {
        "n": len(lats),
        "avg_latency_s": sum(lats) / len(lats),
        "p95_latency_s": sorted(lats)[max(0, int(len(lats) * 0.95) - 1)],
        "max_latency_s": max(lats),
    }


print("benchmark helpers ready")

In [None]:
def run_benchmark(items: list[BenchmarkItem], *, host: str, out_dir: Path) -> list[dict[str, Any]]:
    out_dir.mkdir(parents=True, exist_ok=True)
    results: list[dict[str, Any]] = []

    seen_models: set[str] = set()
    for item in items:
        if item.model not in seen_models:
            seen_models.add(item.model)
            _ = call_ollama_local(host, BenchmarkItem(item.model, "Warmup", "warmup"), timeout_s=120.0)

        r = call_ollama_local(host, item, timeout_s=120.0)
        results.append(r)
        out_path = out_dir / f"{item.model}_{item.prompt_id}.json"
        out_path.write_text(json.dumps(r, indent=2), encoding="utf-8")

    summary = summarize_latencies(results)
    (out_dir / "summary.json").write_text(json.dumps(summary, indent=2), encoding="utf-8")
    return results


# Example usage (requires Ollama running locally):
# host = "http://localhost:11434"
# prompts = [
#     ("p01", "Summarize: Large language models are useful but require careful evaluation."),
#     ("p02", "Extract JSON with keys {name, email} from: 'Name: Sam, Email: sam@example.com'"),
#     ("p03", "Write 3 bullet points about caching."),
# ]
# items = [BenchmarkItem(model=m, prompt=p, prompt_id=pid) for m in ["llama3.1", "qwen2.5"] for pid, p in prompts]
# results = run_benchmark(items, host=host, out_dir=Path("output/benchmarks"))
# print("wrote", len(results), "results")

## How to compare models

Compare:

- **Speed**: average latency + slowest case
- **Quality**: read saved outputs for:
  - correctness
  - adherence to format
  - completeness

Simple “quality heuristics” without heavy math:

- for JSON prompts: count parse failures
- for extraction: check if required keys exist
- for summaries: check length caps and whether key facts are present

---

## References

- Python `time`: https://docs.python.org/3/library/time.html
- Python `timeit`: https://docs.python.org/3/library/timeit.html