# Foundamental Course — Week 5 Capstone Practice: Local inference benchmarking (Ollama)

This notebook is a guided mini-capstone. Your goal is to produce **benchmark artifacts + written conclusions** comparing local models.

## What success looks like

- You benchmark **2+ models** on a **consistent prompt set** (suggested: 10–20 prompts).
- You save results to disk under:
  - `output/week_05_bench/results.json`
- You can answer:
  - best model for speed
  - best model for quality
  - major failure modes (timeouts, low-quality outputs, format failures)

## References (docs)
- Ollama (official): https://ollama.com/
- Ollama GitHub (docs): https://github.com/ollama/ollama
- Python `time` (official): https://docs.python.org/3/library/time.html
- `requests` (official docs): https://requests.readthedocs.io/

## Setup

This assumes you have Ollama running locally.

Checklist:

- `ollama --version` works
- `ollama serve` is running
- `ollama list` shows at least 2 models (pull more if needed)

Default endpoint:

- `http://localhost:11434`

If the endpoint is different, update `OLLAMA_BASE_URL`.

### Checkpoint

After you run the health check cell below, you should see a list of available model names.

In [None]:
import json
import time
from dataclasses import dataclass
from typing import List, Dict, Any

import requests


In [None]:
OLLAMA_BASE_URL = 'http://localhost:11434'
GENERATE_URL = f'{OLLAMA_BASE_URL}/api/generate'
GENERATE_URL


## Step 1: Define a benchmark set (run after Step 0 health check)

Keep your benchmark **small and consistent** so model comparisons are fair.

Suggested prompt categories (mix at least 2–3 per category):

- JSON extraction (format adherence)
- summarization (faithfulness)
- short instruction following

Start with 3 prompts to verify the loop works. Then scale to 10–20 prompts.

In [None]:
PROMPTS = [
    # Start small (3 prompts) to validate the benchmark loop works.
    "Summarize: Local inference trades API cost for hardware constraints.",
    "Extract as JSON with keys {name, email} from: John Doe (john@example.com).",
    "Write 3 bullets: pros/cons of local inference.",
]

# TODO: choose 2–3 models you have pulled (see the health check output / `ollama list`).
# Tip: include at least one smaller/faster model and one larger/slower model.
MODELS = []

PROMPTS

## Step 0: Health check (run this first)

If this fails, fix Ollama first before debugging your benchmark code.

In [None]:
from pathlib import Path

TAGS_URL = f"{OLLAMA_BASE_URL}/api/tags"

try:
    resp = requests.get(TAGS_URL, timeout=2.0)
    resp.raise_for_status()
    tags = resp.json()
    AVAILABLE_MODELS = [m.get("name") for m in tags.get("models", []) if m.get("name")]
    print("ollama_ok=True")
    print("available_models=", AVAILABLE_MODELS)
except Exception as e:
    AVAILABLE_MODELS = []
    print("ollama_ok=False")
    print("error=", type(e).__name__, str(e))
    print("Next steps:")
    print("- Start server: ollama serve")
    print("- Pull a model: ollama pull <model>")
    print("- Confirm: ollama list")

## Ollama generate call

We call the HTTP API and measure latency.
If you get connection errors, confirm Ollama is running and your model name is correct.


In [None]:
def ollama_generate(model: str, prompt: str, *, timeout_s: float = 60.0) -> Dict[str, Any]:
    payload = {
        'model': model,
        'prompt': prompt,
        'stream': False,
    }
    resp = requests.post(GENERATE_URL, json=payload, timeout=timeout_s)
    resp.raise_for_status()
    return resp.json()


## Benchmark loop (starter)

Stores latency and a small output sample for later inspection.


In [None]:
# Auto-fill MODELS (so the notebook can run end-to-end)
# For the assignment, you should still choose MODELS yourself.
if not MODELS:
    try:
        MODELS = AVAILABLE_MODELS[:2]
    except Exception:
        MODELS = []

if not MODELS:
    print("No MODELS selected.")
    print("Next steps:")
    print("- Ensure you pulled at least one model: ollama pull <model>")
    print("- Set MODELS = [...] to model names shown by the health check / ollama list")

results = []
for model in MODELS:
    for prompt in PROMPTS:
        t0 = time.time()
        try:
            out = ollama_generate(model, prompt)
            ok = True
            text = out.get('response', '')
        except Exception as e:
            ok = False
            text = f'ERROR: {type(e).__name__}: {e}'
        latency_ms = int((time.time() - t0) * 1000)
        results.append({
            'model': model,
            'prompt': prompt,
            'ok': ok,
            'latency_ms': latency_ms,
            'output_preview': text[:400],
        })

results

## Step 3: Save artifacts

Write a single `results.json` file so you can review quality later and share evidence.

### Checkpoint

After running the next cell, you should have:

- `output/week_05_bench/results.json`

In [None]:
OUT_DIR = Path("output/week_05_bench")
OUT_DIR.mkdir(parents=True, exist_ok=True)

out_path = OUT_DIR / "results.json"
out_path.write_text(json.dumps(results, indent=2), encoding="utf-8")
print("wrote=", out_path.resolve())

## Step 4: Conclusions template

Fill this in after you run at least 2 models.

- Best for speed:
- Best for quality:
- Biggest failure modes:
- When you would choose each model:

Optional: paste 1–2 example outputs that demonstrate the differences.

## Optional: Choose your models (do this before running the benchmark loop)

You already have a `MODELS = []` TODO near the top of the notebook.

If you want to override what the notebook auto-selects, set `MODELS` manually (2–3 models) and then re-run the benchmark loop cell.

Checklist:

- At least one smaller/faster model
- At least one larger/slower model

You can copy names from the health check output (`AVAILABLE_MODELS`).

In [None]:
print("AVAILABLE_MODELS=", AVAILABLE_MODELS)
print("MODELS=", MODELS)

print("If you change MODELS, re-run the benchmark loop cell to regenerate results.")

## Appendix: Solutions (peek only after trying)

Reference defaults that make the notebook runnable.

Use these only if you got stuck; for a meaningful comparison you should still choose your own prompt set and models.

In [None]:
# Reference: pick up to 2 models automatically if you left MODELS empty.
# For the assignment, explicitly set MODELS yourself.
if not MODELS and AVAILABLE_MODELS:
    MODELS = AVAILABLE_MODELS[:2]
    print("MODELS auto-filled:", MODELS)

# Reference: expand prompts (copy this pattern)
# PROMPTS = PROMPTS + ["Add more prompts here..."]
