# Level 1 â€” Week 5 Practice (Starter Notebook)

This notebook provides starter code to benchmark **local inference with Ollama**.

## References (docs)
- Ollama (official): https://ollama.com/
- Ollama GitHub (docs): https://github.com/ollama/ollama
- Python `time` (official): https://docs.python.org/3/library/time.html
- `requests` (official docs): https://requests.readthedocs.io/


## Setup

This assumes you have Ollama running locally. Typical default endpoint:
- `http://localhost:11434`

If the endpoint is different, update `OLLAMA_BASE_URL`.


In [None]:
import json
import time
from dataclasses import dataclass
from typing import List, Dict, Any

import requests


In [None]:
OLLAMA_BASE_URL = 'http://localhost:11434'
GENERATE_URL = f'{OLLAMA_BASE_URL}/api/generate'
GENERATE_URL


## Define a small benchmark set

Keep it small and consistent so you can compare models fairly.


In [None]:
PROMPTS = [
    'Summarize: LLMs can be used for extraction, summarization, and classification.',
    'Extract as JSON: name, email from: John Doe (john@example.com).',
    'Write 3 bullets: pros/cons of local inference.',
]
MODELS = [
    # TODO: replace with models you have pulled, e.g. 'llama3.1:8b'
    'llama3',
]
PROMPTS


## Ollama generate call

We call the HTTP API and measure latency.
If you get connection errors, confirm Ollama is running and your model name is correct.


In [None]:
def ollama_generate(model: str, prompt: str, *, timeout_s: float = 60.0) -> Dict[str, Any]:
    payload = {
        'model': model,
        'prompt': prompt,
        'stream': False,
    }
    resp = requests.post(GENERATE_URL, json=payload, timeout=timeout_s)
    resp.raise_for_status()
    return resp.json()


## Benchmark loop (starter)

Stores latency and a small output sample for later inspection.


In [None]:
results = []
for model in MODELS:
    for prompt in PROMPTS:
        t0 = time.time()
        try:
            out = ollama_generate(model, prompt)
            ok = True
            text = out.get('response', '')
        except Exception as e:
            ok = False
            text = f'ERROR: {type(e).__name__}: {e}'
        latency_ms = int((time.time() - t0) * 1000)
        results.append({
            'model': model,
            'prompt': prompt,
            'ok': ok,
            'latency_ms': latency_ms,
            'output_preview': text[:400],
        })

results


## TODO: Write conclusions

- Which model is best for speed?
- Which model is best for quality?
- What are the failure modes (timeouts, low quality, context issues)?

Optional next step: export `results` to JSON and include in your report.
