# Week 5 — Part 02: Calling Ollama via HTTP (minimal client)

**Estimated time:** 60–90 minutes

## Learning Objectives

- Treat local inference as a service call with the same reliability concerns
- Implement a minimal HTTP client for Ollama `/api/generate`
- Apply timeouts and basic parsing/validation
- Save outputs + latency as artifacts for later comparison


### What this part covers
This notebook builds a **minimal HTTP client for Ollama** — treating local inference as a service call with the same reliability concerns as a cloud API.

Even though Ollama runs on your machine, your Python script is still making a network-style call:
- your client process sends a request to `localhost:11434`
- the Ollama server process does the work
- you receive a response (or a failure)

So the same engineering principles apply: **always set timeouts, log failures, validate responses.**

## Overview

Ollama exposes a local HTTP API. This lets you treat local inference like a normal service call.

---

## What success looks like (end of Part 02)

- You can make one request to `POST http://localhost:11434/api/generate` and receive a response.
- You record latency in seconds.
- You can save a JSON artifact to disk under `output/ollama_runs/`.

If you cannot reach the endpoint, confirm Ollama is running before debugging your client code.

---

## Underlying theory: local inference is still a distributed system (just smaller)

Even though the model is on your machine, your Python script is still making a network-style call:

- your client process sends a request
- the Ollama server process does work
- you receive a response (or a failure)

So the same engineering principles apply:

- always set timeouts
- log failures with enough context to debug
- treat response parsing as untrusted input (validate what you need)

We’ll implement a minimal client that:

- sends a prompt
- receives text output
- prints it
- records latency

### What this cell does
Defines `OllamaGenerateRequest` (typed parameters for one call) and `call_ollama()` — the core HTTP client function.

**Walk through `call_ollama()` step by step:**
1. Build the URL: `{host}/api/generate`
2. Build the payload: `model`, `prompt`, `stream=False` (wait for full response), `temperature=0.0` (deterministic)
3. Record `t0 = time.time()` before the call
4. `requests.post(..., timeout=req.timeout_s)` — the timeout is a real network-level timeout
5. `resp.raise_for_status()` — raises an exception for any 4xx/5xx response
6. Parse the response JSON and extract `response` text
7. Return a dict with model, response text, and latency

**`stream=False` vs `stream=True`:** With `stream=False`, Ollama waits until the full response is generated before sending it back. With `stream=True`, you get tokens as they're generated (better for interactive UX, more complex to parse). Use `stream=False` for batch/pipeline work.

## Minimal client (Python)

Dependency:

- `requests` (commonly used for HTTP)

This notebook calls Ollama’s local endpoint:

- `POST http://localhost:11434/api/generate`

Two practical notes:

- The timeout is a policy choice. Slower hardware or larger models may need longer.
- When you later build a benchmark, consider warmup: the first call can be slower due to model loading.

---

## Guided walkthrough: one request end-to-end

Goal: prove the contract **request → response → latency → save artifact**.

### Checkpoint

- If Ollama is running and you have at least one model pulled, you should see:
  - `ok=True`
  - a non-empty `response_preview`
  - an artifact path under `output/ollama_runs/`

If not, you should see a clear error and next steps.

### What this cell does
Makes one real request to Ollama and saves the result as a JSON artifact under `output/ollama_runs/`.

**What to check before running:**
- `DEFAULT_MODEL` must match a model name shown by `ollama list`
- Ollama must be running (`ollama serve`)

**What to check after running:**
- `ok=True` and a non-empty `response_preview`
- A JSON file created under `output/ollama_runs/`
- `latency_s` — note the first call is slower (model loading). Subsequent calls to the same model are faster.

**If it fails:** The error message tells you exactly what to fix (Ollama not running, wrong model name, etc.).

In [None]:
from __future__ import annotations

import json
import time
from dataclasses import dataclass
from pathlib import Path


try:
    import requests
except Exception as e:  # pragma: no cover
    requests = None
    _requests_import_error = e


@dataclass(frozen=True)
class OllamaGenerateRequest:
    model: str
    prompt: str
    host: str = "http://localhost:11434"
    timeout_s: float = 60.0


def call_ollama(req: OllamaGenerateRequest) -> dict:
    if requests is None:
        raise RuntimeError(f"requests is required: {_requests_import_error}")

    url = f"{req.host}/api/generate"
    payload = {
        "model": req.model,
        "prompt": req.prompt,
        "stream": False,
        "options": {"temperature": 0.0},
    }

    t0 = time.time()
    resp = requests.post(url, json=payload, timeout=req.timeout_s, headers={"X-Client": "level-1"})
    resp.raise_for_status()
    data = resp.json()
    response_text = data.get("response", "")
    return {
        "model": req.model,
        "response": response_text,
        "latency_s": time.time() - t0,
    }


print("call_ollama() ready")

### What this cell does
Defines `call_ollama_safe()` — a wrapper that **never raises**. Instead of propagating exceptions, it catches them and returns a structured dict with an `ok` flag and `error` string.

**Why a "safe" wrapper?** When benchmarking multiple models, you want to collect all results — including failures — without the loop crashing on the first error. A safe wrapper lets you store failures as data: `{"ok": false, "error": "ConnectionError: ..."}` is a valid benchmark result.

**Your task:** Implement `call_ollama_safe()` using `call_ollama()`. Requirements:
- Always return a dict with keys: `ok`, `model`, `latency_s`, `response`, `error`
- On success: `ok=True`, `error=None`
- On failure: `ok=False`, `error="{ExceptionType}: {message}"`

In [None]:
DEFAULT_HOST = "http://localhost:11434"
DEFAULT_MODEL = "llama3.1"  # TODO: change to a model you have pulled (see `ollama list`)
DEFAULT_PROMPT = "Say hello in one sentence."

req = OllamaGenerateRequest(model=DEFAULT_MODEL, prompt=DEFAULT_PROMPT, host=DEFAULT_HOST, timeout_s=60.0)

try:
    out = call_ollama(req)
    ok = True
    err = None
except Exception as e:
    out = {"model": req.model, "response": "", "latency_s": None}
    ok = False
    err = f"{type(e).__name__}: {e}"

print("ok=", ok)
if err:
    print("error=", err)
    print("Next steps:")
    print("- Ensure Ollama is running: ollama serve")
    print("- Ensure you pulled a model: ollama pull <model>")
    print("- Set DEFAULT_MODEL to a model shown by: ollama list")
else:
    print("latency_s=", out.get("latency_s"))
    print("response_preview=", (out.get("response") or "")[:200])
    saved = save_result(out, out_dir=Path("output/ollama_runs"))
    print("saved_to=", saved)


## Exercise: make the client more robust

Implement the TODO below.

Goal:

- Return a dict that *always* has keys: `ok`, `model`, `latency_s`, `response`, `error`.
- Never raise in this helper (errors should be captured as strings).

This pattern makes benchmarking easier because you can store failures as data.


In [None]:
from typing import Any, Dict


def call_ollama_safe(req: OllamaGenerateRequest) -> Dict[str, Any]:
    # TODO: implement using call_ollama(req).
    # Requirements:
    # - include ok flag
    # - include error string on failure
    # - include latency_s when available
    return {
        "ok": False,
        "model": req.model,
        "latency_s": None,
        "response": "",
        "error": "TODO: implement call_ollama_safe(req)",
    }


print("Implement call_ollama_safe().")

## Next: saving artifacts

After you can make one successful request, the next step is to save results to disk so you can:

- inspect outputs later for quality
- compare runs across models
- share evidence with others

The `save_result()` helper below writes one JSON file per run.

In [None]:
def save_result(out: dict, *, out_dir: Path) -> Path:
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / f"ollama_{int(time.time())}.json"
    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
    return out_path


# Example usage (requires Ollama running locally):
# req = OllamaGenerateRequest(model="llama3.1", prompt="Say hello in one sentence")
# out = call_ollama(req)
# print(json.dumps(out, indent=2))
# print("saved to", save_result(out, out_dir=Path("output/ollama_runs")))

## How to run (CLI)

```bash
python call_ollama.py --model llama3.1 --prompt "Say hello in one sentence"
```

If this works, you've proven local inference end-to-end.

---

## Common pitfalls

- Ollama service not running
- wrong model name
- timeouts for slow hardware

---

## References

- Ollama docs/issues: https://github.com/ollama/ollama
- Requests timeouts: https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts

## Appendix: Solutions (peek only after trying)

Reference implementation for `call_ollama_safe`.

In [None]:
from typing import Any, Dict


def call_ollama_safe(req: OllamaGenerateRequest) -> Dict[str, Any]:
    try:
        out = call_ollama(req)
        return {
            "ok": True,
            "model": out.get("model", req.model),
            "latency_s": out.get("latency_s"),
            "response": out.get("response", ""),
            "error": None,
        }
    except Exception as e:
        return {
            "ok": False,
            "model": req.model,
            "latency_s": None,
            "response": "",
            "error": "%s: %s" % (type(e).__name__, e),
        }


print(call_ollama_safe(req))