# Week 5 — Part 02: Calling Ollama via HTTP (minimal client)

**Estimated time:** 60–90 minutes

## Learning Objectives

- Treat local inference as a service call with the same reliability concerns
- Implement a minimal HTTP client for Ollama `/api/generate`
- Apply timeouts and basic parsing/validation
- Save outputs + latency as artifacts for later comparison


## Overview

Ollama exposes a local HTTP API. This lets you treat local inference like a normal service call.

---

## Underlying theory: local inference is still a distributed system (just smaller)

Even though the model is on your machine, your Python script is still making a network-style call:

- your client process sends a request
- the Ollama server process does work
- you receive a response (or a failure)

So the same engineering principles apply:

- always set timeouts
- log failures with enough context to debug
- treat response parsing as untrusted input (validate what you need)

We’ll implement a minimal client that:

- sends a prompt
- receives text output
- prints it
- records latency

## Minimal client (Python)

Dependency:

- `requests` (commonly used for HTTP)

This notebook shows a minimal call to Ollama’s local endpoint:

- `POST http://localhost:11434/api/generate`

Two practical notes:

- The timeout is a policy choice. Slower hardware or larger models may need longer.
- When you later build a benchmark, consider warmup: the first call can be slower due to model loading.

In [None]:
from __future__ import annotations

import json
import time
from dataclasses import dataclass
from pathlib import Path


try:
    import requests
except Exception as e:  # pragma: no cover
    requests = None
    _requests_import_error = e


@dataclass(frozen=True)
class OllamaGenerateRequest:
    model: str
    prompt: str
    host: str = "http://localhost:11434"
    timeout_s: float = 60.0


def call_ollama(req: OllamaGenerateRequest) -> dict:
    if requests is None:
        raise RuntimeError(f"requests is required: {_requests_import_error}")

    url = f"{req.host}/api/generate"
    payload = {
        "model": req.model,
        "prompt": req.prompt,
        "stream": False,
        "options": {"temperature": 0.0},
    }

    t0 = time.time()
    resp = requests.post(url, json=payload, timeout=req.timeout_s, headers={"X-Client": "level-1"})
    resp.raise_for_status()
    data = resp.json()
    response_text = data.get("response", "")
    return {
        "model": req.model,
        "response": response_text,
        "latency_s": time.time() - t0,
    }


print("call_ollama() ready")

In [None]:
def save_result(out: dict, *, out_dir: Path) -> Path:
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / f"ollama_{int(time.time())}.json"
    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
    return out_path


# Example usage (requires Ollama running locally):
# req = OllamaGenerateRequest(model="llama3.1", prompt="Say hello in one sentence")
# out = call_ollama(req)
# print(json.dumps(out, indent=2))
# print("saved to", save_result(out, out_dir=Path("output/ollama_runs")))

## How to run (CLI)

```bash
python call_ollama.py --model llama3.1 --prompt "Say hello in one sentence"
```

If this works, you’ve proven local inference end-to-end.

---

## Common pitfalls

- Ollama service not running
- wrong model name
- timeouts for slow hardware

---

## References

- Ollama docs/issues: https://github.com/ollama/ollama
- Requests timeouts: https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts