# Week 5 — Part 01: Local inference concepts + setup checklist

**Estimated time:** 45–75 minutes

## Learning Objectives

- Define inference and local inference
- Explain how moving from hosted APIs to local inference changes constraints
- Understand how model size, context length, and quantization affect latency and memory
- Follow a practical setup checklist for Ollama


## Overview

**Inference** = using a trained model to generate outputs.

**Local inference** = you run the model on your own machine.

Local inference is useful for:

- privacy (data stays local)
- cost control (no per-request billing)
- offline capability

Trade-offs:

- quality may be lower than top hosted models
- performance depends on your CPU/GPU/RAM/VRAM

---

## What success looks like (end of Part 01)

- You can run:
  - `ollama --version`
  - `ollama list`
- You can confirm the local server is reachable either by:
  - CLI (`ollama list` succeeds), or
  - HTTP (`GET http://localhost:11434/api/tags` succeeds)

If you cannot complete the checks above, fix runtime/environment issues before writing client/benchmark code.

---

## Underlying theory: moving the boundary changes your constraints

When you use a hosted API, the provider owns the compute and you mostly worry about:

- request formatting
- rate limits
- latency and cost

When you run locally, you become the provider. That means **hardware is now part of your system design**.

You can think of local inference performance as a function:

$$
\text{latency} = f(\text{model size},\ \text{context length},\ \text{hardware},\ \text{quantization})
$$

Practical implication:

- if a model does not fit in RAM/VRAM, it won’t run (or will thrash)
- even if it fits, throughput/latency can vary dramatically across machines

## Setup checklist (practical)

1. Install Ollama
2. Start the Ollama service
3. Pull a model
4. Run a test prompt

What to do and what “success” looks like:

1. **Install Ollama**
    - Goal: have the `ollama` CLI available.
    - Verify: `ollama --version` prints a version.

2. **Start the Ollama service**
    - Goal: local server process ready to accept requests.
    - Verify: `ollama serve` starts without immediately exiting.
    - Common failure: port conflicts or permission issues.

3. **Pull a model**
    - Goal: download at least one model.
    - Verify: `ollama list` shows the model.
    - Practical note: start small to avoid memory failures.

4. **Run a test prompt**
    - Goal: confirm request → generation works locally.
    - Verify: `ollama run <model_name>` produces output and doesn’t crash.
    - Note: first run can be slow due to model loading.

import platform
import shutil
import subprocess
from typing import List


def try_run(cmd: List[str]) -> None:
    print("$", " ".join(cmd))
    try:
        out = subprocess.run(cmd, capture_output=True, text=True, check=False)
        print("returncode=", out.returncode)
        if out.stdout:
            print(out.stdout.strip())
        if out.stderr:
            print(out.stderr.strip())
    except FileNotFoundError:
        print("command not found")


print("python=", platform.python_version())
print("platform=", platform.platform())
print("ollama_in_path=", shutil.which("ollama") is not None)

try_run(["ollama", "--version"])
try_run(["ollama", "list"])

## Guided check (recommended): verify the Ollama HTTP endpoint

Even though Ollama runs on your machine, it exposes a local HTTP API (default `http://localhost:11434`).

This is useful because:

- it confirms the server is actually running
- it matches how your later Python client and benchmark will call Ollama

### Checkpoint

- If the server is running, you should see a JSON response containing a `models` list.
- If it is not running, you will typically see a connection/refused error.


In [None]:
import json


try:
    import requests
except Exception as e:  # pragma: no cover
    requests = None
    _requests_import_error = e


def fetch_ollama_tags(host: str = "http://localhost:11434", *, timeout_s: float = 2.0) -> dict:
    if requests is None:
        raise RuntimeError(f"requests is required for HTTP health checks: {_requests_import_error}")
    url = f"{host}/api/tags"
    resp = requests.get(url, timeout=timeout_s)
    resp.raise_for_status()
    return resp.json()


try:
    tags = fetch_ollama_tags()
    model_names = [m.get("name") for m in tags.get("models", [])]
    print("ollama_http_ok=True")
    print("models=", model_names)
except Exception as e:
    print("ollama_http_ok=False")
    print("error=", type(e).__name__, str(e))
    print("Next steps:")
    print("- Ensure Ollama is installed: ollama --version")
    print("- Ensure the server is running: ollama serve")
    print("- Ensure at least one model is pulled: ollama pull <model>")
    print("- Then re-run this cell")

## What “model size / context window / quantization” mean

- **Size (e.g. 7B, 13B)**: larger often means better quality but slower and more memory.
- **Context window**: how much text you can include per request.
- **Quantization**: smaller memory footprint (quality may change slightly).

More concrete intuition:

- model size is roughly the number of parameters
- more parameters usually means more compute per generated token
- quantization stores weights with fewer bits, reducing memory and often increasing speed on constrained hardware

Practical rule of thumb: local inference is often bottlenecked by memory bandwidth and/or VRAM capacity, not just CPU speed.

For Foundamental Course, focus on the practical effect:

- if it doesn’t fit, you can’t run it

In [None]:
def estimate_memory_gb(params_billion: float, bits_per_weight: int) -> float:
    # Rough estimate: params * bits per weight -> bytes -> GB
    params = params_billion * 1_000_000_000
    bytes_used = params * (bits_per_weight / 8)
    return bytes_used / (1024 ** 3)


sizes = [7, 13, 70]
for s in sizes:
    for bits in [4, 8, 16]:
        print(f"{s}B @ {bits}-bit: {estimate_memory_gb(s, bits):.2f} GB")

In [None]:
def choose_model_for_hardware(vram_gb: float) -> str:
    # TODO: implement a rule-of-thumb mapping.
    # Example:
    # - vram < 8 -> prefer 3B or smaller
    # - vram < 16 -> prefer 7B
    # - otherwise -> 13B+ (if latency acceptable)
    #
    # Placeholder behavior (so the notebook can run end-to-end):
    return "TODO: implement choose_model_for_hardware(vram_gb)"


print("Implement choose_model_for_hardware().")

## References

- Ollama: https://ollama.com/
- Ollama GitHub: https://github.com/ollama/ollama
- Hugging Face model cards: https://huggingface.co/docs/hub/model-cards

## Exercise: implement a health check function

In later notebooks you’ll call Ollama via HTTP. Before you do that, you need a quick local health check.

Requirements:

- Return `True` only when Ollama is reachable.
- If you use HTTP, use a short timeout (fast failure).
- If you use CLI, check return codes.

When done, you should be able to run:

- `check_ollama_status()` → `True` when Ollama is running, otherwise `False`.


In [None]:
def check_ollama_status() -> bool:
    # TODO: implement a quick local health check.
    # Option 1: run `ollama list` and check return code.
    # Option 2: attempt a small HTTP request to localhost:11434.
    #
    # Placeholder behavior (so the notebook can run end-to-end):
    return False


print("Implement check_ollama_status().")

## Appendix: Solutions (peek only after trying)

Reference implementations for the TODO exercises above.

In [None]:
def choose_model_for_hardware(vram_gb: float) -> str:
    if vram_gb < 8:
        return "Prefer a 3B (or smaller) model / stronger quantization (e.g., 4-bit)"
    if vram_gb < 16:
        return "Prefer a 7B model (quantized if needed)"
    if vram_gb < 24:
        return "Try a 13B model (quantized if needed)"
    return "Try 13B+ (and compare speed/quality); consider context length and latency"


def check_ollama_status() -> bool:
    # Fast path: HTTP tags endpoint
    try:
        _ = fetch_ollama_tags(timeout_s=1.5)
        return True
    except Exception:
        pass

    # Fallback: CLI `ollama list`
    try:
        out = subprocess.run(["ollama", "list"], capture_output=True, text=True, check=False)
        return out.returncode == 0
    except Exception:
        return False


print(choose_model_for_hardware(6))
print(choose_model_for_hardware(12))
print(choose_model_for_hardware(20))
print("ollama_ok=", check_ollama_status())
