# Week 4 — Part 05: LLM Client Skeleton Lab

**Estimated time:** 120–150 minutes

---

## Pre-study (Self-learn)

Foundations Course assumes Self-learn is complete. If you need a refresher on reliability patterns:

- [Foundations Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Chapter 5: Resource Monitoring and Containerization](../self_learn/Chapters/5/Chapter5.md)

---

## What success looks like (end of Part 05)

- You have a reusable LLM client with:
  - Timeouts
  - Retries with backoff
  - Rate limiting
  - Caching
  - Logging

### Checkpoint

After running this notebook:
- You can call the LLM client with retries
- You can demonstrate cache hits

## Learning Objectives

- Build a production-ready LLM client
- Combine timeouts, retries, caching, and logging
- Create reusable client code

### What this part covers
This notebook assembles all the reliability layers from Parts 01–04 into a **single reusable `LLMClient` class** — your building block for every LLM project going forward.

The client combines:
- **Cache check** (Part 04) — return cached result if available
- **Provider call** with timeout (Part 01)
- **Retry loop** with backoff (Part 02)
- **Structured logging** (Part 04)

**Design principle:** The provider-specific API call is isolated in `_provider_call()`. Everything else (caching, retries, logging) is provider-agnostic. Swapping from OpenAI to Ollama to Anthropic only requires changing `_provider_call()`.

## Overview

Your goal is a single module you can reuse across projects that provides:

- timeouts
- retries + backoff
- basic rate limit handling
- basic caching
- logging

In this lab you’ll implement a provider-agnostic skeleton and keep the provider-specific API call isolated in `_provider_call`.

If you want the deeper reliability-boundary discussion, use the Self-learn links at the top of the notebook.

### What this cell does
Defines `LLMRequest`, `make_cache_key()`, `SimpleMemoryCache`, and the full `LLMClient` class.

**Walk through `LLMClient.call()` step by step:**
1. Generate a unique `request_id` (UUID) for tracing
2. Compute the cache key from the request
3. Check cache — if hit, log and return immediately
4. Enter retry loop (up to `max_retries + 1` attempts):
   - Call `_provider_call()` with timeout
   - On success: log, cache the result, return
   - On failure: log the error, sleep with backoff, retry
5. After all retries exhausted: raise `RuntimeError`

**Your task:** Implement `_provider_call()` using your chosen provider's SDK (OpenAI, Ollama, etc.). Everything else in the client is already production-ready.

## Skeleton design

We’ll define:

- a request payload (model + prompt + settings)
- a stable cache key
- a `call()` method

The cache key must represent the “effective input” to the model. If two requests differ in any setting that can change output, they must not share a key.

Next you’ll implement a provider-agnostic skeleton you can adapt later.

### What this cell does
Defines `add_jitter()` and `backoff_delay()` — two helpers for improving the retry backoff strategy.

**`backoff_delay(attempt)`** — exponential backoff: `min(cap, base * 2^(attempt-1))` → 0.2s, 0.4s, 0.8s, 1.6s, 2.0s (capped)

**`add_jitter(delay_s)`** — randomizes the delay within `[0, delay_s]` ("full jitter"). This prevents multiple clients from retrying at exactly the same moment after a shared failure.

**Your task:** Implement both functions. The current stubs return `0.0` for jitter and `0.0` for backoff — replace them with the real formulas. The solution is in the Appendix.

In [None]:
import hashlib
import json
import logging
import time
import uuid
from dataclasses import asdict, dataclass
from typing import Optional


logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


@dataclass(frozen=True)
class LLMRequest:
    model: str
    prompt: str
    temperature: float = 0.0


def make_cache_key(req: LLMRequest) -> str:
    raw = json.dumps(asdict(req), sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()


class SimpleMemoryCache:
    def __init__(self) -> None:
        self._store = {}

    def get(self, key: str) -> Optional[str]:
        return self._store.get(key)

    def set(self, key: str, value: str) -> None:
        self._store[key] = value


class LLMClient:
    def __init__(self, cache: Optional[SimpleMemoryCache] = None) -> None:
        self._cache = cache or SimpleMemoryCache()

    def _provider_call(self, req: LLMRequest, *, timeout_s: float) -> str:
        # TODO: Implement provider-specific HTTP/API call.
        # This method must:
        # - respect timeout_s
        # - raise clear exceptions for retry classification
        time.sleep(min(0.05, float(timeout_s)))
        return "echo: %s" % req.prompt

    def call(self, req: LLMRequest, *, timeout_s: float = 30.0, max_retries: int = 2) -> str:
        request_id = str(uuid.uuid4())
        cache_key = make_cache_key(req)

        cached = self._cache.get(cache_key)
        if cached is not None:
            logger.info("llm_cache_hit", extra={"request_id": request_id, "model": req.model})
            return cached

        last_err = None
        for attempt in range(max_retries + 1):
            t0 = time.time()
            try:
                text = self._provider_call(req, timeout_s=timeout_s)
                logger.info(
                    "llm_call_ok",
                    extra={
                        "request_id": request_id,
                        "model": req.model,
                        "latency_s": time.time() - t0,
                        "attempt": attempt,
                    },
                )
                self._cache.set(cache_key, text)
                return text
            except Exception as e:
                last_err = e
                logger.warning(
                    "llm_call_failed",
                    extra={
                        "request_id": request_id,
                        "model": req.model,
                        "latency_s": time.time() - t0,
                        "attempt": attempt,
                        "error_type": type(e).__name__,
                    },
                )
                if attempt < max_retries:
                    time.sleep(min(2 ** attempt, 4))

        raise RuntimeError("LLM call failed after retries: %s" % last_err)


# Quick sanity call (still provider-stubbed)
client = LLMClient()
print(client.call(LLMRequest(model="demo", prompt="hello"), timeout_s=1.0, max_retries=0))

## Practice exercises

1) Extend `LLMRequest` to include `system_prompt` and update `make_cache_key()` accordingly.

2) Add jitter to the backoff so many clients do not retry at the same times.

3) Add a simple 429 handler:

- if `Retry-After` is present and small enough, sleep that long
- otherwise backoff

4) Add structured output validation (from Week 3) as an optional mode.

In [None]:
import random


def add_jitter(delay_s: float) -> float:
    # TODO: implement jitter.
    # Example: "full jitter" uniform(0, delay_s).
    return random.uniform(0.0, max(0.0, float(delay_s)))


def backoff_delay(attempt: int, *, base: float = 0.5, cap: float = 8.0) -> float:
    # TODO: implement exponential backoff with cap.
    raw = base * (2 ** max(0, attempt - 1))
    return min(cap, float(raw))


for a in range(0, 5):
    d = backoff_delay(a + 1, base=0.2, cap=2.0)
    print("attempt", a + 1, "delay", d, "jittered", add_jitter(d))

## Next steps

- Implement `_provider_call()` using your chosen provider SDK.
- Add structured output validation from Week 3.
- Use this client in later pipeline/capstone work.

## References

- Python logging: https://docs.python.org/3/library/logging.html
- Tenacity (for more robust retries): https://tenacity.readthedocs.io/

## Exercise: Persist raw failures

Goal:

- If the provider call fails (e.g. timeout), persist a short JSON record under `output/`.
- Return the written path.

Checkpoint:

- Trigger a failure (`force_error=True`) and confirm `output/raw_failure.json` exists.

In [None]:
from pathlib import Path


OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)


def persist_raw_failure_todo(payload: dict, err: Exception, *, filename: str = "raw_failure.json") -> Path:
    # TODO: implement
    out_path = OUTPUT_DIR / filename
    out_path.write_text("TODO\n", encoding="utf-8")
    return out_path


try:
    client.call(LLMRequest(model="demo", prompt="this will fail"), timeout_s=0.001, max_retries=0)
except Exception as e:
    p = persist_raw_failure_todo({"model": "demo", "prompt": "this will fail"}, e)
    print("saved failure to", p)

## Appendix: Solutions (peek only after trying)

Reference implementations for `add_jitter`, `backoff_delay`, and `persist_raw_failure_todo`.

In [None]:
def add_jitter(delay_s: float) -> float:
    return random.uniform(0.0, max(0.0, float(delay_s)))


def backoff_delay(attempt: int, *, base: float = 0.5, cap: float = 8.0) -> float:
    raw = base * (2 ** max(0, attempt - 1))
    return min(cap, float(raw))


for a in range(0, 5):
    d = backoff_delay(a + 1, base=0.2, cap=2.0)
    print("solution attempt", a + 1, "delay", d, "jittered", add_jitter(d))

In [None]:
def persist_raw_failure_todo(payload: dict, err: Exception, *, filename: str = "raw_failure.json") -> Path:
    out_path = OUTPUT_DIR / filename
    record = {
        "payload": payload,
        "error_type": type(err).__name__,
        "error": str(err),
    }
    out_path.write_text(json.dumps(record, indent=2), encoding="utf-8")
    return out_path


try:
    client.call(LLMRequest(model="demo", prompt="this will fail"), timeout_s=0.001, max_retries=0)
except Exception as e:
    p = persist_raw_failure_todo({"model": "demo", "prompt": "this will fail"}, e, filename="raw_failure_solution.json")
    print("saved failure to", p)