# Week 4 — Part 04: Caching and observability (logging)

**Estimated time:** 75–120 minutes

## Learning Objectives

- Design safe cache keys for pure-ish LLM calls
- Implement an in-memory cache and a simple file cache
- Add minimum viable request logging (request id, latency, success/failure, attempt)


## Overview

Two practical realities of LLM APIs:

- calls can be expensive
- failures are hard to debug without logs

Caching reduces cost/latency.

Logging makes failures diagnosable.

---

## Underlying theory: caching is memoization of a pure-ish function

If your LLM call were a pure function:

$$
 y = f(x)
$$

then caching would be memoization: store $f(x)$ so repeated calls return instantly.

LLM calls are only “pure-ish” because settings affect output. So your cache key must include **every input that can change the result**.

Practical implication: incorrect cache keys cause **silent wrong answers**, which are worse than visible failures.

## Caching

Cache when:

- the same request repeats
- you are iterating on downstream code

Cache key must include everything that changes output:

- model name
- system prompt
- user prompt
- temperature

Common cache pitfalls:

- forgetting system prompt / tool context in the key
- caching when temperature is high (outputs are intentionally stochastic)
- caching errors (you accidentally “remember” a failure)

In [None]:
from __future__ import annotations

import hashlib
import json
import logging
import time
from dataclasses import asdict, dataclass
from pathlib import Path


@dataclass(frozen=True)
class LLMRequest:
    model: str
    system_prompt: str
    user_prompt: str
    temperature: float = 0.0


def make_cache_key(req: LLMRequest) -> str:
    # Key must include every field that can change the output.
    raw = json.dumps(asdict(req), sort_keys=True, ensure_ascii=False)
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()


req = LLMRequest(model="demo", system_prompt="You are helpful.", user_prompt="Hello", temperature=0.0)
print("cache_key=", make_cache_key(req)[:12])

In [None]:
class SimpleMemoryCache:
    def __init__(self) -> None:
        self._store: dict[str, str] = {}

    def get(self, key: str) -> str | None:
        return self._store.get(key)

    def set(self, key: str, value: str) -> None:
        self._store[key] = value


cache = SimpleMemoryCache()
key = "k1"
print(cache.get(key))
cache.set(key, "value")
print(cache.get(key))

In [None]:
class SimpleFileCache:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.path.parent.mkdir(parents=True, exist_ok=True)
        if not self.path.exists():
            self.path.write_text("{}", encoding="utf-8")

    def _read(self) -> dict[str, str]:
        return json.loads(self.path.read_text(encoding="utf-8"))

    def _write(self, data: dict[str, str]) -> None:
        self.path.write_text(json.dumps(data, ensure_ascii=False, sort_keys=True, indent=2), encoding="utf-8")

    def get(self, key: str) -> str | None:
        data = self._read()
        return data.get(key)

    def set(self, key: str, value: str) -> None:
        data = self._read()
        data[key] = value
        self._write(data)


file_cache = SimpleFileCache(Path("output/cache/llm_cache.json"))
k = "demo"
print(file_cache.get(k))
file_cache.set(k, "hello")
print(file_cache.get(k))

## Logging (minimum viable request log)

A minimal request log should include:

- request id
- model
- latency
- success/failure
- failure location (network vs parsing vs validation)

Two extra fields that help later:

- prompt length (or token estimate)
- retry attempt count

In [None]:
logger = logging.getLogger("demo")
logging.basicConfig(level=logging.INFO)


def fake_llm_call(text: str) -> str:
    time.sleep(0.05)
    return text.upper()


def logged_call(request_id: str, req: LLMRequest) -> str:
    t0 = time.time()
    try:
        out = fake_llm_call(req.user_prompt)
        logger.info(
            "llm_call_ok",
            extra={
                "request_id": request_id,
                "model": req.model,
                "latency_s": time.time() - t0,
                "prompt_len": len(req.system_prompt) + len(req.user_prompt),
                "attempt": 0,
            },
        )
        return out
    except Exception as e:
        logger.warning(
            "llm_call_failed",
            extra={
                "request_id": request_id,
                "model": req.model,
                "latency_s": time.time() - t0,
                "attempt": 0,
                "error_type": type(e).__name__,
            },
        )
        raise


print(logged_call("req_001", req))

In [None]:
def cached_call(cache_obj: SimpleMemoryCache, req: LLMRequest) -> str:
    key = make_cache_key(req)
    hit = cache_obj.get(key)
    if hit is not None:
        logger.info("llm_cache_hit", extra={"model": req.model})
        return hit

    out = fake_llm_call(req.user_prompt)
    cache_obj.set(key, out)
    logger.info("llm_cache_set", extra={"model": req.model})
    return out


cache2 = SimpleMemoryCache()
print(cached_call(cache2, req))
print(cached_call(cache2, req))

In [None]:
def make_cache_key_todo(req: LLMRequest) -> str:
    # TODO: extend the key so it would remain correct if you add fields like:
    # - top_p
    # - max_tokens
    # - tool schema / tool definitions
    # - few-shot examples
    raise NotImplementedError


def should_cache(req: LLMRequest) -> bool:
    # TODO: implement policy, e.g.
    # - cache only if temperature == 0.0
    # - avoid caching very large prompts
    raise NotImplementedError


print("Implement make_cache_key_todo() and should_cache().")

## References

- `functools.lru_cache`: https://docs.python.org/3/library/functools.html#functools.lru_cache
- Python logging: https://docs.python.org/3/library/logging.html