# Week 4 — Part 03: Rate limiting + graceful degradation

**Estimated time:** 60–90 minutes

## What success looks like (end of Part 03)

- You can explain what a 429 means and how clients should respond.
- You can parse `Retry-After` and decide whether to wait or degrade.
- You can implement a deterministic degradation strategy (shrink prompt, fallback model).

### Checkpoint

After running this notebook, you should be able to:

- show a `decide_on_429(...)` output for at least 2 cases
- show a degraded prompt that respects `max_chars`

## Learning Objectives

- Understand rate limiting as a capacity allocation policy
- Handle HTTP 429 (Retry-After, backoff, retry)
- Implement a small local token bucket limiter
- Practice graceful degradation strategies

## Overview

Rate limits protect providers and enforce fair usage.

Your client should behave gracefully:

- pause and retry
- or degrade (fallback model, smaller prompt, cached response)

---

## Underlying theory: rate limiting is a capacity allocation policy

You can think of a provider as having finite capacity. Rate limiting enforces a maximum request rate per user.

A common conceptual model is a token bucket:

- bucket capacity $B$
- tokens refill at rate $r$ tokens/second
- each request spends tokens

If there are not enough tokens, requests are rejected or delayed.

Practical implication:

- bursts may succeed, but sustained high QPS will hit limits
- your client must treat 429s as normal and recover gracefully

## HTTP 429

429 means “Too Many Requests”.

Client behavior:

- respect the `Retry-After` header if present
- otherwise backoff and retry

Graceful degradation options (choose based on your product):

- return a clear “busy, try later” message
- fall back to a cheaper/faster model
- reduce prompt size / requested output length
- serve a cached result if correctness allows

In [None]:
from __future__ import annotations

import time
from dataclasses import dataclass
from typing import Optional


def parse_retry_after_seconds(value: str) -> Optional[float]:
    """Parse Retry-After header.

    For Level 1, we support the most common form: integer seconds.
    (HTTP also allows an HTTP-date; production clients should support both.)
    """
    v = value.strip()
    if not v:
        return None
    try:
        seconds = int(v)
        return max(0.0, float(seconds))
    except ValueError:
        return None


print(parse_retry_after_seconds("2"))
print(parse_retry_after_seconds("0"))
print(parse_retry_after_seconds(""))
print(parse_retry_after_seconds("Wed, 21 Oct 2015 07:28:00 GMT"))  # unsupported here

In [None]:
@dataclass
class TokenBucket:
    capacity: float
    refill_per_s: float
    tokens: float
    last_refill_s: float

    @classmethod
    def create(cls, *, capacity: float, refill_per_s: float) -> "TokenBucket":
        now = time.time()
        return cls(capacity=capacity, refill_per_s=refill_per_s, tokens=capacity, last_refill_s=now)

    def _refill(self) -> None:
        now = time.time()
        dt = max(0.0, now - self.last_refill_s)
        self.tokens = min(self.capacity, self.tokens + dt * self.refill_per_s)
        self.last_refill_s = now

    def allow(self, cost: float = 1.0) -> bool:
        self._refill()
        if self.tokens >= cost:
            self.tokens -= cost
            return True
        return False


bucket = TokenBucket.create(capacity=5, refill_per_s=1.0)

for i in range(12):
    ok = bucket.allow(cost=1.0)
    print(f"i={i:02d} allowed={ok} tokens_left={bucket.tokens:.2f}")
    time.sleep(0.2)

## Graceful degradation (what to do when limited)

When you hit rate limits, you typically have two broad options:

- **Wait and retry** (respecting `Retry-After` if provided)
- **Degrade** (return something cheaper/faster or less precise)

Common degradation choices:

- Return a clear “busy, try later” message
- Fall back to a cheaper/faster model
- Reduce prompt size (trim history, remove low-value context)
- Reduce requested output length
- Serve a cached result (only if acceptable for correctness)

The correct choice depends on your product:

- For interactive UX, a fast partial answer may be better than waiting.
- For offline pipelines, retrying may be preferable to degrading quality.

In [None]:
from dataclasses import dataclass
from typing import Optional, Tuple


@dataclass(frozen=True)
class RateLimitResponse:
    status_code: int
    retry_after: Optional[str] = None


def decide_on_429(
    resp: RateLimitResponse,
    *,
    max_wait_s: float,
) -> Tuple[str, Optional[float]]:
    """Return (action, wait_seconds).

    action is one of: "wait", "degrade".
    """
    if resp.retry_after is not None:
        s = parse_retry_after_seconds(resp.retry_after)
        if s is not None and s <= max_wait_s:
            return ("wait", s)
    return ("degrade", None)


print(decide_on_429(RateLimitResponse(429, retry_after="2"), max_wait_s=5.0))
print(decide_on_429(RateLimitResponse(429, retry_after="60"), max_wait_s=5.0))
print(decide_on_429(RateLimitResponse(429, retry_after=None), max_wait_s=5.0))

In [None]:
def degrade_request(prompt: str, *, max_chars: int) -> str:
    # TODO: implement a deterministic prompt shrinking strategy.
    # Requirements:
    # - Never exceed max_chars.
    # - Preserve the beginning of the prompt (often contains instructions).
    # - Consider preserving the end too (often contains the latest user input).
    if max_chars <= 0:
        return ""
    if len(prompt) <= max_chars:
        return prompt

    head = max(0, max_chars // 2)
    tail = max_chars - head
    if tail <= 0:
        return prompt[:max_chars]
    return prompt[:head] + prompt[-tail:]


def choose_fallback_model(primary: str) -> str:
    # TODO: implement a simple fallback mapping.
    # Example behavior:
    # - if primary is "gpt-4", fallback to "gpt-4-mini"
    # - otherwise fallback to a safe default
    mapping = {
        "gpt-4": "gpt-4-mini",
        "gpt-4o": "gpt-4o-mini",
    }
    return mapping.get(primary, "gpt-4o-mini")


print("degraded:", degrade_request("INSTRUCTIONS..." + ("x" * 200) + "...LATEST", max_chars=60))
print("fallback:", choose_fallback_model("gpt-4"))

## References

- HTTP 429: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429

## Appendix: Solutions (peek only after trying)

Reference implementations for `degrade_request` and `choose_fallback_model`.

In [None]:
def degrade_request(prompt: str, *, max_chars: int) -> str:
    if max_chars <= 0:
        return ""
    if len(prompt) <= max_chars:
        return prompt

    if max_chars <= 10:
        return prompt[:max_chars]

    keep_head = max(1, int(max_chars * 0.6))
    keep_tail = max_chars - keep_head
    return prompt[:keep_head] + prompt[-keep_tail:]


def choose_fallback_model(primary: str) -> str:
    mapping = {
        "gpt-4": "gpt-4-mini",
        "gpt-4o": "gpt-4o-mini",
        "claude-3-opus": "claude-3-sonnet",
    }
    return mapping.get(primary, "gpt-4o-mini")


print("degraded:", degrade_request("INSTRUCTIONS..." + ("x" * 200) + "...LATEST", max_chars=60))
print("fallback:", choose_fallback_model("gpt-4"))