# Level 2 - Week 6 - 03 Reliability and Step Logs

**Estimated time:** 60-90 minutes

## Learning Objectives

- Add bounded retries
- Use backoff
- Log step metadata


## Overview

Agents amplify failures:

- more calls
- more chances to hang
- more hidden state

So reliability is non-negotiable.

## Reliability rules

- all network calls have timeouts
- retries are bounded + backoff
- tool failures have fallbacks
- every step is logged

## Underlying theory: retries trade latency for success probability

If each independent attempt succeeds with probability $p$, then after $n$ attempts the probability of at least one success is:

$$
P(\text{success by } n) = 1 - (1-p)^n
$$

This is why limited retries can dramatically improve reliability.

## Exponential backoff intuition

A common schedule is:

$$
\text{delay}_k = \min(\text{cap}, \text{base} \cdot 2^k)
$$

Jitter (randomness) prevents synchronization when many clients retry.

## Practice Steps

- Implement a bounded retry wrapper with exponential backoff + jitter.
- Define a step log record shape.
- Produce logs that let you answer: “what did the agent do, and why did it stop?”

### Sample code

Retry wrapper with backoff.


In [None]:
import random
import time


def with_retries(fn, max_retries: int = 2, base_delay_s: float = 0.2, cap_delay_s: float = 2.0):
    for attempt in range(max_retries + 1):
        try:
            return fn()
        except Exception:
            if attempt >= max_retries:
                raise

            delay = min(cap_delay_s, base_delay_s * (2**attempt))
            jitter = random.uniform(0.0, delay * 0.2)
            time.sleep(delay + jitter)

### Student fill-in

- Add jitter to retries (randomness) to prevent synchronized retry spikes.
- Define a step log record shape for each tool call.

Then create one example step log dict and confirm it has all required fields.

In [None]:
STEP_LOG_FIELDS = [
    "request_id",
    "step_index",
    "tool",
    "input",
    "output",
    "error",
    "latency_ms",
]

example_step_log = {
    "request_id": "req_123",
    "step_index": 1,
    "tool": "search",
    "input": {"query": "refund policy", "top_k": 3},
    "output": {"hits": [{"chunk_id": "kb#001", "score": 0.82}]},
    "error": None,
    "latency_ms": 183,
}

print(STEP_LOG_FIELDS)
print(example_step_log)

## Self-check

- Are retries bounded?
- Do step logs include request_id?
