<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/118_Error_Handling_Introduction_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 8) EAFP vs LBYL: Pythonic error style

---

## 1. what the acronyms mean

* **EAFP**: *Easier to Ask Forgiveness than Permission*
  → just do the thing, and handle the error *if it happens*.
* **LBYL**: *Look Before You Leap*
  → check conditions ahead of time, only proceed if it’s safe.

---

## 2. concrete example: dict lookup

### EAFP

```python
try:
    value = d[key]
except KeyError:
    value = default
```

* attempt → if missing, handle it.
* short, clear.

### LBYL

```python
value = d[key] if key in d else default
```

* check first (`key in d`).
* avoids exception, but adds an extra lookup.

---

## 3. another example: file access

### EAFP

```python
try:
    with open("data.txt") as f:
        contents = f.read()
except FileNotFoundError:
    contents = ""
```

* open the file, deal with error if it doesn’t exist.

### LBYL

```python
import os
if os.path.exists("data.txt"):
    with open("data.txt") as f:
        contents = f.read()
else:
    contents = ""
```

* check first, then open.

---

## 4. why Python favors EAFP

* **concise**: fewer conditionals, clearer happy path.
* **race-condition safe**:
  imagine two processes: you check `if file exists`, but in the split second after that, another process deletes it. → you still crash when you open it.
  with EAFP, you just try and handle the real outcome.
* **fits with exceptions** being cheap and explicit in Python.

---

## 5. when LBYL is still useful

* when **failure is common** and exceptions would flood your program.
* when **readability** is better with an explicit check (e.g., user input validation).
* when you want to **avoid using exceptions for control flow** in very tight loops (exceptions are slower than `if` checks if triggered constantly).

example (LBYL makes sense):

```python
if isinstance(x, int) and x > 0:
    do_something(x)
else:
    raise ValueError("x must be positive int")
```

---

## 6. rule of thumb

* **use EAFP** for system interactions (files, network, DB, dict lookups, concurrency).
* **use LBYL** for input validation, data contracts, or rare failure cases.

---

✅ **mental model**:

* EAFP = “just do it, I’ll handle it if it breaks.” (more Pythonic)
* LBYL = “check all conditions first, then act.” (more defensive, often used in C/Java style)




# 9) Retrying with backoff (bread-and-butter for agents)

```py
import time
import random

def retry(func, *, attempts=5, base=0.2, jitter=0.1, exceptions=(TimeoutError,)):
    for i in range(attempts):
        try:
            return func()
        except exceptions as e:
            if i == attempts - 1:
                raise
            sleep = base * (2 ** i) + random.uniform(0, jitter)
            time.sleep(sleep)

```

* Only retry **transient** errors (timeouts, 5xx).
* **Don’t** retry deterministic failures (e.g., 400/validation errors).

---

## 1. why retrying matters for agents

agents almost always talk to **external systems** (APIs, DBs, tools).
those systems sometimes fail for reasons that *might* go away if you just try again:

* network blip
* server overloaded → 5xx error
* temporary quota throttling

if you don’t retry, your agent looks “fragile.”
if you retry *wisely*, your agent looks “resilient.”

---

## 2. what the code is doing

```python
def retry(func, *, attempts=5, base=0.2, jitter=0.1, exceptions=(TimeoutError,)):
    for i in range(attempts):
        try:
            return func()                     # try the risky call
        except exceptions as e:               # only handle specific transient errors
            if i == attempts - 1:             # if this was the last attempt
                raise                         # ...give up and raise the error
            sleep = base * (2 ** i) + random.uniform(0, jitter)
            time.sleep(sleep)                 # wait a bit before retrying
```

* **attempts=5** → try up to 5 times.
* **base=0.2** → first wait = 0.2 seconds.
* **exponential backoff** → wait time doubles each time (0.2, 0.4, 0.8, 1.6…).
* **jitter** → add a tiny random wiggle to avoid many clients retrying in sync.
* **exceptions=(TimeoutError,)** → only retry if the error is in this set.

---

## 3. transient vs deterministic errors

* **transient**: will probably succeed on retry. (timeouts, 5xx server errors, connection resets).
* **deterministic**: retrying won’t help. (bad request 400, invalid JSON, auth failure).

👉 retrying deterministic errors just wastes time and resources.

---

## 4. usage in practice

```python
def call_api():
    # simulate flaky API
    if random.random() < 0.5:
        raise TimeoutError("network hiccup")
    return {"status": "ok"}

result = retry(call_api, exceptions=(TimeoutError,))
print(result)
```

about half the time it succeeds on first try, other times it retries until it works (or gives up).

---

## 5. where this shows up in real agents

* **LLM API calls** (OpenAI/Anthropic/Azure) — network can drop, servers can 5xx.
* **database queries** — connections reset.
* **web scraping / API integrations** — rate limiting.

without retry logic, agents break constantly. with it, they “self-heal” on common hiccups.

---

## 6. best practices

* **be selective**: retry only transient error types.
* **limit attempts**: don’t loop forever.
* **use backoff + jitter**: polite to the server, avoids dogpiling.
* **log retries**: so you can debug flaky systems later.

---

✅ **mental model**:
retrying with backoff is like being polite on a busy phone line:

* don’t slam redial instantly.
* wait longer each time.
* add randomness so everyone’s not retrying in unison.



# 10) Adding context with `contextlib`

```py
from contextlib import suppress

with suppress(KeyError):
    del cache["stale"]       # ignore if missing
```

Create your own contextual wrapper:

```py
from contextlib import contextmanager

@contextmanager
def add_context(msg):
    try:
        yield
    except Exception as e:
        raise RuntimeError(f"{msg}: {e}") from e

with add_context("Fetching embeddings"):
    compute_embeddings(docs)
```

---

This section is about a very powerful but subtle tool: **context managers** from `contextlib`. the idea is that you can wrap a block of code in a `with` statement to change how errors (or resources) are handled inside.

let’s break it down.

---

## 1. `suppress` — ignoring specific errors

```python
from contextlib import suppress

with suppress(KeyError):
    del cache["stale"]  # ignore if missing
```

* **normally**: `del cache["stale"]` would raise `KeyError` if `"stale"` is not in the dict.
* **with suppress(KeyError)**: python will catch that exception and ignore it.

👉 it’s like saying: “i know this might fail with `KeyError`, but if it does, I don’t care — just keep going.”

### when to use

* cleanup steps (delete temp files, clear dict keys).
* “best effort” operations where missing is fine.

---

## 2. `contextmanager` — creating your own wrappers

```python
from contextlib import contextmanager

@contextmanager
def add_context(msg):
    try:
        yield
    except Exception as e:
        raise RuntimeError(f"{msg}: {e}") from e
```

* `@contextmanager` lets you define a **custom context manager** using a generator.
* everything **before** `yield` runs when you enter the `with` block.
* everything **after** `yield` runs when you exit the block (cleanup or error handling).

---

## 3. usage of `add_context`

```python
with add_context("Fetching embeddings"):
    compute_embeddings(docs)
```

* if `compute_embeddings` raises an exception, it will be **caught** inside `add_context`.
* then re-raised as a `RuntimeError` with extra context:

  ```
  RuntimeError: Fetching embeddings: original-error-message
  ```
* `from e` keeps the original traceback attached.

👉 this is about **adding useful context to errors** so they’re easier to debug.

---

## 4. why this matters for agents

agents often do multi-step pipelines: fetch data → embed → query vector DB → call model.

without context:

```
ValueError: could not convert string to float
```

(where in the pipeline did that happen?)

with context:

```
RuntimeError: Fetching embeddings: could not convert string to float
```

now you know exactly *which stage* failed.

---

## 5. what to focus on

* **`suppress`**: makes error handling cleaner when ignoring specific exceptions is intentional.
* **`@contextmanager`**: lets you create **scoped error handling / cleanup** logic.
* **adding context**: makes logs and tracebacks much more readable (critical when debugging agents).

---

✅ **mental model**:

* `suppress` = “I *expect* this error, and it’s safe to ignore.”
* custom `contextmanager` = “wrap this block so any errors tell me not only *what* went wrong, but also *where in my pipeline* it happened.”




# 11) Top-level “error boundary” for an agent step

Wrap each agent **tool call / step** so one failure doesn’t crash the whole run:

```py
def run_step(step_fn):
    try:
        return {"ok": True, "value": step_fn()}
    except RetryableToolError as e:
        return {"ok": False, "retryable": True, "error": str(e)}
    except Exception as e:
        # unknown non-retryable
        return {"ok": False, "retryable": False, "error": f"{type(e).__name__}: {e}"}
```

This pattern is the **agent equivalent of a seatbelt**: you wrap each tool/step so *any* exception becomes **data** (not a crash). Then the controller (planner/policy) can make decisions with a small, predictable schema.

Here’s how it maps to agent development, plus a slightly upgraded version you can drop in.

# Why an “error boundary” matters for agents

* **Stability:** One flaky tool call won’t kill the whole run; it returns `{ok: False, ...}` instead.
* **Uniform control flow:** Your planner only reasons over a small, consistent shape (pass/fail \[+ retryable]).
* **Separation of concerns:** Tool code uses idiomatic `raise ...`; the boundary converts that into structured results.
* **Good UX:** You can summarize failures to the user without dumping tracebacks, while still logging them for debugging.
* **Metrics & observability:** Counting `ok`, `retryable`, `non_retryable` per tool is trivial.

# A sturdier boundary you’ll actually want

Adds timing, traceback for logs, and a hook to classify exceptions:

```python
import time, logging, traceback
from dataclasses import dataclass
from typing import Any, Callable, Optional, Tuple

@dataclass
class StepResult:
    ok: bool
    value: Any = None
    error: Optional[str] = None
    retryable: bool = False
    took_s: float = 0.0

def run_step(step_fn: Callable[[], Any],
             *,
             classify: Optional[Callable[[BaseException], Tuple[bool, str]]] = None
            ) -> StepResult:
    t0 = time.time()
    try:
        val = step_fn()
        return StepResult(ok=True, value=val, took_s=time.time() - t0)
    except BaseException as e:
        # classify → (retryable?, message)
        retryable, msg = (classify(e) if classify else (False, f"{type(e).__name__}: {e}"))
        logging.exception("Step failed: %s", msg)  # includes full traceback
        return StepResult(ok=False, error=msg, retryable=retryable, took_s=time.time() - t0)
```

**Why these fields:**

* `ok`/`error`/`retryable` → minimal policy inputs.
* `took_s` → latency budget & timeouts.
* Logging with `logging.exception` → you keep the full traceback outside the agent’s reasoning loop.

# Where it plugs into your agent

**Inside the controller:**

```python
res = run_step(lambda: call_search_tool(query), classify=classify_tool_errors)

if res.ok:
    planner.observe(f"Search OK in {res.took_s:.2f}s")
    use(res.value)
else:
    if res.retryable:
        planner.schedule_retry("search", reason=res.error)
    else:
        planner.fallback("search", message=res.error)  # e.g., different tool or user summary
```

# Classifying exceptions (tiny policy brain)

You decide what’s retryable:

```python
def classify_tool_errors(e: BaseException):
    from json import JSONDecodeError
    # transient
    if isinstance(e, (TimeoutError, ConnectionError)):
        return True, f"{type(e).__name__}: {e}"
    # library-specific timeouts/status errors (adjust per stack)
    # if isinstance(e, httpx.TimeoutException): return True, ...
    if isinstance(e, JSONDecodeError):
        return False, "Tool returned invalid JSON"
    # everything else is non-retryable by default
    return False, f"Unexpected {type(e).__name__}: {e}"
```

# How it plays with other pieces you learned

* **Retries/backoff (#9):** Wrap your tool function with a retry helper *inside* `step_fn`. The boundary just reports the final outcome.
* **Context managers (#6 & #10):** Use `with`/`contextlib` *inside* the step; boundary keeps exceptions structured if setup/teardown fails.
* **EAFP style (#8):** Steps can be written EAFP (just try/catch). The boundary turns any exception into `StepResult`.
* **Logging vs traceback (#7):** Boundary logs full traceback (for you), returns a concise message (for the agent).

# Practical tips

* Keep the **try** inside the boundary **tight**—only the step call.
* **Don’t** swallow errors silently: always log at the boundary.
* Add a **deadline** if your agent has a time budget (abort long steps):

  * Track elapsed with `took_s`, or wrap the step in a timeout (thread/async/ signal depending on your stack).
* Include a **tool name** in logs/metrics: `logging.exception("[web_search] failed: %s", msg)`.
* Make results **serializable** if you persist runs (avoid raw exceptions inside `StepResult`).

# TL;DR

An error boundary converts unpredictable exceptions into a **predictable contract** (`ok/value` or `error/retryable/took_s`) per step. That simplicity lets your agent plan, retry, and fallback reliably—while you retain deep debuggability via logs and classification.


# 12) Validating inputs (before the expensive call)

Keep validation **close to the boundary** (API request construction, tool arguments):

```py
def build_search_query(q: str, k: int):
    if not isinstance(q, str) or not q.strip():
        raise ValueError("q must be a non-empty string")
    if not isinstance(k, int) or not (1 <= k <= 50):
        raise ValueError("k must be an int 1..50")
    return {"q": q.strip(), "k": k}
```

This turns vague downstream errors into fast, clear ones.

---

# What you should learn

## 1) Validate at the **boundary**

A “boundary” is where your code meets the outside world (user input, tool args, API payloads, files). Put checks **right there**, before downstream work (network, DB, model calls). It makes errors:

* **Fast** (caught immediately)
* **Clear** (precise message)
* **Cheap** (no wasted requests)

## 2) Fail fast with the **right exception**

* Wrong **type** → `TypeError`
* Wrong **value/shape/range** → `ValueError`
* Unknown **option** → `ValueError` (or `KeyError` if it’s a dict-like selection)

Keep messages actionable:

> “k must be an int 1..50 (got '7.0')”

## 3) Keep `try` blocks **tight** by validating first

If you validate before the expensive call, the `try` around the call handles only *runtime* issues (timeouts, 5xx), not preventable input mistakes.

## 4) Separate **validation** from **normalization**

* **Validation**: reject bad inputs (raise).
* **Normalization**: coerce *safe* inputs into a canonical form (strip whitespace, lowercasing, defaulting).
  Do normalization first when it’s harmless, then validate.

```python
def build_search_query(q, k):
    # normalize
    if isinstance(q, str):
        q = q.strip()
    # validate
    if not isinstance(q, str) or not q:
        raise ValueError("q must be a non-empty string")
    if not isinstance(k, int) or not (1 <= k <= 50):
        raise ValueError("k must be an int 1..50")
    # return canonical shape
    return {"q": q, "k": k}
```

## 5) Choose your stance: **strict** vs **forgiving**

* **Strict**: raise on anything off-spec → best for code-facing APIs and internal tools.
* **Forgiving**: gently coerce (`" 42 "` → `42`) but *still* raise on ambiguous/bad data → best for user-facing inputs.

## 6) Make contracts **explicit** with types & enums

Use `typing` hints and small enums to constrain inputs.

```python
from enum import Enum

class SortOrder(str, Enum):
    RELEVANCE = "relevance"
    DATE = "date"

def build_query(q: str, k: int, order: SortOrder = SortOrder.RELEVANCE):
    ...
```

Now tools and linters help catch mistakes *before runtime*.

## 7) Aggregate errors when it helps UX

For form-like inputs, it’s nicer to return all issues at once:

```python
def validate_args(args: dict):
    errors = []
    if "q" not in args or not isinstance(args["q"], str) or not args["q"].strip():
        errors.append("q must be a non-empty string")
    if "k" not in args or not isinstance(args["k"], int) or not (1 <= args["k"] <= 50):
        errors.append("k must be an int 1..50")
    if errors:
        raise ValueError("; ".join(errors))
```

## 8) For agents: validation = **tool safety**

Every tool your agent calls should:

* Validate its **inputs** (from the planner/LLM)
* Validate **responses** (schema check)
* Translate errors to your **Result** or domain exceptions

```python
def search_tool(q: str, k: int):
    # input guard
    payload = build_search_query(q, k)

    # call (could raise timeout/HTTP errors)
    data = http_search(payload)

    # output guard
    if not isinstance(data, dict) or "results" not in data:
        raise ValueError("tool response missing 'results'")
    return data["results"]
```

Combined with your error boundary, the agent sees a clean `{ok|error}`; you get crisp messages.

## 9) Testing validation is simple and powerful

* Happy path: valid inputs return canonical data.
* Bad type/value: use `pytest.raises` and check the **message**.

```python
import pytest

def test_build_query_bad_k():
    with pytest.raises(ValueError, match="k must be an int 1..50"):
        build_search_query("hello", 0)
```

# Quick checklist (tape near your keyboard)

* Validate at **entry points** (tool adapters, API builders).
* **Normalize** safe things first (strip, lower).
* Raise **TypeError** for type, **ValueError** for content.
* Messages must be **actionable** and precise.
* Keep **try** around only the truly risky call.
* After the call, **validate outputs** (schema/keys/ranges).
* Convert to your agent’s **Result** at the boundary.

# Two drop-in helpers

```python
def require_str(name: str, v, *, nonempty=True) -> str:
    if not isinstance(v, str):
        raise TypeError(f"{name} must be str (got {type(v).__name__})")
    v2 = v.strip()
    if nonempty and not v2:
        raise ValueError(f"{name} must be a non-empty string")
    return v2

def require_int_in(name: str, v, lo: int, hi: int) -> int:
    if not isinstance(v, int):
        raise TypeError(f"{name} must be int (got {type(v).__name__})")
    if not (lo <= v <= hi):
        raise ValueError(f"{name} must be in [{lo},{hi}] (got {v})")
    return v
```

Use them inside tools to keep guard code short and consistent.


