<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/121_Error_Types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Knowing the **most common Python exception types** (and when they tend to occur) gives you a major advantage when designing robust code or AI agents.

Here’s a curated list, grouped by **category**, with plain-language explanations and examples. These are the ones you’ll encounter **most often**.

---

## 🧮 Type & Value Errors

| Exception            | When It Happens                   | Example                            |
| -------------------- | --------------------------------- | ---------------------------------- |
| **`TypeError`**      | Wrong data type used              | `len(5)` or `"a" + 2`              |
| **`ValueError`**     | Right type, wrong value           | `int("hello")`                     |
| **`AssertionError`** | `assert` fails during dev/testing | `assert x > 0, "must be positive"` |

🧠 **You’ll use these often when validating input or agent arguments.**

---

## 📦 Collections & Lookups

| Exception            | When It Happens                 | Example                       |
| -------------------- | ------------------------------- | ----------------------------- |
| **`IndexError`**     | List index out of bounds        | `my_list[100]`                |
| **`KeyError`**       | Dict key not found              | `my_dict["missing"]`          |
| **`AttributeError`** | Accessing nonexistent attribute | `x.foo` where `x` is an `int` |

🔍 Very common when dealing with structured data, APIs, or agent memory.

---

## 📂 Files & I/O

| Exception               | When It Happens    | Example                                            |
| ----------------------- | ------------------ | -------------------------------------------------- |
| **`FileNotFoundError`** | File doesn’t exist | `open("missing.txt")`                              |
| **`PermissionError`**   | Can’t access file  | Trying to write to a read-only location            |
| **`IOError`**           | General I/O issues | Often used in older Python for file/network issues |

🧠 Use `with open(...) as f:` to reduce these.

---

## 🌐 Network / Agent Tooling

| Exception                  | When It Happens         | Example                    |
| -------------------------- | ----------------------- | -------------------------- |
| **`TimeoutError`**         | Operation took too long | HTTP request, socket, etc. |
| **`ConnectionError`**      | Network is unreachable  | Failed API/tool call       |
| **`json.JSONDecodeError`** | Bad JSON parsing        | `json.loads("not json")`   |

📡 These are **bread-and-butter** for agent toolchains calling external APIs.

---

## 🧪 Testing / Internal Checks

| Exception                 | When It Happens               | Example                                |
| ------------------------- | ----------------------------- | -------------------------------------- |
| **`RuntimeError`**        | Something unexpected happened | “This should never happen” placeholder |
| **`NotImplementedError`** | Placeholder method            | In base classes or abstract tools      |

🧠 Often used in early-stage prototypes or when enforcing agent behaviors.

---

## ⚙️ Custom Exceptions (You Define)

For agent logic, you’ll commonly define your own, such as:

```python
class RetryableToolError(Exception): pass
class NonRetryableToolError(Exception): pass
```

So you can **structure how the agent reacts**:

* Retryable? Retry.
* Non-retryable? Summarize or skip.
* Unexpected? Log and crash the step.

---

## ✅ Summary Table (Quick Reference)

| Category         | Exceptions                                                            |
| ---------------- | --------------------------------------------------------------------- |
| Type/Value       | `TypeError`, `ValueError`, `AssertionError`                           |
| Data access      | `IndexError`, `KeyError`, `AttributeError`                            |
| I/O              | `FileNotFoundError`, `PermissionError`, `IOError`                     |
| Network/External | `TimeoutError`, `ConnectionError`, `json.JSONDecodeError`             |
| Internal control | `RuntimeError`, `NotImplementedError`                                 |
| Custom/Agent     | `RetryableToolError`, `NonRetryableToolError`, `ValidationError`, etc |



## ✅ 1. **TypeError** – *Wrong data type used*

This error happens when an operation or function is applied to an object of an **inappropriate type**.




In [1]:
# Example: Trying to get the length of an integer
x = 5
print(len(x))  # ❌ TypeError

TypeError: object of type 'int' has no len()

In [2]:
# Fix
x = "hello"
print(len(x))  # ✅ OK: strings have a length

5


## ✅ 2. **ValueError** – *Right type, wrong value*

This happens when you pass the **right data type**, but the **content is invalid**.




In [4]:
# Example: Trying to convert an invalid string to an integer
num = int("hello")  # ❌ ValueError

ValueError: invalid literal for int() with base 10: 'hello'

In [3]:
# Fix
num = int("42")  # ✅ OK







## ✅ 3. **AssertionError** – *assert fails during dev/testing*

`assert` is used to check conditions **during development**. It will raise an `AssertionError` if the condition is `False`.




In [7]:
x = -3
assert x > 0, "x must be positive"  # ❌ AssertionError

AssertionError: x must be positive

In [6]:
# Fix
x = 5
assert x > 0, "x must be positive"  # ✅ No error

In [5]:
try:
    num = int("hello")
except ValueError:
    print("Oops! Could not convert to int.")

Oops! Could not convert to int.


##**📦 Collections & Lookup Errors**

which are extremely common when working with:

* Dictionaries (`KeyError`)
* Lists (`IndexError`)
* Objects (`AttributeError`)

### 🧠 What You’re Practicing:

| Error Type       | Cause                                       | Fix Strategy                    |
| ---------------- | ------------------------------------------- | ------------------------------- |
| `KeyError`       | Dict key doesn’t exist                      | Use `.get()` or check with `in` |
| `IndexError`     | List index is out of range                  | Use `len()` or try/except block |
| `AttributeError` | Object doesn’t have that method or property | Use `hasattr()` or type checks  |

---

### ✅ Tip for Agent Devs:

These errors often show up in:

* Tool inputs and outputs
* JSON API responses (`dict["missing_key"]`)
* Data wrangling (`obj.nonexistent_attribute`)

In agents, it’s good practice to:

* Catch these explicitly
* Add context: `raise KeyError("Missing 'email' in profile") from e`
* Use structured fallback values



In [8]:
# 📦 1. KeyError: Missing dictionary key
print("KeyError Example")
person = {"name": "Ada", "age": 30}
try:
    print(person["email"])  # ❌ KeyError
except KeyError as e:
    print(f"Caught KeyError: {e}")
print()

KeyError Example
Caught KeyError: 'email'



In [9]:
# ✅ Fix: use .get() or check key
email = person.get("email", "Not provided")
print(f"Email (safe access): {email}")
print("-" * 40)

Email (safe access): Not provided
----------------------------------------


In [14]:
# 📦 2. IndexError: List index out of range
print("IndexError Example")
colors = ["red", "green", "blue"]
try:
    print(colors[5])  # ❌ IndexError
except IndexError as e:
    print(f"Caught IndexError: {e}")
print("-" * 40)

IndexError Example
Caught IndexError: list index out of range
----------------------------------------


In [11]:
# ✅ Fix: check index
index = 1
if index < len(colors):
    print(f"Color at index {index}: {colors[index]}")
print("-" * 40)

Color at index 1: green
----------------------------------------


In [15]:
# 📦 3. AttributeError: Object missing attribute
print("AttributeError Example")
x = 42
try:
    print(x.lower())  # ❌ integers have no .lower()
except AttributeError as e:
    print(f"Caught AttributeError: {e}")
print("-" * 40)

AttributeError Example
Caught AttributeError: 'int' object has no attribute 'lower'
----------------------------------------


In [16]:
# ✅ Fix: only call .lower() on strings
word = "Hello"
print(word.lower())  # ✅ OK

hello


Let's now explore 📂 **Files & I/O Errors**, which commonly occur when reading, writing, or accessing files or other input/output resources.

These are especially relevant when your agent is:

* Reading documents, logs, or datasets
* Saving or exporting results
* Downloading or uploading to/from disk
* Interacting with APIs that touch the file system

---

## 📂 Files & I/O Error Examples — Practice Code


## 💡 Key Learnings

| Exception              | Cause                              | Fix / Best Practice                         |
| ---------------------- | ---------------------------------- | ------------------------------------------- |
| `FileNotFoundError`    | File doesn’t exist                 | Check if file exists first, handle with try |
| `PermissionError`      | No permission to read/write        | Avoid system-protected locations            |
| `IOError`              | Generic file/network error         | Rarely used now, but still catchable        |
| ✅ Use `with open(...)` | Automatically handles file closing | Prevents leaks, reduces errors              |

---

## 🧠 Agent & DS Tip:

Use file error handling when:

* Loading training data
* Logging results or saving checkpoints
* Writing temp files or responses to disk
* Letting users upload or select file paths




In [17]:
import os

# 📂 1. FileNotFoundError
print("FileNotFoundError Example")
try:
    with open("missing_file.txt", "r") as f:
        data = f.read()
except FileNotFoundError as e:
    print(f"Caught FileNotFoundError: {e}")
print("-" * 40)

FileNotFoundError Example
Caught FileNotFoundError: [Errno 2] No such file or directory: 'missing_file.txt'
----------------------------------------


In [18]:
# 📂 2. PermissionError
print("PermissionError Example")
try:
    # Trying to write to a system file (simulated case — avoid in real use!)
    with open("/root/secret.txt", "w") as f:
        f.write("Access denied?")
except PermissionError as e:
    print(f"Caught PermissionError: {e}")
print("-" * 40)

PermissionError Example
----------------------------------------


In [19]:
# 📂 3. IOError (less common in modern Python, general I/O problem)
print("IOError Example (Simulated)")
try:
    # Simulating an I/O issue by manually raising one
    raise IOError("Simulated I/O failure during read/write")
except IOError as e:
    print(f"Caught IOError: {e}")
print("-" * 40)

IOError Example (Simulated)
Caught IOError: Simulated I/O failure during read/write
----------------------------------------


In [20]:
# ✅ Best practice: use 'with' to safely open/close files
print("Best Practice: with open(...) as f:")
try:
    # Create a file safely
    with open("my_data.txt", "w") as f:
        f.write("This file is safely written.\n")
    print("File written successfully.")
except Exception as e:
    print(f"Unexpected error: {e}")


Best Practice: with open(...) as f:
File written successfully.



## 🧠 1. Most Common Error Types in Agent Development

When building AI Agents (tool-using, async, API-integrating, user-facing), these are the **top error types** you're most likely to encounter:

| Error Type                          | Typical Cause                                        | Common Location                        |
| ----------------------------------- | ---------------------------------------------------- | -------------------------------------- |
| `TypeError` / `ValueError`          | Invalid inputs or wrong assumptions about data       | Tool inputs, prompt templates, parsing |
| `KeyError` / `IndexError`           | Missing dict/list data                               | JSON tool responses, memory buffers    |
| `AttributeError`                    | Incorrect object assumptions (e.g., `.name` on None) | Chained calls, dynamic objects         |
| `TimeoutError` / `requests.Timeout` | External APIs are slow, fail to respond              | Tool calls, web scraping, APIs         |
| `ConnectionError`                   | Network dropped, server unreachable                  | API and DB access                      |
| `FileNotFoundError`                 | File dependencies missing                            | Loading tools, cached files, logs      |
| `RuntimeError`                      | Unexpected program state                             | Internal bugs in agent planning        |
| `Exception` (fallback)              | Catch-all when something really goes wrong           | Top-level agent loop/tool step         |

---

## ✅ 2. Best Practices for Handling These Errors

### 🔒 Handle what you **expect**

```python
try:
    response = requests.get(url, timeout=5)
except requests.Timeout:
    retry()  # or backoff
except requests.ConnectionError:
    log_offline_state()
```

### 🛡️ Use `try / except / else / finally` carefully

* Keep `try` blocks **tight**: only the risky lines.
* Use `except` for **specific** errors (not bare `except`).
* Use `finally` to clean up (e.g., close DB or file).

### 🧱 Add structure to agent tool calls:

```python
def run_tool_call(tool_fn):
    try:
        return {"ok": True, "value": tool_fn()}
    except RetryableError as e:
        return {"ok": False, "retryable": True, "error": str(e)}
    except Exception as e:
        return {"ok": False, "retryable": False, "error": f"{type(e).__name__}: {e}"}
```

### 📄 Use Logging (instead of just printing):

```python
import logging
logging.exception("tool_call failed")  # captures traceback!
```

### 🧪 Validate Inputs Early:

```python
if not isinstance(query, str) or not query.strip():
    raise ValueError("query must be a non-empty string")
```

### 🔁 Retry External Calls (transient failures only!)

Use exponential backoff + jitter for things like:

* `requests.Timeout`
* `ConnectionError`
* 5xx API status codes

---

## 🚀 3. How to Level Up Your Agent Error Handling Skills

### 🎯 Develop these core reflexes:

| Skill                            | How to Build It                                                  |
| -------------------------------- | ---------------------------------------------------------------- |
| **Read tracebacks fluently**     | Practice running buggy code, locate the real error cause         |
| **Identify expected failures**   | Think through what might go wrong (timeout? bad input? 404?)     |
| **Use structured error returns** | Avoid crashing the whole agent with clear `{ok, error}` wrappers |
| **Write custom exceptions**      | Define `RetryableToolError`, `ToolParseError`, etc               |
| **Use logging everywhere**       | Track what failed and **why**, especially in async tools         |
| **Write pytest for failures**    | Use `pytest.raises()` to test edge cases & validation            |
| **Learn context managers**       | For safe file/net resource handling                              |

---

## 🧭 Want a practical training plan?

Here's a simple **Error Handling Skill Ladder**:

| Level | Focus Area                             | Practice With                              |
| ----- | -------------------------------------- | ------------------------------------------ |
| 🟢 1  | Tracebacks + `try/except` basics       | Built-in type/value/index errors           |
| 🟡 2  | Tool APIs + JSON parsing + retries     | `requests`, `openai`, `json.loads`         |
| 🟠 3  | Custom exceptions + logging + timeouts | Structured error responses in agents       |
| 🔴 4  | Async tools, backoff strategies        | `asyncio`, `aiohttp`, `concurrent.futures` |
| 🟣 5  | Framework-level guardrails             | Agent wrappers, toolkits, `FastAPI`, etc.  |

---

## ✅ TL;DR: What To Do Next

* **Expect + handle** known errors
* **Log + observe** unknown ones (don't blindly suppress)
* Build a **structured error boundary** around each agent step
* Treat errors as **valuable signals**, not just bugs
* Practice reading real-world tracebacks to build intuition




Here’s a **complete reusable error-handling scaffold + checklist** tailored for **AI Agent development**. You can drop this directly into your agent framework or adapt it to your style.

---

# 📘 Agent Error Handling Template

### ✅ Basic Structure

```python
import logging
import time
import json

class RetryableToolError(Exception): pass
class NonRetryableToolError(Exception): pass

def run_agent_step(step_fn):
    """Wrap one agent step/tool call and return a structured result."""
    try:
        return {
            "ok": True,
            "value": step_fn()
        }
    except RetryableToolError as e:
        logging.warning("Retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": True,
            "error": str(e)
        }
    except Exception as e:
        logging.error("Non-retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": False,
            "error": f"{type(e).__name__}: {e}"
        }
```

---

### ⚙️ Example Tool Function

```python
def call_tool():
    try:
        status, text = call_tool_raw()
        
        if status >= 500:
            raise RetryableToolError(f"server 5xx: {status}")
        if status >= 400:
            raise NonRetryableToolError(f"client 4xx: {status}")

        payload = json.loads(text)
        if "answer" not in payload:
            raise NonRetryableToolError("missing 'answer' field")

        return payload["answer"]

    except json.JSONDecodeError as e:
        raise NonRetryableToolError("Invalid JSON from tool") from e
```

---

### 🔁 Retry with Backoff (for transient failures)

```python
import time
import random

def retry(func, attempts=5, base=0.2, jitter=0.1, exceptions=(RetryableToolError,)):
    for i in range(attempts):
        try:
            return func()
        except exceptions as e:
            if i == attempts - 1:
                raise
            sleep = base * (2 ** i) + random.uniform(0, jitter)
            logging.info(f"Retrying after {sleep:.2f}s due to: {e}")
            time.sleep(sleep)
```

---

## ✅ Agent Developer Error Handling Checklist

> Tape this next to your LLM or keyboard. Use it when building every agent or tool call.

### 🔍 Before the Call

* [ ] Validate inputs (`str`, `int`, non-empty, valid range)
* [ ] Default to `with open(...)` or `async with` when needed
* [ ] Wrap **only the risky lines** inside `try`

### 🔧 During the Call

* [ ] Catch **specific exceptions**, not just `Exception`
* [ ] Add custom error messages with `raise ... from e`
* [ ] Wrap external APIs with retry + timeout
* [ ] Use logging — `logging.exception()` inside `except`

### 🧱 After the Call

* [ ] Return a **structured error**: `{ok, error, retryable}`
* [ ] Handle agent steps with a **top-level wrapper** (don’t crash full run)
* [ ] Test known failure modes with `pytest.raises()`

---

## 🧪 What to Log vs Raise?

| Condition                  | Do This                       |
| -------------------------- | ----------------------------- |
| Expected, transient error  | Log & Retry                   |
| Permanent user input error | Log once & Skip/Fallback      |
| Unexpected failure         | Log traceback + Raise for dev |

---



In [None]:
import logging
import time
import json

class RetryableToolError(Exception): pass
class NonRetryableToolError(Exception): pass

def run_agent_step(step_fn):
    """Wrap one agent step/tool call and return a structured result."""
    try:
        return {
            "ok": True,
            "value": step_fn()
        }
    except RetryableToolError as e:
        logging.warning("Retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": True,
            "error": str(e)
        }
    except Exception as e:
        logging.error("Non-retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": False,
            "error": f"{type(e).__name__}: {e}"
        }

Let’s break this down **line by line** so you not only understand it, but can **reuse** the pattern in your agent development work with confidence.

---

# 🧠 Goal of This Code

This is a **top-level error handling wrapper** for a **single tool step** in an agent. The purpose is to:

* Handle errors gracefully (instead of crashing the agent)
* Distinguish between **retryable** and **non-retryable** errors
* Return a **structured result** `{ok, value/error, retryable}` that the agent controller can act on

---

# 🔍 Walkthrough

```python
import logging
import time
import json
```

✅ **Standard libraries** that you'll often need:

* `logging`: for proper error logging (instead of `print`)
* `time`: may be used for timing agent steps
* `json`: used if tools return or require JSON payloads

---

```python
class RetryableToolError(Exception): pass
class NonRetryableToolError(Exception): pass
```

✅ **Custom exception classes**:

* These let you **categorize** errors your tools might raise
* You can raise them in tools when:

  * Something failed but you want to retry (`RetryableToolError`)
  * It’s a permanent failure like a 400 or bad input (`NonRetryableToolError`)

Think of these like “labels” on your errors to decide whether to **retry** or **fail fast**.

---

```python
def run_agent_step(step_fn):
```

✅ This is the **top-level wrapper** for a **single step** in your agent.

It expects `step_fn` to be a function that:

* Encapsulates one agent tool call
* Might raise exceptions

So instead of doing:

```python
result = call_tool()
```

You’d do:

```python
result = run_agent_step(call_tool)
```

---

```python
    try:
        return {
            "ok": True,
            "value": step_fn()
        }
```

✅ If the tool call (`step_fn()`) succeeds:

* Return `ok = True`
* Return the result inside `"value"`

This is a **happy path**.

---

```python
    except RetryableToolError as e:
        logging.warning("Retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": True,
            "error": str(e)
        }
```

✅ If the step fails with a **retryable** error:

* Log it as a warning
* Mark it as `ok = False`, `retryable = True`
* Include the error message for context

This allows the **controller** to decide:

> Should I try again? Back off? Or skip?

---

```python
    except Exception as e:
        logging.error("Non-retryable error: %s", e, exc_info=True)
        return {
            "ok": False,
            "retryable": False,
            "error": f"{type(e).__name__}: {e}"
        }
```

✅ If any **other error** happens:

* Log it as an error
* Mark `ok = False`, `retryable = False`
* Include both the type and message for clarity

This is for **unexpected errors** or bugs.

---

# 🧠 What You Should Be Learning

| Concept                                   | Why It Matters                                   |
| ----------------------------------------- | ------------------------------------------------ |
| ✅ `try/except` around just the risky call | Keeps error handling scoped and clear            |
| ✅ Custom error types                      | Helps downstream logic know what to retry        |
| ✅ Structured return `{ok, error}`         | Makes agent flow **robust and inspectable**      |
| ✅ Logging                                 | Debugging is easier with tracebacks and messages |
| ❌ No bare `except:`                       | Always catch only what you expect                |

---

## 🔁 How It Fits in Agent Design

This is **your foundation**. All tools, steps, or subroutines the agent calls should be **wrapped like this** so the agent:

* Doesn't crash
* Can decide what to do next
* Logs problems clearly






## ✅ **Why This Wrapper Matters in Agent Design**

### 🧰 **It’s a reusable, standardized wrapper**

You use `run_agent_step()` to wrap any tool function like:

```python
result = run_agent_step(fetch_user_profile)
```

No matter what the tool does — API call, file access, DB query — this wrapper:

* **Catches errors** without crashing the agent
* **Classifies** them as retryable or not
* **Returns a predictable object** like:

  ```python
  {'ok': False, 'retryable': True, 'error': 'TimeoutError: ...'}
  ```

---

## 🤖 **Why This Helps the Agent**

Agents need to:

* Stay alive through partial failures
* Know **how to respond** to failures (retry, skip, fallback, or escalate)
* Avoid wasting cognitive budget parsing inconsistent error outputs

This wrapper provides:

| Feature                 | Benefit to Agent                            |
| ----------------------- | ------------------------------------------- |
| `ok: True/False`        | Easy to detect success/failure              |
| `retryable: True/False` | Logic can retry only when safe              |
| `error: str`            | Can be logged, summarized, or shown to user |
| Logging                 | Maintainers can debug without print()       |

---

## 🧠 **Mental Model**

Think of this wrapper as your agent’s **“shock absorber”**:

* It catches all the turbulence from each tool
* Translates it into a **calm, structured format**
* Lets the agent controller focus on **what to do next**, not how to interpret chaos

---

## 🔁 Example

Imagine wrapping multiple tools:

```python
step1 = run_agent_step(lambda: fetch_weather(location))
step2 = run_agent_step(lambda: call_llm_tool(prompt))
step3 = run_agent_step(lambda: summarize_results(step1, step2))
```

Now your agent controller just needs to check:

```python
if not step2["ok"] and step2["retryable"]:
    retry_call()
```

---

## ✅ Summary: Why It’s Powerful

| 💡 Principle                         | ✨ Benefit                        |
| ------------------------------------ | -------------------------------- |
| Reusable pattern                     | One consistent way to wrap tools |
| Predictable return shape             | Easier agent logic               |
| Handles expected & unexpected errors | More robust agents               |
| Logging built-in                     | Easier debugging                 |
| Supports retry logic                 | Enables resilient retries        |






## ✅ This **is** the best-practice approach for error handling in agents

### You’ve discovered what’s often called:

> **An Error Boundary + Standardized Contract**

It’s one of the **most effective ways** to:

* Make your agents modular and composable
* Prevent brittle error-handling logic
* Focus your thinking where it matters (task logic, not plumbing)

---

## 🧠 What You’ve Built with `run_agent_step`

| Feature            | What It Means                                                                                        |
| ------------------ | ---------------------------------------------------------------------------------------------------- |
| ✅ **Modular**      | You can wrap any tool or function without rewriting new error-handling every time                    |
| ✅ **Standardized** | The return format is always `{ok, value?, error?, retryable?}`, which agents can easily reason about |
| ✅ **Robust**       | Unexpected failures don’t crash the agent, they’re logged and surfaced cleanly                       |
| ✅ **Debuggable**   | Logging gives full tracebacks, even for retryable errors                                             |
| ✅ **Retry-Aware**  | Separating `RetryableToolError` vs `NonRetryableToolError` lets your agent retry safely              |
| ✅ **Extensible**   | You can later add things like metrics, timeouts, tracing, etc. without changing the agent logic      |

---

## 💥 Why This Is So Powerful

By creating this **one small abstraction**, you:

* Reduce repetitive error handling across tools
* Give your agent controller *one consistent way* to check tool results
* **Minimize branching logic** and “what if this fails?” overhead
* Create a foundation for **retries, logging, fallbacks, summarization**, and more

This is especially critical in agents because:

* They run long chains of uncertain steps
* They rely on APIs and tools that can fail unpredictably
* You often want graceful degradation (not "one step failed = game over")

---

## 🚫 Without This

You’d end up writing things like this for every tool:

```python
try:
    result = call_tool()
except TimeoutError:
    # do this
except ValueError:
    # do that
```

And then later have to figure out:

* “Did this tool return a value or throw?”
* “Is this something I should retry?”
* “What even happened?”

Which leads to:

* Spaghetti logic
* Hard-to-debug behavior
* Unmaintainable agents

---

## ✅ Final Answer: Is This the Best Approach?

> **Yes.** For modern agent development, wrapping every step/tool in a standardized error-handling function like `run_agent_step` is widely considered a **best-in-class practice**.

You're not only handling errors — you’re building a reliable, maintainable **framework** for intelligent agent behavior.





## ❓ Your Question, Restated

You've learned that:

* Python has many specific error types (`TypeError`, `KeyError`, etc.)
* You’re expected to handle each one **individually** and **thoughtfully**

But now we’re saying:

> “Just wrap everything in `run_agent_step()` and return a simple dict like `{ok, retryable, error}`.”

So how can this **simplified wrapper** possibly handle the **full complexity** of all those error types?

---

## 💡 Here's the Core Insight

### 🔁 You still handle specific errors inside the wrapped function

The **specific error handling** still happens — but it’s **encapsulated** inside the function you pass to `run_agent_step()`.

Let’s walk through it step by step:

---

### 🧱 Step 1: Build the tool with detailed error logic

Inside your tool (e.g. `call_tool()`), you handle **specific exceptions** like this:

```python
def call_tool():
    response = api_call()

    if response.status_code >= 500:
        raise RetryableToolError("5xx server error")
    if response.status_code == 404:
        raise NonRetryableToolError("Not found")
    if not valid_json(response.text):
        raise NonRetryableToolError("Malformed JSON")
    return parse(response.text)
```

This is where all your **specific logic** lives.

---

### 🧱 Step 2: Wrap it in a standard boundary

Now you wrap it:

```python
result = run_agent_step(call_tool)
```

That’s it.

---

### 🔄 Step 3: `run_agent_step()` standardizes the response

No matter *what* happened inside `call_tool()`, the wrapper returns one of:

```python
{'ok': True, 'value': 42}
```

or

```python
{'ok': False, 'retryable': True, 'error': "5xx server error"}
```

or

```python
{'ok': False, 'retryable': False, 'error': "JSONDecodeError: Unexpected token"}
```

So your agent doesn't need to worry about whether the failure was:

* a `TimeoutError`,
* a `TypeError`,
* a `ValueError`,
* or a `CustomToolError`.

It just sees:
🟥 Something went wrong
🔁 Can I retry it or not?

---

## ✅ Why This Works

You're **still using all your error-handling skills** to:

* Detect problems
* Classify them
* Add logging
* Raise meaningful exceptions

But then you **funnel all of that** into a **unified shape** for the agent to consume:

```python
{ok, retryable, error}
```

The agent now has **one simple interface** to make decisions like:

```python
if not result["ok"] and result["retryable"]:
    retry()
elif not result["ok"]:
    log_and_continue()
```

---

## 📦 Analogy: Error Wrapping Is Like an Adapter

Think of `run_agent_step()` like an **adapter** in hardware:

| Tool Output  | Adapter (`run_agent_step`) | Agent Logic       |
| ------------ | -------------------------- | ----------------- |
| TimeoutError | retryable=True             | retry it          |
| KeyError     | retryable=False            | fallback          |
| JSON error   | retryable=False            | explain to user   |
| OK           | value                      | move to next step |

You **preserve the detail internally**, but only expose **structured decisions** to the outer world.

---

## 🔑 Final Summary

> You are still expected to know and handle specific error types — but you’re doing it **inside the tool** (where the error originates), not at the top level.

At the **agent controller level**, you simplify everything to:

* success/failure
* retryable or not
* clear string error

This is what makes **complex agent orchestration feasible** without turning it into a pile of `try/except` soup.


What you just described is one of the **most important design patterns in AI agent development**:

---

## 🧠 Two-Tier Error Handling Strategy

### 🔹 1. **Function-Level Error Handling**

This is where you:

* Know what *might* go wrong based on the logic, API, or data
* Catch **specific exceptions** (like `ValueError`, `KeyError`, `TimeoutError`)
* Decide if the error is **retryable** or **non-retryable**
* Add **context** or clean up resources
* Raise custom exceptions like `RetryableToolError`

This layer is all about **accuracy, safety, and clean code**.

```python
def call_tool():
    try:
        response = api_call()
        if response.status_code >= 500:
            raise RetryableToolError("Server error")
        if "answer" not in response.json():
            raise NonRetryableToolError("Missing 'answer'")
        return response.json()["answer"]
    except requests.Timeout:
        raise RetryableToolError("Timed out")
    except json.JSONDecodeError:
        raise NonRetryableToolError("Bad JSON")
```

---

### 🔹 2. **Agent-Level (Top-Level) Error Boundary**

This is where you:

* Don’t care *what* specifically failed
* Just care:
  ➕ Did it work?
  🔁 Can I retry?
  🟥 What was the error message?

This layer is all about **simple decision-making** and **error isolation**.

```python
result = run_agent_step(call_tool)

if result["ok"]:
    do_something(result["value"])
elif result["retryable"]:
    try_again()
else:
    log_failure(result["error"])
```

---

## 🤖 Why This Helps the Agent

You're exactly right:

> It **reduces the mental overhead** for the model.

Instead of the agent needing to understand:

* The details of every tool’s failure case
* Every exception type in Python
* How to retry certain errors but not others

You give it a **simple, universal contract**:

```python
{
  "ok": False,
  "retryable": True,
  "error": "TimeoutError: took too long"
}
```

So now the agent can reason cleanly:

> "This step failed. It was retryable. Let me try again or pick an alternative tool."

---

## ✅ Why It Improves Reliability

This pattern leads to:

* 🔒 Fewer crashes
* 🔁 Safer retries
* 🔍 Easier debugging
* 🧩 Easier composition of tools and chains
* 🧠 Cleaner LLM reasoning

And it keeps **your agent logic** clean and general — not tied to any one tool’s quirks.

---

## 🔚 Summary

You’ve got it 100% correct:

| Layer                     | Purpose                      | Error Handling Style                         |
| ------------------------- | ---------------------------- | -------------------------------------------- |
| ✅ **Function/tool level** | Deal with known, local risks | Catch specific exceptions, raise custom ones |
| ✅ **Agent/top level**     | Coordinate behavior          | Interpret `{ok, retryable, error}` result    |

This is exactly how **production-grade agents** are built at places like OpenAI, Anthropic, Replit, and more.

