# Chapter 15: Building Stateful Agents
**From: Zero to AI Agent**

## Overview
In this chapter, you'll learn about:
- Understanding agent state management
- Defining state schemas
- State updates and transformations
- Checkpointing and persistence
- Implementing retry logic
- Handling partial failures
- State visualization and monitoring


In [None]:
!pip install -q -r requirements.txt

from dotenv import load_dotenv
load_dotenv()

---
## Section 15.1: Understanding agent state management

In [None]:
# From: no_state_demo.py

# From: Zero to AI Agent, Chapter 15, Section 15.1
# File: no_state_demo.py

from typing import TypedDict
from langgraph.graph import StateGraph, START, END

class CounterState(TypedDict):
    count: int

def increment(state: CounterState) -> dict:
    new_count = state["count"] + 1
    print(f"Count is now: {new_count}")
    return {"count": new_count}

graph = StateGraph(CounterState)
graph.add_node("increment", increment)
graph.add_edge(START, "increment")
graph.add_edge("increment", END)

app = graph.compile()  # No checkpointer!

# Run three times
for i in range(3):
    result = app.invoke({"count": 0})


In [None]:
# From: with_state_demo.py

# From: Zero to AI Agent, Chapter 15, Section 15.1
# File: with_state_demo.py

from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver

class CounterState(TypedDict):
    count: int

def increment(state: CounterState) -> dict:
    new_count = state["count"] + 1
    print(f"Count is now: {new_count}")
    return {"count": new_count}

graph = StateGraph(CounterState)
graph.add_node("increment", increment)
graph.add_edge(START, "increment")
graph.add_edge("increment", END)

# Add checkpointer!
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)

# The secret ingredient: thread_id
config = {"configurable": {"thread_id": "my-counter"}}

# Run three times with SAME thread_id
for i in range(3):
    result = app.invoke({"count": 0}, config)


---
### Section 15.1 Exercises

### Exercise 15.1.1: Conversation Counter

Build a simple agent that:
- Tracks how many times a user has talked to it
- Greets returning users differently from new users
- Uses MemorySaver for persistence during the session

Try invoking it multiple times with the same thread_id and see the count increase.

In [None]:
# Your code here


### Exercise 15.1.2: Multi-User Tracker

Create an agent that:
- Supports multiple users (different thread_ids)
- Tracks each user's visit count separately
- Demonstrates that thread_ids isolate state completely

Run it with "alice" and "bob" thread_ids and verify they have separate counts.

In [None]:
# Your code here


### Exercise 15.1.3: State History Explorer

Build a 3-node workflow that:
- Each node adds something to the state
- After running, use `get_state_history()` to print all snapshots
- Show how state evolved through the graph

This helps you understand how checkpointing captures every step.

In [None]:
# Your code here


---
## Section 15.2: Defining state schemas

In [None]:
# From: well_designed_schema.py

# From: Zero to AI Agent, Chapter 15, Section 15.2
# File: well_designed_schema.py

"""
Example of a well-designed state schema.
"""

from typing import TypedDict, Annotated, Optional
from operator import add
from pydantic import BaseModel, Field
from enum import Enum
from datetime import datetime

# Enums for controlled values
class AgentStatus(str, Enum):
    IDLE = "idle"
    THINKING = "thinking"
    ACTING = "acting"
    DONE = "done"
    ERROR = "error"

# Pydantic for validated sub-structures
class Message(BaseModel):
    role: str = Field(pattern=r'^(user|assistant|system)$')
    content: str = Field(min_length=1)
    timestamp: datetime = Field(default_factory=datetime.now)

# TypedDict for LangGraph state
class AgentState(TypedDict):
    """
    Main state for the conversational agent.
    
    Fields marked with Annotated[..., add] accumulate across nodes.
    Other fields are replaced with new values.
    """
    # Accumulating fields
    messages: Annotated[list[dict], add]
    action_log: Annotated[list[str], add]
    
    # Replacing fields
    status: str  # Use AgentStatus values
    current_task: Optional[str]
    iteration: int

# Validation helper
def validate_message(msg: dict) -> dict:
    """Validate and normalize a message."""
    validated = Message(**msg)
    return validated.model_dump()


---
### Section 15.2 Exercises

### Exercise 15.2.1: Validated User Profile

Create a Pydantic model for a user profile with:
- Username (3-20 characters, alphanumeric only)
- Email (valid email format)
- Age (optional, but if provided must be 13-120)
- Membership level (enum: "free", "basic", "premium")

Test it with both valid and invalid data.

In [None]:
# Your code here


### Exercise 15.2.2: Chat Message Validation

Build a LangGraph node that:
- Accepts raw message input
- Validates it with Pydantic (role must be "user" or "assistant", content not empty)
- Returns the validated message or an error

In [None]:
# Your code here


### Exercise 15.2.3: Order State Schema

Design a complete state schema for an order processing agent:
- Order with id, items list, total price, and status
- Items with name, quantity (\>0), and price (\>0)
- Status enum (pending, processing, shipped, delivered, cancelled)
- Validation that total equals sum of item prices √ó quantities

In [None]:
# Your code here


---
## Section 15.3: State updates and transformations

In [None]:
# From: reducer_demo.py

# From: Zero to AI Agent, Chapter 15, Section 15.3
# File: reducer_demo.py

"""
Demonstrates the difference between accumulating and replacing state fields.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END

class State(TypedDict):
    items: Annotated[list[str], add]  # Will accumulate
    count: int                         # Will replace

def node_a(state):
    return {"items": ["from A"], "count": 1}

def node_b(state):
    return {"items": ["from B"], "count": 2}

graph = StateGraph(State)
graph.add_node("a", node_a)
graph.add_node("b", node_b)
graph.add_edge(START, "a")
graph.add_edge("a", "b")
graph.add_edge("b", END)

app = graph.compile()
result = app.invoke({"items": [], "count": 0})

print(f"items: {result['items']}")  # ['from A', 'from B'] - accumulated!
print(f"count: {result['count']}")  # 2 - replaced!


---
### Section 15.3 Exercises

### Exercise 15.3.1: Deduplicating Reducer

Create a custom reducer that:
- Accumulates messages like `add` does
- But removes duplicates (same content)
- Preserves the order (first occurrence wins)

Test it with a graph where multiple nodes might add the same message.

In [None]:
# Your code here


### Exercise 15.3.2: Priority Queue Reducer

Build a reducer that:
- Maintains a sorted list of tasks by priority
- Each task is `{"task": str, "priority": int}`
- Higher priority items come first
- New items are inserted in the correct position

In [None]:
# Your code here


### Exercise 15.3.3: Change Tracker

Create a state schema that:
- Tracks the current value of several fields
- Also tracks a "changelog" of what changed and when
- Each node automatically logs its changes to the changelog

In [None]:
# Your code here


---
## Section 15.4: Checkpointing and persistence

In [None]:
# From: persistence_demo.py

# From: Zero to AI Agent, Chapter 15, Section 15.4
# File: persistence_demo.py

"""
Complete persistence demonstration.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.sqlite import SqliteSaver

class ChatState(TypedDict):
    messages: Annotated[list[str], add]
    turn: int

def chat_turn(state: ChatState) -> dict:
    turn = state["turn"] + 1
    return {
        "messages": [f"Turn {turn}: Hello!"],
        "turn": turn
    }

# Build graph
graph = StateGraph(ChatState)
graph.add_node("chat", chat_turn)
graph.add_edge(START, "chat")
graph.add_edge("chat", END)

# Run with persistence
DB_PATH = "chat_demo.db"

print("=== Session 1: Starting fresh ===")
with SqliteSaver.from_conn_string(DB_PATH) as saver:
    app = graph.compile(checkpointer=saver)
    config = {"configurable": {"thread_id": "demo"}}
    
    # Two turns
    state = {"messages": [], "turn": 0}
    state = app.invoke(state, config)
    state = app.invoke(state, config)
    
    print(f"Messages: {state['messages']}")
    print(f"Turn: {state['turn']}")

print("\n=== Session 2: Resuming after 'restart' ===")
with SqliteSaver.from_conn_string(DB_PATH) as saver:
    app = graph.compile(checkpointer=saver)
    config = {"configurable": {"thread_id": "demo"}}
    
    # Load existing state
    existing = app.get_state(config)
    print(f"Loaded {existing.values['turn']} turns from disk!")
    
    # Continue
    state = app.invoke(existing.values, config)
    print(f"Messages: {state['messages']}")
    print(f"Turn: {state['turn']}")


---
### Section 15.4 Exercises

### Exercise 15.4.1: Multi-User Chat System

Build a chat system that:
- Supports multiple users with separate thread IDs
- Persists all conversations to SQLite
- Can list all conversations for a given user
- Shows message count and last activity per conversation

In [None]:
# Your code here


### Exercise 15.4.2: Checkpoint Cleanup Utility

Create a maintenance utility that:
- Reports total checkpoints and storage size
- Prunes old checkpoints (keep only last N per thread)
- Runs database VACUUM to reclaim space
- Shows before/after statistics

In [None]:
# Your code here


### Exercise 15.4.3: Conversation Export Tool

Build export/import functionality:
- Export a conversation to JSON file
- Import JSON back into a new thread
- Support "forking" (copy conversation to new thread)
- Preserve all metadata through export/import

In [None]:
# Your code here


---
## Section 15.5: Implementing retry logic

In [None]:
# From: retry_wrapper.py

# From: Zero to AI Agent, Chapter 15, Section 15.5
# File: retry_wrapper.py

"""
A reusable retry wrapper for agent nodes.
"""

import time
import random
from functools import wraps

def with_retry(max_attempts: int = 3, base_delay: float = 1.0):
    """
    Decorator that adds retry logic to any function.
    
    Usage:
        @with_retry(max_attempts=3)
        def my_flaky_function():
            # might fail sometimes
            pass
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_error = None
            
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                    
                except Exception as e:
                    last_error = e
                    
                    # Don't retry on final attempt
                    if attempt == max_attempts - 1:
                        break
                    
                    # Calculate wait time
                    wait = base_delay * (2 ** attempt)
                    jitter = random.uniform(0, wait * 0.1)
                    
                    print(f"  ‚ö†Ô∏è Attempt {attempt + 1} failed: {e}")
                    print(f"  ‚è≥ Retrying in {wait + jitter:.1f}s...")
                    
                    time.sleep(wait + jitter)
            
            # All retries exhausted
            raise last_error
        
        return wrapper
    return decorator


# Demo usage
if __name__ == "__main__":
    import random
    
    @with_retry(max_attempts=3, base_delay=0.5)
    def flaky_api_call(query: str) -> str:
        """Simulates an API that fails 60% of the time."""
        if random.random() < 0.6:
            raise ConnectionError("Service temporarily unavailable")
        return f"Result for: {query}"
    
    print("=== Retry Wrapper Demo ===\n")
    
    for i in range(3):
        try:
            result = flaky_api_call(f"Query {i+1}")
            print(f"Success: {result}\n")
        except Exception as e:
            print(f"Final failure: {e}\n")


In [None]:
# From: resilient_node.py

# From: Zero to AI Agent, Chapter 15, Section 15.5
# File: resilient_node.py

"""
A complete example of a resilient LangGraph node.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
import random
import time

class TaskState(TypedDict):
    task: str
    result: str
    status: str
    attempt_log: Annotated[list[str], add]

def resilient_processor(state: TaskState) -> dict:
    """
    A node with built-in retry logic.
    
    This demonstrates the pattern without external decorators,
    so you can see exactly what's happening.
    """
    max_attempts = 3
    base_delay = 1.0
    
    for attempt in range(max_attempts):
        try:
            # Simulate flaky operation (fails 60% of time)
            if random.random() < 0.6:
                raise ConnectionError("Simulated failure")
            
            # Success!
            return {
                "result": f"Processed: {state['task']}",
                "status": "success",
                "attempt_log": [f"Attempt {attempt + 1}: Success ‚úì"]
            }
            
        except ConnectionError as e:
            log_entry = f"Attempt {attempt + 1}: Failed - {e}"
            
            if attempt < max_attempts - 1:
                wait = base_delay * (2 ** attempt)
                log_entry += f" (retrying in {wait}s)"
                time.sleep(wait)
            
            if attempt == max_attempts - 1:
                # Final attempt failed
                return {
                    "result": "",
                    "status": "failed",
                    "attempt_log": [log_entry + " - giving up"]
                }
            
            # Will retry - just log this attempt
            # (The loop continues, so we don't return yet)

# Build and test
graph = StateGraph(TaskState)
graph.add_node("process", resilient_processor)
graph.add_edge(START, "process")
graph.add_edge("process", END)

app = graph.compile(checkpointer=MemorySaver())

# Run multiple times to see retry behavior
print("=== Resilient Node Demo ===\n")

for i in range(3):
    config = {"configurable": {"thread_id": f"test-{i}"}}
    result = app.invoke({
        "task": f"Task #{i + 1}",
        "result": "",
        "status": "pending",
        "attempt_log": []
    }, config)
    
    print(f"Task #{i + 1}: {result['status']}")
    for log in result['attempt_log']:
        print(f"  {log}")
    print()


---
### Section 15.5 Exercises

### Exercise 15.5.1: Smart Retry Decorator

Create an improved `@with_retry` decorator that:
- Accepts a `RetryPolicy` object for configuration
- Only retries specific exception types
- Logs each retry attempt with timestamp
- Returns metadata about retries alongside the result

In [None]:
# Your code here


### Exercise 15.5.2: Circuit Breaker Pattern

Implement a circuit breaker that:
- Tracks failure rate over recent calls
- "Opens" (stops calling) when failure rate exceeds threshold
- Automatically "closes" (resumes) after a cooldown period
- Integrate it with a LangGraph node

In [None]:
# Your code here


### Exercise 15.5.3: Retry Dashboard

Build a simple monitoring system that:
- Tracks retry statistics across all nodes
- Reports which nodes fail most often
- Shows average retry count per successful operation
- Alerts when retry rate exceeds normal levels

In [None]:
# Your code here


---
## Section 15.6: Handling partial failures

In [None]:
# From: graceful_degradation.py

# From: Zero to AI Agent, Chapter 15, Section 15.6
# File: graceful_degradation.py

"""
Graceful degradation pattern - continue with partial results when some sources fail.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END

class ResearchState(TypedDict):
    query: str
    results: Annotated[list[str], add]
    errors: Annotated[list[str], add]
    success_count: Annotated[int, add]  # Reducer needed for parallel updates
    failure_count: Annotated[int, add]  # Reducer needed for parallel updates

def search_source_a(state: ResearchState) -> dict:
    """Simulates a successful search."""
    return {
        "results": [f"Source A result for: {state['query']}"],
        "success_count": 1  # Return increment, reducer will sum
    }

def search_source_b(state: ResearchState) -> dict:
    """Simulates a failed search."""
    # This source is "down"
    return {
        "errors": ["Source B: Connection timeout"],
        "failure_count": 1  # Return increment, reducer will sum
    }

def search_source_c(state: ResearchState) -> dict:
    """Simulates another successful search."""
    return {
        "results": [f"Source C result for: {state['query']}"],
        "success_count": 1  # Return increment, reducer will sum
    }

def summarize_results(state: ResearchState) -> dict:
    """Summarize what we got."""
    print("\n=== Research Results ===")
    print(f"Successful sources: {state['success_count']}")
    print(f"Failed sources: {state['failure_count']}")
    
    if state["results"]:
        print("\nResults retrieved:")
        for r in state["results"]:
            print(f"  ‚úì {r}")
    
    if state["errors"]:
        print("\nErrors encountered:")
        for e in state["errors"]:
            print(f"  ‚úó {e}")
    
    return {}

# Build the graph
graph = StateGraph(ResearchState)

graph.add_node("source_a", search_source_a)
graph.add_node("source_b", search_source_b)
graph.add_node("source_c", search_source_c)
graph.add_node("summarize", summarize_results)

# All sources feed into summarize
graph.add_edge(START, "source_a")
graph.add_edge(START, "source_b")
graph.add_edge(START, "source_c")
graph.add_edge("source_a", "summarize")
graph.add_edge("source_b", "summarize")
graph.add_edge("source_c", "summarize")
graph.add_edge("summarize", END)

app = graph.compile()

# Run it
print("=== Graceful Degradation Demo ===")
result = app.invoke({
    "query": "AI agents",
    "results": [],
    "errors": [],
    "success_count": 0,
    "failure_count": 0
})

In [None]:
# From: fault_tolerant_node.py

# From: Zero to AI Agent, Chapter 15, Section 15.6
# File: fault_tolerant_node.py

"""
Pattern for fault-tolerant nodes.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
import random

class RobustState(TypedDict):
    input: str
    results: Annotated[list[dict], add]
    warnings: Annotated[list[str], add]

def fault_tolerant_operation(
    name: str,
    operation: callable,
    fallback_value: any = None
):
    """
    Wrap an operation with fault tolerance.
    
    Returns a node function that:
    - Tries the operation
    - Falls back on failure
    - Always returns useful state
    """
    def node(state: RobustState) -> dict:
        try:
            result = operation(state)
            return {
                "results": [{"source": name, "data": result, "status": "ok"}]
            }
        except Exception as e:
            return {
                "results": [{"source": name, "data": fallback_value, "status": "failed"}],
                "warnings": [f"{name}: {str(e)}"]
            }
    return node

# Example operations (some will fail randomly)
def flaky_api(state):
    if random.random() < 0.5:
        raise ConnectionError("Service unavailable")
    return f"API data for {state['input']}"

def reliable_cache(state):
    return f"Cached data for {state['input']}"

def sometimes_slow(state):
    if random.random() < 0.3:
        raise TimeoutError("Request timed out")
    return f"Fresh data for {state['input']}"

# Build graph with fault-tolerant nodes
graph = StateGraph(RobustState)

graph.add_node("api", fault_tolerant_operation("API", flaky_api, "N/A"))
graph.add_node("cache", fault_tolerant_operation("Cache", reliable_cache, "N/A"))
graph.add_node("fresh", fault_tolerant_operation("Fresh", sometimes_slow, "N/A"))

graph.add_edge(START, "api")
graph.add_edge("api", "cache")
graph.add_edge("cache", "fresh")
graph.add_edge("fresh", END)

app = graph.compile()

# Test it multiple times
print("=== Fault Tolerant Node Demo ===\n")

for i in range(3):
    print(f"Run {i + 1}:")
    result = app.invoke({"input": "test query", "results": [], "warnings": []})
    
    for r in result["results"]:
        status = "‚úì" if r["status"] == "ok" else "‚úó"
        print(f"  {status} {r['source']}: {r['data']}")
    
    if result["warnings"]:
        print("  Warnings:")
        for w in result["warnings"]:
            print(f"    ‚ö†Ô∏è {w}")
    print()


---
### Section 15.6 Exercises

### Exercise 15.6.1: Multi-Source Aggregator

Build an agent that:
- Queries 4 different "data sources" (simulate with functions)
- 2 sources randomly fail
- Aggregates successful results
- Reports which sources failed and why
- Returns a confidence score based on how many succeeded

In [None]:
# Your code here


### Exercise 15.6.2: Fallback Chain

Create a node with a chain of fallbacks:
- Try primary API (fails 70% of the time)
- If that fails, try secondary API (fails 40% of the time)
- If that fails, try cache (always succeeds but data is "stale")
- Track which source ultimately provided the data

In [None]:
# Your code here


### Exercise 15.6.3: Graceful Feature Degradation

Build a "document analyzer" that:
- Always extracts word count (core feature)
- Optionally analyzes sentiment (fails sometimes)
- Optionally extracts keywords (fails sometimes)
- Optionally summarizes (fails sometimes)
- Returns whatever it could compute with clear status for each

In [None]:
# Your code here


---
## Section 15.7: State visualization and monitoring

In [None]:
# From: state_monitor.py

# From: Zero to AI Agent, Chapter 15, Section 15.7
# File: state_monitor.py

"""
Simple state monitoring for LangGraph agents.
"""

from datetime import datetime
from typing import Any

class StateMonitor:
    """Track state changes over time."""
    
    def __init__(self, name: str = "Agent"):
        self.name = name
        self.history = []
        self.start_time = datetime.now()
    
    def record(self, node_name: str, state: dict):
        """Record state after a node runs."""
        entry = {
            "timestamp": datetime.now().isoformat(),
            "elapsed": (datetime.now() - self.start_time).total_seconds(),
            "node": node_name,
            "state_snapshot": {k: self._summarize(v) for k, v in state.items()}
        }
        self.history.append(entry)
    
    def _summarize(self, value: Any) -> str:
        """Create a short summary of a value."""
        if isinstance(value, list):
            return f"list[{len(value)}]"
        elif isinstance(value, dict):
            return f"dict[{len(value)}]"
        elif isinstance(value, str) and len(value) > 30:
            return f'"{value[:30]}..."'
        return repr(value)
    
    def report(self):
        """Print a summary report."""
        print(f"\n{'‚ïê' * 50}")
        print(f"üìà Monitor Report: {self.name}")
        print(f"{'‚ïê' * 50}")
        print(f"Total nodes executed: {len(self.history)}")
        print(f"Total time: {self.history[-1]['elapsed']:.2f}s" if self.history else "N/A")
        
        print(f"\n{'‚îÄ' * 50}")
        print("Execution Timeline:")
        print(f"{'‚îÄ' * 50}")
        
        for entry in self.history:
            print(f"  [{entry['elapsed']:5.2f}s] {entry['node']}")
            for key, summary in entry['state_snapshot'].items():
                print(f"           {key}: {summary}")
        
        print(f"{'‚ïê' * 50}\n")


def monitored_node(monitor: StateMonitor, original_func, node_name: str):
    """Wrap a node with monitoring."""
    def wrapper(state):
        result = original_func(state)
        # Merge result with state for recording
        new_state = {**state, **result}
        monitor.record(node_name, new_state)
        return result
    return wrapper


def visualize_history(history: list[dict]):
    """Create ASCII timeline of state changes."""
    print("\nüìú State Evolution Timeline")
    print("=" * 60)
    
    for i, snapshot in enumerate(history):
        # Header
        node = snapshot.get("node", f"Step {i}")
        print(f"\n‚îå‚îÄ {node} {'‚îÄ' * (55 - len(node))}")
        
        # State changes
        state = snapshot.get("state_snapshot", {})
        for key, value in state.items():
            print(f"‚îÇ  {key}: {value}")
        
        # Connector to next
        if i < len(history) - 1:
            print("‚îÇ")
            print("‚ñº")
    
    print("\n" + "=" * 60)


# Demo
if __name__ == "__main__":
    monitor = StateMonitor("Demo Agent")
    
    # Simulate some state changes
    monitor.record("start", {"query": "test", "messages": []})
    monitor.record("process", {"query": "test", "messages": ["Hello"], "status": "processing"})
    monitor.record("complete", {"query": "test", "messages": ["Hello", "Done"], "status": "complete"})
    
    monitor.report()
    visualize_history(monitor.history)


In [None]:
# From: metrics_tracker.py

# From: Zero to AI Agent, Chapter 15, Section 15.7
# File: metrics_tracker.py

"""
Track operational metrics for agents.
"""

from collections import defaultdict
from datetime import datetime

class MetricsTracker:
    """Track operational metrics."""
    
    def __init__(self):
        self.counters = defaultdict(int)
        self.timings = defaultdict(list)
        self.errors = []
    
    def increment(self, metric: str, amount: int = 1):
        """Increment a counter."""
        self.counters[metric] += amount
    
    def record_timing(self, operation: str, duration: float):
        """Record how long something took."""
        self.timings[operation].append(duration)
    
    def record_error(self, node: str, error: str):
        """Record an error."""
        self.errors.append({
            "time": datetime.now().isoformat(),
            "node": node,
            "error": error
        })
        self.increment("total_errors")
    
    def summary(self) -> dict:
        """Get metrics summary."""
        timing_stats = {}
        for op, times in self.timings.items():
            timing_stats[op] = {
                "count": len(times),
                "avg": sum(times) / len(times) if times else 0,
                "max": max(times) if times else 0
            }
        
        return {
            "counters": dict(self.counters),
            "timings": timing_stats,
            "error_count": len(self.errors),
            "recent_errors": self.errors[-5:]  # Last 5 errors
        }
    
    def print_summary(self):
        """Print formatted summary."""
        s = self.summary()
        
        print("\nüìä Metrics Summary")
        print("‚îÄ" * 40)
        
        print("\nCounters:")
        for name, value in s["counters"].items():
            print(f"  {name}: {value}")
        
        print("\nTimings:")
        for op, stats in s["timings"].items():
            print(f"  {op}: avg={stats['avg']:.3f}s, max={stats['max']:.3f}s ({stats['count']} calls)")
        
        if s["recent_errors"]:
            print(f"\nRecent Errors ({s['error_count']} total):")
            for err in s["recent_errors"]:
                print(f"  [{err['node']}] {err['error']}")


# Demo
if __name__ == "__main__":
    import random
    import time
    
    tracker = MetricsTracker()
    
    # Simulate some operations
    print("=== Metrics Tracker Demo ===\n")
    
    for i in range(5):
        # Track API calls
        tracker.increment("api_calls")
        duration = random.uniform(0.1, 0.5)
        tracker.record_timing("api_call", duration)
        
        # Some failures
        if random.random() < 0.3:
            tracker.record_error("api_node", "Connection timeout")
        
        # Track processed items
        tracker.increment("items_processed", random.randint(1, 10))
    
    tracker.print_summary()


In [None]:
# From: monitored_agent.py

# From: Zero to AI Agent, Chapter 15, Section 15.7
# File: monitored_agent.py

"""
Example agent with monitoring.
"""

from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from datetime import datetime
import time

class AgentState(TypedDict):
    task: str
    steps: Annotated[list[str], add]
    result: str

# Simple metrics
metrics = {"nodes_run": 0, "total_time": 0.0}

def timed_node(name: str):
    """Decorator to add timing to nodes."""
    def decorator(func):
        def wrapper(state):
            start = time.time()
            result = func(state)
            elapsed = time.time() - start
            
            metrics["nodes_run"] += 1
            metrics["total_time"] += elapsed
            
            print(f"  ‚úì {name} ({elapsed:.3f}s)")
            return result
        return wrapper
    return decorator

@timed_node("analyze")
def analyze(state: AgentState) -> dict:
    time.sleep(0.1)  # Simulate work
    return {"steps": [f"Analyzed: {state['task']}"]}

@timed_node("process")
def process(state: AgentState) -> dict:
    time.sleep(0.2)  # Simulate work
    return {"steps": ["Processed data"]}

@timed_node("complete")
def complete(state: AgentState) -> dict:
    time.sleep(0.05)  # Simulate work
    return {"result": "Done!", "steps": ["Completed"]}

# Build graph
graph = StateGraph(AgentState)
graph.add_node("analyze", analyze)
graph.add_node("process", process)
graph.add_node("complete", complete)
graph.add_edge(START, "analyze")
graph.add_edge("analyze", "process")
graph.add_edge("process", "complete")
graph.add_edge("complete", END)

app = graph.compile(checkpointer=MemorySaver())

# Run with monitoring
if __name__ == "__main__":
    print("üöÄ Running monitored agent...\n")
    config = {"configurable": {"thread_id": "monitored-run"}}
    
    result = app.invoke({
        "task": "Process important data",
        "steps": [],
        "result": ""
    }, config)
    
    # Report
    print(f"\nüìä Metrics:")
    print(f"  Nodes run: {metrics['nodes_run']}")
    print(f"  Total time: {metrics['total_time']:.3f}s")
    print(f"\n‚úÖ Result: {result['result']}")
    print(f"üìù Steps: {result['steps']}")


In [None]:
# From: task_manager_challenge.py

"""
Chapter 15 Challenge: Persistent Task Manager Agent

Build a complete task management agent that demonstrates:
- State schemas with Pydantic validation (15.2)
- State management with reducers (15.3)
- SQLite persistence (15.4)
- Retry logic for external sync (15.5)
- Graceful failure handling (15.6)
- Monitoring and health checks (15.7)

Commands:
- add <title>      : Add a new task
- complete <id>    : Mark task as complete
- list             : Show all tasks
- stats            : Show task statistics
- history          : Show action history
- health           : Run health check
- quit             : Exit (tasks persist!)

Run this file, add some tasks, quit, run again - your tasks should still be there!
"""

from typing import TypedDict, Annotated, Optional
from datetime import datetime
from enum import Enum
from operator import add
import uuid
import time
import random

from pydantic import BaseModel, Field, field_validator
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.sqlite import SqliteSaver


# =============================================================================
# SECTION 1: Enums and Models (15.2 - State Schemas)
# =============================================================================

# TODO: Define TaskStatus enum with values: pending, in_progress, completed, failed
class TaskStatus(str, Enum):
    pass  # Your code here


# TODO: Define TaskPriority enum with values: low, medium, high, urgent
class TaskPriority(str, Enum):
    pass  # Your code here


# TODO: Define Task Pydantic model with validation
# Fields needed: id (str), title (str), description (str), status (TaskStatus), 
#                priority (TaskPriority), created_at (datetime)
# Add a validator that ensures title is not empty
class Task(BaseModel):
    """A task with validation."""
    pass  # Your code here


# =============================================================================
# SECTION 2: State Schema with Reducers (15.3 - State Transformations)
# =============================================================================

# TODO: Define a reducer function for accumulating tasks
# Hint: Should merge existing tasks with new tasks, updating if same ID exists
def task_reducer(existing: list[dict], new: list[dict]) -> list[dict]:
    """Merge task lists, updating existing tasks by ID."""
    pass  # Your code here


# TODO: Define TaskManagerState TypedDict with:
# - tasks: list of task dicts (use Annotated with task_reducer)
# - action_history: list of strings (use Annotated with add for accumulation)
# - last_error: optional string
# - pending_command: optional string
class TaskManagerState(TypedDict):
    pass  # Your code here


# =============================================================================
# SECTION 3: Monitoring (15.7 - Visualization and Monitoring)
# =============================================================================

class TaskMonitor:
    """Track operations and metrics."""
    
    def __init__(self):
        self.operations = []
        self.start_time = datetime.now()
    
    # TODO: Implement log_operation to record operations with timestamps
    def log_operation(self, operation: str, details: str = ""):
        """Log an operation with timestamp."""
        pass  # Your code here
    
    # TODO: Implement get_report to return formatted monitoring report
    def get_report(self) -> str:
        """Get monitoring report."""
        pass  # Your code here


# Global monitor instance
monitor = TaskMonitor()


# =============================================================================
# SECTION 4: Helper Functions
# =============================================================================

def create_task(title: str, description: str = "", priority: str = "medium") -> dict:
    """Create a new task with validation."""
    # TODO: Create and validate a Task using Pydantic, return as dict
    # Handle validation errors gracefully
    pass  # Your code here


def format_task(task: dict) -> str:
    """Format a task for display."""
    status_icons = {
        "pending": "‚è≥",
        "in_progress": "üîÑ", 
        "completed": "‚úÖ",
        "failed": "‚ùå"
    }
    priority_icons = {
        "low": "üîµ",
        "medium": "üü°",
        "high": "üü†",
        "urgent": "üî¥"
    }
    
    icon = status_icons.get(task.get("status", "pending"), "‚ùì")
    pri = priority_icons.get(task.get("priority", "medium"), "‚ö™")
    
    return f"{icon} {pri} [{task['id'][:8]}] {task['title']}"


# =============================================================================
# SECTION 5: Graph Nodes (15.1 - State Management)
# =============================================================================

def parse_command(state: TaskManagerState) -> dict:
    """Parse the pending command and route appropriately."""
    command = state.get("pending_command", "")
    
    # TODO: Parse command and return appropriate routing info
    # Commands: add, complete, list, stats, history, health
    # Return dict with parsed info for next node
    pass  # Your code here


def add_task_node(state: TaskManagerState) -> dict:
    """Add a new task to the state."""
    # TODO: Extract task info from pending_command
    # Create task, log operation, return state update
    # Remember: return {"tasks": [new_task_dict], "action_history": [...]}
    pass  # Your code here


def complete_task_node(state: TaskManagerState) -> dict:
    """Mark a task as completed."""
    # TODO: Find task by ID prefix, update status to completed
    # Handle case where task not found
    # Log operation, return state update
    pass  # Your code here


def list_tasks_node(state: TaskManagerState) -> dict:
    """Display all tasks."""
    tasks = state.get("tasks", [])
    
    if not tasks:
        print("\nüìã No tasks yet! Use 'add <title>' to create one.")
    else:
        print(f"\nüìã Tasks ({len(tasks)}):")
        print("-" * 40)
        for task in tasks:
            print(f"  {format_task(task)}")
    
    return {"action_history": [f"Listed {len(tasks)} tasks"]}


def stats_node(state: TaskManagerState) -> dict:
    """Show task statistics."""
    tasks = state.get("tasks", [])
    
    # TODO: Calculate and display statistics:
    # - Total tasks
    # - Tasks by status (pending, completed, etc.)
    # - Tasks by priority
    pass  # Your code here


def history_node(state: TaskManagerState) -> dict:
    """Show action history."""
    history = state.get("action_history", [])
    
    print(f"\nüìú Action History ({len(history)} actions):")
    print("-" * 40)
    for i, action in enumerate(history[-10:], 1):  # Last 10 actions
        print(f"  {i}. {action}")
    
    return {}


# =============================================================================
# SECTION 6: Retry Logic for External Sync (15.5 - Retry Logic)
# =============================================================================

def sync_tasks_node(state: TaskManagerState) -> dict:
    """
    Simulate syncing tasks to an external service.
    This demonstrates retry logic for transient failures.
    """
    # TODO: Implement retry logic with exponential backoff
    # Simulate a flaky external API (random failures)
    # Use max_retries=3, base_delay=0.5
    # On success: return success message in action_history
    # On failure after retries: return error in last_error, don't crash
    
    max_retries = 3
    base_delay = 0.5
    
    for attempt in range(max_retries):
        try:
            # Simulate flaky API (30% chance of failure)
            if random.random() < 0.3:
                raise ConnectionError("Sync service unavailable")
            
            # Success!
            task_count = len(state.get("tasks", []))
            monitor.log_operation("sync", f"Synced {task_count} tasks")
            print(f"  ‚úÖ Synced {task_count} tasks to cloud")
            return {"action_history": [f"Synced {task_count} tasks successfully"]}
            
        except ConnectionError as e:
            # TODO: Implement exponential backoff with jitter
            # Log the retry attempt
            # If last attempt, handle gracefully (don't crash)
            pass  # Your code here
    
    # All retries failed
    return {
        "last_error": "Sync failed after 3 attempts",
        "action_history": ["Sync failed - will retry later"]
    }


# =============================================================================
# SECTION 7: Health Check (15.7 - Monitoring)
# =============================================================================

def health_check_node(state: TaskManagerState) -> dict:
    """Run health checks on the agent."""
    # TODO: Implement health checks that verify:
    # - State is accessible
    # - No recent errors
    # - Task counts are consistent
    # Print formatted health report
    pass  # Your code here


# =============================================================================
# SECTION 8: Graph Construction
# =============================================================================

def route_command(state: TaskManagerState) -> str:
    """Route to appropriate node based on command."""
    command = state.get("pending_command", "").lower().split()[0] if state.get("pending_command") else ""
    
    routes = {
        "add": "add_task",
        "complete": "complete_task",
        "list": "list_tasks",
        "stats": "stats",
        "history": "history",
        "health": "health_check",
        "sync": "sync_tasks",
    }
    
    return routes.get(command, "list_tasks")


def build_graph() -> StateGraph:
    """Build the task manager graph."""
    # TODO: Create StateGraph with TaskManagerState
    # Add all nodes: add_task, complete_task, list_tasks, stats, history, health_check, sync_tasks
    # Add conditional routing from START based on command
    # All nodes should route to END
    
    builder = StateGraph(TaskManagerState)
    
    # Add nodes
    # builder.add_node("add_task", add_task_node)
    # ... add other nodes
    
    # Add conditional entry point
    # builder.add_conditional_edges(START, route_command, {...})
    
    # Add edges to END
    # builder.add_edge("add_task", END)
    # ... add other edges
    
    pass  # Your code here - return builder.compile(checkpointer=...)


# =============================================================================
# SECTION 9: Main Loop with Persistence (15.4 - Checkpointing)
# =============================================================================

def main():
    """Main entry point with SQLite persistence."""
    print("üóÇÔ∏è  Task Manager Agent")
    print("=" * 40)
    print("Commands: add, complete, list, stats, history, health, sync, quit")
    print("Your tasks persist across restarts!")
    print("=" * 40)
    
    # TODO: Set up SQLite persistence
    # Hint: Use SqliteSaver and pass to graph compilation
    db_path = "task_manager.db"
    
    # TODO: Build graph with checkpointer
    # app = build_graph()
    
    # TODO: Set up config with thread_id for user isolation
    # Support multiple users by changing thread_id
    user_id = input("\nEnter your user ID (or press Enter for 'default'): ").strip() or "default"
    config = {"configurable": {"thread_id": f"user_{user_id}"}}
    
    print(f"\nüë§ Logged in as: {user_id}")
    
    # Try to load existing state
    # TODO: Check if user has existing tasks and show count
    
    while True:
        try:
            command = input("\n> ").strip()
            
            if not command:
                continue
            
            if command.lower() == "quit":
                print("\nüëã Goodbye! Your tasks are saved.")
                break
            
            # TODO: Invoke graph with command
            # result = app.invoke({"pending_command": command}, config)
            
        except KeyboardInterrupt:
            print("\n\nüëã Goodbye! Your tasks are saved.")
            break
        except Exception as e:
            print(f"\n‚ùå Error: {e}")
            monitor.log_operation("error", str(e))


if __name__ == "__main__":
    main()

---
### Section 15.7 Exercises

### Exercise 15.7.1: State Diff Viewer

Build a tool that:
- Compares two state snapshots
- Shows what changed between them (added, removed, modified)
- Formats the diff in a readable way
- Highlights significant changes

In [None]:
# Your code here


### Exercise 15.7.2: Performance Dashboard

Create a monitoring dashboard that tracks:
- Node execution counts
- Average time per node
- Success/failure rates per node
- Slowest nodes ranking
- Print a formatted report after each run

In [None]:
# Your code here


### Exercise 15.7.3: Alert System

Build a simple alerting system that:
- Monitors metrics against thresholds
- Triggers alerts when thresholds exceeded (e.g., error rate \> 10%)
- Tracks alert history
- Supports different severity levels (warning, critical)

In [None]:
# Your code here


---
## Next Steps

- Check your answers in **chapter_15_stateful_agents_solutions.ipynb**
- Proceed to **Chapter 16**