<h1 align="center">Welcome to the InboxOps Agent Framework Demo</h1>

<p align="start">
InboxOps is an e-commerce operations company handling thousands of inbound support emails per day.
In this notebook, we'll build a production-minded multi-agent Support Email Copilot using the
<strong>Microsoft Agent Framework (Python SDK)</strong>.
</p>

---

## The InboxOps Problem

InboxOps started with a simple goal:

✅ Reply faster  
✅ Reduce agent workload  
✅ Maintain consistent tone and quality  
✅ Avoid risky or incorrect customer promises  

But as volume increased, we discovered that **a single LLM prompt is not enough**.

So we evolve the system step-by-step:

**V0 → V1 → V2 → Production-Ready Multi-Agent Workflows**

---

## What is an Agent?

![What is an Agent](images/what-is-agent.png)

Unlike traditional LLM deployments that simply respond to prompts, agents follow the **ReAct pattern** (Reasoning + Acting):

| Traditional LLM | Agent (ReAct) |
|-----------------|---------------|
| Input → Output | Input → Reason → Act → Observe → Repeat |
| Single response | Multi-step execution |
| No tool access | Tool integration |
| Stateless | Memory & context |

Agents autonomously decide *what* to do, *which* tools to use, and *when* to stop.

---

## Workflows & Multi-Agent Orchestration

![Workflow Example](images/workflow-example.png)

Complex tasks require coordination between multiple specialized agents. The Agent Framework provides workflow primitives:

- **Sequential** — Agents execute in order (A → B → C)
- **Parallel (Fan-out/Fan-in)** — Concurrent execution with result aggregation
- **Branching** — Conditional routing based on outputs
- **Group Chat** — Collaborative multi-agent discussions

---

## Demo Overview

We'll build the **InboxOps Support Email Copilot** that demonstrates core framework concepts:

| Section | Concept |
|---------|---------|
| 1-2 | V0: Basic Agent & Streaming |
| 3-4 | V1: Threads & Tools |
| 5-7 | V2: Approvals, Middleware, Memory |
| 8-10 | Workflows: Sequential, Branching, Parallel |
| 11-12 | Multi-Agent: Group Chat & Magentic |

---

## Prerequisites

- Azure subscription with Azure OpenAI access
- Azure OpenAI resource with deployed model (e.g., `gpt-4o-mini`)
- Azure CLI installed and authenticated (`az login`)
- Python 3.10+

# Environment Setup (InboxOps Internal Dev Environment)

InboxOps is prototyping a Support Email Copilot using Azure OpenAI and the Microsoft Agent Framework.

This notebook assumes:
- You have access to an Azure OpenAI resource
- A model deployment exists (example: `gpt-4o-mini`)
- You can authenticate with Azure CLI (`az login`)
- Python 3.10+

> The goal is to keep the demo reproducible for developers and consistent across environments.

## Create Virtual Environment

Run the following in your terminal to set up the environment:

```bash
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Or run the cell below to install dependencies directly.

In [None]:
# Create and configure the virtual environment (run once)
import subprocess
import shutil

def find_python():
    """Find a Python 3.10+ interpreter on the system."""
    # Check common Python commands in order of preference
    candidates = [
        "python3.13", "python3.12", "python3.11", "python3.10",
        "python3", "python"
    ]
    
    for cmd in candidates:
        path = shutil.which(cmd)
        if path:
            # Verify version is 3.10+
            try:
                result = subprocess.run(
                    [path, "-c", "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')"],
                    capture_output=True, text=True
                )
                version = result.stdout.strip()
                major, minor = map(int, version.split('.'))
                if major >= 3 and minor >= 10:
                    return path, version
            except:
                continue
    
    raise RuntimeError("No Python 3.10+ found. Please install Python 3.10 or higher.")

# Find suitable Python
python_path, python_version = find_python()
print(f"✅ Found Python {python_version}: {python_path}")

# Create .venv
subprocess.run([python_path, "-m", "venv", ".venv"])

# Install requirements with pre-release flag
subprocess.run([".venv/bin/pip", "install", "-r", "requirements.txt", "--pre"])

print("\n✅ Virtual environment created at .venv")
print("   Activate with: source .venv/bin/activate")

## Initialize the InboxOps Chat Client

We create **one shared Azure OpenAI client** and reuse it across the entire notebook.

This mirrors how InboxOps would run a long-lived backend service:
- The service initializes once
- Agents are created from the same client
- Tool calls, workflows, memory, and orchestration all share the same foundation

In [None]:
from agent_framework_azure_ai import AzureAIAgentClient
import nest_asyncio
nest_asyncio.apply()

import asyncio
from dotenv import load_dotenv
from azure.identity import AzureCliCredential
from agent_framework.azure import AzureOpenAIChatClient

# Load environment variables
load_dotenv()

# Create ONE chat client - reused throughout the notebook
chat_client = AzureOpenAIChatClient(credential=AzureCliCredential())
chat_client_mcp = AzureAIAgentClient(credential=AzureCliCredential())

print("✅ Environment loaded and chat_client created")

## Data Models (InboxOps Message Contracts)

InboxOps wants predictable, structured outputs—not messy free-text.

We define Pydantic schemas used across the system:
- Incoming email structure (`EmailInput`)
- Classification outputs (`ClassificationResult`)
- Draft response formats (`DraftResponse`)
- Final approval structure (`FinalResponse`)

> These schemas represent the "API contracts" between our agents, tools, and workflows.

In [None]:
from typing import Literal, Annotated
from pydantic import BaseModel, Field

# === Input Model ===
class EmailInput(BaseModel):
    """Incoming support email."""
    sender: str = Field(description="Email sender address")
    subject: str = Field(description="Email subject line")
    body: str = Field(description="Email body content")
    customer_id: str | None = Field(default=None, description="Customer ID if known")
    ticket_id: str | None = Field(default=None, description="Related ticket ID if any")

# === Classification Model ===
class ClassificationResult(BaseModel):
    """Result of email classification."""
    category: Literal["spam", "not_spam", "uncertain"] = Field(description="Email category")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence score 0-1")
    reason: str = Field(description="Brief explanation of classification")

# === Draft Response Model ===
class DraftResponse(BaseModel):
    """Draft reply to customer email."""
    subject: str = Field(description="Reply subject line")
    body: str = Field(description="Reply body")
    tone: Literal["formal", "friendly", "apologetic"] = Field(description="Tone used")
    needs_review: bool = Field(default=False, description="Flag if needs human review")

# === Final Response Model ===
class FinalResponse(BaseModel):
    """Final approved response."""
    classification: ClassificationResult
    draft: DraftResponse | None = Field(default=None, description="Draft if not spam")
    review_notes: str | None = Field(default=None, description="Reviewer comments")
    approved: bool = Field(default=False, description="Whether approved to send")

print("✅ Shared models defined: EmailInput, ClassificationResult, DraftResponse, FinalResponse")

## Sample InboxOps Emails

We'll use three realistic email types to simulate real inbox traffic:

✅ Legitimate Customer Issue — should generate a helpful response  
🚫 Spam Message — should be blocked  
⚠️ Ambiguous Request — should be routed for human review  

> This is exactly what InboxOps sees daily at scale.

In [None]:
# === LEGITIMATE EMAIL ===
LEGIT_EMAIL = EmailInput(
    sender="sarah.chen@acmecorp.com",
    subject="Order #12345 - Delivery Issue",
    body="""Hi Support Team,

I placed order #12345 last week and the tracking shows it was delivered, 
but I never received the package. I've checked with my neighbors and the building 
concierge, but no one has seen it.

This is urgent as the items were needed for a client presentation on Friday.
Can you please help me locate the package or arrange a replacement?

Thank you,
Sarah Chen
Account: ACME-7891
""",
    customer_id="CUST-7891",
    ticket_id="TKT-2024-001"
)

# === SPAM EMAIL ===
SPAM_EMAIL = EmailInput(
    sender="winner@prize-notifications.biz",
    subject="🎉 CONGRATULATIONS! You've WON $1,000,000!!!",
    body="""URGENT NOTIFICATION!!!

You have been selected as the WINNER of our international lottery!
To claim your $1,000,000 prize, simply send your bank details and 
a processing fee of $500 to unlock your winnings.

ACT NOW - This offer expires in 24 HOURS!!!

Click here to claim: http://totally-legit-prize.com/claim
""",
    customer_id=None,
    ticket_id=None
)

# === AMBIGUOUS EMAIL ===
AMBIGUOUS_EMAIL = EmailInput(
    sender="j.smith@unknown-domain.net",
    subject="Partnership Opportunity",
    body="""Hello,

I found your company online and I'm interested in discussing a potential 
business partnership. We have a new product line that might complement your services.

Can we schedule a call this week?

Best,
J. Smith
""",
    customer_id=None,
    ticket_id=None
)

print("✅ Sample emails defined: LEGIT_EMAIL, SPAM_EMAIL, AMBIGUOUS_EMAIL")

# 1. V0 — A Single Support Agent

![Agent Components](images/agent-components.png)

InboxOps started with the simplest solution:

**One agent that reads an email and drafts a reply.**

This already provides huge value:
- Faster draft creation
- More consistent tone
- Reduced repetitive typing for support reps

But this is still "V0":
- No streaming UX
- No tools
- No multi-turn context
- No approvals or governance

In [None]:
# Create the core Support Agent - we'll enhance this throughout the notebook
support_agent = chat_client.as_agent(
    name="SupportAgent",
    instructions="""You are a helpful customer support agent for an e-commerce company.
Your job is to:
1. Understand customer issues from their emails
2. Draft professional, empathetic responses
3. Provide clear next steps when possible

Always be polite, acknowledge the customer's frustration, and offer concrete solutions."""
)

print("✅ support_agent created")

## Run the SupportAgent

This is the InboxOps baseline:

**Input:** customer email  
**Output:** draft reply

At this stage, we're validating:
- Can the agent understand the issue?
- Does it respond empathetically?
- Are next steps clear and actionable?

In [None]:
# Run the support agent on our legitimate email
async def run_basic_agent():
    prompt = f"""Please draft a response to this customer email:

From: {LEGIT_EMAIL.sender}
Subject: {LEGIT_EMAIL.subject}

{LEGIT_EMAIL.body}
"""
    result = await support_agent.run(prompt)
    print("📧 Draft Response:\n")
    print(result.text)

asyncio.run(run_basic_agent())

# 2. V0.1 — Streaming Responses (Real-Time UX)

InboxOps support reps don't want to wait for a full answer.

They want a **live drafting experience**:
- The response appears token-by-token
- It feels interactive, like a "Copilot"
- Faster perceived performance

Streaming is not just cosmetic—it's a product requirement when humans are in the loop.

In [None]:
### Stream the response token by token using the SAME support_agent
async def stream_support_response():
    prompt = f"""Please draft a response to this customer email:

From: {LEGIT_EMAIL.sender}
Subject: {LEGIT_EMAIL.subject}

{LEGIT_EMAIL.body}
"""
    print("📧 Streaming Draft Response:\n")
    async for update in support_agent.run_stream(prompt):
        if update.text:
            print(update.text, end="", flush=True)
    print()  # New line after streaming

asyncio.run(stream_support_response())

# 3. V1 — Multi-Turn Conversations with Threads

![Threads and Memory](images/threads-and-memory.png)

InboxOps quickly discovered a real-world problem:

Customers don't send only one email.

They follow up:
- "Any updates?"
- "This is urgent"
- "I already tried that"

By default, agents are stateless.
So InboxOps introduced **Threads** to preserve context across multiple turns.

✅ The agent can summarize first  
✅ Then draft a response using the summary  
✅ And continue the conversation coherently

## Using Threads

Create a thread with `agent.get_new_thread()` and pass it to each call.

In [None]:
# Create a thread for multi-turn conversation
thread = support_agent.get_new_thread()

# Turn 1: Summarize the customer issue
print("Turn 1: Summarize the issue")
print("-" * 50)
result1 = await support_agent.run(
    f"Summarize the key issues in this email in 2-3 bullet points:\n\n{LEGIT_EMAIL.body}", 
    thread=thread
)
print(result1.text)
print()

# Turn 2: Draft a response (agent remembers the summary from Turn 1)
print("Turn 2: Draft response with professional tone")
print("-" * 50)
result2 = await support_agent.run(
    "Now draft a professional response addressing each of those issues. Use a formal but empathetic tone.",
    thread=thread
)
print(result2.text)

# 4. V1.1 — Tools: Connecting InboxOps Internal Systems

A drafting agent is helpful…
but a production support assistant must also be **correct**.

InboxOps needs the agent to reference real internal data, not guess.

Examples:
- SLA tier (Premium vs Standard)
- Current ticket status (Open/Resolved)
- Prior actions already taken

So we expose internal functions as tools using `@tool`.

The agent will autonomously decide when tool calls are needed.

## Define InboxOps Tools

In a real InboxOps environment these tools would call:
- CRM systems
- ticketing platforms
- order management databases

For this demo, we simulate internal systems using in-memory dictionaries.

> The key point: the Agent Framework turns Python functions into callable tools.

In [None]:
from agent_framework import tool
# Simulated database of customer SLAs
CUSTOMER_SLAS = {
    "CUST-7891": {"tier": "Premium", "response_time": "4 hours", "replacement_policy": "Free expedited replacement"},
    "CUST-1234": {"tier": "Standard", "response_time": "24 hours", "replacement_policy": "Standard replacement"},
}

# Simulated ticket database
TICKET_STATUSES = {
    "TKT-2024-001": {"status": "Open", "priority": "High", "assigned_to": "Support Team", "last_update": "2024-01-15"},
    "TKT-2024-002": {"status": "Resolved", "priority": "Low", "assigned_to": "Bot", "last_update": "2024-01-10"},
}

@tool(name="lookup_customer_sla", description="Look up a customer's SLA tier and policies")
def lookup_customer_sla(
    customer_id: Annotated[str, Field(description="The customer ID to look up (e.g., CUST-7891)")]
) -> str:
    """Look up customer SLA information."""
    if customer_id in CUSTOMER_SLAS:
        sla = CUSTOMER_SLAS[customer_id]
        return f"Customer {customer_id}: {sla['tier']} tier, {sla['response_time']} response time, {sla['replacement_policy']}"
    return f"Customer {customer_id} not found in system."

@tool(name="get_incident_status", description="Get the current status of a support ticket")
def get_incident_status(
    ticket_id: Annotated[str, Field(description="The ticket ID to check (e.g., TKT-2024-001)")]
) -> str:
    """Get ticket status information."""
    if ticket_id in TICKET_STATUSES:
        ticket = TICKET_STATUSES[ticket_id]
        return f"Ticket {ticket_id}: Status={ticket['status']}, Priority={ticket['priority']}, Assigned to={ticket['assigned_to']}, Last update={ticket['last_update']}"
    return f"Ticket {ticket_id} not found in system."

print("✅ Support tools defined: lookup_customer_sla, get_incident_status")

## Attach Tools to Agent

Pass tools when creating the agent.

In [None]:
# Create support agent with tools
support_agent_with_tools = chat_client.as_agent(
    name="SupportAgentWithTools",
    instructions="""You are a customer support agent with access to internal systems.
When handling emails:
1. Look up the customer's SLA tier to understand their service level
2. Check ticket status if a ticket ID is mentioned
3. Use this information to provide appropriate responses and set expectations

Always be empathetic and use the customer's SLA tier to guide your response (e.g., Premium customers get expedited service).""",
    tools=[lookup_customer_sla, get_incident_status]
)

print("✅ support_agent_with_tools created")

## Execute with Tools

The agent autonomously decides when to invoke tools.

In [None]:
# Test with the legitimate email that has customer_id and ticket_id
prompt = f"""Handle this customer support email. Look up their SLA and ticket status first:

From: {LEGIT_EMAIL.sender}
Subject: {LEGIT_EMAIL.subject}
Customer ID: {LEGIT_EMAIL.customer_id}
Ticket ID: {LEGIT_EMAIL.ticket_id}

{LEGIT_EMAIL.body}
"""

result = await support_agent_with_tools.run(prompt)
print("📧 Response (with tool lookups):\n")
print(result.text)

# 4.1. V1.2 — Multimodal Input (InboxOps Visual Support)

## The Problem: Customers Send Screenshots

InboxOps customers often attach **error screenshots** instead of describing problems in text:

> "My checkout isn't working" + 🖼️ `error_screenshot.png`

Our agents need to understand images, PDFs, and attachments to provide accurate support.

## Solution: Multimodal Content

The Agent Framework supports multimodal input using `Content` objects:

Let's enable our Support Agent to handle customer screenshots.


In [None]:
# Create a specialized Multimodal Support Agent for handling visual issues
multimodal_support_agent = chat_client.as_agent(
    name="MultimodalSupportAgent",
    instructions="""You are a specialized customer support agent with expertise in visual issue diagnosis.

IMPORTANT: When you receive an image, you MUST:
1. Acknowledge that you can see the image
2. Describe what you observe in the screenshot
3. Identify any error messages visible
4. Provide specific troubleshooting steps based on what you see

Your responsibilities:
- Analyze both textual descriptions and visual evidence (screenshots, images)
- Identify the exact error or problem from the visual content
- Provide step-by-step resolution instructions
- Consider visual context when recommending solutions
- Prioritize urgent issues and offer temporary workarounds

Be empathetic, solution-focused, and clear in your guidance."""
)

print("✅ multimodal_support_agent created")

# Load the customer's screenshot from local images folder
image_path = "images/customer_image.png"

print("\n📧 Email with screenshot received...")
print(f"📎 Attachment: {image_path}")

# Load the image from file
with open(image_path, "rb") as f:
    image_bytes = f.read()
print("\n" + "="*80 + "\n")

# Create a multimodal message with the local image
from agent_framework import ChatMessage, Content, Role

multimodal_message = ChatMessage(
    role=Role.USER,
    contents=[
        Content.from_text(text="What error do you see in this checkout screenshot? Describe the issue and provide troubleshooting steps."),
        Content.from_data(data=image_bytes, media_type="image/png")
    ]
)

# Run the specialized multimodal support agent - pass as messages list
print("🤖 Multimodal Support Agent analyzing email and screenshot...\n")
result = await multimodal_support_agent.run(messages=[multimodal_message])
print(result.text)

# 4.2. V1.3 — Structured Output (InboxOps Ticket Metadata)

## The Problem: Downstream Systems Need JSON

After the agent drafts a response, InboxOps needs to:
- Create a ticket in the CRM with structured metadata
- Log priority, category, sentiment
- Route to the correct team

**Free-text agent output is hard to parse reliably.**

## Solution: Structured Output with Pydantic

Use `response_format` to enforce a JSON schema:


In [None]:
from pydantic import BaseModel

class TicketMetadata(BaseModel):
    priority: str  # "low", "medium", "high", "urgent"
    category: str  # "order", "refund", "technical", etc.
    sentiment: str  # "positive", "neutral", "negative"
    estimated_resolution_time: str

agent = chat_client.as_agent(
    name="TicketMetadataExtractor",
    response_format=TicketMetadata  # Force structured output
)

Let's extract ticket metadata automatically.

In [None]:
from pydantic import BaseModel, Field

# Define ticket metadata schema
class TicketMetadata(BaseModel):
    """Structured metadata for InboxOps support tickets"""
    priority: str = Field(description="Priority level: low, medium, high, or urgent")
    category: str = Field(description="Ticket category: order, refund, technical, shipping, account, other")
    sentiment: str = Field(description="Customer sentiment: positive, neutral, or negative")
    estimated_resolution_time: str = Field(description="Estimated time to resolve (e.g., '1 hour', '24 hours', '3-5 days')")
    requires_human_review: bool = Field(description="Whether this ticket needs escalation to a human agent")

# Create a metadata extraction agent
metadata_agent = chat_client.as_agent(
    name="TicketMetadataExtractor",
    instructions="""You are an InboxOps ticket classification system.
    Extract structured metadata from customer support emails.
    Be accurate and consistent with your classifications.
    
    You must return JSON with these exact fields:
    - priority: low, medium, high, or urgent
    - category: order, refund, technical, shipping, account, or other
    - sentiment: positive, neutral, or negative
    - estimated_resolution_time: estimated time like "1 hour", "24 hours", "3-5 days"
    - requires_human_review: true or false""",
    response_format=TicketMetadata  # Enforce structured output
)

# Test with the legitimate email
test_email = LEGIT_EMAIL.body

print("📧 Extracting metadata from email...\n")
result = await metadata_agent.run(test_email)

# Debug: Show what the agent returned
print("🔍 Raw agent output:")
print(result.text)
print()

# Parse the structured output
metadata = TicketMetadata.model_validate_json(result.text)

print("📊 TICKET METADATA")
print("="*50)
print(f"Priority:              {metadata.priority}")
print(f"Category:              {metadata.category}")
print(f"Sentiment:             {metadata.sentiment}")
print(f"Est. Resolution Time:  {metadata.estimated_resolution_time}")
print(f"Needs Human Review:    {metadata.requires_human_review}")
print("\n✅ Structured output ready for CRM ingestion!")

# 4.3. V1.4 — MCP Integration (InboxOps External Tool Connections)

## The Problem: Need to Connect External Systems

InboxOps uses **Zendesk** for ticketing, **Shopify** for orders, and **Stripe** for payments.

Instead of building custom API wrappers for each system, we can use **Model Context Protocol (MCP)** to connect agents to external tools.

## Solution: MCP Tools

MCP provides a standardized way to expose tools from external systems:


Let's simulate connecting to a Zendesk-like ticket system via MCP.

In [None]:
from agent_framework import ChatAgent, HostedMCPTool, ChatMessage

# Recreate the MCP tool with auto-approval
learn_mcp_tool = HostedMCPTool(
    name="MicrosoftLearn",
    url="https://learn.microsoft.com/api/mcp",
    approval_mode="never_require"  # Auto-approve MCP tool calls
)

# Create the agent with the new tool
mcp_support_agent = ChatAgent(
    chat_client=chat_client_mcp,
    name="MCPSupportAgent",
    instructions="""You are a documentation assistant agent with access to Microsoft Learn documentation via MCP. 
When asked about Azure features, you MUST use the MCP tool to search for information.""",
    tools=[learn_mcp_tool],
)

# Test: Ask a very specific recent question that requires the MCP tool
test_request = """
A customer is asking: "What are the latest Azure AI Foundry features announced in January 2026?"

You MUST use the MCP tool to search for this information.
"""

print("🔌 MCP Agent with Microsoft Learn tool connection...")
print(f"Request: {test_request}\n")

# Run with auto-approval loop
from agent_framework import AgentThread

thread = AgentThread()
max_approvals = 5  # Safety limit
approval_count = 0

result = await mcp_support_agent.run(test_request, thread=thread)

# Handle approval requests automatically
while approval_count < max_approvals:
    # Check for approval requests
    has_approval_request = False
    approval_responses = []
    
    if hasattr(result, 'messages') and result.messages:
        for msg in result.messages:
            if hasattr(msg, 'contents'):
                for content in msg.contents:
                    if content.type == "function_approval_request":
                        has_approval_request = True
                        # Auto-approve by converting to approval response
                        approval_response = content.to_function_approval_response(approved=True)
                        tool_name = content.function_call.name if hasattr(content, 'function_call') and content.function_call else 'unknown'
                        print(f"🔐 Auto-approving MCP tool call: {tool_name}")
                        approval_responses.append(approval_response)
    
    if not has_approval_request:
        break
    
    # Continue the conversation with approval responses wrapped in a message
    approval_message = ChatMessage(role="tool", contents=approval_responses)
    result = await mcp_support_agent.run(
        [approval_message],
        thread=thread
    )
    approval_count += 1

# Debug output
print(f"\n🔍 Total approvals granted: {approval_count}")
print(f"   Result text length: {len(result.text) if result.text else 0}")

print("\n📝 Agent Response:")
print("="*60)
print(result.text if result.text else "(empty response)")
print("\n" + "="*60)

# 5. V2 — Human-in-the-Loop Approval (InboxOps Safety Gate)

Drafting is safe.

**Sending an email is not.**

InboxOps policy:
✅ AI may draft responses  
🔒 A human must approve before sending  

So we mark the sending tool as approval-required.

This creates a safety mechanism:
- The agent can propose the action
- The platform pauses execution
- A human confirms or rejects
- Only then can the workflow continue

## Approval-Required Action Tool

We treat sending a reply as a sensitive business action.

We set:

`approval_mode="always_require"`

This ensures:
- No accidental customer emails
- No legal/compliance surprises
- Brand safety for InboxOps

In [None]:
from agent_framework import ChatMessage, Content, Role

# Tool that requires human approval before sending
@tool(approval_mode="always_require", name="send_email_reply", description="Send an email reply to the customer. Requires human approval.")
def send_email_reply(
    to: Annotated[str, Field(description="Recipient email address")],
    subject: Annotated[str, Field(description="Email subject")],
    body: Annotated[str, Field(description="Email body content")]
) -> str:
    """Send an email reply to the customer. Requires human approval."""
    # In production, this would actually send the email
    return f"✅ Email sent to {to} with subject '{subject}'"

# Create agent with the approval-required tool
approval_agent = chat_client.as_agent(
    name="ApprovalSupportAgent",
    instructions="""You are a customer support agent. When you finish drafting a response, 
you MUST call the send_email_reply tool to send it. Do not ask for permission - just call the tool.
The system will automatically handle approval. Always use the tool to send your response.""",
    tools=[lookup_customer_sla, get_incident_status, send_email_reply]
)

print("✅ approval_agent created with send_email_reply tool")

## Check for Pending Approvals

Approval-required calls return `user_input_requests` instead of executing.

In [None]:
# Ask the agent to handle and send a response
prompt = f"""Handle this email and propose sending the response using the send_email_reply tool.
The platform will automatically require human approval before execution.

From: {LEGIT_EMAIL.sender}
Subject: {LEGIT_EMAIL.subject}
Customer ID: {LEGIT_EMAIL.customer_id}

{LEGIT_EMAIL.body}
"""

result = await approval_agent.run(prompt)

# Check if approval is needed
if result.user_input_requests:
    print("🔒 APPROVAL REQUIRED!")
    for user_input_needed in result.user_input_requests:
        print(f"  Function: {user_input_needed.function_call.name}")
        print(f"  Arguments: {user_input_needed.function_call.arguments}")
else:
    print("⚠️ No approval requested - agent didn't call the tool")
    print(result.text)

## Grant Approval

Respond with `to_function_approval_response(True/False)`.

In [None]:
print("\n--- Handling Approval ---\n")

# Provide approval and continue the conversation
if result.user_input_requests:
    user_input_needed = result.user_input_requests[0]
    
    # Simulate human approval (in production, this would be interactive)
    user_approval = True
    print(f"✅ Human approved: {user_approval}\n")
    
    # Create approval response message
    approval_message = ChatMessage(
        role=Role.USER,
        contents=[user_input_needed.to_function_approval_response(user_approval)]
    )
    
    # Continue with approval
    final_result = await approval_agent.run([
        prompt,
        ChatMessage(role=Role.ASSISTANT, contents=[user_input_needed]),
        approval_message
    ])
    print(f"📊 Final Result:\n{final_result.text}")
else:
    print("❌ No approval was requested in the previous cell.")
    print("   The agent needs to call the send_email_reply tool to trigger approval.")
    print("   Re-run the previous cell to try again.")

# 6. V2.1 — Middleware (Observability for Production)

InboxOps engineering asked the next obvious question:

"How do we monitor this system in production?"

They need:
- execution timing
- tool call logging
- tracing / visibility for debugging
- metrics for performance

Middleware gives InboxOps **observability hooks** without rewriting agent code.

## Define Middleware

Middleware wraps execution with `context` and `next` function.

In [None]:
from typing import Callable, Awaitable
from agent_framework import AgentRunContext, FunctionInvocationContext
import time

async def logging_agent_middleware(
    context: AgentRunContext,
    next: Callable[[AgentRunContext], Awaitable[None]],
) -> None:
    """Log agent execution with timing."""
    print(f"🚀 Agent starting... ({len(context.messages)} message(s))")
    start_time = time.time()
    
    await next(context)  # Continue to agent execution
    
    elapsed = time.time() - start_time
    print(f"✅ Agent finished in {elapsed:.2f}s")

async def logging_function_middleware(
    context: FunctionInvocationContext,
    next: Callable[[FunctionInvocationContext], Awaitable[None]],
) -> None:
    """Log function tool calls."""
    print(f"  📞 Calling: {context.function.name}({context.arguments})")
    
    await next(context)
    
    print(f"  📤 Result: {context.result[:100]}..." if len(str(context.result)) > 100 else f"  📤 Result: {context.result}")

print("✅ Middleware defined: logging_agent_middleware, logging_function_middleware")

## Attach Middleware

Pass middleware list when creating the agent.

In [None]:
# Create agent with middleware for logging
middleware_agent = chat_client.as_agent(
    name="LoggingSupportAgent",
    instructions="You are a support agent. Look up customer information when handling requests.",
    tools=[lookup_customer_sla, get_incident_status],
    middleware=[logging_agent_middleware, logging_function_middleware]
)

# Test - you'll see logs for agent and function calls
prompt = f"Check the SLA for customer {LEGIT_EMAIL.customer_id} and ticket status for {LEGIT_EMAIL.ticket_id}"
result = await middleware_agent.run(prompt)
print(f"\n💬 Response: {result.text}")

# 6.1. V2.2 — Error Handling & Retry (InboxOps Resilience)

## The Problem: External Systems Are Unreliable

During Black Friday, InboxOps experiences:
- **Zendesk API timeouts** (504 Gateway Timeout)
- **OpenAI rate limiting** (429 Too Many Requests)
- **Database connection drops** (network failures)

Without retry logic, customers get cryptic errors and agents crash mid-conversation.

---

## Solution: Multi-Layered Error Handling

InboxOps implements 3 resilience patterns:

1. **Retry with Exponential Backoff** - Automatically retry transient failures
2. **Timeout Protection** - Prevent hung requests from blocking the system
3. **Circuit Breaker** - Stop calling failing services to prevent cascading failures

---

In [None]:
## Solution 1: Retry with Exponential Backoff
# Uses tenacity library for production-grade retry logic

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
import random
import asyncio

@tool
@retry(
    stop=stop_after_attempt(3),  # Max 3 attempts
    wait=wait_exponential(multiplier=1, min=1, max=10),  # 1s, 2s, 4s delays
    retry=retry_if_exception_type((TimeoutError, ConnectionError))
)
async def create_ticket_with_retry(
    customer_id: Annotated[str, "Customer ID"],
    subject: Annotated[str, "Ticket subject"],
    description: Annotated[str, "Ticket description"]
) -> dict:
    """Create a Zendesk ticket with automatic retry on transient failures."""
    
    # Simulate API call that might fail (30% failure rate)
    if random.random() < 0.3:
        raise ConnectionError("Zendesk API temporarily unavailable")
    
    ticket_id = f"TKT-{random.randint(1000, 9999)}"
    print(f"✅ Ticket {ticket_id} created successfully")
    
    return {
        "ticket_id": ticket_id,
        "customer_id": customer_id,
        "subject": subject,
        "status": "open"
    }

# Test the retry logic
print("🔄 Testing automatic retry...")
try:
    result = await create_ticket_with_retry(
        "CUST-001",
        "Website is slow",
        "Customer reports 5+ second page load times"
    )
    print(f"Success: {result}")
except Exception as e:
    print(f"Failed after 3 retries: {e}")

In [None]:
## Solution 2: Timeout Protection
# Prevent hung requests from blocking the system

from asyncio import timeout as asyncio_timeout
import time

async def safe_agent_call(agent, message: str, timeout_sec: int = 30) -> str:
    """Call agent with timeout protection."""
    try:
        async with asyncio_timeout(timeout_sec):
            result = await agent.run(message)
            return result.text
    except asyncio.TimeoutError:
        return "⏱️ Request timed out. The system is under heavy load. Please try again in a few moments."
    except Exception as e:
        return f"❌ Error: {str(e)}"

# Test timeout protection
print("Testing timeout protection...")

# Create a test agent
test_agent = ChatAgent(
    chat_client=client,
    name="TestAgent",
    instructions="You are a test agent."
)

# This should complete normally
response = await safe_agent_call(
    test_agent,
    "Say hello",
    timeout_sec=10
)
print(f"Response: {response[:100]}...")

In [None]:
## Solution 3: Circuit Breaker Pattern
# Stop calling failing services to prevent cascading failures

from typing import Optional, Callable
from datetime import datetime, timedelta

class CircuitBreaker:
    """Circuit breaker to prevent cascading failures."""
    
    def __init__(self, failure_threshold: int = 5, recovery_timeout: int = 60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time: Optional[datetime] = None
        self.is_open = False
    
    def call(self, func: Callable):
        """Execute function with circuit breaker protection."""
        if self.is_open:
            # Check if recovery timeout has passed
            if (datetime.now() - self.last_failure_time).seconds > self.recovery_timeout:
                print("🔧 Circuit breaker attempting to close...")
                self.is_open = False
                self.failure_count = 0
            else:
                raise Exception(
                    f"⚠️ Circuit breaker is OPEN - service unavailable. "
                    f"Retry after {self.recovery_timeout} seconds."
                )
        
        try:
            result = func()
            self.failure_count = 0  # Reset on success
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = datetime.now()
            print(f"❌ Failure #{self.failure_count}: {e}")
            
            if self.failure_count >= self.failure_threshold:
                self.is_open = True
                print(f"🚨 Circuit breaker OPENED after {self.failure_count} failures")
            
            raise e

# Create circuit breaker for customer database
customer_db_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

@tool
def get_customer_data_protected(
    customer_id: Annotated[str, "Customer ID"]
) -> dict:
    """Get customer data with circuit breaker protection."""
    
    def _fetch():
        # Simulate database query (sometimes fails)
        if random.random() < 0.4:  # 40% failure rate
            raise ConnectionError("Database connection timeout")
        
        return {
            "customer_id": customer_id,
            "tier": "premium",
            "email": f"{customer_id}@example.com"
        }
    
    return customer_db_breaker.call(_fetch)

# Test circuit breaker
print("Testing circuit breaker pattern...")
print("Making 10 rapid requests to a flaky service:\n")

for i in range(10):
    try:
        result = get_customer_data_protected("CUST-123")
        print(f"✅ Request {i+1}: Success")
    except Exception as e:
        print(f"❌ Request {i+1}: {str(e)[:80]}")
    
    time.sleep(0.5)  # Small delay between requests

In [None]:
## Putting It All Together: Resilient Agent

resilient_agent = ChatAgent(
    chat_client=client,
    name="ResilientInboxOps",
    instructions="""
    You are InboxOps customer support with enterprise-grade resilience.
    Use tools to help customers. If a tool fails, retry automatically.
    If all retries fail, apologize and provide next steps.
    """,
    tools=[
        create_ticket_with_retry,  # Auto-retry on failures
        get_customer_data_protected  # Circuit breaker protection
    ]
)

# Test resilient agent
print("Testing resilient agent with flaky tools...\n")

test_message = """
Customer CUST-999 is reporting slow website performance.
Look up their account details and create a support ticket.
"""

try:
    response = await safe_agent_call(
        resilient_agent,
        test_message,
        timeout_sec=60  # Allow time for retries
    )
    print(f"\n✅ Agent Response:\n{response}")
except Exception as e:
    print(f"\n❌ Error: {e}")

print("\n" + "="*80)
print("✅ V2.2 Complete: InboxOps can now handle API failures gracefully!")
print("="*80)

# 6.2. V2.3 — Rate Limiting (InboxOps Black Friday Protection)

## The Problem: Email Surges Overwhelm the System

**Black Friday scenario:**
- Normal load: 1,000 emails/hour
- Black Friday: **100,000 emails/hour** 🔥

Without rate limiting:
- OpenAI API quota exhausted
- $10,000+ in unexpected costs
- System crashes

## Solution: Token Bucket Rate Limiting



Let's protect InboxOps from traffic surges.

In [None]:
import time
import asyncio
from typing import Callable, Awaitable
from agent_framework import ChatAgent, AgentRunContext

# Simple rate limiter using token bucket algorithm
class RateLimiter:
    def __init__(self, max_requests: int, time_window: float):
        """
        Args:
            max_requests: Maximum requests allowed in time window
            time_window: Time window in seconds
        """
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = []
    
    def allow_request(self) -> bool:
        """Check if request is allowed under rate limit"""
        now = time.time()
        
        # Remove old requests outside time window
        self.requests = [req_time for req_time in self.requests 
                        if now - req_time < self.time_window]
        
        # Check if under limit
        if len(self.requests) < self.max_requests:
            self.requests.append(now)
            return True
        
        return False
    
    def get_wait_time(self) -> float:
        """Get time to wait before next request is allowed"""
        if not self.requests:
            return 0.0
        
        oldest_request = min(self.requests)
        time_passed = time.time() - oldest_request
        return max(0.0, self.time_window - time_passed)

# Create rate limiter: 3 requests per 5 seconds (stricter for demo)
rate_limiter = RateLimiter(max_requests=3, time_window=5.0)

# Middleware function for rate limiting with proper signature
async def rate_limit_middleware(
    context: AgentRunContext,
    next: Callable[[AgentRunContext], Awaitable[None]]
) -> None:
    """Middleware that enforces rate limits"""
    if not rate_limiter.allow_request():
        wait_time = rate_limiter.get_wait_time()
        error_msg = f"⚠️ RATE LIMIT EXCEEDED. Wait {wait_time:.1f}s before next request."
        print(error_msg)
        # In production, you might queue the request or return an error
        # For demo, we'll wait the required time
        print(f"   ⏳ Waiting {wait_time:.1f}s...")
        await asyncio.sleep(wait_time)
        # After waiting, allow the request
        rate_limiter.allow_request()
    
    await next(context)

# Create rate-limited agent
rate_limited_agent = chat_client.as_agent(
    name="RateLimitedSupportAgent",
    instructions="You are an InboxOps support agent. Answer briefly in 1-2 sentences.",
    middleware=[rate_limit_middleware]
)

# Simulate Black Friday email surge
print("🛍️ BLACK FRIDAY EMAIL SURGE SIMULATION")
print("="*60)
print("Rate Limit: 3 requests per 5 seconds")
print("Simulating rapid-fire emails to trigger rate limiting...\n")

async def simulate_email_surge():
    emails = [
        "Where is my order #12345?",
        "I need a refund for order #67890",
        "Is the sale still active?",
        "My promo code isn't working",
        "When will item XYZ be back in stock?",
        "I can't log into my account",
        "Need help with shipping address",
        "Is free shipping available?",
    ]
    
    for i, email in enumerate(emails, 1):
        print(f"📧 Email {i}/{len(emails)}: {email[:40]}...")
        start_time = time.time()
        
        # Don't await run_stream - it returns an async generator
        result = rate_limited_agent.run_stream(email)
        
        # Consume stream
        response = ""
        async for chunk in result:
            if hasattr(chunk, 'delta') and chunk.delta:
                response += chunk.delta
        
        elapsed = time.time() - start_time
        print(f"   ✅ Response: {response[:60]}...")
        print(f"   ⏱️  Processed in {elapsed:.2f}s\n")
        
        # NO delay between requests - fire them as fast as possible to trigger rate limit!
        # await asyncio.sleep(0.5)

# Run the simulation
await simulate_email_surge()

print("\n" + "="*60)
print("✅ Rate limiting protected the system from being overwhelmed!")
print("💡 In production, rate-limited requests would be queued or delayed.")
print(f"📊 Total requests in rate limiter: {len(rate_limiter.requests)}")

# 6.3. V2.4 — Caching (InboxOps FAQ Optimization)

## The Problem: Repetitive Questions Waste API Calls

InboxOps receives the same questions repeatedly:
- "What's your return policy?" (asked 500 times/day)
- "Do you ship internationally?" (asked 300 times/day)
- "How do I reset my password?" (asked 200 times/day)

**Each question costs $0.002 in API calls = $2,000/day wasted!**

## Solution: Response Caching

Cache responses for common questions:

In [None]:
import hashlib

cache = {}

def get_cached_response(message: str) -> str | None:
    cache_key = hashlib.sha256(message.encode()).hexdigest()
    
    if cache_key in cache:
        cached_item = cache[cache_key]
        if time.time() - cached_item['timestamp'] < TTL:
            return cached_item['response']  # Cache hit!
    
    return None  # Cache miss

Let's reduce InboxOps API costs with smart caching.

In [None]:
import hashlib
import time
from typing import Optional
from agent_framework import ChatAgent, Context

# Simple in-memory cache with TTL (Time To Live)
class ResponseCache:
    def __init__(self, ttl_seconds: int = 3600):
        """
        Args:
            ttl_seconds: Cache TTL in seconds (default: 1 hour)
        """
        self.cache = {}
        self.ttl = ttl_seconds
        self.hits = 0
        self.misses = 0
    
    def _get_cache_key(self, message: str) -> str:
        """Generate cache key from message"""
        # Normalize message (lowercase, strip whitespace)
        normalized = message.lower().strip()
        return hashlib.sha256(normalized.encode()).hexdigest()
    
    def get(self, message: str) -> Optional[str]:
        """Get cached response if available and not expired"""
        cache_key = self._get_cache_key(message)
        
        if cache_key in self.cache:
            cached_item = self.cache[cache_key]
            age = time.time() - cached_item['timestamp']
            
            if age < self.ttl:
                self.hits += 1
                print(f"   💰 CACHE HIT! (age: {age:.1f}s, saved $0.002)")
                return cached_item['response']
            else:
                # Expired, remove from cache
                del self.cache[cache_key]
        
        self.misses += 1
        return None
    
    def set(self, message: str, response: str):
        """Cache a response"""
        cache_key = self._get_cache_key(message)
        self.cache[cache_key] = {
            'response': response,
            'timestamp': time.time()
        }
    
    def get_stats(self) -> dict:
        """Get cache statistics"""
        total = self.hits + self.misses
        hit_rate = (self.hits / total * 100) if total > 0 else 0
        
        return {
            'hits': self.hits,
            'misses': self.misses,
            'hit_rate': f"{hit_rate:.1f}%",
            'size': len(self.cache),
            'estimated_savings': f"${self.hits * 0.002:.2f}"
        }

# Create cache
response_cache = ResponseCache(ttl_seconds=300)  # 5 minute TTL

# Middleware for caching
async def cache_middleware(messages: list, context: Context, next_fn):
    """Middleware that caches agent responses"""
    if not messages:
        return await next_fn(messages, context)
    
    last_message = messages[-1]
    user_message = last_message.get('content', '') if isinstance(last_message, dict) else str(last_message)
    
    # Try to get cached response
    cached_response = response_cache.get(user_message)
    if cached_response:
        # Return cached response without calling LLM
        from agent_framework import ChatMessage, Role
        return type('CachedResult', (), {
            'output': cached_response,
            'messages': [ChatMessage(role=Role.ASSISTANT, content=cached_response)]
        })()
    
    # Cache miss - call the agent
    result = await next_fn(messages, context)
    
    # Cache the response
    if hasattr(result, 'output'):
        response_cache.set(user_message, result.output)
    
    return result

# Create cached agent
cached_agent = ChatAgent(
    name="CachedSupportAgent",
    model_client=model,
    instructions="""You are an InboxOps support agent. Provide concise, helpful answers to customer questions.""",
    middleware=[cache_middleware]
)

# Test with frequently asked questions
faq_questions = [
    "What is your return policy?",
    "Do you ship internationally?",
    "What is your return policy?",  # Duplicate - should hit cache
    "How do I reset my password?",
    "Do you ship internationally?",  # Duplicate - should hit cache
    "What is your return policy?",  # Duplicate - should hit cache
]

print("🧪 TESTING FAQ CACHING")
print("="*60)

async def test_caching():
    for i, question in enumerate(faq_questions, 1):
        print(f"\n📧 Question {i}: {question}")
        
        result = await cached_agent.run_stream(
            messages=[{"role": "user", "content": question}]
        )
        
        # Consume stream
        response = ""
        async for chunk in result.stream:
            if hasattr(chunk, 'delta') and chunk.delta:
                response += chunk.delta
        
        if not response_cache.get(question):  # If wasn't cached before
            print(f"   🔄 Generated new response")

await test_caching()

# Show cache statistics
print("\n" + "="*60)
print("📊 CACHE STATISTICS")
stats = response_cache.get_stats()
for key, value in stats.items():
    print(f"   {key}: {value}")

print("\n✅ Caching dramatically reduces API costs for repetitive questions!")


# 7. V3 — Memory That Survives Beyond a Single Thread

Threads remember a conversation.

But InboxOps also needs persistent preferences across conversations, such as:
- preferred language
- preferred tone
- customer name

Example:
A customer always wants **brief responses**.
Or requests replies in **Hebrew**.
Or is a VIP account.

This is where a ContextProvider-based memory layer becomes powerful:
✅ Extract preferences automatically  
✅ Inject them as context into future calls  
✅ Maintain consistent customer experience

## Preferences Model

Define what to remember.

In [None]:
class SupportPreferences(BaseModel):
    """User preferences for support interactions."""
    name: str | None = None
    preferred_language: Literal["English", "Hebrew", "Spanish"] = "English"
    preferred_tone: Literal["formal", "friendly", "brief"] = "formal"

print("✅ SupportPreferences model defined")

## Implement ContextProvider

Two methods: `invoking` (inject context before calls) and `invoked` (extract state after calls).

In [None]:
from collections.abc import MutableSequence, Sequence
from typing import Any

from agent_framework import ContextProvider, Context, ChatAgent, ChatOptions


class SupportMemory(ContextProvider):
    """Memory that tracks user preferences for support interactions."""
    
    def __init__(self, chat_client, preferences: SupportPreferences | None = None, **kwargs: Any):
        """Create the memory.
        
        Args:
            chat_client: The chat client to use for extracting structured data
            preferences: Optional initial preferences
            **kwargs: Additional keyword arguments for deserialization
        """
        self._chat_client = chat_client
        if preferences:
            self.preferences = preferences
        elif kwargs:
            self.preferences = SupportPreferences.model_validate(kwargs)
        else:
            self.preferences = SupportPreferences()
    
    async def invoked(
        self,
        request_messages: ChatMessage | Sequence[ChatMessage],
        response_messages: ChatMessage | Sequence[ChatMessage] | None = None,
        invoke_exception: Exception | None = None,
        **kwargs: Any,
    ) -> None:
        """Extract preferences from user messages after each call."""
        # Ensure request_messages is a list
        messages_list = [request_messages] if isinstance(request_messages, ChatMessage) else list(request_messages)
        
        # Check if we have user messages
        user_messages = [msg for msg in messages_list if msg.role.value == "user"]
        
        if user_messages:
            try:
                # Use the chat client to extract structured information
                # NOTE: Use `options=` not `chat_options=`
                result = await self._chat_client.get_response(
                    messages=messages_list,
                    options=ChatOptions(
                        instructions=(
                            "Extract the user's name, preferred tone (formal/friendly/brief), "
                            "and preferred language (English/Hebrew/Spanish) from the messages if present. "
                            "If not present, return None for that field."
                        ),
                        response_format=SupportPreferences,
                    ),
                )
                
                # result.value should now be a SupportPreferences instance
                extracted = result.value
                
                # Update preferences with extracted data
                if extracted and isinstance(extracted, SupportPreferences):
                    if self.preferences.name is None and extracted.name:
                        self.preferences.name = extracted.name
                        print(f"   🧠 Memory updated: name = {extracted.name}")
                    
                    if extracted.preferred_tone != "formal":  # formal is default
                        self.preferences.preferred_tone = extracted.preferred_tone
                        print(f"   🧠 Memory updated: tone = {extracted.preferred_tone}")
                    
                    if extracted.preferred_language != "English":  # English is default
                        self.preferences.preferred_language = extracted.preferred_language
                        print(f"   🧠 Memory updated: language = {extracted.preferred_language}")
                        
            except Exception as e:
                print(f"   ⚠️ Failed to extract preferences: {e}")
    
    async def invoking(self, messages: ChatMessage | MutableSequence[ChatMessage], **kwargs: Any) -> Context:
        """Provide preference context before each agent call."""
        instructions: list[str] = []
        
        if self.preferences.name:
            instructions.append(f"The user's name is {self.preferences.name}. Address them by name.")
        
        instructions.append(f"Respond in {self.preferences.preferred_language}.")
        instructions.append(f"Use a {self.preferences.preferred_tone} tone.")
        
        return Context(instructions=" ".join(instructions))
    
    def serialize(self) -> str:
        """Serialize for persistence."""
        return self.preferences.model_dump_json()

print("✅ SupportMemory ContextProvider defined")

## Test Memory

The agent automatically extracts and applies preferences across turns.

In [None]:
# Create the memory provider using the existing chat_client
support_memory = SupportMemory(chat_client)

# Create the agent with memory
memory_agent = ChatAgent(
    name="MemorySupportAgent",
    instructions="You are a friendly support agent. Adapt your responses based on user preferences.",
    chat_client=chat_client,
    context_provider=support_memory,
)

# Turn 1: User introduces themselves
print("Turn 1: User introduction")
print("-" * 50)
result1 = await memory_agent.run("Hi, my name is David")
print(f"Agent: {result1.text}\n")

# Turn 2: User sets preference
print("Turn 2: Setting preference")
print("-" * 50)
result2 = await memory_agent.run("Please keep responses brief and casual")
print(f"Agent: {result2.text}\n")

# Turn 3: Ask a question - memory should apply name and brief tone
print("Turn 3: Question with preferences applied")
print("-" * 50)
result3 = await memory_agent.run("What's your return policy?")
print(f"Agent: {result3.text}\n")

# Check memory state - access the original support_memory object directly
print("🧠 Memory State (tracked by ContextProvider):")
print(f"   Name: {support_memory.preferences.name}")
print(f"   Language: {support_memory.preferences.preferred_language}")
print(f"   Tone: {support_memory.preferences.preferred_tone}")

# Workflows: InboxOps Pipeline Automation

At scale, InboxOps realized something important:

**A single agent loop is not enough.**

They need repeatable, testable execution paths:
- classify → draft → review
- routing rules (spam vs legit)
- parallel tasks (respond + summarize)
- human escalation loops

Workflows turn agent interactions into a real operational pipeline.

---

### Agent vs. Workflow

| AI Agent | Workflow |
|----------|----------|
| Single reasoning loop | Orchestrates multiple components |
| Dynamic tool selection | Predefined execution paths |
| Best for: focused tasks | Best for: multi-step processes |

### When to Use Workflows

| Pattern | Use Case |
|---------|----------|
| **Sequential** | Steps must run in order (classify → draft → review) |
| **Branching** | Different paths based on conditions (spam vs. legitimate) |
| **Parallel (Fan-out/Fan-in)** | Independent tasks that can run concurrently |
| **Group Chat** | Iterative refinement with multiple reviewers |

### Core Concepts

| Concept | Description |
|---------|-------------|
| **Executor** | Unit of work — agent or custom logic |
| **Edge** | Connection between executors with optional conditions |
| **WorkflowBuilder** | Constructs the execution graph |

# 8. Workflow Pattern #1 — Sequential Pipeline (InboxOps Assembly Line)

![Sequential Workflow](images/sequential-workflow.png)

InboxOps introduced a standard pipeline for every inbound email:

1) Classify the email  
2) Draft a response  
3) Review it  

This pattern is perfect when order matters.

✅ Predictable  
✅ Easy to debug  
✅ Easy to measure  
✅ Easy to extend

## Core Concepts

| Concept | Description |
|---------|-------------|
| **Executor** | Unit of work (`@executor` or class with `@handler`) |
| **WorkflowBuilder** | Connects executors with `add_edge()` |
| `ctx.send_message()` | Pass data to next executor |
| `ctx.yield_output()` | Return final result |

## Define Executors

Create agent executors for classification, writing, and review.

In [None]:
from typing_extensions import Never
from agent_framework import (
    WorkflowBuilder, WorkflowContext, WorkflowOutputEvent,
    Executor, executor, handler, AgentExecutor, AgentExecutorRequest, AgentExecutorResponse
)

# === CLASSIFIER AGENT ===
classifier_agent = AgentExecutor(
    chat_client.as_agent(
        name="Classifier",
        instructions="""Classify incoming emails. Return JSON with:
- category: "spam", "not_spam", or "uncertain"
- confidence: float 0-1
- reason: brief explanation""",
        response_format=ClassificationResult,
    ),
    id="classifier",
)

# === DRAFT WRITER AGENT ===
writer_agent = AgentExecutor(
    chat_client.as_agent(
        name="DraftWriter",
        instructions="""Draft professional support responses. Return JSON with:
- subject: reply subject line
- body: reply body
- tone: "formal", "friendly", or "apologetic"
- needs_review: true if sensitive or complex""",
        response_format=DraftResponse,
    ),
    id="writer",
)

# === REVIEWER AGENT ===
reviewer_agent = AgentExecutor(
    chat_client.as_agent(
        name="Reviewer",
        instructions="""Review draft responses for quality. Check:
- Professionalism and tone
- Accuracy of information
- Completeness
Return approval decision with notes.""",
    ),
    id="reviewer",
)

print("✅ Workflow agents defined: classifier, writer, reviewer")

## Build & Run

Connect executors with `add_edge()` and execute.

In [None]:
# Build sequential workflow
sequential_support_workflow = (
    WorkflowBuilder()
    .set_start_executor(classifier_agent)
    .add_edge(classifier_agent, writer_agent)
    .add_edge(writer_agent, reviewer_agent)
    .build()
)

# Run with legitimate email
async def run_sequential_workflow():
    email_prompt = f"""Process this support email:

From: {LEGIT_EMAIL.sender}
Subject: {LEGIT_EMAIL.subject}
Customer ID: {LEGIT_EMAIL.customer_id}

{LEGIT_EMAIL.body}
"""
    
    print("📧 Processing email through workflow: Classify → Draft → Review\n")
    print("-" * 60)
    
    request = AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=email_prompt)],
        should_respond=True
    )
    
    from agent_framework._workflows._events import ExecutorCompletedEvent
    
    async for event in sequential_support_workflow.run_stream(request):
        if isinstance(event, ExecutorCompletedEvent) and event.data:
            data = event.data[0] if isinstance(event.data, list) else event.data
            if hasattr(data, 'agent_response'):
                print(f"\n✅ [{event.executor_id}]:")
                print(f"   {data.agent_response.text[:300]}...")
        elif isinstance(event, WorkflowOutputEvent):
            print(f"\n🎯 FINAL OUTPUT:")
            if isinstance(event.data, list) and event.data:
                final = event.data[0]
                if hasattr(final, 'agent_response'):
                    print(final.agent_response.text)

await run_sequential_workflow()

# 9. Workflow Pattern #2 — Branching (InboxOps Triage System)

InboxOps doesn't want to treat every message the same.

So they built a triage workflow:

🚫 Spam → Block and log  
✅ Legitimate → Draft a response  
⚠️ Uncertain → Escalate to human review  

This prevents wasted effort, reduces risk, and keeps human attention focused where needed.

## Routing Patterns

| Pattern | Use Case |
|---------|----------|
| **Conditional Edge** | Binary if/else |
| **Switch-Case** | Multi-way routing |
| **Multi-Selection** | Dynamic fan-out |

## Define Branch Handlers

Create handlers for each classification outcome.

In [None]:
from dataclasses import dataclass
from uuid import uuid4
from agent_framework import Case, Default

# Internal payload for routing
@dataclass
class ClassifiedEmail:
    email_id: str
    category: str  # spam, not_spam, uncertain
    confidence: float
    reason: str
    original_content: str

# Shared state keys
EMAIL_KEY = "current_email"

# Helper to extract JSON from markdown code blocks
def extract_json(text: str) -> str:
    """Extract JSON from text, stripping markdown code blocks if present."""
    import re
    match = re.search(r'```(?:json)?\s*([\s\S]*?)```', text)
    if match:
        return match.group(1).strip()
    return text.strip()

# Transform classification result to routable payload
@executor(id="extract_classification")
async def extract_classification(response: Any, ctx: WorkflowContext[ClassifiedEmail]) -> None:
    """Extract classification from agent response for routing."""
    if isinstance(response, list):
        response = response[0]
    
    # Extract JSON (handles markdown code blocks)
    json_text = extract_json(response.agent_response.text)
    classification = ClassificationResult.model_validate_json(json_text)
    
    # Get original email from shared state
    original_content = await ctx.get_shared_state(EMAIL_KEY) or "Unknown"
    
    payload = ClassifiedEmail(
        email_id=str(uuid4()),
        category=classification.category,
        confidence=classification.confidence,
        reason=classification.reason,
        original_content=original_content
    )
    await ctx.send_message(payload)

# Route conditions
def is_spam(message: Any) -> bool:
    return isinstance(message, ClassifiedEmail) and message.category == "spam"

def is_not_spam(message: Any) -> bool:
    return isinstance(message, ClassifiedEmail) and message.category == "not_spam"

def is_uncertain(message: Any) -> bool:
    return isinstance(message, ClassifiedEmail) and message.category == "uncertain"

# Terminal handlers
@executor(id="handle_spam")
async def handle_spam_terminal(email: ClassifiedEmail, ctx: WorkflowContext[Never, str]) -> None:
    """Handle spam: block and log."""
    await ctx.yield_output(f"🚫 SPAM BLOCKED: {email.reason} (confidence: {email.confidence:.0%})")

@executor(id="handle_not_spam")
async def handle_not_spam_continue(email: ClassifiedEmail, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Handle not_spam: forward to writer."""
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Draft a response to: {email.original_content}")],
        should_respond=True
    ))

@executor(id="finalize_draft")
async def finalize_draft(response: Any, ctx: WorkflowContext[Never, str]) -> None:
    """Output the final draft."""
    if isinstance(response, list):
        response = response[0]
    # Extract JSON (handles markdown code blocks)
    json_text = extract_json(response.agent_response.text)
    draft = DraftResponse.model_validate_json(json_text)
    await ctx.yield_output(f"✉️ DRAFT READY:\nSubject: {draft.subject}\n\n{draft.body}")

@executor(id="handle_uncertain")
async def handle_uncertain_terminal(email: ClassifiedEmail, ctx: WorkflowContext[Never, str]) -> None:
    """Handle uncertain: flag for human review."""
    await ctx.yield_output(f"⚠️ NEEDS HUMAN REVIEW: {email.reason} (confidence: {email.confidence:.0%})\n\nOriginal: {email.original_content[:200]}...")

print("✅ Branching executors defined")

## Build Switch-Case Workflow

Route based on classification result.

In [None]:
# Store email and start classification
@executor(id="start_classification")
async def start_classification(email_text: str, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Store email and send for classification."""
    await ctx.set_shared_state(EMAIL_KEY, email_text)
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Classify this email:\n\n{email_text}")],
        should_respond=True
    ))

# Build branching workflow
branching_workflow = (
    WorkflowBuilder()
    .set_start_executor(start_classification)
    .add_edge(start_classification, classifier_agent)
    .add_edge(classifier_agent, extract_classification)
    # Switch-case routing
    .add_switch_case_edge_group(
        extract_classification,
        [
            Case(condition=is_spam, target=handle_spam_terminal),
            Case(condition=is_not_spam, target=handle_not_spam_continue),
            Default(target=handle_uncertain_terminal),  # Catches uncertain + unexpected
        ],
    )
    # Continue not_spam path to draft
    .add_edge(handle_not_spam_continue, writer_agent)
    .add_edge(writer_agent, finalize_draft)
    .build()
)

print("✅ Branching workflow built")

## Test Branching

Run all three email types through the workflow.

In [None]:
# Test all three paths
async def test_branching():
    test_cases = [
        ("LEGITIMATE", LEGIT_EMAIL),
        ("SPAM", SPAM_EMAIL),
        ("AMBIGUOUS", AMBIGUOUS_EMAIL),
    ]
    
    for label, email in test_cases:
        print(f"\n📧 Testing {label} email...")
        print("-" * 50)
        
        email_text = f"From: {email.sender}\nSubject: {email.subject}\n\n{email.body}"
        
        async for event in branching_workflow.run_stream(email_text):
            if isinstance(event, WorkflowOutputEvent):
                print(event.data)

await test_branching()

# 9.1. Workflow Pattern #2.5 — Checkpointing (InboxOps Batch Recovery)

## The Problem: Processing 10,000 Emails Overnight

InboxOps runs nightly batch jobs to process accumulated emails:
- **10,000 emails** need classification and routing
- Job takes **3 hours** to complete
- **What if it crashes at email 7,500?**

Without checkpointing:
- ❌ Restart from beginning
- ❌ Reprocess 7,500 emails (duplicate work, duplicate costs)
- ❌ Delays customer responses

## Solution: Workflow Checkpointing

Save progress at each step so you can resume from failure point:

```python
from agent_framework.workflows import FileCheckpointStorage

# Enable checkpointing
workflow = SequentialWorkflow()\
    .with_checkpointing(
        storage=FileCheckpointStorage(checkpoint_dir="./checkpoints")
    )

# If workflow fails at step 3 of 5:
# Resume from step 3 (skip steps 1-2)
workflow.resume(checkpoint_id="batch_job_001")
```

Let's add fault tolerance to InboxOps batch processing.


In [None]:
from agent_framework.workflows import SequentialWorkflow, WorkflowAgent
from agent_framework import ChatAgent
import json
import os
import time

# Simulated batch email processing
emails_batch = [
    {"id": f"EMAIL-{i:04d}", "subject": f"Customer inquiry #{i}", "priority": "normal"}
    for i in range(1, 21)  # 20 emails for demo (imagine 10,000 in production)
]

# Create processing agents
classifier_agent = ChatAgent(
    name="EmailClassifier",
    model_client=model,
    instructions="Classify emails into: urgent, normal, or low priority. Return only the priority level."
)

router_agent = ChatAgent(
    name="EmailRouter",
    model_client=model,
    instructions="Based on email priority, route to: urgent_queue, normal_queue, or low_queue. Return only the queue name."
)

# Simple file-based checkpoint storage (similar to FileCheckpointStorage)
class SimpleCheckpointStorage:
    def __init__(self, checkpoint_dir: str = "./checkpoints"):
        self.checkpoint_dir = checkpoint_dir
        os.makedirs(checkpoint_dir, exist_ok=True)
    
    def save_checkpoint(self, workflow_id: str, step: int, data: dict):
        """Save checkpoint to file"""
        checkpoint_file = os.path.join(self.checkpoint_dir, f"{workflow_id}.json")
        checkpoint = {
            "workflow_id": workflow_id,
            "step": step,
            "timestamp": time.time(),
            "data": data
        }
        with open(checkpoint_file, 'w') as f:
            json.dump(checkpoint, f, indent=2)
        print(f"   💾 Checkpoint saved: step {step}")
    
    def load_checkpoint(self, workflow_id: str) -> dict | None:
        """Load checkpoint from file"""
        checkpoint_file = os.path.join(self.checkpoint_dir, f"{workflow_id}.json")
        if os.path.exists(checkpoint_file):
            with open(checkpoint_file, 'r') as f:
                return json.load(f)
        return None
    
    def clear_checkpoint(self, workflow_id: str):
        """Clear checkpoint after successful completion"""
        checkpoint_file = os.path.join(self.checkpoint_dir, f"{workflow_id}.json")
        if os.path.exists(checkpoint_file):
            os.remove(checkpoint_file)

# Create checkpoint storage
checkpoint_storage = SimpleCheckpointStorage()
workflow_id = "email_batch_2024"

# Check for existing checkpoint
checkpoint = checkpoint_storage.load_checkpoint(workflow_id)
if checkpoint:
    print(f"🔄 RESUMING FROM CHECKPOINT")
    print(f"   Workflow: {checkpoint['workflow_id']}")
    print(f"   Last completed step: {checkpoint['step']}")
    print(f"   Processed emails: {checkpoint['data']['processed_count']}\n")
    start_index = checkpoint['data']['processed_count']
else:
    print("🚀 STARTING NEW BATCH PROCESSING\n")
    start_index = 0

# Process emails with checkpointing
processed_emails = []

print(f"📧 Processing {len(emails_batch)} emails (starting from email {start_index + 1})...")
print("="*60)

for i in range(start_index, len(emails_batch)):
    email = emails_batch[i]
    print(f"\n📬 Processing: {email['id']} - {email['subject']}")
    
    # Classify email
    classification = f"priority_{email['priority']}"  # Simplified for demo
    print(f"   🏷️  Classified as: {classification}")
    
    # Route email
    queue = f"{email['priority']}_queue"
    print(f"   📮 Routed to: {queue}")
    
    processed_emails.append({
        **email,
        "classification": classification,
        "queue": queue,
        "processed_at": time.time()
    })
    
    # Save checkpoint every 5 emails
    if (i + 1) % 5 == 0:
        checkpoint_storage.save_checkpoint(
            workflow_id=workflow_id,
            step=i + 1,
            data={
                "processed_count": i + 1,
                "last_email_id": email['id'],
                "processed_emails": processed_emails
            }
        )
    
    # Simulate processing time
    time.sleep(0.1)
    
    # Simulate failure at email 12 (only on first run)
    if i == 11 and start_index == 0:
        print("\n" + "="*60)
        print("❌ SIMULATED SYSTEM CRASH AT EMAIL 12!")
        print("💾 Checkpoint saved at email 10")
        print("\n🔄 To resume, run this cell again...")
        print("="*60)
        break
else:
    # Successfully completed
    print("\n" + "="*60)
    print(f"✅ BATCH PROCESSING COMPLETE!")
    print(f"   Total processed: {len(processed_emails)} emails")
    print(f"   Success rate: 100%")
    
    # Clear checkpoint
    checkpoint_storage.clear_checkpoint(workflow_id)
    print("   🧹 Checkpoint cleared")

# Show sample results
if processed_emails:
    print("\n📊 Sample Results:")
    for email in processed_emails[:3]:
        print(f"   {email['id']}: {email['classification']} → {email['queue']}")
    if len(processed_emails) > 3:
        print(f"   ... and {len(processed_emails) - 3} more")

print("\n✅ Checkpointing enables fault-tolerant batch processing!")
print("💡 In production, use FileCheckpointStorage from agent_framework.workflows")


# 10. Workflow Pattern #3 — Fan-Out / Fan-In (Parallelization)

![Concurrent Workflow](images/concurrent-workflow.png)

InboxOps had a performance and productivity challenge:

For long emails, support reps want:
✅ a customer-facing reply  
✅ an internal summary (for ticket notes)  

These tasks are independent.
So InboxOps runs them in parallel and merges the results.

This reduces total processing time and improves rep productivity.

## Define Parallel Paths

For long emails: respond AND summarize concurrently.

In [None]:
# Summary model
class EmailSummary(BaseModel):
    """Concise email summary."""
    key_points: list[str] = Field(description="Main points from the email")
    urgency: Literal["low", "medium", "high"] = Field(description="Urgency level")
    action_required: str = Field(description="Primary action needed")

# Summarizer agent
summarizer_agent = AgentExecutor(
    chat_client.as_agent(
        name="Summarizer",
        instructions="""Summarize emails concisely. Return JSON with:
- key_points: list of main points
- urgency: low/medium/high
- action_required: primary action needed""",
        response_format=EmailSummary,
    ),
    id="summarizer",
)

# Threshold for "long" emails
LONG_EMAIL_THRESHOLD = 200  # characters

@dataclass
class EnrichedEmail:
    """Email with metadata for routing."""
    email_id: str
    content: str
    is_long: bool
    category: str

# Selection function for multi-selection routing
def select_parallel_paths(email: EnrichedEmail, target_ids: list[str]) -> list[str]:
    """Select paths based on email length."""
    # target_ids order: [respond_path, summarize_path]
    respond_id, summarize_id = target_ids
    
    if email.is_long:
        return [respond_id, summarize_id]  # Both paths in parallel
    else:
        return [respond_id]  # Only respond for short emails

# Executors for parallel paths
@executor(id="prepare_parallel")
async def prepare_parallel(classified: ClassifiedEmail, ctx: WorkflowContext[EnrichedEmail]) -> None:
    """Prepare email for parallel processing."""
    enriched = EnrichedEmail(
        email_id=classified.email_id,
        content=classified.original_content,
        is_long=len(classified.original_content) > LONG_EMAIL_THRESHOLD,
        category=classified.category
    )
    await ctx.send_message(enriched)

@executor(id="respond_path")
async def respond_path(email: EnrichedEmail, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Send to writer for response."""
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Draft a response to:\n{email.content}")],
        should_respond=True
    ))

@executor(id="summarize_path")
async def summarize_path(email: EnrichedEmail, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Send to summarizer."""
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Summarize this email:\n{email.content}")],
        should_respond=True
    ))

# Aggregator to combine parallel results
class ParallelAggregator(Executor):
    def __init__(self):
        super().__init__(id="parallel_aggregator")
    
    @handler
    async def aggregate(self, results: list[Any], ctx: WorkflowContext[Never, str]) -> None:
        """Combine response and summary."""
        output_parts = []
        
        for result in results:
            if isinstance(result, AgentExecutorResponse):
                try:
                    draft = DraftResponse.model_validate_json(result.agent_response.text)
                    output_parts.append(f"📧 DRAFT RESPONSE:\nSubject: {draft.subject}\n{draft.body}")
                except:
                    try:
                        summary = EmailSummary.model_validate_json(result.agent_response.text)
                        points = "\n".join(f"  • {p}" for p in summary.key_points)
                        output_parts.append(f"📋 SUMMARY:\n{points}\nUrgency: {summary.urgency}\nAction: {summary.action_required}")
                    except:
                        output_parts.append(f"Result: {result.agent_response.text[:200]}...")
        
        await ctx.yield_output("\n\n" + "="*40 + "\n\n".join(output_parts))

aggregator = ParallelAggregator()

print("✅ Parallel processing executors defined")

## Build Fan-Out/Fan-In Workflow

Short emails → respond only. Long emails → respond + summarize in parallel.

In [None]:
from agent_framework import WorkflowBuilder
from agent_framework._workflows._events import ExecutorCompletedEvent
from datetime import datetime

# Constants
LONG_EMAIL_THRESHOLD = 200  # Characters

# Start executor - entry point stores email and passes it forward
@executor(id="fanout_start")
async def fanout_start(email_text: str, ctx: WorkflowContext[str]) -> None:
    """Entry point: store email length, forward email text."""
    # Store email length in shared state for selection
    await ctx.set_shared_state("email_length", len(email_text))
    # Store workflow start time
    await ctx.set_shared_state("workflow_start_time", time.time())
    await ctx.send_message(email_text)

# Selection function that uses shared state
def fanout_select_paths(email_text: str, target_ids: list[str]) -> list[str]:
    """Select paths based on email length (stored in text)."""
    # The email_text is still the raw string at this point
    if len(email_text) > LONG_EMAIL_THRESHOLD:
        return target_ids  # Both paths for long emails
    return [target_ids[0]]  # Only response path for short emails

# Response path preparer with timing
@executor(id="fanout_respond_prep")
async def fanout_respond_prep(email_text: str, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Prepare email for writer agent."""
    workflow_start = await ctx.get_shared_state("workflow_start_time")
    start_time = time.time()
    elapsed = start_time - workflow_start
    print(f"   ⏱️  [+{elapsed:.2f}s] 📝 RESPONSE PATH started")
    
    await ctx.set_shared_state("response_start_time", start_time)
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Draft a response to:\n{email_text}")],
        should_respond=True
    ))

# Summary path preparer with timing
@executor(id="fanout_summarize_prep")
async def fanout_summarize_prep(email_text: str, ctx: WorkflowContext[AgentExecutorRequest]) -> None:
    """Prepare email for summarizer agent."""
    workflow_start = await ctx.get_shared_state("workflow_start_time")
    start_time = time.time()
    elapsed = start_time - workflow_start
    print(f"   ⏱️  [+{elapsed:.2f}s] 📋 SUMMARY PATH started")
    
    await ctx.set_shared_state("summary_start_time", start_time)
    await ctx.send_message(AgentExecutorRequest(
        messages=[ChatMessage(Role.USER, text=f"Summarize this email:\n{email_text}")],
        should_respond=True
    ))

# Capture completion time immediately after writer finishes
@executor(id="capture_writer_completion")
async def capture_writer_completion(result: Any, ctx: WorkflowContext[Any]) -> None:
    """Capture writer completion time."""
    workflow_start = await ctx.get_shared_state("workflow_start_time")
    response_start = await ctx.get_shared_state("response_start_time")
    end_time = time.time()
    
    elapsed_from_start = end_time - workflow_start
    duration = end_time - response_start
    print(f"   ⏱️  [+{elapsed_from_start:.2f}s] ✅ RESPONSE PATH completed ({duration:.2f}s)")
    
    await ctx.set_shared_state("response_end_time", end_time)
    await ctx.send_message(result)

# Capture completion time immediately after summarizer finishes
@executor(id="capture_summarizer_completion")
async def capture_summarizer_completion(result: Any, ctx: WorkflowContext[Any]) -> None:
    """Capture summarizer completion time."""
    workflow_start = await ctx.get_shared_state("workflow_start_time")
    summary_start = await ctx.get_shared_state("summary_start_time")
    end_time = time.time()
    
    elapsed_from_start = end_time - workflow_start
    duration = end_time - summary_start
    print(f"   ⏱️  [+{elapsed_from_start:.2f}s] ✅ SUMMARY PATH completed ({duration:.2f}s)")
    
    await ctx.set_shared_state("summary_end_time", end_time)
    await ctx.send_message(result)

# Aggregator - combines results from parallel paths with timing
@executor(id="fanout_aggregator")
async def fanout_aggregator(results: list[Any], ctx: WorkflowContext[Never, str]) -> None:
    """Combine response and summary results with timing information."""
    response_start = await ctx.get_shared_state("response_start_time")
    summary_start = await ctx.get_shared_state("summary_start_time")
    response_end = await ctx.get_shared_state("response_end_time")
    summary_end = await ctx.get_shared_state("summary_end_time")
    
    output_parts = []
    response_time = None
    summary_time = None
    
    # Calculate durations from stored times
    if response_start and response_end:
        response_time = response_end - response_start
    if summary_start and summary_end:
        summary_time = summary_end - summary_start
    
    for result in results:
        if isinstance(result, AgentExecutorResponse):
            try:
                draft = DraftResponse.model_validate_json(extract_json(result.agent_response.text))
                output_parts.append(
                    f"📬 RESPONSE (completed in {response_time:.2f}s):\n"
                    f"Subject: {draft.subject}\n{draft.body}"
                )
            except:
                try:
                    summary = EmailSummary.model_validate_json(extract_json(result.agent_response.text))
                    points = "\n".join(f"  • {p}" for p in summary.key_points)
                    output_parts.append(
                        f"📋 SUMMARY (completed in {summary_time:.2f}s):\n"
                        f"{points}\n"
                        f"Urgency: {summary.urgency}\n"
                        f"Action: {summary.action_required}"
                    )
                except:
                    output_parts.append(f"Result: {result.agent_response.text[:200]}...")
    
    # Calculate overlap to show parallelization
    if response_time and summary_time:
        total_sequential = response_time + summary_time
        total_parallel = max(response_time, summary_time)
        time_saved = total_sequential - total_parallel
        output_parts.append(
            f"\n⚡ PARALLEL EXECUTION BENEFIT:\n"
            f"   Sequential time: {total_sequential:.2f}s\n"
            f"   Parallel time: {total_parallel:.2f}s\n"
            f"   Time saved: {time_saved:.2f}s ({time_saved/total_sequential*100:.1f}%)"
        )
    
    await ctx.yield_output("\n\n" + "="*50 + "\n\n".join(output_parts))

# Build the fan-out workflow
# Pattern: start -> [fanout to preparers] -> [agents] -> [capture timing] -> aggregator
fanout_workflow = (
    WorkflowBuilder()
    .set_start_executor(fanout_start)
    # Fan-out from start directly to path preparers based on email length
    .add_multi_selection_edge_group(
        fanout_start,
        targets=[fanout_respond_prep, fanout_summarize_prep],
        selection_func=fanout_select_paths,
    )
    # Each preparer sends to its agent
    .add_edge(fanout_respond_prep, writer_agent)
    .add_edge(fanout_summarize_prep, summarizer_agent)
    # Capture completion times immediately after each agent
    .add_edge(writer_agent, capture_writer_completion)
    .add_edge(summarizer_agent, capture_summarizer_completion)
    # Fan-in: collect all results
    .add_fan_in_edges([capture_writer_completion, capture_summarizer_completion], fanout_aggregator)
    .build()
)

print("✅ Fan-out/fan-in workflow built")

## Test Parallel Execution

Long emails trigger both response and summary paths concurrently.

In [None]:
# Test with long legitimate email
async def test_fanout():
    email_text = f"From: {LEGIT_EMAIL.sender}\nSubject: {LEGIT_EMAIL.subject}\n\n{LEGIT_EMAIL.body}"
    
    print(f"📧 Testing LONG email ({len(email_text)} chars > {LONG_EMAIL_THRESHOLD} threshold)")
    print("Expected: Response AND Summary in parallel\n")
    print("-" * 60)
    
    async for event in fanout_workflow.run_stream(email_text):
        if isinstance(event, WorkflowOutputEvent):
            print(event.data)

await test_fanout()

# 11. Workflow Pattern #4 — Group Chat (InboxOps Review Committee)

![Group Chat Pattern](images/group-chat.png)

InboxOps Enterprise customers have strict requirements.

Before sending a response, the draft must pass:
1) Security review (PII / compliance)
2) Accuracy review (promises and timelines)
3) Tone review (final editor)

A Group Chat pattern allows:
- shared context across reviewers
- structured collaboration
- a "final editor" agent that ships the final result

## Key Differences

| Pattern | Coordination | Use Case |
|---------|--------------|----------|
| **Concurrent** | No coordination | Independent parallel tasks |
| **Group Chat** | Orchestrator selects speakers | Iterative refinement, shared context |
| **Magentic** | Manager with dynamic planning | Complex open-ended tasks |

## Define Specialists

Create agents with distinct review roles. All agents will see the shared conversation.

In [None]:
from agent_framework import GroupChatBuilder, GroupChatState, ConcurrentBuilder, MagenticBuilder

# Three specialized reviewers - order matters! Last one produces final output.

# 1st: Security reviewer - identifies security/compliance issues
security_reviewer = ChatAgent(
    name="SecurityReviewer",
    description="Security and compliance specialist - reviews first",
    instructions="""You are the FIRST reviewer. Analyze the support response for:
- Data exposure risks (customer IDs, case numbers that shouldn't be in emails)
- PII handling concerns (names, order details)
- Policy compliance issues

Be concise. List only the security issues you find. Do NOT rewrite the email - just identify problems for later reviewers to address.""",
    chat_client=chat_client,
)

# 2nd: Accuracy reviewer - checks facts and promises
accuracy_reviewer = ChatAgent(
    name="AccuracyReviewer", 
    description="Factual accuracy specialist - reviews second",
    instructions="""You are the SECOND reviewer. Analyze the support response for:
- Unrealistic promises or timelines
- Unverifiable claims
- Compensation appropriateness

Consider the security feedback from the previous reviewer. Be concise. List only the accuracy issues. Do NOT rewrite the email - just identify problems for the final reviewer to address.""",
    chat_client=chat_client,
)

# 3rd: Tone reviewer - applies all feedback and produces final email
tone_reviewer = ChatAgent(
    name="ToneReviewer",
    description="Tone specialist and final editor - produces revised email",
    instructions="""You are the FINAL reviewer. Your job is to:
1. Consider ALL feedback from SecurityReviewer and AccuracyReviewer
2. Review the tone and empathy of the original email
3. **PRODUCE A FINAL REVISED EMAIL** that:
   - Addresses security concerns (remove/mask sensitive identifiers if needed)
   - Fixes accuracy issues (realistic timelines, appropriate promises)
   - Maintains professional, empathetic tone
   - Is ready to send to the customer

End your response with the complete revised email in a clear format.""",
    chat_client=chat_client,
)

print("✅ Three specialist reviewers defined:")
print("   1. SecurityReviewer - identifies security issues")
print("   2. AccuracyReviewer - checks facts and promises")  
print("   3. ToneReviewer - applies all feedback and produces FINAL email")

## Build Group Chat with Round-Robin

Simple selection: each reviewer speaks in turn.

In [None]:
# Sample draft response to review
draft_to_review = """
Subject: Re: Order #12345 - Delivery Issue

Dear Sarah,

I'm so sorry to hear about the missing package! This must be incredibly frustrating.

I've located your order and can confirm it was marked as delivered on Monday. Here's what I'll do:

1. I've opened an investigation with our shipping partner (Case #INV-789)
2. As a Premium customer, I'm expediting a replacement shipment TODAY
3. The replacement will arrive by Thursday, well before your Friday presentation

Your account has also been credited $50 for the inconvenience.

If you need anything else, reply directly to this email - I'm here to help!

Best regards,
Support Team
"""

# Round-robin selector: each reviewer speaks in order
def round_robin_selector(state: GroupChatState) -> str:
    """Pick the next speaker based on round index."""
    participants = list(state.participants.keys())
    return participants[state.current_round % len(participants)]

# Build group chat with round-robin selection
# ORDER MATTERS: Security → Accuracy → Tone (final editor)
review_group_chat = (
    GroupChatBuilder()
    .with_orchestrator(selection_func=round_robin_selector, orchestrator_name="RoundRobinOrchestrator")
    .participants([security_reviewer, accuracy_reviewer, tone_reviewer])  # Order: Security → Accuracy → Tone
    .with_termination_condition(lambda msgs: len([m for m in msgs if m.role.value == "assistant"]) >= 3)
    .build()
)

print("✅ Group chat built with round-robin selection")
print("   Order: SecurityReviewer → AccuracyReviewer → ToneReviewer (final)")

## Test Round-Robin Group Chat

Each reviewer analyzes the draft in turn, building on previous insights.

In [None]:
# Run the group chat with round-robin selection
from agent_framework._workflows._events import AgentRunUpdateEvent

async def test_round_robin_group_chat():
    print("📝 DRAFT TO REVIEW:")
    print(draft_to_review)
    print("-" * 60)
    print("\n🔄 ROUND-ROBIN GROUP CHAT (each reviewer speaks in turn):\n")
    
    last_executor_id: str | None = None
    agent_order = []
    
    async for event in review_group_chat.run_stream(f"Review this support response:\n{draft_to_review}"):
        if isinstance(event, AgentRunUpdateEvent):
            eid = event.executor_id
            if eid != last_executor_id:
                if last_executor_id is not None:
                    print("\n")
                agent_order.append(eid)
                print(f"\n🤖 [{eid}] (Turn #{len(agent_order)}):", end=" ", flush=True)
                last_executor_id = eid
            print(event.data, end="", flush=True)
        
        elif isinstance(event, WorkflowOutputEvent):
            print("\n\n" + "=" * 60)
            print(f"📊 EXECUTION ORDER: {' → '.join(agent_order)}")
            print("=" * 60)

await test_round_robin_group_chat()

In [None]:
# Agent-based orchestrator for intelligent speaker selection
from typing import cast
from agent_framework._workflows._events import AgentRunUpdateEvent, WorkflowOutputEvent
from agent_framework._types import ChatMessage

orchestrator_agent = ChatAgent(
    name="ReviewOrchestrator",
    description="Coordinates multi-agent review process",
    instructions=f"""You coordinate a team reviewing this support response:

{draft_to_review}

YOUR TEAM:
- SecurityReviewer: Identifies security/PII issues (reviews first)
- AccuracyReviewer: Checks facts and promises (reviews second)
- ToneReviewer: Final editor who produces the revised email (reviews last)

YOUR PROCESS:
1. Start with SecurityReviewer to check data safety and PII
2. Then AccuracyReviewer to verify claims and timelines
3. **Finally, ToneReviewer to produce the FINAL REVISED EMAIL** incorporating all feedback
4. If needed, you may ask follow-up questions to any reviewer
5. End when ToneReviewer delivers the complete revised email

Select speakers intelligently. CRITICAL: ToneReviewer must go last and produce the final email.""",
    chat_client=chat_client,
)

# Build group chat with agent-based orchestration
# ORDER: Security → Accuracy → Tone (final editor)
intelligent_review_chat = (
    GroupChatBuilder()
    .with_orchestrator(agent=orchestrator_agent)
    .participants([security_reviewer, accuracy_reviewer, tone_reviewer])
    .with_termination_condition(lambda msgs: len([m for m in msgs if m.role.value == "assistant"]) >= 5)
    .build()
)

# Run with detailed logging
async def test_agent_orchestrated_group_chat():
    print("📝 DRAFT TO REVIEW:")
    print(draft_to_review)
    print("-" * 60)
    print("\n🧠 AGENT-ORCHESTRATED GROUP CHAT (intelligent speaker selection):\n")
    
    last_executor_id: str | None = None
    agent_calls: dict[str, int] = {}
    
    async for event in intelligent_review_chat.run_stream("Review this support response. Security and Accuracy reviewers identify issues, then ToneReviewer produces the final revised email."):
        if isinstance(event, AgentRunUpdateEvent):
            eid = event.executor_id
            if eid != last_executor_id:
                if last_executor_id is not None:
                    print("\n")
                agent_calls[eid] = agent_calls.get(eid, 0) + 1
                print(f"\n🤖 [{eid}] (Call #{agent_calls[eid]}):", end=" ", flush=True)
                last_executor_id = eid
            print(event.data, end="", flush=True)
        
        elif isinstance(event, WorkflowOutputEvent):
            output_messages = cast(list[ChatMessage], event.data)
            
            print("\n\n" + "=" * 60)
            print("📊 EXECUTION SUMMARY")
            print("=" * 60)
            print(f"   Total calls: {sum(agent_calls.values())}")
            print("\n   Calls per agent:")
            for agent, count in sorted(agent_calls.items()):
                print(f"      {agent}: {count} call(s)")
            
            print("\n   💡 The orchestrator dynamically selected speakers")
            print("      based on what was needed at each step")
            
            print("\n" + "=" * 60)
            print("📧 FINAL REVISED EMAIL (from ToneReviewer)")
            print("=" * 60)
            for msg in reversed(output_messages):
                if msg.role.value == "assistant" and "ToneReviewer" in str(msg):
                    print(msg.text)
                    break

await test_agent_orchestrated_group_chat()

# 12. Workflow Pattern #5 — Magentic Orchestration (Dynamic Planning)

![Magentic Pattern](images/magentic-workflow.png)

InboxOps eventually expanded beyond emails.

They wanted an AI system that can:
- research patterns in customer complaints
- identify recurring issues
- propose operational improvements
- generate executive summaries

That's not a fixed pipeline anymore.

Magentic orchestration introduces a **manager agent** that:
✅ plans dynamically  
✅ delegates to specialists  
✅ iterates until the task is complete  

This is the most powerful orchestration mode for open-ended problems.

## Use Case: InboxOps Weekly Support Intelligence Report

A complex task requiring:
1. **Research Agent** - Gather complaint patterns and data
2. **Analyst Agent** - Process and analyze the data
3. **Manager** - Dynamic planning and synthesis

The manager autonomously decides which agent to call and when based on progress.

In [None]:
# Magentic Orchestration: Research + Analysis workflow
import json
from typing import cast
from agent_framework import (
    AgentRunUpdateEvent,
    MagenticOrchestratorEvent,
    MagenticProgressLedger,
)

# Research Agent - gathers data and patterns from support tickets
# Note: In production, this would connect to your ticketing system
researcher_agent = ChatAgent(
    name="ResearcherAgent",
    description="Specialist in research and information gathering about support patterns and customer complaints",
    instructions="""You are an InboxOps Support Research Specialist. Your job is to:
- Gather information about customer complaint patterns
- Identify recurring issues and trends
- Provide realistic example data based on common e-commerce support scenarios

When asked about support data, provide realistic example data for categories like:
shipping issues, refund requests, product defects, billing disputes, account access.
Be concise and factual. Format data clearly for analysis.""",
    chat_client=chat_client,
)

# Analyst Agent - processes and analyzes data
# Note: In production, add HostedCodeInterpreterTool for real code execution
analyst_agent = ChatAgent(
    name="AnalystAgent",
    description="Data analyst who processes support data and creates operational insights",
    instructions="""You are an InboxOps Data Analyst. Your job is to:
- Process and analyze support ticket data
- Calculate metrics (volume trends, resolution times, escalation rates)
- Create clear tables and visualizations descriptions
- Identify operational improvement opportunities

Show your calculations step by step. Format results in clear tables.""",
    chat_client=chat_client,
)

# Manager Agent - orchestrates the research workflow
manager_agent = ChatAgent(
    name="ResearchManager",
    description="Orchestrator that coordinates support intelligence workflows",
    instructions="""You manage an InboxOps research team to complete support intelligence reports.

YOUR TEAM:
- ResearcherAgent: Gathers information about support patterns and customer complaints
- AnalystAgent: Processes data, performs calculations, creates operational insights

YOUR PROCESS:
1. Break down the intelligence request into subtasks
2. Delegate to ResearcherAgent to gather relevant support data
3. Delegate to AnalystAgent to process and analyze the data
4. Continue iterating until you have comprehensive insights
5. Synthesize all findings into a final report

You dynamically decide who to call based on what's needed. You may call agents multiple times.""",
    chat_client=chat_client,
)

print("✅ Magentic agents defined: ResearcherAgent, AnalystAgent, ResearchManager")

## Build & Run Magentic Workflow

The manager dynamically plans and delegates. Watch how it calls different agents based on the evolving task.

In [None]:
# Build Magentic workflow
magentic_research_workflow = (
    MagenticBuilder()
    .participants([researcher_agent, analyst_agent])
    .with_manager(
        agent=manager_agent,
        max_round_count=10,  # Maximum delegation rounds
        max_stall_count=2,   # Replan after 2 stalls
    )
    .build()
)

# Research task - InboxOps Weekly Support Intelligence Report
research_task = """
InboxOps wants an internal weekly support intelligence report:
1. Identify top 5 complaint categories from incoming emails
2. Estimate urgency and business impact for each category
3. Calculate resolution time trends and escalation rates
4. Suggest operational improvements
5. Output a clean summary table + executive summary
"""

async def run_magentic_research():
    print("🔬 INBOXOPS SUPPORT INTELLIGENCE WORKFLOW")
    print("=" * 60)
    print(f"📋 TASK:\n{research_task}")
    print("=" * 60)
    
    last_message_id: str | None = None
    agent_calls: dict[str, int] = {}
    
    async for event in magentic_research_workflow.run_stream(research_task):
        # Track streaming from agents
        if isinstance(event, AgentRunUpdateEvent):
            message_id = event.data.message_id
            executor_id = event.executor_id
            
            if message_id != last_message_id:
                if last_message_id is not None:
                    print("\n")
                agent_calls[executor_id] = agent_calls.get(executor_id, 0) + 1
                print(f"\n🤖 [{executor_id}] (Call #{agent_calls[executor_id]}):", end=" ", flush=True)
                last_message_id = message_id
            
            print(event.data, end="", flush=True)
        
        # Track orchestration events
        elif isinstance(event, MagenticOrchestratorEvent):
            print(f"\n\n{'='*55}")
            print(f"📋 ORCHESTRATOR: {event.event_type.name}")
            print(f"{'='*55}")
            
            if isinstance(event.data, MagenticProgressLedger):
                ledger = event.data.to_dict()
                if "next_speaker" in ledger:
                    next_info = ledger.get('next_speaker', {})
                    if isinstance(next_info, dict):
                        print(f"   ➡️ Next: {next_info.get('answer', 'N/A')}")
                        reason = next_info.get('reason', '')
                        if reason:
                            print(f"   💭 Why: {reason[:100]}...")
                    else:
                        print(f"   ➡️ Next: {next_info}")
        
        # Final output
        elif isinstance(event, WorkflowOutputEvent):
            output_messages = cast(list[ChatMessage], event.data)
            
            print("\n\n" + "=" * 60)
            print("📊 EXECUTION SUMMARY")
            print("=" * 60)
            print(f"   Total agent calls: {sum(agent_calls.values())}")
            print("\n   Calls per agent:")
            for agent, count in sorted(agent_calls.items()):
                print(f"      {agent}: {count} call(s)")
            
            print("\n   ✨ Manager dynamically orchestrated:")
            print(f"      - Broke down complex task into subtasks")
            print(f"      - Called ResearcherAgent for data gathering")
            print(f"      - Called AnalystAgent for processing")
            print(f"      - Synthesized into final report")
            
            print("\n" + "=" * 60)
            print("📑 FINAL INBOXOPS INTELLIGENCE REPORT")
            print("=" * 60)
            for msg in reversed(output_messages):
                if msg.role.value == "assistant":
                    print(msg.text)
                    break

await run_magentic_research()

# 13. V4 — Evaluation & Testing (InboxOps Quality Metrics)

## The Problem: How Do We Know the Agent Works Well?

After deploying the InboxOps Support Email Copilot, stakeholders ask:

- 📊 **What's our response accuracy rate?**
- ⏱️ **How fast are we resolving tickets?**
- 😊 **Are customers satisfied?**
- 💰 **What's the ROI vs. human agents?**

**"Deploy and hope" is not a strategy.**

## Solution: Agent Evaluation Framework

Systematically measure agent performance:

```python
# Evaluation metrics
metrics = {
    "accuracy": measure_correct_responses(),
    "response_time": measure_avg_latency(),
    "customer_satisfaction": measure_csat_score(),
    "cost_per_ticket": calculate_cost(),
    "human_escalation_rate": measure_escalations()
}

# Automated testing
for test_case in test_dataset:
    prediction = agent.run(test_case.input)
    score = evaluate(prediction, test_case.expected_output)
    results.append(score)
```

Let's build a comprehensive evaluation system for InboxOps.


In [None]:
from typing import List, Dict, Any
from dataclasses import dataclass
import time
from datetime import datetime

# Define test cases for InboxOps agents
@dataclass
class TestCase:
    """Test case for agent evaluation"""
    id: str
    input: str
    expected_output: Dict[str, Any]
    category: str

# Create test dataset
test_cases = [
    TestCase(
        id="TEST-001",
        input="Where is my order #12345? It's been 3 weeks!",
        expected_output={
            "category": "order_status",
            "priority": "high",
            "sentiment": "negative",
            "requires_tool": True,
            "tool_name": "get_order_status"
        },
        category="order_inquiry"
    ),
    TestCase(
        id="TEST-002",
        input="Thank you for the quick refund! Great service!",
        expected_output={
            "category": "feedback",
            "priority": "low",
            "sentiment": "positive",
            "requires_tool": False
        },
        category="positive_feedback"
    ),
    TestCase(
        id="TEST-003",
        input="I need to cancel ticket TKT-789 immediately.",
        expected_output={
            "category": "ticket_management",
            "priority": "urgent",
            "sentiment": "neutral",
            "requires_tool": True,
            "tool_name": "cancel_ticket"
        },
        category="urgent_request"
    ),
]

# Evaluation metrics
class AgentEvaluator:
    def __init__(self):
        self.results = []
    
    def evaluate_response_accuracy(self, prediction: str, expected: Dict) -> float:
        """Evaluate if response addresses the correct category"""
        # Simplified accuracy check (in production, use more sophisticated NLP)
        score = 0.0
        
        # Check category keywords
        category_keywords = {
            "order_status": ["order", "shipment", "tracking", "delivery"],
            "feedback": ["thank", "appreciate", "great"],
            "ticket_management": ["ticket", "cancel", "close"],
        }
        
        expected_category = expected.get("category", "")
        keywords = category_keywords.get(expected_category, [])
        
        if any(keyword in prediction.lower() for keyword in keywords):
            score += 0.5
        
        # Check sentiment alignment
        sentiment = expected.get("sentiment", "")
        if sentiment == "positive" and any(word in prediction.lower() for word in ["glad", "happy", "great"]):
            score += 0.25
        elif sentiment == "negative" and any(word in prediction.lower() for word in ["sorry", "apologize", "unfortunately"]):
            score += 0.25
        
        # Check priority handling
        priority = expected.get("priority", "")
        if priority == "urgent" and any(word in prediction.lower() for word in ["immediately", "right away", "asap"]):
            score += 0.25
        
        return min(score, 1.0)
    
    def evaluate_response_time(self, latency_ms: float) -> Dict[str, Any]:
        """Evaluate response time performance"""
        # SLA targets for InboxOps
        if latency_ms < 1000:
            grade = "excellent"
            score = 1.0
        elif latency_ms < 3000:
            grade = "good"
            score = 0.8
        elif latency_ms < 5000:
            grade = "acceptable"
            score = 0.6
        else:
            grade = "poor"
            score = 0.3
        
        return {
            "latency_ms": latency_ms,
            "grade": grade,
            "score": score,
            "meets_sla": latency_ms < 5000
        }
    
    def run_evaluation(self, agent: ChatAgent, test_cases: List[TestCase]) -> Dict[str, Any]:
        """Run comprehensive evaluation on test dataset"""
        results = []
        total_latency = 0
        
        print("🧪 RUNNING AGENT EVALUATION")
        print("="*60)
        
        for test_case in test_cases:
            print(f"\n📝 Test Case: {test_case.id}")
            print(f"   Input: {test_case.input[:60]}...")
            
            # Measure response time
            start_time = time.time()
            response = agent.run(messages=[{"role": "user", "content": test_case.input}])
            latency_ms = (time.time() - start_time) * 1000
            
            # Evaluate accuracy
            accuracy_score = self.evaluate_response_accuracy(
                response.output,
                test_case.expected_output
            )
            
            # Evaluate response time
            timing_result = self.evaluate_response_time(latency_ms)
            
            result = {
                "test_id": test_case.id,
                "category": test_case.category,
                "accuracy_score": accuracy_score,
                "latency_ms": latency_ms,
                "timing_grade": timing_result["grade"],
                "meets_sla": timing_result["meets_sla"],
                "response_preview": response.output[:100] + "..." if len(response.output) > 100 else response.output
            }
            
            results.append(result)
            print(f"   ✅ Accuracy: {accuracy_score:.2f}")
            print(f"   ⏱️  Latency: {latency_ms:.0f}ms ({timing_result['grade']})")
        
        # Calculate aggregate metrics
        avg_accuracy = sum(r["accuracy_score"] for r in results) / len(results)
        avg_latency = sum(r["latency_ms"] for r in results) / len(results)
        sla_compliance = sum(1 for r in results if r["meets_sla"]) / len(results) * 100
        
        summary = {
            "total_tests": len(results),
            "avg_accuracy": avg_accuracy,
            "avg_latency_ms": avg_latency,
            "sla_compliance_rate": sla_compliance,
            "timestamp": datetime.now().isoformat(),
            "detailed_results": results
        }
        
        return summary

# Create evaluator
evaluator = AgentEvaluator()

# Run evaluation on our support agent
evaluation_results = evaluator.run_evaluation(support_agent_with_tools, test_cases)

# Display results
print("\n" + "="*60)
print("📊 EVALUATION SUMMARY")
print("="*60)
print(f"Total Test Cases:      {evaluation_results['total_tests']}")
print(f"Average Accuracy:      {evaluation_results['avg_accuracy']:.2%}")
print(f"Average Latency:       {evaluation_results['avg_latency_ms']:.0f}ms")
print(f"SLA Compliance:        {evaluation_results['sla_compliance_rate']:.1f}%")
print(f"Timestamp:             {evaluation_results['timestamp']}")

print("\n📈 DETAILED RESULTS")
print("-"*60)
for result in evaluation_results['detailed_results']:
    status = "✅" if result['accuracy_score'] >= 0.7 else "⚠️"
    print(f"{status} {result['test_id']}: Accuracy={result['accuracy_score']:.2f}, Latency={result['latency_ms']:.0f}ms")

# Calculate ROI metrics
print("\n💰 ROI ANALYSIS")
print("-"*60)
human_cost_per_email = 2.50  # $2.50 per email with human agent
ai_cost_per_email = 0.02     # $0.02 per email with AI agent
monthly_volume = 50000       # 50K emails/month

monthly_human_cost = monthly_volume * human_cost_per_email
monthly_ai_cost = monthly_volume * ai_cost_per_email
monthly_savings = monthly_human_cost - monthly_ai_cost
annual_savings = monthly_savings * 12

print(f"Monthly Email Volume:   {monthly_volume:,}")
print(f"Human Agent Cost:       ${monthly_human_cost:,.2f}/month")
print(f"AI Agent Cost:          ${monthly_ai_cost:,.2f}/month")
print(f"Monthly Savings:        ${monthly_savings:,.2f}")
print(f"Annual Savings:         ${annual_savings:,.2f}")
print(f"Cost Reduction:         {(1 - ai_cost_per_email/human_cost_per_email)*100:.1f}%")

print("\n✅ Evaluation complete! Use these metrics to:")
print("   • Track agent performance over time")
print("   • Identify areas for improvement")
print("   • Justify ROI to stakeholders")
print("   • Set SLAs and quality benchmarks")


## Solution 3: LLM-as-Judge Evaluation

Manual evaluation doesn't scale to hundreds of test cases. **LLM judges** automate quality assessment:

- **Correctness** - Is the response factually accurate?
- **Helpfulness** - Does it solve the customer's problem?
- **Tone** - Is it professional and empathetic?
- **Completeness** - Are all questions answered?

---

In [None]:
## LLM Judge Implementation

class LLMJudge:
    """Use GPT-4 to evaluate agent responses."""
    
    def __init__(self, judge_client):
        self.judge = ChatAgent(
            chat_client=judge_client,
            name="QualityJudge",
            instructions="""
            You are an expert evaluator of customer support responses.
            Rate responses on a scale of 1-5 for each criterion.
            Provide a brief explanation for your rating.
            """
        )
    
    async def evaluate_response(
        self,
        customer_message: str,
        agent_response: str,
        expected_answer: str = None
    ) -> dict:
        """Evaluate a single response."""
        
        eval_prompt = f"""
        Evaluate this customer support response:
        
        Customer Question:
        {customer_message}
        
        Agent Response:
        {agent_response}
        
        {f'Expected Answer: {expected_answer}' if expected_answer else ''}
        
        Rate the response on these criteria (1-5 scale):
        1. Correctness: Is the information accurate?
        2. Helpfulness: Does it solve the problem?
        3. Tone: Professional and empathetic?
        4. Completeness: All questions answered?
        
        Respond in this JSON format:
        {{
            "correctness": 4,
            "helpfulness": 5,
            "tone": 5,
            "completeness": 4,
            "overall": 4.5,
            "explanation": "Brief explanation here"
        }}
        """
        
        result = await self.judge.run(eval_prompt)
        
        # Parse JSON from response
        import json
        import re
        
        # Extract JSON from markdown code blocks if present
        text = result.text
        json_match = re.search(r'```(?:json)?\s*({.*?})\s*```', text, re.DOTALL)
        if json_match:
            json_str = json_match.group(1)
        else:
            json_str = text
        
        try:
            return json.loads(json_str)
        except:
            # Fallback if parsing fails
            return {
                "error": "Failed to parse judge response",
                "raw": text[:200]
            }

# Create judge
judge = LLMJudge(client)

# Test evaluation
print("Testing LLM judge...\n")

test_customer_msg = "My order #12345 hasn't arrived and it's been 2 weeks."
test_agent_response = """
I understand your frustration. Let me check order #12345 for you.
I see it shipped on May 1st via USPS. The expected delivery was May 8th.
Since it's now May 15th and you haven't received it, I'll file a missing package claim
and issue a full refund today. You should see it in 3-5 business days.
Would you like a replacement order shipped with expedited shipping at no charge?
"""

evaluation = await judge.evaluate_response(
    test_customer_msg,
    test_agent_response
)

print("Evaluation Results:")
print(json.dumps(evaluation, indent=2))

In [None]:
## Batch Evaluation

async def run_batch_evaluation(agent, test_cases: list, judge: LLMJudge):
    """Evaluate agent on multiple test cases."""
    results = []
    
    print(f"Evaluating agent on {len(test_cases)} test cases...\n")
    
    for i, case in enumerate(test_cases):
        print(f"[{i+1}/{len(test_cases)}] Testing: {case['input'][:50]}...")
        
        # Get agent response
        try:
            response = await agent.run(case['input'])
            agent_text = response.text
        except Exception as e:
            agent_text = f"ERROR: {str(e)}"
        
        # Evaluate with LLM judge
        eval_result = await judge.evaluate_response(
            case['input'],
            agent_text,
            case.get('expected')
        )
        
        results.append({
            'test_case': case,
            'agent_response': agent_text,
            'evaluation': eval_result
        })
    
    # Calculate aggregate metrics
    if results:
        avg_correctness = sum(r['evaluation'].get('correctness', 0) for r in results) / len(results)
        avg_helpfulness = sum(r['evaluation'].get('helpfulness', 0) for r in results) / len(results)
        avg_tone = sum(r['evaluation'].get('tone', 0) for r in results) / len(results)
        avg_completeness = sum(r['evaluation'].get('completeness', 0) for r in results) / len(results)
        
        print("\n" + "="*80)
        print("📊 INBOXOPS QUALITY REPORT")
        print("="*80)
        print(f"Total Test Cases: {len(results)}")
        print(f"\nAverage Scores (out of 5):")
        print(f"  Correctness:  {avg_correctness:.2f} {'⭐' * int(avg_correctness)}")
        print(f"  Helpfulness:  {avg_helpfulness:.2f} {'⭐' * int(avg_helpfulness)}")
        print(f"  Tone:         {avg_tone:.2f} {'⭐' * int(avg_tone)}")
        print(f"  Completeness: {avg_completeness:.2f} {'⭐' * int(avg_completeness)}")
        print(f"\nOverall Quality: {(avg_correctness + avg_helpfulness + avg_tone + avg_completeness) / 4:.2f}/5.0")
        print("="*80)
    
    return results

# Define test cases
inboxops_test_cases = [
    {
        'input': 'How do I track my order #98765?',
        'expected': 'Provide tracking link and estimated delivery date'
    },
    {
        'input': 'I want to return a defective product.',
        'expected': 'Explain return process, provide label, and timeline'
    },
    {
        'input': 'Your website is down!',
        'expected': 'Acknowledge issue, provide status, offer alternatives'
    }
]

# Run evaluation
# eval_results = await run_batch_evaluation(support_agent, inboxops_test_cases, judge)

print("\n✅ Batch evaluation framework ready!")
print("Uncomment the line above to run full evaluation.")

## Solution 4: A/B Testing Framework

Compare different agent configurations to optimize for cost, speed, and quality:

- **Model comparison** - GPT-4 vs GPT-4o vs GPT-3.5
- **Prompt engineering** - Test different instructions
- **Temperature tuning** - Creativity vs consistency
- **Tool configurations** - Which tools improve quality?

---

In [None]:
## A/B Testing Harness

class ABTestingHarness:
    """Compare two agent configurations."""
    
    def __init__(self, agent_a, agent_b, test_cases, judge):
        self.agent_a = agent_a
        self.agent_b = agent_b
        self.test_cases = test_cases
        self.judge = judge
    
    async def run_comparison(self):
        """Run both agents on all test cases and compare."""
        results_a = []
        results_b = []
        
        print(f"Running A/B test on {len(self.test_cases)} cases...\n")
        
        for i, case in enumerate(self.test_cases):
            print(f"[{i+1}/{len(self.test_cases)}] Testing: {case['input'][:40]}...")
            
            # Test Agent A
            try:
                result_a = await self.agent_a.run(case['input'])
                eval_a = await self.judge.evaluate_response(
                    case['input'],
                    result_a.text
                )
                results_a.append({
                    'response': result_a.text,
                    'tokens': getattr(result_a.usage, 'total_tokens', 0) if hasattr(result_a, 'usage') else 0,
                    'evaluation': eval_a
                })
            except Exception as e:
                results_a.append({'error': str(e)})
            
            # Test Agent B
            try:
                result_b = await self.agent_b.run(case['input'])
                eval_b = await self.judge.evaluate_response(
                    case['input'],
                    result_b.text
                )
                results_b.append({
                    'response': result_b.text,
                    'tokens': getattr(result_b.usage, 'total_tokens', 0) if hasattr(result_b, 'usage') else 0,
                    'evaluation': eval_b
                })
            except Exception as e:
                results_b.append({'error': str(e)})
        
        return self._generate_report(results_a, results_b)
    
    def _generate_report(self, results_a, results_b):
        """Generate comparison report."""
        
        # Calculate metrics
        def calc_avg(results, metric):
            values = [r['evaluation'].get(metric, 0) for r in results if 'evaluation' in r]
            return sum(values) / len(values) if values else 0
        
        def calc_tokens(results):
            values = [r.get('tokens', 0) for r in results if 'tokens' in r]
            return sum(values)
        
        report = {
            'agent_a': {
                'correctness': calc_avg(results_a, 'correctness'),
                'helpfulness': calc_avg(results_a, 'helpfulness'),
                'tone': calc_avg(results_a, 'tone'),
                'completeness': calc_avg(results_a, 'completeness'),
                'total_tokens': calc_tokens(results_a)
            },
            'agent_b': {
                'correctness': calc_avg(results_b, 'correctness'),
                'helpfulness': calc_avg(results_b, 'helpfulness'),
                'tone': calc_avg(results_b, 'tone'),
                'completeness': calc_avg(results_b, 'completeness'),
                'total_tokens': calc_tokens(results_b)
            }
        }
        
        # Calculate improvements
        report['comparison'] = {
            'quality_delta': (
                (report['agent_b']['correctness'] + report['agent_b']['helpfulness']) / 2 -
                (report['agent_a']['correctness'] + report['agent_a']['helpfulness']) / 2
            ),
            'token_savings_pct': (
                (report['agent_a']['total_tokens'] - report['agent_b']['total_tokens']) /
                report['agent_a']['total_tokens'] * 100
                if report['agent_a']['total_tokens'] > 0 else 0
            )
        }
        
        self._print_report(report)
        return report
    
    def _print_report(self, report):
        print("\n" + "="*80)
        print("📊 A/B TEST RESULTS")
        print("="*80)
        print(f"\nAgent A (e.g., GPT-4):")
        print(f"  Correctness:  {report['agent_a']['correctness']:.2f}/5")
        print(f"  Helpfulness:  {report['agent_a']['helpfulness']:.2f}/5")
        print(f"  Total Tokens: {report['agent_a']['total_tokens']:,}")
        
        print(f"\nAgent B (e.g., GPT-4o):")
        print(f"  Correctness:  {report['agent_b']['correctness']:.2f}/5")
        print(f"  Helpfulness:  {report['agent_b']['helpfulness']:.2f}/5")
        print(f"  Total Tokens: {report['agent_b']['total_tokens']:,}")
        
        print(f"\n📈 Comparison:")
        quality_delta = report['comparison']['quality_delta']
        token_savings = report['comparison']['token_savings_pct']
        
        print(f"  Quality Delta: {quality_delta:+.2f} {'✅ Better' if quality_delta > 0 else '⚠️ Worse'}")
        print(f"  Token Savings: {token_savings:+.1f}% {'💰 Cheaper' if token_savings > 0 else '💸 More expensive'}")
        
        if quality_delta > 0 and token_savings > 0:
            print(f"\n✅ WINNER: Agent B (Better quality AND cheaper!)")
        elif abs(quality_delta) < 0.2 and token_savings > 0:
            print(f"\n✅ WINNER: Agent B (Similar quality, much cheaper)")
        else:
            print(f"\n⚖️  Trade-off: Choose based on priorities")
        
        print("="*80)

# Example: Compare two configurations
# ab_test = ABTestingHarness(
#     agent_a=support_agent_gpt4,
#     agent_b=support_agent_gpt4o,
#     test_cases=inboxops_test_cases,
#     judge=judge
# )
# comparison = await ab_test.run_comparison()

print("\n✅ A/B testing framework ready!")
print("Use this to optimize cost vs quality tradeoffs.")

---

## 🎯 V4 Evaluation Summary

InboxOps now has enterprise-grade quality assurance:

| Technique | Purpose | When to Use |
|-----------|---------|-------------|
| **Test Cases** | Regression testing | Every code change |
| **LLM Judge** | Quality assessment at scale | Weekly quality audits |
| **A/B Testing** | Optimize cost/quality tradeoffs | Before prod deployments |

### Key Metrics to Track

1. **Quality Metrics**
   - Correctness: Are responses accurate?
   - Helpfulness: Do they solve problems?
   - Tone: Professional and empathetic?
   - Completeness: All questions answered?

2. **Cost Metrics**
   - Tokens per conversation
   - Model API costs
   - Tool execution costs

3. **Performance Metrics**
   - Response time (p50, p95, p99)
   - Tool call latency
   - Concurrent load handling

### Next Steps

- Set up continuous evaluation pipeline
- Define quality thresholds (e.g., >4.0/5 on all metrics)
- Implement automated alerts for quality regressions
- Run A/B tests before deploying model changes

---

# V5 — Durable Agents: Serverless Deployment at Scale

## The Problem: Infrastructure Management

InboxOps faces production deployment challenges:

- **Auto-scaling needed** - Black Friday traffic spikes 100x
- **Long-running workflows** - Human approvals can take days/weeks
- **State persistence** - Conversations must survive server restarts
- **Cost efficiency** - Pay-per-execution, not per-hour

Traditional deployments can't handle:
- ❌ 30-second HTTP timeouts (approval workflows need days)
- ❌ Lost conversation state on server restart
- ❌ Manual scaling during traffic spikes
- ❌ 24/7 compute costs while waiting for human input

---

## Solution: Durable Task Extension + Azure Functions

The **Durable Task extension** for Agent Framework provides:

1. **Serverless Hosting** - Auto-scales from 0 to 10,000+ instances
2. **Stateful Threads** - Conversation history persisted in Cosmos DB
3. **Deterministic Orchestrations** - Multi-agent workflows that survive failures
4. **Zero-Cost Waiting** - No compute charges while waiting for human input
5. **Built-in Observability** - Durable Task Scheduler dashboard

---

## Architecture: How It Works

```
┌─────────────────────────────────────────────────────────────┐
│                     Azure Functions                         │
│                                                             │
│  ┌──────────────┐      ┌──────────────┐                   │
│  │   HTTP       │      │  Durable     │                   │
│  │   Trigger    │─────▶│  Agent       │                   │
│  └──────────────┘      └──────────────┘                   │
│                              │                              │
└──────────────────────────────┼──────────────────────────────┘
                               │
                ┌──────────────┼──────────────┐
                │              │              │
                ▼              ▼              ▼
          ┌─────────┐    ┌─────────┐   ┌──────────┐
          │ Cosmos  │    │  Azure  │   │ Durable  │
          │   DB    │    │ Storage │   │   Task   │
          │ (state) │    │ (queue) │   │Scheduler │
          └─────────┘    └─────────┘   └──────────┘
```

---

In [None]:
## Installation & Setup

# Install the durable task extension
# !pip install agent-framework-azurefunctions --pre
# !pip install azure-identity

# Note: Full durable agents require Azure Functions deployment.
# This example shows the programming model for reference.

print("""\n📦 Durable Agent Deployment Stack:

1. agent-framework-azurefunctions  # Core extension
2. Azure Functions (Flex Consumption)  # Serverless hosting
3. Cosmos DB  # Persistent state storage
4. Azure Storage  # Queue & orchestration
5. Durable Task Scheduler  # Observability dashboard

💰 Cost Model:
- Only pay for execution time
- $0.00 while waiting for human approval
- Auto-scale from 0 to thousands of instances
""")

In [None]:
## Example: Durable Agent Definition

# This is the programming model (requires Azure Functions to run)

sample_code = '''
# function_app.py
from agent_framework_azurefunctions import AIAgentApp
from agent_framework import ChatAgent
from agent_framework_azure_ai import AzureAIAgentClient

app = AIAgentApp()

@app.agent(name="durable-support-agent")
def create_support_agent():
    """Create a durable agent with persistent state."""
    client = AzureAIAgentClient.from_connection_string(
        connection_string=os.environ["AZURE_AI_CONNECTION_STRING"]
    )
    
    return ChatAgent(
        chat_client=client,
        name="DurableInboxOps",
        instructions="You are InboxOps customer support.",
        tools=[create_ticket, lookup_customer, send_email]
    )

# Deploy with:
# func azure functionapp publish <your-function-app>

# The agent automatically gets:
# 1. HTTP endpoint: https://<your-app>.azurewebsites.net/api/agents/durable-support-agent
# 2. Persistent state in Cosmos DB
# 3. Auto-scaling (0 to N instances)
# 4. Built-in observability dashboard
'''

print(sample_code)

print("\n✅ Agent Features After Deployment:")
print("""
- Stateful Threads: Conversation history survives restarts
- HTTP API: RESTful endpoints for client integration
- Auto-Scale: 0→1000+ instances based on load
- Observability: Real-time dashboard of all conversations
""")

In [None]:
## Deterministic Multi-Agent Orchestration

# Durable orchestrations coordinate multiple agents reliably

orchestration_code = '''
@app.orchestration("escalation-workflow")
async def escalation_orchestration(context, input_data):
    """Durable workflow for escalating complex support tickets.
    
    This orchestration:
    - Survives server restarts
    - Waits indefinitely for human input (days/weeks)
    - No compute cost while waiting
    - Automatically resumes when event arrives
    """
    
    # Step 1: Initial support agent handles the request
    support_agent = await context.get_agent("support-agent")
    initial_response = await support_agent.run(input_data["message"])
    
    # Step 2: Check if escalation is needed
    if "complex" in initial_response.text.lower():
        
        # Step 3: Wait for human approval (CAN TAKE DAYS!)
        # Orchestration is serialized to Cosmos DB here
        # Compute resources are released (cost = $0)
        approval = await context.wait_for_external_event("approval_decision")
        
        # Step 4: When approval arrives, orchestration resumes
        if approval["approved"]:
            specialist = await context.get_agent("specialist-agent")
            final_response = await specialist.run(input_data["message"])
            return {
                "status": "escalated",
                "response": final_response.text
            }
        else:
            return {
                "status": "denied",
                "reason": approval["reason"]
            }
    
    return {
        "status": "resolved",
        "response": initial_response.text
    }

# To send approval event from external system:
# POST https://<your-app>.azurewebsites.net/runtime/webhooks/durabletask/instances/{instanceId}/raiseEvent/approval_decision
# Body: {"approved": true, "reason": "Authorized by manager"}
'''

print(orchestration_code)

print("\n🎯 Key Benefits:")
print("""
1. Checkpointing: State saved after each step
2. Fault Tolerance: Survives failures, automatically retries
3. Event-Driven: Wait for external events (webhooks, human input)
4. Cost-Efficient: No charges while waiting
5. Observable: Full execution history in dashboard
""")

In [None]:
## Parallel Orchestration Pattern

parallel_code = '''
@app.orchestration("quality-assessment")
async def quality_assessment_orchestration(context, email_content):
    """Run multiple specialist agents in parallel."""
    
    # Get all specialist agents
    sentiment_agent = await context.get_agent("sentiment-analyzer")
    priority_agent = await context.get_agent("priority-classifier")
    spam_agent = await context.get_agent("spam-detector")
    
    # Execute all agents concurrently
    tasks = [
        sentiment_agent.run(email_content),
        priority_agent.run(email_content),
        spam_agent.run(email_content)
    ]
    
    results = await asyncio.gather(*tasks)
    
    # Aggregate results
    return {
        "sentiment": results[0].text,
        "priority": results[1].text,
        "is_spam": results[2].text
    }
    
# Parallel execution with automatic checkpointing:
# - If one agent fails, others continue
# - Completed work is not repeated after restart
# - Full observability of each parallel branch
'''

print(parallel_code)
print("\n✅ Parallel Pattern Benefits:")
print("- Reduced latency (concurrent execution)")
print("- Fault isolation (one failure doesn't affect others)")
print("- Automatic retry of failed branches")

In [None]:
## Observability Dashboard

print("""
📊 DURABLE TASK SCHEDULER DASHBOARD

The built-in dashboard provides:

1. Agent Session Insights:
   - Complete chat history for each conversation
   - Tool calls and their results
   - Token usage and costs
   - User messages and agent responses

2. Orchestration Insights:
   - Visual execution flow diagram
   - Step-by-step execution timeline
   - Parallel branch visualization
   - Waiting states (pending approvals)

3. Performance Metrics:
   - Agent response times
   - Orchestration duration
   - Active vs queued instances
   - Success/failure rates

4. Debugging Capabilities:
   - Structured agent outputs
   - Tool invocation traces
   - External event monitoring
   - Replay and time-travel debugging

Access: https://<your-app>.azurewebsites.net/durabletask
""")

print("\n" + "="*80)
print("✅ V5 Complete: InboxOps can now deploy at enterprise scale!")
print("="*80)
print("\nNext Steps:")
print("1. Deploy to Azure Functions")
print("2. Configure Cosmos DB for state")
print("3. Enable Durable Task Scheduler dashboard")
print("4. Monitor production traffic in real-time")
print("\nLearn more: https://learn.microsoft.com/agent-framework/durable-agents")