# Evaluating Multi-Turn Conversations with Simulated Interactions

In this notebook, we'll explore a more effective approach to evaluating multi-turn customer support agents. Traditional evaluation methods that use a single input-output pair are insufficient for agents that need to adapt their tool usage as conversations evolve. Instead, we'll implement a simulation-based approach where an LLM evaluates our agent against specific success criteria.

## The Problem with Traditional Evaluation

Traditional evaluation methods for customer support agents often use a dataset where:
- Input: Customer ticket/query
- Output: Expected sequence of tool calls

This approach has significant limitations:
1. It assumes a fixed, predetermined path to resolution
2. It doesn't account for new information discovered during the conversation
3. It focuses on the exact sequence of tools rather than achieving the desired outcome

## A Better Approach: Simulation-Based Evaluation

Instead of predicting exact tool sequences, we'll define success criteria that focus on what the agent must accomplish, regardless of the specific path taken. For example:

In [None]:
success_criteria = [
    "Agent MUST call get_status(order_id)",
    "Agent MUST inform user cancellation is possible IFF package.status != 'shipped'"
]

This approach:
- Focuses on outcomes rather than specific steps
- Allows for multiple valid solution paths
- Better reflects real-world customer support scenarios

## Requirements

Before we start, make sure you have OpenAI installed:

In [None]:
%pip install openai

## Define Tools

Let's implement this simulation-based evaluation approach using mock tools.

In [9]:
# Mock tools for our customer support agent
import asyncio
import json
from typing import Dict, Any, List, Tuple
from openai import AsyncOpenAI
import getpass

api_key = getpass.getpass("Enter your OpenAI API key: ")

# Mock database of orders
ORDERS_DB = {
    "ORD123": {"status": "processing", "customer_id": "CUST456", "items": ["Product A", "Product B"]},
    "ORD456": {"status": "shipped", "customer_id": "CUST789", "items": ["Product C"]},
    "ORD789": {"status": "delivered", "customer_id": "CUST456", "items": ["Product D"]}
}

# Mock database of customers
CUSTOMERS_DB = {
    "CUST456": {"email": "customer1@example.com", "name": "John Doe"},
    "CUST789": {"email": "customer2@example.com", "name": "Jane Smith"}
}

# Tool definitions
async def find_customer_by_email(email: str) -> Dict[str, Any]:
    """Find a customer by their email address."""
    for customer_id, customer in CUSTOMERS_DB.items():
        if customer["email"] == email:
            return {"customer_id": customer_id, **customer}
    return {"error": "Customer not found"}

async def get_orders_by_customer_id(customer_id: str) -> Dict[str, Any]:
    """Get all orders for a specific customer."""
    orders = []
    for order_id, order in ORDERS_DB.items():
        if order["customer_id"] == customer_id:
            orders.append({"order_id": order_id, **order})
    return {"orders": orders}

async def get_order_status(order_id: str) -> Dict[str, Any]:
    """Get the status of a specific order."""
    if order_id in ORDERS_DB:
        return {"order_id": order_id, "status": ORDERS_DB[order_id]["status"]}
    return {"error": "Order not found"}

async def update_ticket_status(ticket_id: str, status: str) -> Dict[str, Any]:
    """Update the status of a support ticket."""
    return {"ticket_id": ticket_id, "status": status, "updated": True}

async def escalate_to_human() -> Dict[str, Any]:
    """Escalate the current issue to a human agent."""
    return {
        "status": "escalated",
        "message": "A human agent has been notified and will follow up shortly."
    }

# Dictionary mapping tool names to functions
TOOL_MAP = {
    "find_customer_by_email": find_customer_by_email,
    "get_orders_by_customer_id": get_orders_by_customer_id,
    "get_order_status": get_order_status,
    "update_ticket_status": update_ticket_status,
    "escalate_to_human": escalate_to_human
}

# Tool schemas for OpenAI API
TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "find_customer_by_email",
            "description": "Find a customer by their email address.",
            "parameters": {
                "type": "object",
                "properties": {
                    "email": {"type": "string", "description": "Customer email address"}
                },
                "required": ["email"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_orders_by_customer_id",
            "description": "Get all orders for a specific customer.",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string", "description": "Customer ID"}
                },
                "required": ["customer_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Get the status of a specific order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string", "description": "Order ID"}
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "update_ticket_status",
            "description": "Update the status of a support ticket.",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticket_id": {"type": "string", "description": "Ticket ID"},
                    "status": {"type": "string", "description": "New status"}
                },
                "required": ["ticket_id", "status"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Escalate the current issue to a human agent.",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": []
            }
        }
    }
]

## Define Agents

Now we define our agents. We will define both a Planner and an Executor agent. The Planner agent is responsible for creating a plan to achieve the user's goal, while the Executor agent is responsible for executing the plan. We also define a helper function to generate a response from the tool outputs.

In [25]:
class PlannerAgent:
    def __init__(self, model: str = "gpt-4o"):
        self.model = model
        self.client = AsyncOpenAI(api_key=api_key)
        
    async def run(self, task_history: List[Dict[str, Any]]) -> Tuple[List, str]:
        """Create a tool execution plan based on user input"""
        # Call OpenAI to create a plan
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=task_history,
            tools=TOOL_SCHEMAS,
            tool_choice="auto"
        )
        
        message = response.choices[0].message
        tool_calls = message.tool_calls or []
        return tool_calls, message.content or ""

    def initialize_history(self, ticket: Dict[str, Any]) -> List[Dict[str, Any]]:
        """Start conversation history from a ticket."""
        system_prompt = """You are a helpful customer support agent for an e-commerce company.
        Your job is to help customers with their inquiries about orders, products, and returns.
        Use the available tools to gather information and take actions on behalf of the customer.
        Always be polite, professional, and helpful."""
        
        return [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": str(ticket)}
        ]

# Simple implementation of the Executor Agent
class ExecutorAgent:
    async def run(self, tool_calls: List, task_history: List[Dict]) -> Dict[str, Any]:
        """Execute tool calls and update conversation history"""
        tool_outputs = []

        for call in tool_calls:
            tool_name = call.function.name
            args = json.loads(call.function.arguments)

            # Get the function from our tool map
            func = TOOL_MAP.get(tool_name)
            if func is None:
                output = {"error": f"Tool '{tool_name}' not found"}
                continue

            try:
                # Execute the tool
                output = await func(**args)
            except Exception as e:
                output = {"error": str(e)}
            
            # Add the tool call to history
            task_history.append({
                "role": "assistant",
                "content": None,
                "tool_calls": [{
                    "id": call.id,
                    "type": "function",
                    "function": {
                        "name": tool_name,
                        "arguments": call.function.arguments
                    }
                }]
            })
            
            # Add the tool response to history
            task_history.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(output)
            })

            tool_outputs.append({"tool_name": tool_name, "output": output})

        return {"task_history": task_history, "tool_outputs": tool_outputs}

# Generate a response from tool outputs
async def generate_response(tool_outputs: List[Dict], model: str = "gpt-4o") -> str:
    """Generate a human-readable response based on tool outputs"""
    client = AsyncOpenAI(api_key=api_key)

    system_prompt = """You are a helpful customer support agent. IMPORTANT GUIDELINES:
    1. When a customer asks about cancellation, ALWAYS check the order status first
    2. EXPLICITLY inform the customer if cancellation is possible based on the status:
    - If status is 'processing' or 'pending', tell them cancellation IS possible
    - If status is 'shipped' or 'delivered', tell them cancellation is NOT possible
    3. Always be polite, professional, and helpful"""

    # Prepare a prompt that includes the tool outputs
    prompt = "Based on the tool outputs, generate a helpful response to the customer:\n\n"
    for output in tool_outputs:
        prompt += f"{output['tool_name']} result: {json.dumps(output['output'])}\n"
        
    # Call OpenAI to generate the response
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content

    
    # Prepare a prompt that includes the tool outputs
    prompt = "Based on the tool outputs, generate a helpful response to the customer:\n\n"
    for output in tool_outputs:
        prompt += f"{output['tool_name']} result: {json.dumps(output['output'])}\n"
        
    # Call OpenAI to generate the response
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful customer support agent. Generate a clear, concise response based on the tool outputs."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content

## Evaluator Agent

The Evaluator Agent evaluates our multi-turn agent behavior using binary success criteria over full simulated conversations. This method moves beyond traditional input/output (I/O) pair evaluation, addressing the stochastic and flexible nature of agent workflows.

In [16]:
from pydantic import BaseModel

class Verdict(BaseModel):
    criterion: str
    passed: bool
    explanation: str

class VerdictList(BaseModel):
    verdicts: list[Verdict]

async def evaluate_conversation(conversation: List[Dict], tools_used: List[str], criteria: List[str], model: str = "gpt-4o") -> Dict[str, Any]:
    """Evaluate a conversation against success criteria"""
    client = AsyncOpenAI(api_key=api_key)
    
    # Format the conversation for evaluation
    conversation_text = ""
    for message in conversation:
        role = message.get("role", "")
        content = message.get("content", "")
        if role == "user":
            conversation_text += f"Customer: {content}\n"
        elif role == "assistant" and content:
            conversation_text += f"Agent: {content}\n"
        elif role == "tool":
            conversation_text += f"Tool Output: {content}\n"
    
    # Create the evaluation prompt
    prompt = f"""
    Please evaluate this customer support conversation against the success criteria.
    
    Conversation:
    {conversation_text}
    
    Tools used: {', '.join(tools_used)}
    
    Success Criteria:
    {', '.join(f'- {criterion}' for criterion in criteria)}
    
    For each criterion, determine if it was met (PASS) or not met (FAIL).
    Provide a brief explanation for each verdict.
    """
    
    # Call OpenAI to evaluate
    response = await client.responses.parse(
        model=model,
        input=[
            {"role": "system", "content": "You are an objective evaluator of customer support conversations."},
            {"role": "user", "content": prompt}
        ],
        text_format=VerdictList
    )
    
    # Process the evaluation response
    eval_text = response.output_parsed
    
    # Parse the evaluation into structured results
    verdicts = eval_text.verdicts
    
    return {"verdicts": verdicts, "raw_evaluation": eval_text}

## Simulation Function

Below we define a method to simulate conversations between our agent and a user. The outputs will be evaluated by our Evaluator Agent in the end.

In [None]:
async def simulate_conversation(ticket: Dict[str, Any], criteria: List[str], max_turns: int = 5):
    """Simulate a conversation with a customer and evaluate against criteria"""
    # Initialize agents
    planner = PlannerAgent()
    executor = ExecutorAgent()
    
    # Initialize conversation history
    task_history = planner.initialize_history(ticket)
    
    # Simulate the conversation
    tools_used = []
    turns = 0
    
    print("\n🤖 Starting conversation simulation...")
    print(f"📝 Ticket: {ticket['subject']}")
    print(f"🎯 Success criteria: {', '.join(criteria)}")
    
    while turns < max_turns:
        turns += 1
        print(f"\n--- Turn {turns} ---")
        
        # Run the planner to decide what to do
        tool_calls, assistant_reply = await planner.run(task_history)
        
        # Handle the agent's response
        if tool_calls:
            # Agent wants to use tools
            tool_names = [call.function.name for call in tool_calls]
            print(f"🔧 Agent uses tools: {', '.join(tool_names)}")
            tools_used.extend(tool_names)
            
            # Execute the tools
            result = await executor.run(tool_calls, task_history)
            
            # Generate a response based on tool outputs
            response_text = await generate_response(result["tool_outputs"])
            print(f"🤖 Agent: {response_text}")
            
            # Add the response to history
            task_history.append({"role": "assistant", "content": response_text})
            
            # Check if conversation should end
            if "update_ticket_status" in tool_names:
                print("\n✅ Ticket resolved — update_ticket_status was called.")
                break
        else:
            # Agent responded directly without tools
            print(f"🤖 Agent: {assistant_reply}")
            task_history.append({"role": "assistant", "content": assistant_reply})
        
        # Get simulated user input
        if turns <= max_turns:
            user_input = input("User: ")
            print(f"👤 Customer: {user_input}")
            task_history.append({"role": "user", "content": user_input})
        else:
            # If we run out of predefined responses, end the conversation
            break
    
    # Evaluate the conversation
    print("\n📊 Evaluating conversation...")
    evaluation = await evaluate_conversation(task_history, tools_used, criteria)
    
    # Print evaluation results
    print("\n--- Evaluation Results ---")
    for verdict in evaluation["verdicts"]:
        status = "✅ PASS" if verdict.passed else "❌ FAIL"
        print(f"{status}: {verdict.criterion}")
    
    # Calculate overall score
    passed = sum(1 for v in evaluation["verdicts"] if v.passed)
    total = len(evaluation["verdicts"])
    score = (passed / total) * 100
    
    print(f"\n📈 Overall Score: {score:.1f}% ({passed}/{total} criteria met)")
    print(f"🔧 Tools Used: {', '.join(set(tools_used))}")
    print(f"🔄 Conversation Length: {turns} turns")
    
    return {
        "conversation": task_history,
        "tools_used": tools_used,
        "evaluation": evaluation,
        "turns": turns,
        "score": score
    }

Now, let's define a ticket and our success criteria. Then we're ready to run!

In [31]:
async def run_example():
    # Define a test ticket
    ticket = {
        "id": "TICKET123",
        "subject": "Order Cancellation Request",
        "description": "I placed an order yesterday (ORD123) and would like to cancel it if it hasn't shipped yet.",
        "status": "open",
        "requester_id": "customer1@example.com"
    }
    
    # Define success criteria
    criteria = [
        "Agent MUST call get_order_status tool",
        "Agent MUST inform user cancellation is possible IFF order.status != 'shipped'"
    ]
    
    # Run the simulation
    await simulate_conversation(ticket, criteria)

await run_example()


🤖 Starting conversation simulation...
📝 Ticket: Order Cancellation Request
🎯 Success criteria: Agent MUST call get_order_status tool, Agent MUST inform user cancellation is possible IFF order.status != 'shipped'

--- Turn 1 ---
🔧 Agent uses tools: find_customer_by_email
🤖 Agent: Hello John Doe,

Thank you for reaching out to us! How can I assist you today? If you have any concerns or questions regarding your order, please provide me with your order number, and I'll be happy to help you further.

Best regards,
[Your Name]
👤 Customer: ehm my order id is 123

--- Turn 2 ---
🔧 Agent uses tools: get_order_status
🤖 Agent: Thank you for reaching out regarding your order ORD123. I checked the status, and it is currently marked as "processing." I'm happy to inform you that cancellation is possible at this stage. If you would like to proceed with canceling your order, please let me know, and I will assist you further. If there's anything else I can do for you, feel free to ask.
👤 Customer: yeah

## Conclusion

Traditional evaluation methods that rely on fixed input-output pairs are insufficient for multi-turn conversational agents. By simulating complete conversations and evaluating against outcome-based criteria, we can better assess an agent's ability to handle real-world customer support scenarios.

This approach allows us to focus on what matters—successfully resolving customer issues—rather than following a predetermined sequence of steps. As your agent evolves, you can refine your success criteria to ensure it meets your business needs while providing excellent customer service.