# AWS Collaboration Benchmark - Interactive Tutorial

This notebook demonstrates the AWS Collaboration benchmark using MASEval. The benchmark evaluates multi-agent systems on collaborative task-solving scenarios.

## What is this benchmark?

The AWS Collaboration benchmark tests how well multi-agent systems can:
- **Coordinate multiple specialist agents** (e.g., a travel expert, a restaurant expert)
- **Interact with simulated users** across multiple conversation turns
- **Use simulated tools** to accomplish realistic tasks
- **Meet user-side and system-side goals** (dual evaluation)

## Key Metrics

- **Overall GSR** (Goal Success Rate): Did the agent satisfy all requirements?
- **User GSR**: Did the agent meet user-visible goals?
- **System GSR**: Did the agent correctly invoke tools behind the scenes?
- **Partial GSR**: What percentage of goals were met?

Let's dive in!

## Overall Structure (Preview)

Before we dive into implementation details, let's see the **main execution flow** to understand the big picture. This is what happens when you run the benchmark:

```python
# Configuration
framework = "smolagents"  # or "langgraph"
domain = "travel"
limit = 5

# 1. Download and process raw data
process_data(verbose=1)

# 2. Load benchmark tasks
tasks = load_tasks(domain, limit)

# 3. Load agent configuration (defines specialist agents)
agent_config = load_agent_config(domain, framework)

# 4. Setup result logging
output_dir = "results/"
logger = FileResultLogger(output_dir=output_dir, filename_pattern=f"{domain}_{framework}_{{timestamp}}.jsonl")

# 5. Create and run the benchmark
benchmark = AWSCollabBenchmark(
    agent_data=agent_config,
    tasks=tasks,
    callbacks=[logger]
)
results = benchmark.run()

# 6. Compute and display metrics
summary = compute_benchmark_metric(results)
print(f"Success Rate: {summary['success_rate']:.2%}")
```

**That's it!** The benchmark handles:
- Creating specialist agents (as defined by config)
- Simulating user interactions
- Simulating tool responses
- Evaluating both user-side and system-side goals

### Inside the Benchmark: What `AWSCollabBenchmark` Does

The `AWSCollabBenchmark` class inherits from MASEval's `Benchmark` base class and implements these key methods:

```python
class AWSCollabBenchmark(Benchmark):
    
    def setup_environment(agent_data, task):
        # Creates the Environment with simulated tools (FlightBooking, HotelBooking, etc.)
        pass
    
    def setup_user(agent_data, environment, task):
        # Creates an LLM-powered User simulator that interacts like a real user
        pass
    
    def setup_agents(agent_data, environment, task, user):
        # Creates specialist agents (travel_expert, restaurant_expert) and primary orchestrator
        pass
    
    def setup_evaluators(environment, task, agents, user):
        # Creates LLM-based evaluators for user-side and system-side assertions
        pass
    
    def run_agents(agents, task, environment):
        # Executes the agent(s) on the task and returns final answer
        pass
    
    def evaluate(evaluators, agents, final_answer, traces):
        # Runs both evaluators and computes GSR metrics (user_gsr, system_gsr, overall_gsr)
        pass
```

The base `Benchmark.run()` method orchestrates these steps automatically for each task!

Now let's break down each piece step by step.

## Step 1: Imports

We start off by importing dependencies. The example here requires severl dependencies beyond the base library dependencies. You can install these:

```python
pip install maseval[examples] # with pip
uv add maseval[examples] # with uv
```


In [None]:
# All the imports we need for this benchmark
import json
import os
from typing import Any, Dict, List, Optional, Sequence, Tuple

# Framework-specific imports (smolagents and langgraph)
from google.genai import Client as GoogleGenAIClient
from langchain_core.messages import SystemMessage
from langchain_core.tools import StructuredTool as LanggraphStructuredTool
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from smolagents import FinalAnswerTool, OpenAIServerModel, Tool as SmolagentsTool, ToolCallingAgent
from typing_extensions import TypedDict

# MASEval core components
from maseval import (
    AgentAdapter,          # Wraps framework-specific agents
    Benchmark,             # Base benchmark class
    Environment,           # Simulates tools and external systems
    Evaluator,             # Evaluates agent performance
    MessageHistory,        # Stores conversation history
    ModelAdapter,          # Wraps LLM models
    Task,                  # Represents a single benchmark task
    TaskCollection,        # Collection of tasks
    ToolInvocationHistory, # Tracks tool usage
    ToolLLMSimulator,      # LLM-based tool simulator
    User,                  # Simulates user behavior
)

# MASEval utilities
from maseval.core.callbacks.result_logger import FileResultLogger
from maseval.core.config import ConfigurableMixin
from maseval.core.tracing import TraceableMixin

# Framework adapters
from maseval.interface.agents.langgraph import LangGraphAgentAdapter, LangGraphUser
from maseval.interface.agents.smolagents import SmolAgentUser, SmolAgentAdapter
from maseval.interface.inference import GoogleGenAIModelAdapter, LiteLLMModelAdapter

# Data processing utility
from process_data import process_data


### Helper Functions: Loading Tasks and Agent Configurations

The next two cells define utility functions for loading benchmark data:

- **`load_tasks()`**: Loads task objects from the processed JSON files
  - Each task becomes a `Task` object with `query`, `environment_data`, `evaluation_data`, and `metadata`
  - Returns a `TaskCollection` (an iterable sequence of tasks)
  
- **`load_agent_config()`**: Loads the recommended multi-agent setup for a domain
  - Reads `agents.json` which defines specialist agents and their tools
  - Adds framework-specific configuration and model settings
  - Returns a dictionary with agent specifications and the primary orchestrator ID

These are simple I/O helpersâ€”the interesting logic happens later when we create the benchmark!

In [None]:
import os
import json
from typing import Optional, Dict, Any

def load_tasks(domain: str, limit: Optional[int] = None) -> TaskCollection:
    """Load tasks from processed travel data."""
    if domain not in ["travel", "software", "mortgage"]:
        raise ValueError(f"Unsupported domain: {domain}")

    with open(os.path.join(os.getcwd(), "data", domain, "tasks.json"), "r") as f:
        tasks_list = json.load(f)

    tasks_data = []
    for task_dict in tasks_list[:limit] if limit else tasks_list:
        metadata = task_dict.get("metadata", {})
        tasks_data.append(
            Task(
                query=task_dict["query"],
                environment_data=task_dict["environment_data"],
                evaluation_data=task_dict["evaluation_data"],
                metadata=metadata,
            )
        )
    return TaskCollection(tasks_data)

def load_agent_config(domain: str, framework: str) -> Dict[str, Any]:
    """Load agent configuration from processed travel data."""

    # loads the recommended setup of agents for the domain as per AWS Collab benchmark
    with open(os.path.join(os.getcwd(), "data", domain, "agents.json"), "r") as f:
        agents_data = json.load(f)

    # adds more config to the agent data
    agents_data["framework"] = framework  # Set framework as specified
    for agent in agents_data.get("agents", []):
        agent["model_config"] = {"model_id": "gemini-2.5-flash", "temperature": 0.7}

    return agents_data


### Helper Function: Model Initialization

The `get_model()` function creates a `ModelAdapter` instance for a given LLM.

**Why ModelAdapter?** MASEval uses `ModelAdapter` as a unified interface for different LLM providers:
- Wraps provider-specific clients (Google GenAI, OpenAI, etc.)
- Provides a consistent `.generate()` method across all models
- Makes it easy to swap models without changing benchmark code

This model will be used for:
- **Tool simulation**: Generating realistic tool responses
- **User simulation**: Simulating multi-turn user conversations
- **Evaluation**: LLM-as-a-judge for assertion checking

In this notebook, we default to `gemini-2.5-flash` for speed, but you can use other models like `gemini-2.5-pro` or `gpt-4o`.

In [None]:
def get_model(model_id: str = "gemini-2.5-flash", **kwargs) -> ModelAdapter:
    if model_id == "gemini-2.5-flash":
        google_client = GoogleGenAIClient(api_key=os.getenv("GOOGLE_API_KEY"))
        bare_model = GoogleGenAIModelAdapter(
            google_client, model_id="gemini-2.5-flash", default_generation_params=kwargs
        )
    elif model_id == "gemini-2.5-pro":
        google_client = GoogleGenAIClient(api_key=os.getenv("GOOGLE_API_KEY"))
        bare_model = GoogleGenAIModelAdapter(
            google_client, model_id="gemini-2.5-pro", default_generation_params=kwargs
        )
    elif model_id == "gpt-4o":
        bare_model = LiteLLMModelAdapter(model_id="gpt-4o", default_generation_params=kwargs)
    else:
        raise ValueError(f"Unsupported model_id: {model_id}")
    return bare_model


## Step 2: Load Data & Configure

Before we can run the benchmark, we need to:

1. **Download and process the raw benchmark data** from the AWS Collaboration benchmark repository
2. **Load tasks** for our chosen domain (travel, software, or mortgage)
3. **Configure which agent framework** to use (smolagents or langgraph)

### Configuration Parameters

Let's start by setting three key configuration variables:

- **`framework`**: Which agent framework to use for orchestrating agents
  - `"smolagents"`: HuggingFace's lightweight agent framework with built-in multi-agent support
  - `"langgraph"`: LangChain's stateful graph-based agent framework
  
- **`domain`**: Which benchmark domain to evaluate on
  - `"travel"`: Travel planning scenarios (flights, hotels, restaurants)
  - `"software"`: Software development scenarios (bug tracking, code review)
  - `"mortgage"`: Mortgage application scenarios (loan processing, document verification)
  
- **`limit`**: How many tasks to run from the domain
  - Set this to a small number (e.g., 3-5) for quick testing
  - Use `None` to run all tasks (~30 per domain)

In [None]:
# Configuration: Set these to customize your benchmark run
framework = "smolagents"  # Agent framework: "smolagents" or "langgraph"
domain = "travel"         # Benchmark domain: "travel", "software", or "mortgage"
limit = 3                 # Number of tasks to run (use None for all ~30 tasks)

### Download and Process Raw Data

The AWS Collaboration benchmark data is hosted on GitHub. The `process_data()` function:

1. **Downloads** raw JSON files from the official AWS benchmark repository
2. **Processes** the data into MASEval's `Task` format with three key components:
   - **`environment_data`**: Available tools and their specifications
   - **`evaluation_data`**: Assertions to check if the agent succeeded
   - **`metadata`**: Scenario descriptions and context
3. **Saves** processed data to `data/<domain>/` for reuse

This only needs to run once (subsequent runs will skip download if data exists).

In [None]:
# Download and process raw benchmark data from AWS repository
# This creates a data/<domain>/ directory with tasks.json, agents.json, and prompt templates
process_data(verbose=1)

# Load tasks for the chosen domain
# Returns a TaskCollection - an iterable of Task objects with query, environment_data, and evaluation_data
tasks = load_tasks(domain=domain, limit=limit)

print(f"Successfully loaded {len(tasks)} tasks from the '{domain}' domain")

### Inspect a Sample Task

Let's examine what a task looks like. Each `Task` object contains:

- **`query`**: The initial user request that starts the conversation
- **`id`**: A unique identifier (auto-generated UUID)
- **`environment_data`**: Data needed to set up the environment (tools, user profile, etc.)
- **`user_data`**: Data specific to user simulation configuration
- **`evaluation_data`**: Data needed to evaluate agent performance (assertions)
- **`metadata`**: Any additional metadata (scenario descriptions, context, etc.)

**Evaluation Assertions** define success criteria with two types:
- **User-side assertions** (prefixed with `user:`): Check if the user's goals were met
- **System-side assertions** (prefixed with `agent:`): Check if correct tools were used with correct parameters

This dual evaluation approach (user + system) is a key feature of the AWS Collaboration benchmark.

In [None]:
# Examine the first task in detail
sample_task = tasks[0]

print("=" * 80)
print("TASK QUERY (Initial User Request):")
print("=" * 80)
print(f"{sample_task.query}\n")

print("=" * 80)
print("SCENARIO (Full Context - truncated):")
print("=" * 80)
scenario = sample_task.metadata.get('scenario', 'N/A')
print(f"{scenario[:300]}...\n")

print("=" * 80)
print("EVALUATION ASSERTIONS:")
print("=" * 80)
assertions = sample_task.evaluation_data.get("assertions", [])
for i, assertion in enumerate(assertions, 1):
    # Show which type of assertion it is
    assertion_type = "USER-SIDE" if assertion.lower().startswith("user:") else "SYSTEM-SIDE"
    print(f"{i}. [{assertion_type}] {assertion}")

### Load Agent Configuration

Each domain comes with a recommended multi-agent setup from the AWS benchmark:

- **Specialist agents**: Domain experts with access to specific tools (e.g., `travel_expert` can book flights/hotels)
- **Primary orchestrator agent**: Coordinates specialist agents and handles user interaction

The configuration specifies:
- Which agents exist and their roles
- Which tools each agent can use
- System prompts defining agent behavior
- Which agent is the primary orchestrator (`primary_agent_id`)

MASEval supports both **smolagents** (using `managed_agents`) and **langgraph** (using tool-calling + routing) to implement this multi-agent pattern.

In [None]:
# Load the recommended agent configuration for the chosen domain
agent_config = load_agent_config(domain=domain, framework=framework)

print("=" * 80)
print("AGENT CONFIGURATION OVERVIEW:")
print("=" * 80)
print(f"Framework: {agent_config['framework']}")
print(f"Primary Orchestrator: {agent_config['primary_agent_id']}")
print(f"Total Agents: {len(agent_config['agents'])}\n")

print("=" * 80)
print("AGENT DETAILS:")
print("=" * 80)
for agent in agent_config["agents"]:
    print(f"\n{agent['agent_id']} - {agent['agent_name']}")
    print(f"   Tools: {', '.join(agent.get('tools', [])) or 'None (orchestrator role only)'}")
    print(f"   Role: {agent['agent_instruction'][:150]}...")

### Setup Result Logger

Before running the benchmark, we'll configure a callback to save results to disk.

**What is a callback?** In MASEval, callbacks are hooks that get notified during benchmark execution. The `FileResultLogger` callback:

- Automatically saves each task's results to a JSONL file (one JSON object per line)
- Captures metrics, traces, and evaluation reports
- Uses timestamped filenames to avoid overwriting previous runs
- Enables later analysis without re-running expensive evaluations

The results will be saved to `results/<domain>_<framework>_<timestamp>.jsonl`.

In [None]:
# Setup result logging to save benchmark results
output_dir = os.path.join(os.getcwd(), "results")
os.makedirs(output_dir, exist_ok=True)

# Create a FileResultLogger callback with timestamped filename
logger = FileResultLogger(
    output_dir=output_dir,
    filename_pattern=f"{domain}_{framework}_{{timestamp}}.jsonl"  # e.g., travel_smolagents_20231111_143025.jsonl
)

print(f"Results will be saved to: {output_dir}/")
print(f"  Filename pattern: {domain}_{framework}_<timestamp>.jsonl")

## Step 3: Environment Setup - Tools and User Simulation

Now we'll implement the core simulation components:

1. **`GenericTool`**: Framework-agnostic tool with LLM-based response simulation
2. **Framework-specific wrappers**: `GenericSmolagentsTool` and `GenericLanggraphTool` 
3. **`TravelEnvironment`**: Environment class that creates and manages tools
4. **User simulation**: LLM-powered user that responds realistically to agent queries

These components enable **realistic multi-turn interactions** without needing real APIs!

### Inspect Available Tools in Task

Before implementing the environment, let's see what tools the task provides:

In [None]:
# Examine tools available in this task's environment
env_data = sample_task.environment_data

print("=" * 80)
print("AVAILABLE TOOLS IN ENVIRONMENT:")
print("=" * 80)
for tool in env_data.get("tools", []):
    print(f"\nðŸ”§ {tool['tool_name']}")
    print(f"   Description: {tool.get('description', 'N/A')}")
    print(f"   Actions:")
    for action in tool.get("actions", []):
        action_desc = action.get('description', 'N/A')[:100]
        print(f"      â€¢ {action['name']}: {action_desc}...")

### Implement GenericTool (Framework-Agnostic)

The `GenericTool` class provides the core tool logic that works with any agent framework:

**Key Features:**
- **LLM-based simulation**: Uses `ToolLLMSimulator` to generate realistic responses
- **Invocation tracking**: Records all tool calls in a `ToolInvocationHistory`
- **Schema conversion**: Converts JSON schemas to tool input specifications
- **Tracing & Configuration**: Implements mixins for gathering execution traces and config

**How it works:**
1. Agent calls tool with parameters (e.g., `search_flights(origin="NYC", destination="LAX")`)
2. `GenericTool.__call__()` forwards to `ToolLLMSimulator`
3. Simulator uses an LLM to generate a realistic response based on tool description
4. Response is logged in history and returned to agent

This approach lets us benchmark agents **without real APIs or mock data**â€”the LLM generates contextually appropriate responses!

In [None]:
class GenericTool(TraceableMixin, ConfigurableMixin):
    """Framework-agnostic tool with execution logic and tracing."""

    def __init__(self, spec: Dict[str, Any]):
        super().__init__()
        self.name = spec["name"]
        self.description = spec.get("description", "")
        self.input_schema = spec.get("input_schema", {})
        self.output_schema = "string"
        self.history = ToolInvocationHistory()

        # Create schema to inputs mapping for ToolLLMSimulator
        self.inputs = self._schema_to_inputs(self.input_schema)

        # Create LLM-based simulator
        self.simulator = ToolLLMSimulator(
            model=get_model(),
            template=None,  # Uses default template
            tool_name=self.name,
            tool_description=self.description,
            tool_inputs=self.inputs,
            max_try=1,
        )

    @staticmethod
    def _schema_to_inputs(input_schema: Dict[str, Any]) -> Dict[str, Any]:
        """Convert JSON schema to inputs format for Tool interface."""
        inputs_dict = {}
        for k, prop in input_schema.get("properties", {}).items():
            declared = prop.get("data_type") or prop.get("type")
            declared_type = (
                [d if isinstance(d, str) else "Any" for d in declared]
                if isinstance(declared, list)
                else (declared if isinstance(declared, str) else "Any")
            )
            inputs_dict[k] = {"type": declared_type, "description": prop.get("description", "")}
        return inputs_dict

    def __call__(self, **kwargs) -> str:
        """Execute the tool with given inputs.

        Uses LLM simulator to generate realistic response.
        """
        response_text, details = self.simulator(actual_inputs=kwargs)
        self.history.add_invocation(
            inputs={"kwargs": kwargs},
            outputs=response_text,
            status="<Unknown>",
            timestamp=None,
            meta=details,
        )
        return response_text
    
    def gather_traces(self) -> Dict[str, Any]:
        """Gather execution traces from this tool."""
        return {
            **super().gather_traces(),
            "name": self.name,
            "invocations": self.history.to_list(),
            "total_invocations": len(self.history.to_list()),
        }
    
    def gather_config(self) -> Dict[str, Any]:
        """Gather configuration from this tool."""
        return {
            **super().gather_config(),
            "name": self.name,
            "description": self.description,
            "input_schema": self.input_schema,
        }

    def __repr__(self):
        arguments = ", ".join([f"{k}: {v['type']}" for k, v in self.inputs.items()]) if isinstance(self.inputs, dict) else str(self.inputs)
        return f"{self.__class__.__name__}: {self.name}({arguments}) -> {str(self.output_schema)}"

print("GenericTool class defined")

### Implement Framework-Specific Tool Wrappers

Since each agent framework has its own tool interface expectations, we need thin wrapper classes:

**GenericSmolagentsTool (for smolagents):**
- Inherits from `smolagents.Tool`
- Implements `forward()` method (required by smolagents)
- Wraps `GenericTool` via composition (NOT inheritance to avoid `__call__` conflicts)
- Delegates execution to wrapped `GenericTool`

**GenericLanggraphTool (for langgraph):**
- Wraps `GenericTool` using LangChain's `StructuredTool`
- Converts to LangChain's tool format
- Delegates execution to wrapped `GenericTool`

**Why composition?** Different frameworks expect different `__call__` signatures. By wrapping `GenericTool` instead of inheriting, we avoid method conflicts.

In [None]:
class GenericSmolagentsTool(SmolagentsTool, ConfigurableMixin, TraceableMixin):
    """Smolagents-specific wrapper around GenericTool.
    
    Uses composition (NOT inheritance from GenericTool) to avoid __call__ conflicts.
    Smolagents expects: __call__ â†’ forward() â†’ actual work
    """
    
    # Skip smolagents signature validation (like other smolagents tool wrappers)
    skip_forward_signature_validation = True
    
    def __init__(self, generic_tool: GenericTool):
        # Store wrapped GenericTool instance
        self.generic_tool = generic_tool
        
        # Set smolagents-required attributes from wrapped tool
        self.name = self.generic_tool.name
        self.description = self.generic_tool.description
        self.inputs = self.generic_tool.inputs
        self.output_type = self.generic_tool.output_schema
        
        # Initialize smolagents.Tool
        super().__init__()
    
    def forward(self, **kwargs) -> str:
        """Smolagents forward method - delegates to wrapped GenericTool."""
        return self.generic_tool(**kwargs)
    
    def gather_traces(self) -> Dict[str, Any]:
        """Gather traces from wrapped tool."""
        return self.generic_tool.gather_traces()
    
    def gather_config(self) -> Dict[str, Any]:
        """Gather config from wrapped tool."""
        return self.generic_tool.gather_config()
    
    def __repr__(self):
        return f"GenericSmolagentsTool({self.generic_tool})"


class GenericLanggraphTool(ConfigurableMixin, TraceableMixin):
    """LangGraph-specific tool wrapper."""
    
    def __init__(self, generic_tool: GenericTool):
        self.generic_tool = generic_tool
        
        # Expose name for Environment's _tools_dict storage
        self.name = generic_tool.name
        
        # Create LangChain StructuredTool wrapper
        self.tool = LanggraphStructuredTool.from_function(
            func=generic_tool.__call__,
            name=generic_tool.name,
            description=generic_tool.description,
        )
    
    def __call__(self, *args, **kwargs):
        """Delegate to wrapped LangChain tool."""
        return self.tool(*args, **kwargs)
    
    def gather_traces(self) -> Dict[str, Any]:
        """Gather traces from wrapped tool."""
        return self.generic_tool.gather_traces()
    
    def gather_config(self) -> Dict[str, Any]:
        """Gather config from wrapped tool."""
        return self.generic_tool.gather_config()
    
    def __repr__(self):
        return f"GenericLanggraphTool({self.generic_tool})"

print("Framework-specific tool wrappers defined")

### Implement TravelEnvironment

The `TravelEnvironment` class extends MASEval's `Environment` base class to create and manage tools:

**Key Responsibilities:**
1. **`setup_state()`**: Initialize environment state from task data
2. **`create_tools()`**: Create tool instances from tool specifications in task
   - Iterates through tool specs and actions
   - Creates `GenericTool` for each action
   - Wraps in framework-specific wrapper (`GenericSmolagentsTool` or `GenericLanggraphTool`)
   - Registers tools with benchmark for tracing
3. **`get_tools_for_agent()`**: Filter tools by name for specific agents
   - Specialist agents only get their assigned tools
   - Orchestrator agents typically get no tools (they delegate to specialists)

The base `Environment` class automatically stores tools in `_tools_dict` for tracing and provides the `get_tools()` method.

In [None]:
class TravelEnvironment(Environment):
    """Environment that creates framework-agnostic tools and manages them.
    
    This class demonstrates the pattern:
    1. Create GenericTool (framework-agnostic)
    2. Wrap in framework-specific wrapper
    3. Register with benchmark for tracing
    """

    def __init__(self, task_data: Dict[str, Any], framework: str, benchmark: Optional[Benchmark] = None):
        self.benchmark = benchmark
        self._framework = framework
        super().__init__(task_data)

    def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from task data."""
        return task_data.get("environment_data", {})

    def create_tools(self) -> list:
        """Create tool instances from tool specifications.
        
        Returns:
            List of tool objects. The base Environment class will handle
            storing them in self._tools_dict for tracing.
        """
        tools_list = []
        seen_tool_names = set()
        
        # Determine which wrapper class to use based on framework
        framework = self._framework
        tool_classes = {"smolagents": GenericSmolagentsTool, "langgraph": GenericLanggraphTool}
        if framework not in tool_classes:
            raise ValueError(f"Unsupported framework: {framework}. Must be one of {list(tool_classes.keys())}")
        WrapperClass = tool_classes[framework]

        for tool_spec in self.state.get("tools", []):
            for action_spec in tool_spec.get("actions", []):
                tool_name = action_spec.get("name")
                if tool_name and tool_name not in seen_tool_names:
                    # Step 1: Create framework-agnostic GenericTool
                    generic_tool = GenericTool(action_spec)
                    
                    # Step 2: Wrap it for the specific framework
                    tool = WrapperClass(generic_tool)
                    tools_list.append(tool)
                    seen_tool_names.add(tool_name)
                    
                    # Step 3: Register tool with benchmark for top-level trace/config collection
                    if self.benchmark:
                        self.benchmark.register("tools", tool_name, tool)

        return tools_list

    def get_tools_for_agent(self, agent_tools_names: List[str]) -> List[Any]:
        """Get framework-specific tools for an agent by tool name.
        
        This is a custom method for this benchmark that filters tools by name.
        Tools are already the correct type from create_tools().
        """
        agent_tools = []
        for tool_name in agent_tools_names:
            for tool_spec in self.state.get("tools", []):
                if tool_spec.get("tool_name") == tool_name:
                    for action_spec in tool_spec.get("actions", []):
                        action_name = action_spec.get("name")
                        # Look up tool in the _tools_dict managed by base Environment
                        if action_name in self._tools_dict:
                            agent_tools.append(self._tools_dict[action_name])
        return agent_tools

print("TravelEnvironment class defined")

## Step 4: Implement Evaluator
The evaluator is crucial for the AWS Collaboration benchmark's **dual evaluation** approach. We'll implement:

1. **`LLMAssertionEvaluator`**: LLM-based evaluator that checks assertions
2. **Dual evaluation**: Separate user-side and system-side evaluation
3. **GSR metrics**: Goal Success Rate computation

**Why LLM-as-a-Judge?** 
- Assertions are natural language (e.g., "User was offered a flight departing in the afternoon")
- Requires understanding context and nuance
- LLMs can judge better than rule-based matching

### Implement LLMAssertionEvaluator

The `LLMAssertionEvaluator` class extends MASEval's `Evaluator` base class:

**Key Responsibilities:**
1. **Parse assertions**: Filter assertions by type (user-side vs system-side)
   - User assertions start with `user:` or have no prefix
   - System assertions start with `agent:`
   
2. **Format evaluation prompt**: Build a prompt with:
   - Scenario description (what user wants)
   - Conversation history (agent-user interaction)
   - Tool invocations (for system-side only)
   - Assertions to check
   
3. **LLM judgment**: Use LLM to evaluate each assertion as True/False with reasoning

4. **Compute GSR metrics**:
   - **GSR** (Goal Success Rate): 1.0 if ALL assertions true, else 0.0
   - **Partial GSR**: Percentage of assertions that are true

**Evaluation Types:**
- **User-side (`gsr_type="user"`)**: Did the agent satisfy user-visible goals?
- **System-side (`gsr_type="system"`)**: Did the agent use tools correctly?

In [None]:
class LLMAssertionEvaluator(Evaluator):
    """Framework-agnostic evaluator using LLM for assertion checking.
    
    This evaluator uses an LLM to judge whether assertions are satisfied
    by examining the conversation history and tool invocations.
    """

    def __init__(
        self,
        model: ModelAdapter,
        task: Task,
        gsr_type: str = "user",
    ):
        """Initialize the evaluator.
        
        Args:
            model: The model to use for evaluation (LLM-as-a-judge)
            task: The task being evaluated (contains assertions and scenario)
            gsr_type: Either "user" or "system" for evaluation type
        """
        self.model = model
        self.task = task
        self.gsr_type = gsr_type
        
        # Load appropriate prompt template
        # These templates guide the LLM on how to evaluate assertions
        template_file = "user.txt" if gsr_type == "user" else "system.txt"
        template_path = os.path.join(os.getcwd(), "prompt_templates", template_file)
        with open(template_path, "r") as f:
            self.template = f.read()

    def __call__(self, trace: MessageHistory, tool_traces: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
        """Evaluate the trace against assertions.
        
        Args:
            trace: Message history from agent execution (agent-user conversation)
            tool_traces: Optional tool traces from environment (required for system-side evaluation)
        
        Returns:
            Dict with keys: gsr, partial_gsr, report (list of assertion judgments)
        """
        # Parse assertions for this evaluation type (filter user vs system)
        all_assertions = self.task.evaluation_data["assertions"]
        assertions = self._parse_assertions(all_assertions)
        
        # Format conversation history into a readable string
        history = self._format_conversation_history(trace)
        
        # Get scenario description (what the user wants to accomplish)
        scenario = self.task.metadata.get("scenario", None)
        if scenario is None:
            raise ValueError("Task metadata must include 'scenario' for GSR evaluation")
        
        # Build evaluation prompt based on type
        if self.gsr_type == "user":
            # User-side: Only needs conversation history
            prompt = self.template.replace("{{scenario}}", scenario)
            prompt = prompt.replace("{{history}}", history)
            prompt = prompt.replace("{{assertions}}", "\n".join(assertions))
        else:  # system
            # System-side: Also needs tool invocations
            invocations = self._format_tool_invocations(tool_traces or {})
            prompt = self.template.replace("{{scenario}}", scenario)
            prompt = prompt.replace("{{history}}", history)
            prompt = prompt.replace("{{invocations}}", invocations)
            prompt = prompt.replace("{{assertions}}", "\n".join(assertions))
        
        # Get LLM judgment
        response = self.model.generate(prompt).strip()
        
        # Clean up response (remove markdown code blocks if present)
        response = response.strip("```").strip("json").strip()
        
        try:
            # Parse JSON response
            report = json.loads(response)
            
            # Handle wrapped responses (e.g., {"assertions": [...]})
            for wrapper_key in ["assertions", "results"]:
                if isinstance(report, dict) and wrapper_key in report:
                    report = report[wrapper_key]
                    break
            
            # Ensure it's a list
            if isinstance(report, dict):
                report = [report]
            
            # Compute GSR metrics from the report
            gsr, partial_gsr = self._compute_gsr(report)
            
            # Add assertion type to each result for tracking
            for item in report:
                item["assertion_type"] = self.gsr_type
            
            return {
                "gsr": gsr,
                "partial_gsr": partial_gsr,
                "report": report
            }
            
        except json.JSONDecodeError as e:
            # Handle parse errors gracefully
            return {
                "gsr": 0.0,
                "partial_gsr": 0.0,
                "report": [],
                "error": f"Failed to decode JSON: {e}",
                "raw_response": response
            }
    
    def _parse_assertions(self, assertions: List[str]) -> List[str]:
        """Parse assertions and filter by type (user or system).
        
        User assertions: start with "user:" or have no prefix
        System assertions: start with "agent:"
        """
        parsed = []
        user_prefix, system_prefix = "user:", "agent:"
        
        for assertion in assertions:
            assertion = assertion.strip()
            
            if self.gsr_type == "user":
                if assertion.lower().startswith(user_prefix):
                    # Remove prefix
                    parsed.append(assertion[len(user_prefix):].strip())
                elif not assertion.lower().startswith(system_prefix):
                    # No prefix means user assertion (AWS default)
                    parsed.append(assertion)
            else:  # system
                if assertion.lower().startswith(system_prefix):
                    # Remove prefix
                    parsed.append(assertion[len(system_prefix):].strip())
        
        return parsed
    
    def _format_conversation_history(self, trace: MessageHistory) -> str:
        """Format conversation history for the prompt.
        
        Converts MessageHistory into a readable string format.
        """
        formatted_lines = []
        for msg in trace:
            role = msg.get("role", "unknown")
            content = msg.get("content", "")
            
            # Handle content that might be a list (smolagents format)
            if isinstance(content, list):
                content_str = " ".join(
                    item.get("text", "") if isinstance(item, dict) else str(item)
                    for item in content
                )
            else:
                content_str = str(content)
            
            formatted_lines.append(f"{role}: {content_str}")
        
        return "\n".join(formatted_lines)
    
    def _format_tool_invocations(self, tool_traces: Dict[str, Any]) -> str:
        """Format tool invocations from environment traces for system-side evaluation.
        
        Args:
            tool_traces: Tool traces dictionary from execution_traces["environment"]["tools"]
        
        Returns:
            Formatted string showing which tools were called with what inputs/outputs
        """
        invocations_lines = []
        
        for tool_name, tool_data in tool_traces.items():
            # Check if tool has invocation records
            invocations = tool_data.get("invocations", [])
            if invocations:
                for inv in invocations:
                    invocations_lines.append(
                        f"Tool: {tool_name}\n"
                        f"  Inputs: {inv.get('inputs', {})}\n"
                        f"  Outputs: {inv.get('outputs', '')}\n"
                        f"  Status: {inv.get('status', 'Unknown')}"
                    )
        
        return "\n".join(invocations_lines) if invocations_lines else "No tool invocations recorded"
    
    def _compute_gsr(self, report: List[Dict[str, Any]]) -> Tuple[float, float]:
        """Compute Goal Success Rate metrics.
        
        Returns:
            Tuple of (gsr, partial_gsr)
            - gsr: Binary score (1.0 if ALL assertions True, else 0.0)
            - partial_gsr: Percentage of True assertions
        """
        if not report:
            return 1.0, 1.0
        
        # Count True and False answers from LLM judgments
        true_count = sum(1 for item in report if str(item.get("answer", "")).lower() == "true")
        false_count = sum(1 for item in report if str(item.get("answer", "")).lower() == "false")
        total = true_count + false_count
        
        if total == 0:
            return 1.0, 1.0
        
        # GSR: 1.0 only if ALL assertions are True (strict)
        gsr = 1.0 if false_count == 0 else 0.0
        
        # Partial GSR: percentage of True assertions (lenient)
        partial_gsr = true_count / total
        
        return gsr, partial_gsr

print("LLMAssertionEvaluator class defined")

### How Evaluation Works in Practice

Here's the evaluation flow during benchmark execution:

**1. Agent runs and completes the task:**
   - Interacts with simulated user
   - Calls simulated tools
   - Generates final answer

**2. Benchmark collects traces:**
   - `MessageHistory`: All agent-user messages
   - Tool invocation history: Which tools were called, with what parameters

**3. User-side evaluation:**
   ```python
   user_result = user_evaluator(trace=message_history)
   # Returns: {"gsr": 1.0, "partial_gsr": 1.0, "report": [...]}
   ```

**4. System-side evaluation:**
   ```python
   system_result = system_evaluator(
       trace=message_history, 
       tool_traces=environment.gather_traces()["tools"]
   )
   # Returns: {"gsr": 0.0, "partial_gsr": 0.67, "report": [...]}
   ```

**5. Compute overall metrics:**
   - **Overall GSR** = User GSR Ã— System GSR (both must be 1.0)
   - **User GSR** = Did agent satisfy user's goals?
   - **System GSR** = Did agent use tools correctly?
   - **Partial GSRs** = Percentage of assertions met

This dual evaluation ensures agents don't just satisfy users through incorrect meansâ€”they must also use the right tools with the right parameters!

## Step 5: Implement the Benchmark Class

Now we bring everything together in the `AWSCollabBenchmark` class! This class:

1. **Extends** MASEval's `Benchmark` base class
2. **Implements** required setup methods for environment, user, agents, and evaluators
3. **Orchestrates** the entire benchmark execution flow
4. **Uses minimal branching pattern**: Framework-agnostic logic with small framework-specific branches

The base `Benchmark` class will call these methods automatically when you run `benchmark.run()`!

The `AWSCollabBenchmark` class implements these key methods:

**Setup Methods** (called for each task):
- **`setup_environment()`**: Creates `TravelEnvironment` with simulated tools
- **`setup_user()`**: Creates LLM-powered user simulator (`SmolAgentUser` or `LangGraphUser`)
- **`setup_agents()`**: Creates multi-agent system (orchestrator + specialists)
- **`setup_evaluators()`**: Creates user-side and system-side evaluators

**Execution Methods**:
- **`run_agents()`**: Executes agents on the task
- **`evaluate()`**: Runs both evaluators and computes GSR metrics

**Helper Methods**:
- **`_create_langgraph_agent()`**: Creates LangGraph agent graph (framework-specific)

Let's implement each method step by step!

In [None]:

class AWSCollabBenchmark(Benchmark):
    """AWS Collaboration benchmark using minimal branching pattern."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
        # Shared logic (framework-agnostic)
        task_data = {
            "environment_data": task.environment_data,
            "query": task.query,
            "evaluation_data": task.evaluation_data,
            "metadata": task.metadata,
        }
        framework: str = agent_data["framework"]
        return TravelEnvironment(task_data, framework=framework, benchmark=self)

    def setup_user(self, agent_data: Dict[str, Any], environment: Environment, task: Task) -> User:
        # Shared preprocessing (framework-agnostic)
        user_profile = task.environment_data.get("user_profile", {})
        scenario = task.metadata.get("scenario", "")
        initial_prompt = task.query

        # Get framework from agent_data to create appropriate user type
        framework = agent_data["framework"]

        # Framework-specific user class only
        if framework == "smolagents":
            return SmolAgentUser(
                name="Simulated User",
                model=get_model(),
                user_profile=user_profile,
                scenario=scenario,
                initial_prompt=initial_prompt,
            )
        elif framework == "langgraph":
            return LangGraphUser(
                name="Simulated User",
                model=get_model(),
                user_profile=user_profile,
                scenario=scenario,
                initial_prompt=initial_prompt,
            )
        else:
            raise ValueError(f"Unsupported framework: {framework}")

    def setup_agents(
        self, agent_data: Dict[str, Any], environment: Environment, task: Task, user: User | None
    ) -> Tuple[List[AgentAdapter], Dict[str, AgentAdapter]]:
        if not isinstance(environment, TravelEnvironment):
            raise TypeError("Expected TravelEnvironment")

        # Shared preprocessing (framework-agnostic)
        framework = agent_data["framework"]
        primary_agent_id = agent_data["primary_agent_id"]
        primary_spec = next(a for a in agent_data["agents"] if a["agent_id"] == primary_agent_id)
        
        # Extract model config from agent spec
        model_config = primary_spec.get("model_config", {})
        model_id = model_config.get("model_id", "gemini-2.5-flash")
        temperature = model_config.get("temperature", 0.7)

        # Framework-specific agent creation only       
        if framework == "smolagents":
            # Initialize smolagents model
            smolagents_model = OpenAIServerModel(
                model_id=model_id,
                api_base="https://generativelanguage.googleapis.com/v1beta/openai/",
                api_key=os.getenv("GOOGLE_API_KEY"),
            )

            # Create specialist agents (sub-agents)
            specialist_agents = []
            for agent_spec in agent_data["agents"]:
                if agent_spec["agent_id"] == primary_agent_id:
                    continue  # Skip the primary agent itself
                
                # Get tools for this specialist
                specialist_tools = environment.get_tools_for_agent(agent_spec.get("tools", []))
                specialist_tools.append(FinalAnswerTool())
                
                # Create specialist agent
                specialist = ToolCallingAgent(
                    model=smolagents_model,
                    tools=specialist_tools,
                    name=agent_spec["agent_name"],
                    description=agent_spec["agent_instruction"],
                    verbosity_level=0,
                )
                specialist_agents.append(specialist)

            # Get primary agent tools (usually empty for orchestrators)
            primary_agent_tools = environment.get_tools_for_agent(primary_spec.get("tools", []))
            
            # Build primary agent tool list
            tools = primary_agent_tools + [FinalAnswerTool()]
            user_tool = user.get_tool() if user else None
            if user_tool:
                tools.append(user_tool)

            # Create primary orchestrator agent with managed_agents
            agent = ToolCallingAgent(
                model=smolagents_model,
                tools=tools,
                managed_agents=specialist_agents if specialist_agents else None,
                name=primary_spec["agent_name"],
                description=primary_spec["agent_instruction"],
                verbosity_level=0,
            )

            wrapper = SmolAgentAdapter(agent, primary_agent_id)

        elif framework == "langgraph":
            # Initialize LangChain model
            langgraph_model = ChatGoogleGenerativeAI(
                model=model_id,
                google_api_key=os.getenv("GOOGLE_API_KEY"),
                temperature=temperature,
            )

            # Get primary agent tools
            primary_agent_tools = environment.get_tools_for_agent(primary_spec.get("tools", []))
            user_tool = user.get_tool() if user else None
            
            # Build tool list
            tools = primary_agent_tools[:]
            if user_tool:
                tools.append(user_tool)

            # Create agent graph
            graph = self._create_langgraph_agent(
                model=langgraph_model,
                tools=tools,
                agent_name=primary_spec["agent_name"],
                agent_instruction=primary_spec["agent_instruction"],
            )

            wrapper = LangGraphAgentAdapter(graph, primary_agent_id)

        else:
            raise ValueError(f"Unsupported framework: {framework}")

        agents_dict: Dict[str, AgentAdapter] = {primary_agent_id: wrapper}
        return [wrapper], agents_dict

    def _create_langgraph_agent(self, model: Any, tools: List[Any], agent_name: str, agent_instruction: str):
        """Helper method to create a LangGraph agent (only called for langgraph framework)."""
        class AgentState(TypedDict):
            messages: List[Any]

        # Bind tools to LLM
        llm_with_tools = model.bind_tools(tools)

        # Define the agent node
        def call_model(state: AgentState):
            messages = state["messages"]
            # Add system message with agent instruction if not present
            has_system = any(isinstance(m, SystemMessage) for m in messages)
            if not has_system:
                system_message = SystemMessage(content=f"{agent_name}: {agent_instruction}")
                messages = [system_message] + messages
            
            response = llm_with_tools.invoke(messages)
            return {"messages": [response]}

        # Build the graph
        workflow = StateGraph(AgentState)
        workflow.add_node("agent", call_model)
        workflow.add_node("tools", ToolNode(tools))
        workflow.set_entry_point("agent")
        workflow.add_conditional_edges("agent", tools_condition, {"tools": "tools", END: END})
        workflow.add_edge("tools", "agent")

        return workflow.compile()

    def setup_evaluators(self, environment, task, agents, user) -> Sequence[Evaluator]:
        """Create both user-side and system-side evaluators."""
        user_evaluator = LLMAssertionEvaluator(
            get_model(), task, gsr_type="user"
        )
        system_evaluator = LLMAssertionEvaluator(
            get_model(), task, gsr_type="system"
        )
        return [user_evaluator, system_evaluator]

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment) -> Any:
        # Execute agents and return their final answers (not full traces)
        answers = [agent.run(task.query) for agent in agents]
        return answers[0] if len(answers) == 1 else answers

    def evaluate(
        self,
        evaluators: Sequence[Evaluator],
        agents: Dict[str, AgentAdapter],
        final_answer: Any,
        traces: Dict[str, Any],
    ) -> List[Dict[str, Any]]:
        """Evaluate using both user-side and system-side evaluators.
        
        Following AWS Multi-Agent Collaboration paper (https://arxiv.org/html/2412.05449v1):
        - User-side assertions: Evaluate behaviors observable by the user (conversation only)
        - System-side assertions: Evaluate behaviors NOT observable by user (tool calls, internal state)
        
        Returns aggregated evaluation results matching AWS paper format:
        - user_gsr: Goal Success Rate from user perspective
        - system_gsr: Goal Success Rate from system perspective  
        - overall_gsr: Combined GSR (both must pass)
        - supervisor_gsr: Supervisor agent success rate (user-side OR overall success)
        - partial_gsr: Percentage of all assertions that passed
        - user_partial_gsr: Percentage of user assertions that passed (bonus metric)
        - system_partial_gsr: Percentage of system assertions that passed (bonus metric)
        - report: Combined list of all assertion judgments
        """
        # Use user traces for user-observable conversation (already filtered by User class)
        # The user's history only contains messages from user_input tool interactions
        user_trace = traces.get("user", {})
        user_observable_messages = MessageHistory(user_trace.get("history", []))
        
        # Extract primary agent's full message history for system-side evaluation
        primary_agent_id = list(agents.keys())[0]
        agent_trace = traces.get("agents", {}).get(primary_agent_id, {})
        all_messages = MessageHistory(agent_trace.get("messages", []))
        
        # Extract tool invocations from top-level tools traces (registered directly)
        tool_traces = traces.get("tools", {})
        
        # Run both evaluators
        results = []
        for evaluator in evaluators:
            # Pass tool traces to system evaluator, user-observable messages to user evaluator
            if isinstance(evaluator, LLMAssertionEvaluator) and evaluator.gsr_type == "system":
                # System evaluator gets full message history + tool traces
                result = evaluator(all_messages, tool_traces)
            else:
                # User evaluator gets only user-observable messages
                result = evaluator(user_observable_messages)
            results.append(result)
        
        # Combine results (first is user, second is system based on setup_evaluators order)
        user_result = results[0] if len(results) > 0 else {"gsr": 0.0, "partial_gsr": 0.0, "report": []}
        system_result = results[1] if len(results) > 1 else {"gsr": 0.0, "partial_gsr": 0.0, "report": []}
        
        # Compute overall metrics
        combined_report = user_result.get("report", []) + system_result.get("report", [])
        
        # Overall GSR: Both user and system must have GSR=1.0
        overall_gsr = 1.0 if (user_result.get("gsr", 0.0) == 1.0 and system_result.get("gsr", 0.0) == 1.0) else 0.0
        
        # Supervisor GSR: Per AWS paper Table 1 - "If overall GSR is 1 or supervisor agent is reliable, then score is 1; else 0"
        # Interpretation: Supervisor is considered successful if overall task succeeds OR if supervisor did its job correctly
        # (i.e., user-side assertions passed, meaning supervisor communicated correctly with user regardless of sub-agent failures)
        supervisor_gsr = 1.0 if (overall_gsr == 1.0 or user_result.get("gsr", 0.0) == 1.0) else 0.0
        
        # Overall partial GSR: percentage of ALL assertions that passed
        if combined_report:
            total_true = sum(1 for item in combined_report if str(item.get("answer", "")).lower() == "true")
            total_assertions = len(combined_report)
            overall_partial_gsr = total_true / total_assertions if total_assertions > 0 else 1.0
        else:
            overall_partial_gsr = 1.0
        
        # Return in AWS format
        return [{
            "user_gsr": user_result.get("gsr", 0.0),
            "user_partial_gsr": user_result.get("partial_gsr", 0.0),
            "system_gsr": system_result.get("gsr", 0.0),
            "system_partial_gsr": system_result.get("partial_gsr", 0.0),
            "overall_gsr": overall_gsr,
            "overall_partial_gsr": overall_partial_gsr,
            "supervisor_gsr": supervisor_gsr,
            "report": combined_report,
        }]


# Step 6: Run Benchmark

In [None]:

benchmark = AWSCollabBenchmark(agent_data=agent_config, tasks=tasks, callbacks=[logger])
results = benchmark.run()


## Step 7: Get final results

In [None]:

def compute_benchmark_metric(results):
    """
    Compute the benchmark summary and mean of each numeric metric across all results.
    Returns a dict with summary and mean metrics for printing.
    """
    if not results:
        return {
            "total_tasks": 0,
            "successful_tasks": 0,
            "success_rate": 0.0,
            "mean_metrics": {},
        }
    total_tasks = len(results)
    # Aggregate metrics from nested 'eval' list in each result
    metric_sums = {}
    metric_counts = {}
    successful_tasks = 0

    for res in results:
        evals = res.get('eval', [])
        # For success rate, use 'overall_gsr' if present in any eval entry
        found_success = False
        for entry in evals:
            for k, v in entry.items():
                if isinstance(v, (int, float)):
                    metric_sums.setdefault(k, 0.0)
                    metric_counts.setdefault(k, 0)
                    metric_sums[k] += v
                    metric_counts[k] += 1
            if not found_success and entry.get('overall_gsr', 0.0) == 1.0:
                found_success = True
        if found_success:
            successful_tasks += 1

    success_rate = successful_tasks / total_tasks if total_tasks > 0 else 0.0
    mean_metrics = {k: (metric_sums[k] / metric_counts[k] if metric_counts[k] else 0.0) for k in metric_sums}
    return {
        "total_tasks": total_tasks,
        "successful_tasks": successful_tasks,
        "success_rate": success_rate,
        "mean_metrics": mean_metrics,
    }

# Compute benchmark summary and mean metrics
summary = compute_benchmark_metric(results)

# Print summary
print(f"\n--- Benchmark Summary ---")
print(f"Total Tasks: {summary['total_tasks']}")
print(f"Successful Tasks (Overall GSR=1.0): {summary['successful_tasks']}")
print(f"Success Rate: {summary['success_rate']:.2%}")

print("\nMean metrics across all tasks:")
for k, v in summary['mean_metrics'].items():
    print(f"  {k:<20} {v:.4f}")

# 6. Print final results
print("\n--- Benchmark Complete ---")
print(f"Results saved to: {output_dir}")


## Closing Remarks

This notebook demonstrated implementing the AWS Collaboration benchmark using MASEval's orchestration framework. By extending the `Benchmark` base class and implementing six abstract methods (`setup_environment`, `setup_user`, `setup_agents`, `setup_evaluators`, `run_agents`, `evaluate`), we created a complete multi-agent evaluation pipeline. The framework handled task iteration, trace collection, configuration gathering, and result aggregation automaticallyâ€”we focused purely on domain-specific logic.

**Key patterns demonstrated:**
- **Minimal branching pattern**: Framework-agnostic core logic with small smolagents/langgraph branches only at instantiation
- **LLM-powered simulation**: Realistic tools, user interactions, and evaluation without real APIs or mock data
- **Dual evaluation**: Separate user-side (observable behavior) and system-side (tool correctness) assertions for comprehensive assessment
- **Composition over inheritance**: `GenericTool` wrapped by framework-specific adapters to avoid method signature conflicts
- **Automatic tracing & config**: Registry pattern collects execution traces and configurations from all components for reproducibility
- **Callback extensibility**: `FileResultLogger` demonstrates lifecycle hooks for custom logging, metrics, or external integrations
- **Strict core/interface separation**: `maseval/core` has minimal dependencies; framework adapters live in `maseval/interface` with optional dependencies

All of these patterns are **optional**. MASEval is very flexible and can be used in many ways with real tools, real users, MCP servers etc. This is just an example to demonstrate one possible workflow!

**Contributing:** MASEval is designed to be extended. We welcome contributionsâ€”whether adding new framework adapters, implementing domain-specific benchmarks, improving documentation, or suggesting architectural improvements. Open an issue or PR on GitHub, or reach out with feedback on design patterns and use cases you'd like to see supported!
