# 5 A Day Benchmark

[![Open Notebook on GitHub](https://img.shields.io/badge/Open%20Notebook%20on-GitHub-blue?logo=github)](https://github.com/parameterlab/MASEval/blob/main/examples/five_a_day_benchmark/five_a_day_benchmark.ipynb)

This notebook is available as a Jupyter notebook — clone the repo and run it yourself!

**What This Notebook Demonstrates**

This is a **complete, production-ready example** of evaluating multi-agent systems using the MASEval library. By the end, you'll understand how to:

1. **Build multi-agent systems** with orchestrators and specialists
2. **Create framework-agnostic tools** that work across agent libraries  
3. **Organize systematic evaluation** using Tasks, Environments, and Benchmarks
4. **Implement custom evaluators** (unit tests, LLM judges, pattern matching)
5. **Run reproducible benchmarks** with automatic tool tracing


**Prerequisites**: Familiarity with LLM agents and basic Python programming.

## The 5-A-Day Benchmark

We implement 5 diverse tasks representing real-world agent scenarios:

| Task | Scenario | Tools | Evaluation |
|------|----------|-------|------------|
| 0 | Email & Banking | email, banking | LLM judge, pattern matching |
| 1 | Finance Calculation | stock_price, calculator, family_info | Arithmetic verification |
| 2 | Code Generation | python_executor | Unit tests, complexity analysis |
| 3 | Calendar Scheduling | calendar | Slot matching, logic validation |
| 4 | Hotel Optimization | hotel_search | Ranking, search strategy |

We'll use a **multi-agent architecture**: an orchestrator delegates to specialized agents (banking specialist, email specialist, etc.), demonstrating MASEval's support for complex agent systems.

## Part 1: Understanding Multi-Agent Systems

Before diving into MASEval, let's build the multi-agent system we'll be evaluating.

### 1.1 Imports and Setup

We use **smolagents** as our agent framework (MASEval also supports LangGraph and LlamaIndex).

The imports below include:
- **Standard libraries**: `json`, `os`, `Path` for file handling
- **Helper utilities**: Functions from this example's `utils.py` and `tools.py`  
- **smolagents**: The agent framework we'll use
- **MASEval core**: The evaluation orchestration library

In [None]:
# ruff: noqa E402
# Setup: Set working directory to project root for proper imports
# This must happen FIRST before any other imports
import os
import sys
from pathlib import Path
import json
from typing import Any, Dict, List, Sequence
from rich.console import Console
from rich.panel import Panel

# Determine notebook directory and set working directory to project root
_notebook_dir = Path(__file__).parent if "__file__" in dir() else Path.cwd()
if _notebook_dir.name == "five_a_day_benchmark":
    _project_root = _notebook_dir.parent.parent
    os.chdir(_project_root)
    # Add project root to path so `examples.five_a_day_benchmark.*` imports work
    if str(_project_root) not in sys.path:
        sys.path.insert(0, str(_project_root))
    # Also add the example directory for local imports (utils, tools, evaluators)
    if str(_notebook_dir) not in sys.path:
        sys.path.insert(0, str(_notebook_dir))
    print(f"Working directory set to: {os.getcwd()}")


# Utility functions from this example
# - derive_seed(): Creates reproducible seeds from task_id + agent_id
# - sanitize_name(): Cleans agent names for framework compatibility
from utils import derive_seed, sanitize_name

# Tool collection classes and helpers
# - EmailToolCollection, BankingToolCollection: Pre-built tool groups
# - filter_tool_adapters_by_prefix(): Selects tools by name prefix
# - get_states(): Initializes tool state objects (email inboxes, bank accounts, etc.)
from tools import (
    EmailToolCollection,
    BankingToolCollection,
    CalculatorToolCollection,
    CodeExecutionToolCollection,
    FamilyInfoToolCollection,
    StockPriceToolCollection,
    CalendarToolCollection,
    HotelSearchToolCollection,
    MCPCalendarToolCollection,
    filter_tool_adapters_by_prefix,
    get_states,
)

# smolagents: Our chosen agent framework
from smolagents import ToolCallingAgent, LiteLLMModel, FinalAnswerTool

# MASEval core components
from maseval import Benchmark, Environment, Task, TaskQueue, AgentAdapter, Evaluator, ModelAdapter
from maseval.interface.agents.smolagents import SmolAgentAdapter

# Import evaluators module (dynamically loaded later)
import evaluators


def load_benchmark_data(
    config_type: str = "multi",
    framework: str = "smolagents",
    model_id: str = "gemini-2.5-flash",
    temperature: float = 0.7,
    limit: int | None = None,
    seed: int | None = None,
    task_indices: list[int] | None = None,
) -> tuple[TaskQueue, list[Dict[str, Any]]]:
    """Load tasks and agent configurations.

    Args:
        config_type: 'single' or 'multi' agent configuration
        framework: Agent framework to use
        model_id: Model identifier
        temperature: Model temperature
        limit: Optional limit on number of tasks (None = all 5)
        seed: Random seed for reproducibility
        task_indices: Optional list of task indices to load (e.g., [0, 2, 4])

    Returns:
        Tuple of (TaskQueue, list of agent configs)
    """
    data_dir = Path("examples/five_a_day_benchmark/data")

    with open(data_dir / "tasks.json", "r") as f:
        tasks_raw = json.load(f)
    with open(data_dir / f"{config_type}agent.json", "r") as f:
        configs_raw = json.load(f)

    # Apply limit first
    if limit:
        tasks_raw = tasks_raw[:limit]
        configs_raw = configs_raw[:limit]

    # Then apply task_indices filter if specified
    if task_indices is not None:
        tasks_raw = [tasks_raw[i] for i in task_indices if i < len(tasks_raw)]
        configs_raw = [configs_raw[i] for i in task_indices if i < len(configs_raw)]

    tasks_data = []
    configs_data = []

    for task_dict, config in zip(tasks_raw, configs_raw):
        task_id = task_dict["metadata"]["task_id"]
        task_dict["environment_data"]["agent_framework"] = framework

        # Create Task object
        tasks_data.append(
            Task(
                query=task_dict["query"],
                environment_data=task_dict["environment_data"],
                evaluation_data=task_dict["evaluation_data"],
                metadata=task_dict["metadata"],
            )
        )

        # Enrich config with framework and model info
        config["framework"] = framework
        config["model_config"] = {"model_id": model_id, "temperature": temperature}

        # Derive seeds for reproducibility
        if seed is not None:
            for agent_spec in config["agents"]:
                agent_spec["seed"] = derive_seed(seed, task_id, agent_spec["agent_id"])

        configs_data.append(config)

    return TaskQueue(tasks_data), configs_data

### 1.2 Model Factory

We need LLMs to power our agents. This factory function creates models using LiteLLM, which provides a unified interface to many providers (OpenAI, Anthropic, Google, etc.).

In [None]:
import litellm

# Tell litellm to drop unsupported params (like 'seed' for Gemini)
litellm.drop_params = True


def get_model(model_id: str, temperature: float = 0.7, seed: int | None = None):
    """Create a model instance compatible with smolagents.

    Args:
        model_id: Model name (e.g., 'gemini-2.5-flash', 'gpt-4')
        temperature: Randomness (0.0 = deterministic, 1.0 = creative)
        seed: Random seed for reproducible outputs (ignored for models that don't support it)

    Returns:
        LiteLLMModel configured for smolagents
    """
    return LiteLLMModel(
        model_id=f"gemini/{model_id}",  # Prefix determines provider
        api_key=os.getenv("GOOGLE_API_KEY"),
        temperature=temperature,
        seed=seed,  # Will be dropped by litellm for providers that don't support it
    )


# Test the model factory
model = get_model("gemini-2.5-flash", temperature=0.7, seed=42)
print(f"Created model: {model.model_id}")

### 1.3 Loading Task Data

We use the `load_benchmark_data()` function to load tasks and agent configurations. Let's load Task 0 (Email & Banking) to examine its structure.

In [None]:
# Load Task 0 for demonstration in Part 1
task_data, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
    seed=42,
)

# Extract the first (and only) task and config
task_0: Task = task_data[0]
config_0: Dict[str, Any] = agent_configs[0]

print("=" * 60)
print("TASK 0: Email & Banking")
print("=" * 60)
print(f"\nUser Query:\n{task_0.query}\n")
print(f"Required Tools: {task_0.environment_data['tools']}")
print(f"\nEvaluators: {task_0.evaluation_data['evaluators']}")

#### Multi-Agent Configuration

For this task, we use **3 agents**:
1. **Orchestrator** - Coordinates specialists  
2. **Banking Specialist** - Handles financial data
3. **Email Specialist** - Manages email operations

In [None]:
print("Multi-Agent Setup:")
print(f"Agent Type: {config_0['agent_type']}")
print(f"Primary Agent: {config_0['primary_agent_id']}\n")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()

### 1.4 Creating Tools for Specialist Agents

Tools are functions agents can call. We'll create email and banking tools for our specialists.

**Key insight**: Our tools are "framework-agnostic" `BaseTool` objects that convert to any framework (smolagents, LangGraph, LlamaIndex).

In [None]:
# Initialize state objects from task data
# These hold the actual data (emails, bank transactions) that tools operate on
env_data = task_0.environment_data.copy()
states = get_states(env_data["tools"], env_data)

# Create tool collections (tools are examples)
email_tools = EmailToolCollection(states["email_state"])
banking_tools = BankingToolCollection(states["banking_state"])

# Convert to smolagents format (returns tool adapters with tracing support)
email_adapters = [tool.to_smolagents() for tool in email_tools.get_sub_tools()]
banking_adapters = [tool.to_smolagents() for tool in banking_tools.get_sub_tools()]

# Extract raw smolagents Tool objects
all_tool_adapters = email_adapters + banking_adapters
all_tools = [adapter.tool for adapter in all_tool_adapters]

print(f"Created {len(all_tools)} tools:")
for tool in all_tools:
    print(f"  - {tool.name}: {tool.description[:60]}...")

### 1.5 Building the Multi-Agent System

Now we build our 3 agents:
- **Specialist agents** get tools + `FinalAnswerTool()` (to return results)
- **Orchestrator** gets specialists as `managed_agents` (can delegate to them)

In [None]:
# Build specialist agents
def build_agents(agent_data: Dict[str, Any], environment: Environment) -> tuple[list[ToolCallingAgent], Dict[str, ToolCallingAgent]]:
    """Create multi-agent system with orchestrator and specialists."""
    model_id = agent_data["model_config"]["model_id"]

    specialist_agents = []

    temperature = agent_data["model_config"]["temperature"]

    primary_agent_id = agent_data["primary_agent_id"]
    agents_specs = agent_data["agents"]
    all_tool_adapters = environment.get_tools()  # Now returns Dict[str, Any]

    # Build specialists first
    specialist_agents = []
    for agent_spec in agents_specs:
        if agent_spec["agent_id"] == primary_agent_id:
            continue

        seed = agent_spec.get("seed")
        model = get_model(model_id, temperature, seed)
        spec_tool_adapters = filter_tool_adapters_by_prefix(all_tool_adapters, agent_spec["tools"])
        spec_tools = [adapter.tool for adapter in spec_tool_adapters.values()]
        spec_tools.append(FinalAnswerTool())

        specialist = ToolCallingAgent(
            model=model,
            tools=spec_tools,
            name=sanitize_name(agent_spec["agent_name"]),
            description=agent_spec["agent_instruction"],
            instructions=agent_spec["agent_instruction"],
            verbosity_level=0,
        )
        specialist_agents.append(specialist)

    # Build orchestrator
    primary_spec = next(a for a in agents_specs if a["agent_id"] == primary_agent_id)
    primary_seed = primary_spec.get("seed")
    primary_model = get_model(model_id, temperature, primary_seed)

    orchestrator = ToolCallingAgent(
        model=primary_model,
        tools=[FinalAnswerTool()],
        managed_agents=specialist_agents if specialist_agents else None,
        name=sanitize_name(primary_spec["agent_name"]),
        instructions=primary_spec["agent_instruction"],
        verbosity_level=0,
    )

    return [orchestrator], {agent.name: agent for agent in specialist_agents}

## Part 2: Organizing Evaluation with MASEval

We've built a multi-agent system. Now let's see how **MASEval** helps us evaluate it systematically across multiple tasks.

MASEval provides three key abstractions:

1. **Task** - A single evaluation scenario (query + environment + evaluation criteria)
2. **Environment** - Manages tools and state for a task
3. **Benchmark** - Orchestrates running agents on tasks and collecting results

### 2.1 The Environment Class

An `Environment` creates and manages tools for each task. It also enables automatic tool tracing.

**Key methods**:
- `setup_state()`: Initialize tool state (email inboxes, bank accounts, etc.)
- `create_tools()`: Create and convert tools to framework-specific format

In [None]:
class FiveADayEnvironment(Environment):
    """Environment that creates framework-specific tools from task data."""

    def __init__(self, task_data: Dict[str, Any], callbacks: List | None = None):
        super().__init__(task_data, callbacks)

    def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from task data."""
        env_data = task_data["environment_data"].copy()
        tool_names = env_data.get("tools", [])

        # Create state objects (e.g., email inboxes, bank accounts)
        states = get_states(tool_names, env_data)
        env_data.update(states)

        return env_data

    def create_tools(self) -> Dict[str, Any]:
        """Create and convert tools to framework-specific format, keyed by name."""
        tools_dict: Dict[str, Any] = {}

        # Map tool names to their collection classes
        tool_mapping = {
            "email": (EmailToolCollection, lambda: (self.state["email_state"],)),
            "banking": (BankingToolCollection, lambda: (self.state["banking_state"],)),
            "calculator": (CalculatorToolCollection, lambda: ()),
            "python_executor": (CodeExecutionToolCollection, lambda: (self.state["python_executor_state"],)),
            "family_info": (FamilyInfoToolCollection, lambda: (self.state["family_info"],)),
            "stock_price": (StockPriceToolCollection, lambda: (self.state["stock_price_lookup"],)),
            "calendar": (CalendarToolCollection, lambda: (self.state["calendar_state"],)),
            "hotel_search": (HotelSearchToolCollection, lambda: (self.state["hotel_search_state"],)),
            "my_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["my_calendar_mcp_state"],)),
            "other_calendar_mcp": (MCPCalendarToolCollection, lambda: (self.state["other_calendar_mcp_state"],)),
        }

        for tool_name in self.state["tools"]:
            if tool_name in tool_mapping:
                ToolClass, get_init_args = tool_mapping[tool_name]
                tool_instance = ToolClass(*get_init_args())

                # Get base tools and convert to framework format
                for base_tool in tool_instance.get_sub_tools():
                    framework_tool = base_tool.to_smolagents()
                    tool_key = getattr(base_tool, "name", None) or str(type(base_tool).__name__)
                    tools_dict[tool_key] = framework_tool

        return tools_dict

### 2.2 Check Agents

Let's verify our agent setup by building agents for the first task and inspecting their configuration.

First, we check the agent config.

In [None]:
print(f"{config_0['task_description']}")

for i, agent_spec in enumerate(config_0["agents"], 1):
    print(f"{i}. {agent_spec['agent_name']} (ID: {agent_spec['agent_id']})")
    print(f"   Tools: {agent_spec['tools'] if agent_spec['tools'] else 'None (delegates only)'}")
    print(f"   Role: {agent_spec['agent_instruction'][:80]}...")
    print()

Now we implment the agents for the first task.

In [None]:
# Build the agents for task 0
# Note: model_config is already set by load_benchmark_data()

# Create environment from task data
environment_0 = FiveADayEnvironment(
    {
        "environment_data": task_0.environment_data,
        "query": task_0.query,
        "evaluation_data": task_0.evaluation_data,
        "metadata": task_0.metadata,
    }
)

# Build agents using the build_agents function
agents_to_run, agents_to_monitor = build_agents(config_0, environment_0)

print(f"\nBuilt Agents for Task: {task_0.metadata['task_id']}")
print(f"{'=' * 60}")
print(f"\nAgents to run: {[agent.name for agent in agents_to_run]}")
print(f"Agents to monitor: {list(agents_to_monitor.keys())}")

# Print details for each agent
for agent in agents_to_run:
    print(f"\n  Agent: {agent.name}")
    # smolagents stores tools as a dict with string keys
    print(f"    Tools: {list(agent.tools.keys())}")
    if hasattr(agent, "managed_agents") and agent.managed_agents:
        # managed_agents is also a dict with string keys
        print(f"    Managed agents: {list(agent.managed_agents.keys())}")
        for agent_name, managed in agent.managed_agents.items():
            print(f"      - {managed.name}: {list(managed.tools.keys())}")

print("\nAll agents built successfully.")

### 2.3 The Benchmark Class

A `Benchmark` orchestrates the entire evaluation process. It implements 5 key methods:

1. **setup_environment()** - Create tools for a task
2. **setup_agents()** - Build agents with appropriate tools
3. **setup_evaluators()** - Create task-specific evaluators
4. **run_agents()** - Execute agents and collect responses
5. **evaluate()** - Run evaluators on agent outputs

In [None]:
class FiveADayBenchmark(Benchmark):
    """5-A-Day benchmark with multi-agent support."""

    def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
        """Create environment from task data."""
        task_data = {
            "environment_data": task.environment_data,
            "query": task.query,
            "evaluation_data": task.evaluation_data,
            "metadata": task.metadata,
        }

        environment = FiveADayEnvironment(task_data)

        # Register all tools for tracing
        for tool_name, tool_adapter in environment.get_tools().items():
            self.register("tools", tool_name, tool_adapter)

        return environment

    def setup_agents(
        self, agent_data: Dict[str, Any], environment: Environment, task: Task, user=None
    ) -> tuple[list[SmolAgentAdapter], Dict[str, SmolAgentAdapter]]:
        """Create multi-agent system with orchestrator and specialists."""
        agents_to_run, agents_to_monitor = build_agents(agent_data, environment)

        # Create adapters for the primary agent(s) to run
        adapters_to_run = [SmolAgentAdapter(agent, agent.name) for agent in agents_to_run]

        # This ensures all agent traces are collected by the benchmark
        all_agents = {agent.name: agent for agent in agents_to_run} | agents_to_monitor
        adapters_to_monitor = {name: SmolAgentAdapter(agent, name) for name, agent in all_agents.items()}
        return adapters_to_run, adapters_to_monitor

    def setup_evaluators(self, environment, task, agents, user) -> Sequence[Evaluator]:
        """Create evaluators based on task's evaluation criteria."""
        if not task.evaluation_data["evaluators"]:
            return []

        evaluator_instances = []
        for name in task.evaluation_data["evaluators"]:
            evaluator_class = getattr(evaluators, name)
            evaluator_instances.append(evaluator_class(task, environment, user))

        return evaluator_instances

    def run_agents(self, agents: Sequence[AgentAdapter], task: Task, environment: Environment, query: str) -> Sequence[Any]:
        """Execute agents and return their final answers."""
        answers = [agent.run(query) for agent in agents]
        return answers

    def get_model_adapter(self, model_id: str, **kwargs) -> ModelAdapter:
        """Return a model adapter for benchmark components that need LLM access.

        This benchmark doesn't use simulated tools, user simulators, or LLM judges,
        so this method is not called during execution.
        """
        raise NotImplementedError("This benchmark doesn't use model adapters for tools/users/evaluators.")

    def evaluate(
        self,
        evaluators: Sequence[Evaluator],
        agents: Dict[str, AgentAdapter],
        final_answer: Any,
        traces: Dict[str, Any],
    ) -> list[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            filtered_traces = evaluator.filter_traces(traces)
            results.append(evaluator(filtered_traces, final_answer))
        return results

### 2.4 Loading All Tasks

Now let's load all 5 tasks to run the full benchmark. We reuse `load_benchmark_data()` without specifying `task_indices` to get all tasks.

In [None]:
# Reload all 5 tasks for the benchmark
tasks, agent_configs = load_benchmark_data(
    config_type="multi",
    framework="smolagents",
    model_id="gemini-2.5-flash",
    temperature=0.7,
    seed=42,
    # No task_indices = load all tasks
)

print(f"Loaded {len(tasks)} tasks:")
for i, task in enumerate(tasks):
    print(f"  {i}. {task.metadata['task_id']}: {task.metadata['description']}")

### 2.5 Running the Benchmark

Now we can run the complete benchmark! MASEval will:
1. Create environments for each task
2. Build multi-agent systems with appropriate tools
3. Run agents and collect traces (tool calls, messages, etc.)
4. Evaluate results using task-specific evaluators
5. Log everything to a file

In [None]:
# Create and run benchmark (will take approx. 2 min)
benchmark = FiveADayBenchmark(
    agent_data=agent_configs,
    fail_on_setup_error=True,
    fail_on_task_error=True,
    fail_on_evaluation_error=True,
)

results = benchmark.run(tasks=tasks)

### 2.6 Examining Results

Let's look at the results for two tasks to understand the output structure.

In [None]:
console = Console()

for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    traces = task["traces"]
    agent_traces = traces["agents"]
    print(f"Traces available for agents: {list(agent_traces.keys())}")
    orchestrator_name = list(traces["agents"].keys())[0]
    print(f"Last 5 messages for '{orchestrator_name}'")
    print(traces["agents"].keys())
    messages = traces["agents"][orchestrator_name]["messages"]
    for msg in messages[-5:]:
        role = msg.get("role", "unknown")
        content = msg.get("content", [])[0].get("text", "")
        panel = Panel.fit(
            content,
            title=f" {role} ",
            title_align="left",
        )
        console.print(panel)

In [None]:
# print results for first two tasks
for task in results[:2]:
    task_id = task["task_id"]
    print("=" * 60)
    print(f"Results for Task ID: {task_id}")
    print("=" * 60)
    eval_results = task["eval"]
    for evals in eval_results:
        for k, v in evals.items():
            print(f"{k:<35} {v}")

## Summary and Key Takeaways

### What You've Learned

You now understand how to build production agent benchmarks with MASEval:

#### Part 1: Multi-Agent Systems
- **Model creation** with LiteLLM for framework compatibility
- **Framework-agnostic tools** that convert to any agent library
- **Multi-agent architecture** with orchestrators and specialists
- **Tool state management** for realistic task environments

#### Part 2: MASEval Framework
- **Task abstraction** packages queries, environments, and evaluation criteria
- **Environment class** creates tools and enables automatic tracing
- **Benchmark class** orchestrates evaluation across multiple tasks
- **Custom evaluators** for diverse evaluation approaches (unit tests, LLM judges, etc.)
- **Automatic tracing** captures all tool calls and agent interactions

### Key Design Patterns

1. **Separation of Concerns**:
   - Tasks define WHAT to evaluate
   - Environments provides a world in which the agents act (tools and state)
   - Benchmarks orchestrate WHEN and WHERE
   - Evaluators determine SUCCESS

2. **Framework Agnostic**:
   - Same tasks work with smolagents, LangGraph, LlamaIndex
   - Tools convert automatically to framework-specific formats
   - Easy to compare frameworks on identical tasks

3. **Reproducibility**:
   - Seeds derived systematically from task_id + agent_id
   - All parameters logged automatically
   - Results saved in structured JSONL format

## Next Steps

1. **Explore evaluators** — Check `evaluators/` for different evaluation strategies
2. **Try single-agent mode** — Load `data/singleagent.json` to compare architectures
3. **Run from CLI** — Use `five_a_day_benchmark.py` for scripted runs with different frameworks
4. **Add custom tasks** — Create your own task definitions and evaluators
5. **Compare frameworks** — Run the same benchmark with LangGraph or LlamaIndex

## Resources

- [MASEval Documentation](https://github.com/parameterlab/MASEval)
- Example code: [`examples/five_a_day_benchmark/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark)
- Example data: [`examples/five_a_day_benchmark/data/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/data)
- Tool implementations: [`examples/five_a_day_benchmark/tools/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/tools)
- Evaluator implementations: [`examples/five_a_day_benchmark/evaluators/`](https://github.com/parameterlab/MASEval/tree/main/examples/five_a_day_benchmark/evaluators)