# Tiny Tutorial

[![Open Notebook on GitHub](https://img.shields.io/badge/Open%20Notebook%20on-GitHub-blue?logo=github)](https://github.com/parameterlab/MASEval/blob/main/examples/introduction/tutorial.ipynb)

This notebook is available as a Jupyter notebook — clone the repo and run it yourself!

## What You'll Learn

- **Build your first agent** — Create tools and agents with smolagents
- **Run a minimal benchmark** — One task, one agent, end-to-end
- **Understand the core abstractions** — Tasks, Environments, Evaluators working together


This tutorial first introduces [`smolagents`](https://huggingface.co/docs/smolagents/en/index) as introduction to agents. Then it provides a super small single task benchmark.

## Setup

First, let's install the required dependencies and import the libraries we need.

In [None]:
# Install dependencies (uncomment if needed)
# !pip install maseval[smolagents]
# !pip install litellm

import os
import json
from pathlib import Path
from typing import Any, Dict, List, Optional

# Set your API key
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"

## Part 1: Agent Initialization with smolagents

Let's start by building an agent using smolagents. We'll create a simple agent that can handle email and banking tasks.

### 1.1 Define Custom Tools

For this example, we'll create simplified versions of email and banking tools. In the full benchmark, these tools are more sophisticated and stateful.

In [None]:
from smolagents import Tool

class SimpleBankingTool(Tool):
    """A simple tool to retrieve banking transactions."""
    
    name = "get_transactions"
    description = "Retrieve recent banking transactions. Returns a list of transactions with date, description, amount, and type."
    inputs = {}
    output_type = "string"
    
    def __init__(self, transactions: List[Dict], **kwargs):
        super().__init__(**kwargs)
        self.transactions = transactions
    
    def forward(self) -> str:
        """Return all transactions as formatted string."""
        if not self.transactions:
            return "No transactions found."
        
        result = "Recent Transactions:\n"
        for txn in self.transactions:
            result += f"- {txn['date']}: {txn['description']} - ${txn['amount']} ({txn['type']})\n"
        return result


class SimpleEmailTool(Tool):
    """A simple tool to send emails."""
    
    name = "send_email"
    description = "Send an email to a recipient. Provide the recipient email, subject, and body text."
    inputs = {
        "to": {"type": "string", "description": "Recipient email address"},
        "subject": {"type": "string", "description": "Email subject line"},
        "body": {"type": "string", "description": "Email body text"}
    }
    output_type = "string"
    
    def __init__(self, sent_emails: List, **kwargs):
        super().__init__(**kwargs)
        self.sent_emails = sent_emails  # Store sent emails for tracking
    
    def forward(self, to: str, subject: str, body: str) -> str:
        """Send an email and store it."""
        email = {"to": to, "subject": subject, "body": body}
        self.sent_emails.append(email)
        return f"Email sent successfully to {to}"

print("Tools defined successfully!")

### 1.2 Create Tool Instances with Data

Now let's instantiate our tools with the actual data from the benchmark task.

In [None]:
# Sample banking data from the benchmark
banking_transactions = [
    {
        "date": "2025-11-15",
        "description": "Tenant Deposit - Sarah Johnson",
        "amount": 2000,
        "type": "deposit"
    },
    {
        "date": "2025-11-17",
        "description": "Rent Payment - Sarah Johnson",
        "amount": 1500,
        "type": "deposit"
    },
    {
        "date": "2025-11-16",
        "description": "Property Maintenance",
        "amount": -450,
        "type": "expense"
    }
]

# List to track sent emails
sent_emails = []

# Create tool instances
banking_tool = SimpleBankingTool(transactions=banking_transactions)
email_tool = SimpleEmailTool(sent_emails=sent_emails)

print(f"Created {len([banking_tool, email_tool])} tools")

### 1.3 Initialize the Agent

Now we'll create a smolagents agent with our custom tools and give it clear instructions.

In [None]:
from smolagents import ToolCallingAgent, LiteLLMModel

# Initialize the model
model = LiteLLMModel(
    model_id="gemini/gemini-2.5-flash",
    api_key=os.getenv("GOOGLE_API_KEY"),
    temperature=0.7
)

# Create the agent with tools and instructions
agent = ToolCallingAgent(
    tools=[banking_tool, email_tool],
    model=model,
    instructions="""You are a helpful assistant that helps users with email and banking tasks.
Use the available tools to retrieve information and take appropriate actions.
Be professional and thorough in your responses."""
)

print("Agent initialized successfully!")

### 1.4 Test the Agent

Let's test our agent with the actual task query from the benchmark.

In [None]:
# The task query from the benchmark
query = """Sarah Johnson emailed me to confirm that I received her payment for the deposit 
and first month's rent. Please check my transactions and send an email reply accordingly."""

# Run the agent
response = agent.run(query)

print("\n" + "="*60)
print("AGENT RESPONSE:")
print("="*60)
print(response)
print("="*60)

### 1.5 Inspect What Happened

Let's check if the agent sent an email and what it contained.

In [None]:
print("Emails sent by the agent:")
print("\n")

if sent_emails:
    for i, email in enumerate(sent_emails, 1):
        print(f"Email #{i}")
        print(f"To: {email['to']}")
        print(f"Subject: {email['subject']}")
        print(f"Body:\n{email['body']}")
        print("\n" + "-"*60 + "\n")
else:
    print("No emails were sent.")

## Part 2: Evaluating Agents with MASEval

Now that we understand how the agent works, let's see how MASEval helps us systematically evaluate agent performance across multiple tasks.

MASEval provides:
- **Tasks**: Define queries, environments, and evaluation criteria
- **Environments**: Manage tool state and provide context
- **Evaluators**: Measure agent performance using various metrics
- **Benchmarks**: Orchestrate execution and collect results

### 2.1 Import MASEval Components

Let's import the core MASEval components we'll need.

In [None]:
from maseval import Benchmark, Environment, Evaluator, Task, TaskCollection
from maseval.interface.agents.smolagents import SmolAgentAdapter

print("MASEval components imported successfully!")

### 2.2 Load Task Data

The Five-A-Day benchmark uses JSON files to define tasks. Let's load the first task (Email & Banking).

In [None]:
# Load task data from JSON
data_dir = Path("data")

with open(data_dir / "tasks.json", "r") as f:
    tasks_data = json.load(f)

# Get the first task (Email & Banking)
task_data = tasks_data[0]

print("Task Query:")
print(task_data["query"])
print("\nTools Required:")
print(task_data["environment_data"]["tools"])
print("\nEvaluators:")
print(task_data["evaluation_data"]["evaluators"])

### 2.3 Create a Task Object

MASEval uses `Task` objects to encapsulate all information about a benchmark task.

In [None]:
# Create a Task instance
task = Task(
    query=task_data["query"],
    environment_data=task_data["environment_data"],
    evaluation_data=task_data["evaluation_data"],
    metadata=task_data["metadata"]
)

print(f"Created task: {task.metadata['task_id']}")
print(f"Complexity: {task.metadata['complexity']}")
print(f"Skills tested: {', '.join(task.metadata['skills_tested'])}")

### 2.4 Define a Custom Environment

The `Environment` class manages tool state and provides tools to the agent. Here's a simplified version of the FiveADayEnvironment.

In [None]:
class SimpleEnvironment(Environment):
    """Simplified environment for the Email & Banking task."""
    
    def setup_state(self, task_data: Dict[str, Any]) -> Dict[str, Any]:
        """Initialize environment state from task data."""
        return task_data.copy()
    
    def create_tools(self) -> list:
        """Create tool instances from environment data."""
        # Get banking transactions from environment data
        transactions = self.state.get("banking", {}).get("bank_transactions", [])
        
        # Create tool instances - track sent emails for evaluation
        self.sent_emails: List[Dict] = []
        banking_tool = SimpleBankingTool(transactions=transactions)
        email_tool = SimpleEmailTool(sent_emails=self.sent_emails)
        
        return [banking_tool, email_tool]

print("Environment class defined!")

### 2.5 Create Custom Evaluators

Evaluators measure agent performance. Let's create two evaluators:
1. **FinancialAccuracyEvaluator**: Checks if the agent verified the correct payment amounts
2. **EmailSentEvaluator**: Checks if the agent sent an email

In [None]:
class FinancialAccuracyEvaluator(Evaluator):
    """Evaluates if the agent correctly identified payment amounts."""
    
    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment
    
    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces to check tool usage."""
        return traces.get("environment", {})
    
    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if banking information was accessed and email was sent."""
        # Expected values from task evaluation data
        expected_deposit = self.task.evaluation_data["expected_deposit_amount"]
        expected_rent = self.task.evaluation_data["expected_rent_amount"]
        
        # Check if emails were sent by looking at environment state
        sent_emails = getattr(self.environment, 'sent_emails', [])
        email_sent = len(sent_emails) > 0
        
        return {
            "evaluator": "FinancialAccuracyEvaluator",
            "email_sent": email_sent,
            "emails_count": len(sent_emails),
            "expected_deposit": expected_deposit,
            "expected_rent": expected_rent,
            "score": 1.0 if email_sent else 0.0,
            "message": "Agent sent confirmation email" if email_sent else "No email was sent"
        }


class EmailSentEvaluator(Evaluator):
    """Evaluates if the agent sent an email with proper content."""
    
    def __init__(self, task: Task, environment: Environment, user=None):
        """Initialize with task, environment, and optional user."""
        super().__init__(task, environment, user)
        self.task = task
        self.environment = environment
    
    def filter_traces(self, traces: Dict[str, Any]) -> Dict[str, Any]:
        """Filter to environment traces."""
        return traces.get("environment", {})
    
    def __call__(self, traces: Dict[str, Any], final_answer: Optional[str] = None) -> Dict[str, Any]:
        """Check if email was sent with appropriate content."""
        sent_emails = getattr(self.environment, 'sent_emails', [])
        
        if not sent_emails:
            return {
                "evaluator": "EmailSentEvaluator",
                "email_sent": False,
                "score": 0.0,
                "error": "No email was sent"
            }
        
        # Get the last email that was sent
        email_data = sent_emails[-1]
        
        return {
            "evaluator": "EmailSentEvaluator",
            "email_sent": True,
            "score": 1.0,
            "recipient": email_data.get("to"),
            "subject": email_data.get("subject"),
            "message": "Agent successfully sent an email"
        }

print("Evaluators defined!")

### 2.6 Create a Custom Benchmark

The `Benchmark` class orchestrates task execution and evaluation. We'll create a simplified version.

In [None]:
from maseval import AgentAdapter
from typing import Sequence, Tuple

class SimpleBenchmark(Benchmark):
    """Simplified benchmark for the tutorial."""
    
    def setup_environment(self, agent_data: Dict[str, Any], task: Task) -> Environment:
        """Create an environment for the task."""
        return SimpleEnvironment(task.environment_data)
    
    def setup_agents(
        self,
        agent_data: Dict[str, Any],
        environment: Environment,
        task: Task,
        user=None
    ) -> Tuple[Sequence[AgentAdapter], Dict[str, AgentAdapter]]:
        """Create an agent for the task."""
        # Initialize model
        model = LiteLLMModel(
            model_id="gemini/gemini-2.5-flash",
            api_key=os.getenv("GOOGLE_API_KEY"),
            temperature=0.7
        )
        
        # Create agent with environment tools
        agent = ToolCallingAgent(
            tools=environment.get_tools(),
            model=model,
            instructions="""You are a helpful assistant. Help users with email and banking tasks 
by using the available tools to retrieve information and take appropriate actions. 
Be professional and thorough in your responses."""
        )
        
        # Wrap agent in adapter for MASEval
        agent_adapter = SmolAgentAdapter(agent, "main_agent")
        
        # Return (agents_to_run, agents_dict)
        return [agent_adapter], {"main_agent": agent_adapter}
    
    def setup_evaluators(
        self,
        environment: Environment,
        task: Task,
        agents: Sequence[AgentAdapter],
        user=None
    ) -> Sequence[Evaluator]:
        """Create evaluators for the task."""
        return [
            FinancialAccuracyEvaluator(task, environment, user),
            EmailSentEvaluator(task, environment, user)
        ]
    
    def run_agents(
        self,
        agents: Sequence[AgentAdapter],
        task: Task,
        environment: Environment
    ) -> Any:
        """Execute the agent and return the final answer."""
        # Run the main agent with the task query
        agent = agents[0]
        result = agent.run(task.query)
        return result
    
    def evaluate(
        self,
        evaluators: Sequence[Evaluator],
        agents: Dict[str, AgentAdapter],
        final_answer: Any,
        traces: Dict[str, Any]
    ) -> List[Dict[str, Any]]:
        """Evaluate agent performance."""
        results = []
        for evaluator in evaluators:
            # Filter traces for this evaluator
            filtered_traces = evaluator.filter_traces(traces)
            # Run evaluation
            result = evaluator(filtered_traces, final_answer)
            results.append(result)
        return results

print("Benchmark class defined!")

### 2.7 Run the Benchmark

Now let's run the benchmark on our task and see the results!

In [None]:
# Create benchmark instance with agent configuration
agent_data = {
    "model_id": "gemini/gemini-2.5-flash",
    "temperature": 0.7
}

benchmark = SimpleBenchmark(agent_data=agent_data, progress_bar=False)

# Create task collection
tasks = TaskCollection([task])

# Run the benchmark
print("Running benchmark...\n")
reports = benchmark.run(tasks=tasks)

print("\n" + "="*60)
print("BENCHMARK COMPLETE")
print("="*60)

### 2.8 Analyze the Results

Let's examine the evaluation results to see how well our agent performed.

In [None]:
# Get results for the first (and only) task
report = reports[0]

print(f"Task ID: {report['task_id']}")
print(f"Status: {report['status']}")
print("\nEvaluation Results:")
print("-" * 60)

if report.get("eval"):
    for eval_result in report["eval"]:
        print(f"\nEvaluator: {eval_result.get('evaluator', 'Unknown')}")
        print(f"Score: {eval_result.get('score', 'N/A')}")
        
        # Print relevant details
        for key, value in eval_result.items():
            if key not in ["evaluator", "score"]:
                print(f"  {key}: {value}")
else:
    print("No evaluation results available.")
    if report.get("error"):
        print(f"\nError: {report['error']}")

print("\n" + "="*60)

## Summary

In this tutorial, you learned:

### Part 1: Agent Development
- How to create custom tools for smolagents
- How to initialize and configure a ToolCallingAgent
- How to test your agent with queries

### Part 2: Systematic Evaluation with MASEval
- How to structure tasks with queries, environments, and evaluation criteria
- How to create custom environments that manage tool state
- How to write evaluators that measure specific aspects of agent performance
- How to run benchmarks and analyze results

## Next Steps

1. **Try the Five-A-Day Benchmark notebook** — A production-ready example with multi-agent systems and diverse evaluators
2. Create your own custom evaluators for your specific use case
3. Experiment with different agent frameworks (LangGraph, LlamaIndex)
4. Add callbacks for logging and tracing

For more information, visit the [MASEval documentation](https://github.com/parameterlab/MASEval).