<!-- NOTEBOOK_METADATA source: "Jupyter Notebook" title: "Agent Evaluation: How to Evaluate LLM Agents" seoTitle: "Agent Evaluation: How to Evaluate LLM Agents (Metrics, Strategies & Examples)" description: "Complete guide to agent evaluation. Learn agent evaluation metrics like trajectory accuracy and tool selection, evaluation strategies (black-box, glass-box, white-box), and how to build automated agent evaluation pipelines with LLM-as-a-judge scoring." category: "Evaluation" -->

# Agent Evaluation: How to Evaluate LLM Agents

Evaluating AI agents is fundamentally different from evaluating simple LLM calls. Agents make autonomous, multi-step decisions — calling tools, searching databases, and chaining reasoning — which means a single accuracy score on the final output is not enough. You need to evaluate **what the agent did** (its trajectory), **how it did it** (each individual step), and **whether the result is correct** (the final response).

This guide provides a comprehensive framework for agent evaluation. You will learn how to measure agent behavior at three levels — final response, trajectory, and single step — and how to automate these evaluations using [Langfuse](https://langfuse.com) for [tracing](/docs/observability/overview), [datasets](/docs/evaluation/experiments/datasets), and [LLM-as-a-judge evaluations](/docs/evaluation/evaluation-methods/llm-as-a-judge). While the code examples use Pydantic AI, the evaluation strategies apply to any agent framework including [LangGraph](/guides/cookbook/example_langgraph_agents), [OpenAI Agents](/docs/observability/sdk/instrumentation), and others.

## What is an LLM Agent?

An LLM agent is more than just a single call to a language model. It's an autonomous system that operates in a continuous loop of reasoning and action. The loop begins when the LLM receives an input — either from a user or as feedback from a previous step. Based on this input, the LLM decides on an **action**, which often involves calling an external tool like a search API, a database query, or a code interpreter. This action interacts with an **environment**, which then produces **feedback** (like search results or data) that is fed back to the LLM.

This cycle of reasoning, action, environment interaction, and feedback continues until the agent decides to stop and generate a final answer. This entire sequence of events is what we call a **"trace"** or a **"trajectory"** — and it's what makes agent evaluation uniquely challenging compared to evaluating a single LLM call.

<Frame>
![LLM Agent](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/agent-overview.png)
</Frame>

## Why Agent Evaluation Matters

Evaluating agents is critical because they can fail in ways that simple LLM applications cannot. A chatbot might give a wrong answer, but an agent might call the wrong tool, execute actions in the wrong order, get stuck in a loop, or produce a correct final answer through an unsafe or inefficient path. Without structured agent evaluation, these failure modes are invisible — you only see the final output, not the broken reasoning chain that produced it.

Agent evaluation lets you catch regressions before they reach users, compare different agent configurations (models, prompts, tool sets) objectively, and build confidence that your agent handles edge cases correctly. For more on evaluation fundamentals, see [Evaluation Concepts](/docs/evaluation/core-concepts).

## Common Agent Evaluation Challenges

When building and evaluating agents, three problems show up again and again: **understanding, specification,** and **generalization**.

- **Lack of observability.** You often don't know what the agent actually does on real traffic — what tools it calls, what reasoning it follows, and where it gets stuck. Without systematic [trace inspection](/docs/observability/overview) and linking traces to user feedback, debugging is guesswork. This is why [agent monitoring](/blog/2024-07-ai-agent-observability-with-langfuse) is the foundation for any evaluation strategy.

- **Underspecified tasks.** Prompts and examples frequently don't encode what "good" behavior is, so the agent improvises in unpredictable ways. Clear evaluation criteria (what tools should be called, what facts should be included) force you to define success concretely.

- **Failure to generalize.** Even once you've tightened the spec, the agent may perform well on a few handpicked examples but fail on slightly different real-world queries. Systematic, [dataset-based evaluations](/docs/evaluation/experiments/datasets) at scale are the only way to check robustness.

<Frame>
![LLM Agent](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/evaluations.png)
</Frame>

## Agent Evaluation Metrics

Before choosing an evaluation strategy, it helps to understand what you can measure. Agent evaluation metrics fall into several categories depending on whether you're assessing the outcome, the process, or the system performance.

| Metric | What It Measures | Evaluation Level |
| --- | --- | --- |
| **Task Completion** | Does the final output fully satisfy the user's request? | Final Response |
| **Factual Accuracy** | Are all facts in the response correct and verifiable? | Final Response |
| **Tool Selection Accuracy** | Did the agent choose the correct tools for the task? | Trajectory |
| **Trajectory Correctness** | Did the agent follow the expected sequence of actions? | Trajectory |
| **Trajectory Efficiency** | Did the agent reach the answer with minimal unnecessary steps? | Trajectory |
| **Search/Query Quality** | Are search queries relevant and well-formed? | Single Step |
| **Reasoning Quality** | Does each step follow logically from the previous context? | Single Step |
| **Cost & Latency** | How much does each agent run cost in tokens and wall-clock time? | System |
| **Safety & Guardrails** | Does the agent stay within defined operational boundaries? | All Levels |

In this guide, we focus on the first six metrics and show how to automate them using [LLM-as-a-judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge). For cost and latency tracking, see [Token & Cost Tracking](/docs/observability/features/token-and-cost-tracking).

## The 3 Phases of Agent Evaluation

Agent evaluation is not a one-time activity — it evolves as your agent matures. The process has three distinct phases:

**Phase 1: Early Development (Manual Tracing)**  
When you're first building an agent, the most valuable thing you can do is inspect its [traces](/docs/observability/overview). Manual tracing gives you immediate insight into the agent's reasoning, tool calls, and failure points. Use Langfuse's trace viewer to step through each action the agent took.

**Phase 2: First Users (Online Evaluation)**  
As real users interact with your agent, implement feedback mechanisms — like thumbs-up/thumbs-down buttons — to flag problematic traces for review. You can also set up automated [online evaluators](/docs/evaluation/core-concepts#online-evaluation) that score production traces in real time.

**Phase 3: Scaling (Offline Evaluation)**  
The final phase, and the focus of this guide, is creating an automated offline evaluation pipeline. As you scale, you can't manually review every trace. You need a "gold standard" [dataset](/docs/evaluation/experiments/datasets) of inputs and their expected outputs or trajectories. This benchmark allows you to [run experiments](/docs/evaluation/experiments/experiments-via-sdk), prevent regressions, and confidently iterate on prompts, models, and tool configurations.

<Frame>
![LLM Agent](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/issues.png)
</Frame>

## Three Agent Evaluation Strategies

This guide covers three practical, automated evaluation strategies. Each operates at a different level of granularity and answers a different question about your agent's behavior.

| Strategy | Level | Question It Answers | Pros | Cons |
| --- | --- | --- | --- | --- |
| **Final Response** (Black-Box) | Output only | Is the answer correct? | Framework-agnostic, simple to set up | Doesn't explain *why* a failure occurred |
| **Trajectory** (Glass-Box) | Full trace | Did the agent take the right path? | Pinpoints where reasoning broke down | Requires defining expected tool sequences |
| **Single Step** (White-Box) | Per-step | Is each individual decision correct? | Most granular, like a unit test | Most effort to set up and maintain |

**1) Final Response Evaluation (Black-Box):**  
This method evaluates only the user's input and the agent's final answer, ignoring the internal steps entirely. It's the simplest to set up and works with any agent framework, but it cannot tell you *why* a failure occurred.

**2) Trajectory Evaluation (Glass-Box):**  
This method checks whether the agent took the "correct path." It compares the agent's actual sequence of tool calls against the expected sequence from a benchmark dataset. When the final answer is wrong, trajectory evaluation pinpoints exactly where in the reasoning process the failure occurred.

**3) Single Step Evaluation (White-Box):**  
This is the most granular evaluation strategy, acting like a unit test for agent reasoning. Instead of running the whole agent, it tests each decision-making step in isolation to see if it produces the expected next action. This is especially useful for validating that search queries, API parameters, or tool selections are correct.

## Implementation: Evaluate an Agent Step-by-Step

Below, we define a sample agent, create a benchmark dataset, and set up automated [LLM-as-a-judge](/docs/evaluation/evaluation-methods/llm-as-a-judge) evaluations in Langfuse. While the code uses Pydantic AI, the evaluation patterns generalize to any agent framework.

> **Want to see agent evaluation with other frameworks?** Check out the [LangGraph Agent Evaluation](/guides/cookbook/example_langgraph_agents) guide for a LangGraph-specific walkthrough.



### Step 0: Install Packages

In [None]:
%pip install -q --upgrade "pydantic-ai[mcp]" langfuse openai nest_asyncio aiohttp

### Step 1: Set Environment Variables

Get your Langfuse API keys from [project settings](https://cloud.langfuse.com).

In [None]:
import os

os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"  # EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com"  # US region

os.environ["OPENAI_API_KEY"] = "sk-proj-..."

### Step 2: Enable Langfuse Tracing

Enable automatic tracing for Pydantic AI agents.

In [None]:
from langfuse import get_client
from pydantic_ai.agent import Agent

langfuse = get_client()
assert langfuse.auth_check(), "Langfuse auth failed - check your keys"

Agent.instrument_all()
print("✅ Pydantic AI instrumentation enabled")

### Step 3: Create Agent

Build an agent that searches Langfuse docs using the [Langfuse Docs MCP Server](https://langfuse.com/docs/docs-mcp).

In [None]:
from typing import Any
from pydantic_ai import Agent, RunContext
from pydantic_ai.mcp import MCPServerStreamableHTTP, CallToolFunc, ToolResult

LANGFUSE_MCP_URL = "https://langfuse.com/api/mcp"

async def run_agent(item, system_prompt="You are an expert on Langfuse. ", model="openai:gpt-4o-mini"):
    langfuse.update_current_trace(input=item.input)

    tool_call_history = []

    async def process_tool_call(
        ctx: RunContext[Any],
        call_tool: CallToolFunc,
        tool_name: str,
        args: dict[str, Any],
    ) -> ToolResult:
        tool_call_history.append({"tool_name": tool_name, "args": args})
        return await call_tool(tool_name, args)
    
    langfuse_docs_server = MCPServerStreamableHTTP(
        url=LANGFUSE_MCP_URL,
        process_tool_call=process_tool_call,
    )

    agent = Agent(
        model=model,
        system_prompt=system_prompt,
        toolsets=[langfuse_docs_server],
    )

    async with agent:
        result = await agent.run(item.input["question"])
        
        langfuse.update_current_trace(
            output=result.output,
            metadata={"tool_call_history": tool_call_history},
        )

        return result.output, tool_call_history

### Step 4: Create Evaluation Dataset

Build a benchmark dataset with test cases. Each case includes:
- `input`: User question
- `expected_output.response_facts`: Key facts the response must contain
- `expected_output.trajectory`: Expected sequence of tool calls
- `expected_output.search_term`: Expected search query (if applicable)

In [None]:
test_cases = [
    {
        "input": {"question": "What is Langfuse?"},
        "expected_output": {
            "response_facts": [
                "Open Source LLM Engineering Platform",
                "Product modules: Tracing, Evaluation and Prompt Management"
            ],
            "trajectory": ["getLangfuseOverview"],
        }
    },
    {
        "input": {"question": "How to trace a python application with Langfuse?"},
        "expected_output": {
            "response_facts": [
                "Python SDK, you can use the observe() decorator",
                "Lots of integrations, LangChain, LlamaIndex, Pydantic AI, and many more."
            ],
            "trajectory": ["getLangfuseOverview", "searchLangfuseDocs"],
            "search_term": "Python Tracing"
        }
    },
    {
        "input": {"question": "How to connect to the Langfuse Docs MCP server?"},
        "expected_output": {
            "response_facts": [
                "Connect via the MCP server endpoint: https://langfuse.com/api/mcp",
                "Transport protocol: `streamableHttp`"
            ],
            "trajectory": ["getLangfuseOverview"]
        }
    },
    {
        "input": {"question": "How long are traces retained in langfuse?"},
        "expected_output": {
            "response_facts": [
                "By default, traces are retained indefinitely",
                "You can set custom data retention policy in the project settings"
            ],
            "trajectory": ["getLangfuseOverview", "searchLangfuseDocs"],
            "search_term": "Data retention"
        }
    }
]

DATASET_NAME = "pydantic-ai-mcp-agent-evaluation"

dataset = langfuse.create_dataset(name=DATASET_NAME)
for case in test_cases:
    langfuse.create_dataset_item(
        dataset_name=DATASET_NAME,
        input=case["input"],
        expected_output=case["expected_output"]
    )

### Step 5: Set Up Evaluators

Create three evaluators in the Langfuse UI. Each tests a different aspect of agent behavior. You can find the documentation on setting them up [here](https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge). 

#### 1. Final Response Evaluation (Black Box)

Tests output quality. Works regardless of internal implementation.

<Frame>
![Final Response Evaluation](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/eval-final-response.png)
</Frame>

**Prompt template:**

```markdown
You are a teacher grading a student based on the factual correctness of their statements.

### Examples

#### Example 1:
- Response: "The sun is shining brightly."
- Facts to verify: ["The sun is up.", "It is a beautiful day."]
- Reasoning: The response includes both facts.
- Score: 1

#### Example 2:
- Response: "When I was in the kitchen, the dog was there"
- Facts to verify: ["The cat is on the table.", "The dog is in the kitchen."]
- Reasoning: The response mentions the dog but not the cat.
- Score: 0

### New Student Response

- Response: {{response}}
- Facts to verify: {{facts_to_verify}}
```

#### 2. Trajectory Evaluation (Glass Box)

Verifies the agent used the correct sequence of tools.

<Frame>
![Trajectory Evaluation](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/eval-trajectory.png)
</Frame>

**Prompt template:**

```markdown
You are comparing two lists of strings. Check whether the lists contain exactly the same items. Order does not matter.

## Examples

Expected: ["searchWeb", "visitWebsite"]
Output: ["searchWeb"]
Reasoning: Output missing "visitWebsite".
Score: 0

Expected: ["drawImage", "visitWebsite", "speak"]
Output: ["visitWebsite", "speak", "drawImage"]
Reasoning: Output matches expected items.
Score: 1

Expected: ["getNews"]
Output: ["getNews", "watchTv"]
Reasoning: Output contains unexpected "watchTv".
Score: 0

## This Exercise

Expected: {{expected}}
Output: {{output}}
```

#### 3. Search Quality Evaluation

Validates search query quality when agents search documentation.

<Frame>
![Trajectory Evaluation](/images/cookbook/example_pydantic_ai_mcp_agent_evaluation/eval-single-step.png)
</Frame>

**Prompt template:**

```markdown
You are grading whether a student searched for the right information. The search term should correspond vaguely with the expected term.

### Examples

Response: "How can I contact support?"
Expected search topics: Support
Reasoning: Response searches for support.
Score: 1

Response: "Deployment"
Expected search topics: Tracing
Reasoning: Response doesn't match expected topic.
Score: 0

Response: (empty)
Expected search topics: (empty)
Reasoning: No search expected, no search done.
Score: 1

### New Student Response

Response: {{search}}
Expected search topics: {{expected_search_topic}}
```

Create these evaluators in Langfuse UI under **Prompts** → **Create Evaluator**.

### Step 6: Run Experiments

Run agents on your dataset. Compare different models and prompts to find the best configuration.

In [None]:
dataset = langfuse.get_dataset(DATASET_NAME)

result = dataset.run_experiment(
    name="Production Model Test",
    description="Monthly evaluation of our production model",
    task=run_agent
)

print(result.format())

### Step 7: Compare Multiple Configurations

Test different prompts and models to find the best configuration.

In [None]:
from functools import partial

system_prompts = {
    "simple": (
        "You are an expert on Langfuse. "
        "Answer user questions accurately and concisely using the available MCP tools. "
        "Cite sources when appropriate."
    ),
    "nudge_search": (
        "You are an expert on Langfuse. "
        "Answer user questions accurately and concisely using the available MCP tools. "
        "Always cite sources when appropriate. "
        "When unsure, use getLangfuseOverview then search the docs. You can use these tools multiple times."
    )
}

models = ["openai:gpt-5-mini", "openai:gpt-5-nano"]

dataset = langfuse.get_dataset(DATASET_NAME)

for prompt_name, prompt_content in system_prompts.items():
    for test_model in models:
        task = partial(
            run_agent,
            system_prompt=prompt_content,
            model=test_model,
        )

        result = dataset.run_experiment(
            name=f"Test: {prompt_name} {test_model}",
            description="Comparing prompts and models",
            task=task
        )

        print(result.format())

## Agent Evaluation Best Practices

Based on our experience helping teams evaluate agents in production, here are key best practices:

1. **Start with tracing, not scoring.** Before you build automated evaluations, spend time manually reviewing agent traces. The patterns you observe will inform what metrics matter most for your use case. Use [Langfuse tracing](/docs/observability/overview) to inspect every tool call, reasoning step, and intermediate output.

2. **Define success criteria before writing evaluators.** For each test case, explicitly define what "correct" looks like at each level — the expected final answer, the expected tool sequence, and the expected search queries. Vague criteria lead to unreliable evaluations.

3. **Use all three evaluation levels together.** Final response evaluation tells you *what* went wrong. Trajectory evaluation tells you *where* it went wrong. Single step evaluation tells you *why* it went wrong. Together, they give you a complete picture.

4. **Build your dataset from real failures.** The most valuable test cases come from production traces where the agent failed. Use [annotation queues](/docs/evaluation/evaluation-methods/annotation-queues) to systematically review and label problematic traces, then add them to your evaluation dataset.

5. **Run evaluations in CI/CD.** Integrate agent evaluation into your deployment pipeline using [experiments via SDK](/docs/evaluation/experiments/experiments-via-sdk). Block deployments that cause score regressions on your benchmark dataset.

6. **Compare configurations systematically.** When changing prompts, models, or tools, run the same evaluation dataset across all configurations to make data-driven decisions. The experiment comparison view in Langfuse makes this straightforward.

## Next Steps

Now that you have a working agent evaluation pipeline, here are ways to extend it:

- **Scale your dataset** with [synthetic data generation](/guides/cookbook/example_synthetic_datasets) to cover more edge cases
- **Add online evaluation** to score production traces in real time using [LLM-as-a-Judge](/docs/evaluation/evaluation-methods/llm-as-a-judge)
- **Evaluate multi-turn conversations** if your agent handles [multi-turn dialogue](/guides/cookbook/example_evaluating_multi_turn_conversations)
- **Monitor agent performance** over time with [custom dashboards](/docs/metrics/features/custom-dashboards) and [score analytics](/docs/evaluation/evaluation-methods/score-analytics)
- **Explore the full evaluation roadmap** in our [comprehensive evaluation guide](/blog/2025-11-12-evals)

## Frequently Asked Questions

### What is agent evaluation?

Agent evaluation is the process of systematically testing and measuring the performance of AI agents — autonomous systems that use LLMs to make decisions, call tools, and complete multi-step tasks. Unlike evaluating a single LLM call, agent evaluation must assess the entire trajectory of actions, not just the final output.

### How is agent evaluation different from LLM evaluation?

Standard LLM evaluation checks whether a model produces a correct or high-quality response to a given prompt. Agent evaluation is more complex because agents make multiple decisions in sequence — choosing which tools to call, what parameters to pass, and when to stop. You need to evaluate not just the final answer, but also the reasoning path (trajectory) and each individual decision (single step).

### What are the main types of agent evaluation?

There are three main types: **Final Response (Black-Box)** evaluation checks only the end result; **Trajectory (Glass-Box)** evaluation checks whether the agent took the correct sequence of actions; and **Single Step (White-Box)** evaluation tests each individual decision in isolation. Most production systems use a combination of all three.

### How do I build an agent evaluation dataset?

Start by defining test cases that represent your most common and most critical user interactions. Each test case should include the user input, expected facts in the response, the expected sequence of tool calls (trajectory), and expected parameters for key tool calls. Grow your dataset over time by adding cases from real production failures.

### Can I use LLM-as-a-judge for agent evaluation?

Yes. LLM-as-a-judge is one of the most effective approaches for agent evaluation because agent outputs are often too complex for simple rule-based checks. You can use different judge prompts for each evaluation level — one for final response quality, one for trajectory correctness, and one for individual step quality. See the [LLM-as-a-Judge documentation](/docs/evaluation/evaluation-methods/llm-as-a-judge) for setup instructions.

### How often should I run agent evaluations?

Run offline evaluations (experiments) before every deployment that changes prompts, models, or tool configurations. Run online evaluations continuously on production traces to catch issues in real traffic. For a comprehensive approach, see the [evaluation overview](/docs/evaluation/overview).