# Preparing for Evaluation: Running and Debugging the Agent

We'll prepare our evaluations: we'll run our agent using the generated ground truth data and analyze its performance patterns.

In the next lesson, we'll use the results to run the actual evaluations.

## Setting Up the Environment

First, let's import our agent from the main module:

In [None]:
import sys
sys.path.insert(0, '..')

import main
agent = main.agent

This gives us access to the same agent we've been testing and improving throughout the module.

## Loading Ground Truth Data

Load the synthetic dataset we created in the previous lesson:

In [None]:
import pandas as pd

df_ground_truth = pd.read_csv('ground_truth_evidently.csv')
ground_truth = df_ground_truth.to_dict(orient='records')

Let's examine a sample question to understand what we're working with:

In [None]:
q = ground_truth[10]
print(q)

Let's test our agent with this sample question:

In [None]:
result = await agent.run(q['question'])
print(result.output.format_article())

## Preparing Sample Data

Intead of working on the entire dataset, we also can work on a sample.

Let's select 50 questions:

In [None]:
import random
random.seed(1)

ground_truth_sample = random.sample(ground_truth, 50)


Save the sample for reproducibility:

In [None]:
import pickle

with open('sample.bin', 'wb') as f_out:
    pickle.dump(ground_truth_sample, f_out)

## Error Handling

The plan is to evaluate the agent against all ground truth data.

But what if it breaks while evaluating? It'd be pity if at 80% it breaks with a network error (timeout or something like that), and we need to re-run the whole thing .

So let's put things into a try/except block:

In [None]:
import traceback

async def run_agent(q):
    try:
        result = await agent.run(q['question'])
        return (q, result)
    except:
        print(f'error processing {q}')
        traceback.print_exc()
        return (None, None)

This wrapper ensures that even if some queries fail (due to token limits or other issues), we can continue processing the remaining questions.

## Parallel Processing Setup

To efficiently process multiple queries, we'll use asynchronous processing (ChatGPT helped me translate the ThreadPoolExecutor version into asyncio):

In [None]:
import asyncio
from tqdm.auto import tqdm

async def map_progress(seq, f, max_concurrency=6):
    """Asynchronously map async function f over seq with progress bar."""
    semaphore = asyncio.Semaphore(max_concurrency)

    async def run(el):
        async with semaphore:
            return await f(el)

    # create one coroutine per element
    coros = [run(el) for el in seq]

    # turn them into tasks that complete as they finish
    completed = asyncio.as_completed(coros)

    results = []

    for coro in tqdm(completed, total=len(seq)):
        result = await coro
        results.append(result)

    return results


## Running the Initial Evaluation

Execute the evaluation on our sample dataset:

In [None]:
all_results = await map_progress(ground_truth_sample, run_agent)

Analyze the cost of running this evaluation:

In [None]:
from toyaikit.pricing import PricingConfig
pricing = PricingConfig()

Create a helper function to calculate total costs across all results:

In [None]:
def calculate_cost(model, all_results):
    total_input = 0
    total_output = 0
    
    for q, result in all_results:
        if result is None:
            continue
        usage = result.usage()
        total_input = total_input + usage.input_tokens
        total_output = total_output + usage.output_tokens
    
    return pricing.calculate_cost(model, total_input, total_output)


Check the cost for our sample run:

In [None]:
calculate_cost('gpt-4o-mini', all_results)

It's a few cents. For full dataset it'd be around $1.0-$1.5.

## Processing Results for Analysis

When we run it on many queries, we can spot some problems. For example, for some queries the agent is making too many search queries.

Let's look into it.

First, create a helper functions to simplify the message structure:

In [None]:
import json

def simplify_messages(messages):
    messages_simplified = []

    for m in messages:
        parts = []

        for original_part in m.parts:
            kind = original_part.part_kind
            # print(original_part)
            part = {
                'kind': kind
            }
            if kind == 'user-prompt':
                part['content'] = original_part.content
            if kind == 'tool-call':
                if original_part.tool_name == 'final_result':
                    continue
    
                part['tool_name'] = original_part.tool_name
                part['args'] = json.loads(original_part.args)
            if kind == 'tool-return':
                continue
            if kind == 'text':
                part['content'] = original_part.content

            parts.append(part)

        if len(parts) > 0:
            messages_simplified.extend(parts)

    return messages_simplified

This function extracts essential information from the complex message structure, filtering out internal tool calls like final_result. Otherwise it will be too much stuff to look at.

Now let's count the number of tool calls to understand agent behavior:

In [None]:
def count_tool_calls(messages):
    cnt = 0 
    for m in messages:
        if m['kind'] == 'tool-call':
            cnt = cnt + 1
    return cnt

Let's process all the records:

In [None]:
def process_result(q, result):
    row = {}

    row['question'] = q['question']
    row['answer'] = result.output.format_article()
    row['messages'] = simplify_messages(result.new_messages())
    row['num_tool_calls'] = count_tool_calls(row['messages']) 

    row['original_question'] = q
    row['original_result'] = result

    return row


rows = []

for q, result in all_results:
    if result is None:
        continue

    row = process_result(q, result)
    rows.append(row)


Put everything inside a pandas DataFrame for easier analysis:

In [None]:
df_logs = pd.DataFrame(rows)

## Identifying Performance Issues

During our analysis, we discovered a problem: When it can't find something, it keeps searching and searching.

We need to stop it and just explicitly say: "can't find the information you're asking". To address it, we'll ask it to limit search to 6 queries. If it can't find anything, then we'll ask it to just say it.

Let's go back to the agent code and update the instructions:

In [None]:
search_instructions = """
You are a search assistant for the Evidently documentation.

Requirements:
- For every user query, you must perform at least 3 and at most 6 separate searches
- Keep all searches relevant to Evidently and centered on technical or conceptual details
- The FAQ database contains only Evidently-related content, so you don't need to include "Evidently" in queries
- If you cannot answer the user's question after 6 searches, set `found_answer` to False
- Do not perform more than 6 searches per query
"""

class SearchResultArticle(BaseModel):
    found_answer: bool # <- this
    title: str
    sections: list[Section]
    references: list[Reference]


Let's add this to the tests we created previously:

In [None]:
def test_agent_doesnt_make_too_many_calls():
    result = run_agent_sync("detecting constant features in dataset")

    tool_calls = get_tool_calls(result)

    assert len(tool_calls) >= 3, "Less than 3 tool calls found"
    assert len(tool_calls) <= 10, "More than 10 tool calls found"

    output = result.output
    print(output)


And this:

In [None]:
@pytest.mark.asyncio
async def test_cannot_find_result():
    criteria = [
        "agent makes between 3 and 10 search calls",
        "article should say that no relevant information was found",
    ]

    result = await run_agent("detecting constant features in dataset")
    print(result.output.format_article())

    eval_results = await evaluate_agent_performance(
        criteria,
        result,
        output_transformer=lambda x: x.format_article()
    )

    print(eval_results)

    for criterion in eval_results.criteria:
        print(criterion)
        assert criterion.passed, f"Criterion failed: {criterion.criterion_description}"


## Re-running Evaluation with Improvements

After implementing improvements, re-run the evaluation.

In [None]:
all_results = await map_progress(ground_truth_sample, run_agent)

Sort by number of tool calls to understand search patterns:

In [None]:
df_logs.sort_values(by='num_tool_calls')

## Advanced Search Control Options

For even better control, we can programmatically limit searches. Let's programmatically forbid it making many calls.

### Option 1: Context-aware search limiting

We wrap our search function inside another one that has access to RunContext. Through it, we can see how many run_steps (tool-call loop iterations) we already made, and if it's too many, don't return anything:

In [None]:
def create_agent():
    tools = search_tools.prepare_search_tools()

    def search(ctx: RunContext, query: str):
        if ctx.run_step > 3:
            return {"message": "Maximum number of searches reached."}

        return tools.search(query)

    search.__doc__ = tools.search.__doc__

    return Agent(
        name="search",
        instructions=search_instructions,
        tools=[search],
        model="openai:gpt-4o-mini",
        output_type=SearchResultArticle,
    )


Note: we need to copy the docstrings to the new search tool, otherwise the LLM won't know how to use it.

### Option 2: Message history processing

Alternatively, we can alter message history using [history_processors](https://ai.pydantic.dev/message-history/). Here's the docstring from the Agent class:

```text
history_processors: Optional list of callables to process the message history before sending it to the model.
            Each processor takes a list of messages and returns a modified list of messages.
            Processors can be sync or async and are applied in sequence.
```

Let's implement it:

In [None]:
from pydantic_ai.messages import ModelMessage, UserPromptPart

def force_answer_after_6_searches(messages: list[ModelMessage]) -> list[ModelMessage]: 
    num_tool_calls = 0
    
    for m in messages:
        for p in m.parts:
            if p.part_kind == 'tool-call' and p.tool_name == 'search':
                num_tool_calls = num_tool_calls + 1

    if num_tool_calls >= 6:
        print('forcing output')
        last_message = messages[-1]
        finish_prompt = 'System message: The maximal number of searches has exceeded 6. Proceed to finishing the writeup'
        finish_prompt_part = UserPromptPart(finish_prompt)
        last_message.parts.append(finish_prompt_part)

    return messages


Now we control the agent by injecting instructions to stop using search and proceed to generating the answer.

This option is a little cleaner as we don't need to modify our tools.

## Final Evaluation Run

Execute the final evaluation with all improvements:

In [None]:
all_results = await map_progress(ground_truth_sample, run_agent)

Check the final costs:

In [None]:
calculate_cost('gpt-4o-mini', all_results)

Process and save the results:

In [None]:
rows = []

for q, result in all_results:
    if result is None:
        continue

    row = process_result(q, result)
    rows.append(row)


And save it for the next step:

In [None]:
with open('sample_eval_rows.bin', 'wb') as f_out:
    pickle.dump(rows, f_out)

## Summary and next steps

In this lesson we run the agent and collected the output. We also identified a few problems and fixed them.

Because we run the agent against a lot of queries, we can easily identify some common problems.

Now we can use the data we collected to actually do the evaluation. We will do it in the next lesson.