# LLM Judges for Agent Evaluation 

We'll implement an LLM-based evaluation system to systematically assess our agent's performance using the results we collected in the previous lesson.

## Setting Up the Environment

First, let's load the data we have already prepared for evaluations:

In [None]:
import pickle

with open('sample_eval_rows.bin', 'rb') as f_in:
    rows = pickle.load(f_in)

## Initial Evaluation Approach

Let's start with a simple evaluation prompt:

In [None]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent’s answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user’s instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user’s question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
"""

And structured output classes:

In [None]:
from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    reasoning: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

We run it like any other Pydantic AI agent:

In [None]:
from pydantic_ai import Agent

eval_agent = Agent(
    name='eval_agent',
    model='gpt-5-mini',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)

eval_agent.run()

This is a good start but has a potential problem: the model can hallucinate check names and create inconsistent evaluations. We need a more structured approach.

## Structured Evaluation with Pydantic

To ensure consistency and prevent hallucination, we'll enforce the evaluation structure with Enums:

In [None]:
from enum import Enum
from pydantic import BaseModel, Field

class CheckName(str, Enum):
    instructions_follow = "instructions_follow"
    instructions_avoid = "instructions_avoid" 
    answer_relevant = "answer_relevant"
    answer_clear = "answer_clear"
    answer_citations = "answer_citations"
    completeness = "completeness"
    tool_call_search = "tool_call_search"

CHECK_DESCRIPTIONS = {
    CheckName.instructions_follow: "The agent followed the user's instructions (in <INSTRUCTIONS>)",
    CheckName.instructions_avoid: "The agent avoided doing things it was told not to do",
    CheckName.answer_relevant: "The response directly addresses the user's question",
    CheckName.answer_clear: "The answer is clear and correct",
    CheckName.answer_citations: "The response includes proper citations or sources when required",
    CheckName.completeness: "The response is complete and covers all key aspects of the request",
    CheckName.tool_call_search: "Is the search tool invoked?"
}

class EvaluationCheck(BaseModel):
    check_name: CheckName = Field(description="The type of evaluation check")
    reasoning: str = Field(description="The reasoning behind the check result")
    check_pass: bool = Field(description="Whether the check passed (True) or failed (False)")
    
class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck] = Field(description="List of all evaluation checks")
    summary: str = Field(description="Evaluation summary")


Now we can generate the checklist instructions:

In [None]:
def generate_checklist_text():
    checklist_items = []
    for check_name in CheckName:
        description = CHECK_DESCRIPTIONS[check_name]
        checklist_items.append(f"- {check_name.value}: {description}")
    return "\n".join(checklist_items)


eval_instructions = f"""
Use this checklist to evaluate the quality of an AI agent’s answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

{generate_checklist_text()}

Output true/false for each check and provide a short explanation for your judgment.
"""


And use it for running the agent:

In [None]:
from pydantic_ai import Agent

eval_agent = Agent(
    name='eval_agent',
    model='gpt-5-mini', # it's recommended to use a different model
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)

eval_agent.run(...)

## Running the Evals

Now let's prepare the user prompt for each of the records we will evaluate:

In [None]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

def format_prompt(rec):
    answer = rec['original_result'].output.format_article()
    
    logs = '\n'.join(json.dumps(l) for l in rec['messages'])
                     
    return user_prompt_format.format(
        instructions=agent._instructions,
        question=rec['question'],
        answer=answer,
        log=logs
    )


Let's run it! We'll use the same code for parallel execution:

In [None]:
import asyncio
from tqdm.auto import tqdm

async def map_progress(seq, f, max_concurrency=6):
    """Asynchronously map async function f over seq with progress bar."""
    semaphore = asyncio.Semaphore(max_concurrency)

    async def run(el):
        async with semaphore:
            return await f(el)

    # create one coroutine per element
    coros = [run(el) for el in seq]

    # turn them into tasks that complete as they finish
    completed = asyncio.as_completed(coros)

    results = []

    for coro in tqdm(completed, total=len(seq)):
        result = await coro
        results.append(result)

    return results


Create a function we'll apply to each row:

In [None]:
async def evaluate_record(rec):
    user_prompt = format_prompt(rec)
    result = await eval_agent.run(user_prompt)
    return rec, result

And run the evals!

In [None]:
all_results = await map_progress(rows, evaluate_record)

Let's see how much it cost:

In [None]:
from toyaikit.pricing import PricingConfig
pricing = PricingConfig()

def calculate_cost(model, all_results):
    total_input = 0
    total_output = 0
    
    for q, result in all_results:
        if result is None:
            continue

        usage = result.usage()
        total_input = total_input + usage.input_tokens
        total_output = total_output + usage.output_tokens
    
    return pricing.calculate_cost(model, total_input, total_output)

calculate_cost('gpt-5-mini', all_results)


So, the total cost of one eval:

- 0.12 for running the agent
- 0.22 for evaluating



## Displaying the Results

Now let's display the results (and save them too):

In [None]:
import pandas as pd

eval_results = []

for rec, result in all_results:
    eval_row = rec.copy()
    eval_result = result.output
    checks = {c.check_name.value: c.check_pass for c in eval_result.checklist}
    eval_row.update(checks)
    eval_row['summary'] = eval_result.summary
    eval_results.append(eval_row)


df_eval = pd.DataFrame(eval_results)


Now we show only the columns with evals:

In [None]:
eval_columns = [check_name.value for check_name in CheckName]
df_eval[eval_columns].mean()

We get something like that:

```bash
instructions_follow    0.36
instructions_avoid     0.60
answer_relevant        0.98
answer_clear           0.50
answer_citations       0.58
completeness           0.10
tool_call_search       0.96
```

It's not perfect but at least we know where we stand now.

## New Check: answer_match

In addition to the checks we did before, we can add another one: `answer_match`. We will ask the judge to compare the answer generated by the agent with the original article. If the answers match, it should be evaluated to true.

For that, we will need to add a way to fetch the original file. Let's create an index for that:

In [None]:
import docs

github_data = docs.read_github_data()
parsed_data = docs.parse_data(github_data)

file_index = {d['filename']: d['content'] for d in parsed_data}

Now we can use it to get the content of any file:

In [None]:
file_index['metrics/all_metrics.mdx']

Let's update the prompt:

In [None]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<REFERENCE>{referece}</REFERENCE>
<LOG>{log}</LOG>
""".strip()

def format_prompt(rec):
    original_filename = rec['original_question']['filename']
    reference = file_index[original_filename]

    answer = rec['original_result'].output.format_article()

    logs = '\n'.join(json.dumps(l) for l in rec['messages'])

    return user_prompt_format.format(
        instructions=agent._instructions,
        question=rec['question'],
        answer=answer,
        reference=reference,
        log=logs
    )


Our prompt:

```text
Use this checklist to evaluate the quality of an AI agent’s answer
(<ANSWER>) to a user question (<QUESTION>). We also include the
entire log (<LOG>) for analysis. In <REFERENCE> you will see
the file, from which the user question was generated. 
```

And the new checklist item:

```text
- answer_match: The ANSWER is similar to the REFERENCE
```

## Summary

This systematic evaluation approach provides objective metrics to guide agent improvements and track progress over time.

The cost of $0.34 per evaluation cycle is also reasonable, so we can use this approach to tune our agent. For example, to select the best chunking approach.