# Project: Ambient Agents with LangGraph - Module 3: Agent Evaluations

In [1]:
from dotenv import load_dotenv
load_dotenv()

%load_ext autoreload
%autoreload 2

# Evaluating Agents

In previous module, we have an email assistant that uses a router to triage emails and then passes the email to the agent for response generation. The next step is to evaluate how well it works in production. The testing process will guide our decisions about our agent architecture with quantifiable metrics like response quality, token usage, latency, or triage accuracy.

# How to run evaluations

## Pytest / Vitest

Pytest and Vitest are popular testing frameworks for Python and JavaScript/TypeScript, respectively. 

LangSmith integrates with these frameworks to allow us to write and run tests that log results to LangSmith. In this module, we will use Pytest for Python examples.

## LangSmith Datasets

We can also create a dataset in LangSmith and run our agent against the dataset using the LangSmith evaluate API.

# Test cases

We will define a set of example emails we want to handle along with a few things to test. The test cases are in `src/email_assistant/eval/email_dataset.py` and contain the following:
- **Input Emails**: A collection of diverse email examples
- **Ground Truth Classifications**: `Respond`, `Notify`, `Ignore`
- **Expected Tool Calls**: Tools called for each email that requires a reponse
- **Response Criteria**: What makes a good response for emails requiring a response

Note that we need to have both
- End-to-end "integration* tests (e.g., Input Emails -> Agent -> Final Ourput VS Response Criteria)
- Tests for specific steps in our workflow (e.g., Input Emails -> Agent -> Classification VS Ground Truth Classifications)

In [5]:
from email_assistant.eval.email_dataset import email_inputs, expected_tool_calls, triage_outputs_list, response_criteria_list

test_case_idx = 0

print("Email Input:", email_inputs[test_case_idx])
print("Expected Triage Output:", triage_outputs_list[test_case_idx])
print("Expected Tool Calls:", expected_tool_calls[test_case_idx])
print("Response Criteria:", response_criteria_list[test_case_idx])

Email Input: {'author': 'Alice Smith <alice.smith@company.com>', 'to': 'Lance Martin <lance@company.com>', 'subject': 'Quick question about API documentation', 'email_thread': "Hi Lance,\n\nI was reviewing the API documentation for the new authentication service and noticed a few endpoints seem to be missing from the specs. Could you help clarify if this was intentional or if we should update the docs?\n\nSpecifically, I'm looking at:\n- /auth/refresh\n- /auth/validate\n\nThanks!\nAlice"}
Expected Triage Output: respond
Expected Tool Calls: ['write_email', 'done']
Response Criteria: 
• Send email with write_email tool call to acknowledge the question and confirm it will be investigated  



# Pytest Example

In [7]:
import pytest
from email_assistant.eval.email_dataset import email_inputs, expected_tool_calls
from email_assistant.utils import format_messages_string, extract_tool_calls
from email_assistant.email_assistant import email_assistant

from langsmith import testing as t

In [8]:
@pytest.mark.langsmith
@pytest.mark.parametrize(
    'email_input, expected_calls',
    [
        # Pick some examples with email reply expected
        (email_inputs[0], expected_tool_calls[0]),
        (email_inputs[3], expected_tool_calls[3]),
    ]
)
def test_email_dataset_tool_calls(email_input, expected_calls):
    """Test if email processing contains expected tool calls.
    
    This test confirms that all expected tools are called during email processing,
    but does not check the order of tool invocations or the number of invocations
    per tool. Additional checks for these aspects could be added if desired.
    """
    # Run the email assistant
    messages = [{'role': 'user', 'content': str(email_input)}]
    result = email_assistant.invoke({'messages': messages})

    # Extract tool calls from messages list
    extracted_tool_calls = extract_tool_calls(result['messages'])

    # Check if all expected tool calls are in the extracted ones
    missing_calls = [
        call for call in expected_calls if call.lower() not in extracted_tool_calls
    ]

    t.log_outputs({
        'missing_calls': missing_calls,
        'extracted_tool_calls': extracted_tool_calls,
        'response': format_messages_string(result['messages']),
    })

    # Test passes if no expected calls are missing
    assert len(missing_calls) == 0

To run with Pytest and log test results to LangSmith, we only need to add the `@pytest.mark.langsmith` decorator to our function and place it in a file, named `test_tools.py` placed in the same directory as this notebook. This will log the test results to LangSmith.

We can pass dataset examples to the test function as shown via `@py.mark.parametrize`.

We can run the test from the command line with:
```bash
! LANGSMITH_TEST_SUITE='Email assistant: Test Tools For Interrupt' pytest ./test_tools.py
```

After that, we can view the results in the LangSmith UI. The `assert len(missing_calls) == 0` is logged to the `Pass` column in LangSmith.

# LangSmith Datasets Example

In the previous exmaple with Pytest, we evaluated the tool calling accuracy of the email assistant. Now, the dataset that we will evaluate is specifically for the triage step of the email assistant, in classifying whether an email requires a response.

## Dataset definition

We can create a dataset in LangSmith with the LangSmith SDK:

In [10]:
from langsmith import Client

from email_assistant.eval.email_dataset import examples_triage

# Initialize LangSmith client
client = Client()

# Dataset name
dataset_name = "Email Triage Evaluation"

# Create dataset if it doesn't exist
if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="A dataset of emails and their triage decisions."
    )
    # Add examples to the dataset
    client.create_examples(
        dataset_id=dataset.id,
        examples=examples_triage
    )

## Target function

The dataset has the following structure, with an email input and a ground truth triage classification for the email as output:

```python
    examples_triage = [
        {
            "inputs": {"email_input": email_input_1},
            "outputs": {"classification": triage_output_1},   # NOTE: This becomes the reference_output in the created dataset
        }, ...
    ]
```

In [11]:
print("Dataset Example Input (inputs):", examples_triage[0]['inputs'])

Dataset Example Input (inputs): {'email_input': {'author': 'Alice Smith <alice.smith@company.com>', 'to': 'Lance Martin <lance@company.com>', 'subject': 'Quick question about API documentation', 'email_thread': "Hi Lance,\n\nI was reviewing the API documentation for the new authentication service and noticed a few endpoints seem to be missing from the specs. Could you help clarify if this was intentional or if we should update the docs?\n\nSpecifically, I'm looking at:\n- /auth/refresh\n- /auth/validate\n\nThanks!\nAlice"}}


In [12]:
print("Dataset Example Reference Output (reference_outputs):", examples_triage[0]['outputs'])

Dataset Example Reference Output (reference_outputs): {'classification': 'respond'}


We will define a function that takes the dataset inputs and passes them to our email assistant.

In [None]:
def target_email_assistant(inputs: dict) -> dict:
    """Process an email through the workflow"""
    response = email_assistant.nodes['triage_router'].invoke({
        'email_input': inputs['email_input']
    })
    return {
        'classification_decision': response.update['classification_decision'],
    }

## Evaluator function

We will create an evaluator function to compare
- Reference outputs: `"reference_outputs": {'classification': triage_output_1} ...`
- Agent outputs: `"outputs": {'classification_decision': agent_output_1} ...`

We want to evaluate if the agent's output matches the reference output.

In [None]:
def classification_evaluator(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the reference output."""
    return outputs['classification_decision'].lower() == reference_outputs['classification'].lower()

## Running evaluation

The evaluate API will take care of the rest. It passes the `inputs` dict from our dataset to the target function, and passes the `reference_outputs` dict from out dataset to the evaluator function. And it passes the `outputs` of our agent to the evaluator function.

This is similar to what we did with Pytest. In Pytest, we passed in the dataset example inputs and references outputs to the test function with `@pytest.mark.parametrize`. 

In [None]:
# Set to true if we want to kick off evaluation
run_expt = True
if run_expt:
    experiment_results_workflow = client.evaluate(
        # Run agent 
        target_email_assistant,
        # Dataset name   
        data=dataset_name,
        # Evaluator
        evaluators=[classification_evaluator],
        # Name of the experiment
        experiment_prefix="Email assistant workflow", 
        # Number of concurrent evaluations
        max_concurrency=2, 
    )

# LLM-as-Judge evaluation

We have shown unit tests for the triage step (using evaluate()) and tool calling (using Pytest).

Now, we will use an LLM as a judge to evaluate our agent's execution against a set of success criteria.

First, we define a structured output schema for our LLM grader that contains a grade and justification for the grade.

In [None]:
from pydantic import BaseModel, Field
from langchain.chat_models import init_chat_model


class CriteriaGrade(BaseModel):
    """Score the response against specific criteria."""
    justification: str = Field(
        description:"The justification for the grade and score, including specific examples from the response."
    )
    grade: bool = Field(
        description="Does the response meet the provided criteria?"
    )


# Create a global LLM for evaluation to avoid recreating it for each test
criteria_eval_llm = init_chat_model('openai:gpt-4o')
criteria_eval_structured_llm = criteria_eval_llm.with_structured_output(CriteriaGrade)

In [None]:
email_input = email_inputs[0]
print("Email Input:", email_input)
success_criteria = response_criteria_list[0]
print("Success Criteria:", success_criteria)

Our email assistant is invoked with the email input and the response is formatted into a string. These are all then passed to the LLM grader to receive a grade and justification for the grade.

In [None]:
response = email_assistant.invoke({"email_input": email_input})

In [None]:
from email_assistant.eval.prompts import RESPONSE_CRITERIA_SYSTEM_PROMPT

from email_assistant.eval.prompts import RESPONSE_CRITERIA_SYSTEM_PROMPT

all_messages_str = format_messages_string(response['messages'])
eval_result = criteria_eval_structured_llm.invoke([
        {"role": "system",
            "content": RESPONSE_CRITERIA_SYSTEM_PROMPT},
        {"role": "user",
            "content": f"""\n\n Response criteria: {success_criteria} \n\n Assistant's response: \n\n {all_messages_str} \n\n Evaluate whether the assistant's response meets the criteria and provide justification for your evaluation."""}
    ])

eval_result

In [None]:
RESPONSE_CRITERIA_SYSTEM_PROMPT

# Running against a larger test suite

Now that we have sen how to evaluate our agent using Pytest and evaluate(), and seen an example of using an LLM as a judge, we can use evaluations over a bigger test suite to get a better sense of how our agent performs over a wider variety of examples.

We can run our `email_assistant` against a larger test suite by running

```bash
    ! LANGSMITH_TEST_SUITE='Email assistant: Test Full Response Interrupt' 
    LANGSMITH_EXPERIMENT='email_assistant' pytest tests/test_response.py --agent-module email_assistant
```

In `test_response.py`, we pass our dataset examples into functions that will run pytest and log to our `LANGSMITH_TEST_SUITE`:

```python
    # Reference output key
    @pytest.mark.langsmith(output_keys=["criteria"])
    # Variable names and a list of tuples with the test cases
    # Each test case is (email_input, email_name, criteria, expected_calls)
    @pytest.mark.parametrize("email_input,email_name,criteria,expected_calls",create_response_test_cases())
    def test_response_criteria_evaluation(email_input, email_name, criteria, expected_calls):
```

We will use LLM-as-judge with a grading schema:
```python
    class CriteriaGrade(BaseModel):
        """Score the response against specific criteria."""
        grade: bool = Field(description="Does the response meet the provided criteria?")
        justification: str = Field(description="The justification for the grade and score, including specific examples from the response.")
```

We will evaluate the agent response relative to the criteria:

```python
    # Evaluate against criteria
    eval_result = criteria_eval_structured_llm.invoke([
        {"role": "system",
            "content": RESPONSE_CRITERIA_SYSTEM_PROMPT},
        {"role": "user",
            "content": f"""\n\n Response criteria: {criteria} \n\n Assistant's response: \n\n {all_messages_str} \n\n Evaluate whether the assistant's response meets the criteria and provide justification for your evaluation."""}
    ])
```

## Getting results

We can get the results of our evaluation by reading the tracing project associated with our experiment:

In [None]:
# TODO: Copy our experiment name here
experiment_name = "email_assistant:8286b3b8"
# Set this to load expt results
load_expt = False
if load_expt:
    email_assistant_experiment_results = client.read_project(project_name=experiment_name, include_stats=True)
    print("Latency p50:", email_assistant_experiment_results.latency_p50)
    print("Latency p99:", email_assistant_experiment_results.latency_p99)
    print("Token Usage:", email_assistant_experiment_results.total_tokens)
    print("Feedback Stats:", email_assistant_experiment_results.feedback_stats)