# Evaluating our Agent

We've written an agentic workflow that uses a router to triage the email and then passes the email to the agent for response generation. How can we be sure that it will work well in production? This is why testing is important: it guides our decisions about our agent architecture with quantifiable metrics like response quality, token usage, latency, or triage accuracy. [LangSmith](https://docs.smith.langchain.com/) offers two primary ways to test agents. 

![overview-img](img/overview_eval.png)

## Test Approaches 

### Pytest

[Pytest](https://docs.pytest.org/en/stable/) is well known to many developers as a powerful tool for writing tests within the Python ecosystem. LangSmith integrates with pytest to allow you to write tests that we can run on each assistant and log the results to LangSmith. Pytest is a great way to get started quickly with a framework you're already familiar with.

### LangSmith Datasets 

You can also create a dataset [in LangSmith](https://docs.smith.langchain.com/evaluation) and run each assistant against the dataset using the LangSmith evaluate API. LangSmith datasets are great for teams who are collaboratively building out their test suite. You can leverage production traces, annotation queues, and more, to add examples to an ever-growing golden dataset.

## Test Cases

Testing often starts with defining the test cases, which can be a challenging process. In this case, we'll just define a set of example emails we want to handle along with a few things to test. You can see the test cases in `eval/email_dataset.py`, which contains the following:

1. **Input Emails**: A collection of diverse email examples
2. **Ground Truth Classifications**: `Respond`, `Notify`, `Ignore`
3. **Expected Tool Calls**: Tools called for each email that requires a response
4. **Response Criteria**: What makes a good response for emails requiring replies

## Pytest Example

Here's a simple example of testing using Pytest. 

We will test whether our `email_assistant` makes the right tool calls when responding to the emails.

In [None]:
%cd ..
%load_ext autoreload
%autoreload 2

In [None]:
import pytest
from eval.email_dataset import email_inputs, expected_tool_calls
from email_assistant.utils import format_messages_string
from email_assistant.email_assistant import email_assistant
from email_assistant.utils import extract_tool_calls

from langsmith import testing as t
from dotenv import load_dotenv

load_dotenv(".env", override=True)

@pytest.mark.langsmith
@pytest.mark.parametrize(
    "email_input, expected_calls",
    [   # Pick some examples with e-mail reply expected
        (email_inputs[0],expected_tool_calls[0]),
        (email_inputs[3],expected_tool_calls[3]),
    ],
)
def test_email_dataset_tool_calls(email_input, expected_calls):
    """Test if email processing contains expected tool calls."""
    # Run the email assistant
    messages = [{"role": "user", "content": str(email_input)}]
    result = email_assistant.invoke({"messages": messages})
            
    # Extract tool calls from messages list
    extracted_tool_calls = extract_tool_calls(result['messages'])
            
    # Check if all expected tool calls are in the extracted ones
    missing_calls = [call for call in expected_calls if call.lower() not in extracted_tool_calls]
    
    t.log_outputs({
                "missing_calls": missing_calls,
                "extracted_tool_calls": extracted_tool_calls,
                "response": format_messages_string(result['messages'])
            })

    # Test passes if no expected calls are missing
    assert len(missing_calls) == 0


You'll notice a few things. First, to [run with Pytest and log test results to LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest), we only need to add the `@pytest.mark.langsmith ` decorator to our function and place it in a file, as you see in `notebooks/test_tools.py`. Second, we can pass dataset examples to the test function as shown [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest#parametrize-with-pytestmarkparametrize) via `@pytest.mark.parametrize`. We can run the test from the command line. From the project root, run:

```
! LANGSMITH_TEST_SUITE='Email assistant: Test Tools'  pytest notebooks/test_tools.py
```

We can view the results in the LangSmith UI. The `assert len(missing_calls) == 0` is logged to the `Pass` column in LangSmith. The `log_outputs` are passed to the `Outputs` column and function arguments are passed to the `Inputs` column. Each input passed in `@pytest.mark.parametrize(` is a separate row logged to the `LANGSMITH_TEST_SUITE` project name in LangSmith, which is found under `Datasets & Experiments`.

![Test Results](img/test_result.png)

## LangSmith Datasets 

### Dataset Definition 

In addition to the Pytest approach, we can also [create a dataset in LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_programmatically#create-a-dataset) with the LangSmith SDK. This creates a dataset with the test cases in the `eval/email_dataset.py` file.

In [18]:
from langsmith import Client
import matplotlib.pyplot as plt

from eval.email_dataset import examples_triage

# Initialize LangSmith client
client = Client()

# Dataset name
dataset_name = "Interrupt Workshop: E-mail Triage Dataset"

# Create dataset if it doesn't exist
if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(
        dataset_name=dataset_name, 
        description="A dataset of e-mails and their triage decisions."
    )
    # Add examples to the dataset
    client.create_examples(dataset_id=dataset.id, examples=examples_triage)

### Run Agents 

The dataset has the following structure, with an e-mail input and a ground truth classification for the e-mail as output.

In [None]:
# NOTE: This is just an example, this cell won't run
examples_triage = [
  {
      "inputs": {"email_input": email_input_1},
      "outputs": {"classification": triage_output_1},
  }, ...
]

We define functions that take dataset inputs and pass them to each agent we want to evaluate. The function just takes the `inputs` dict from the dataset and passes it to the agent. It returns a dict with the agent's output. Here, we specifically look to evaluate the `email_assistant`'s triage classification decision.

In [19]:
def target_email_assistant(inputs: dict) -> dict:
    """Process an email through the workflow-based email assistant."""
    response = email_assistant.invoke({"email_input": inputs["email_input"]})
    return {"classification_decision": response['classification_decision']}

The LangSmith [evaluate API](https://docs.smith.langchain.com/evaluation) passes the `inputs` dict to this function. 

### Evaluator Function 

We also create an evaluator function. What do we want to evaluate? We have reference outputs in our dataset and agent outputs defined in the functions above.

* Reference outputs: `"outputs": {"classification": triage_output_1} ...`
* Agent outputs: `"outputs": {"classification_decision": agent_output_1} ...`

We want to evaluate if the agent's output matches the reference output. So we simply need a an evaluator function that compares the two, where `outputs` is the agent's output and `reference_outputs` is the reference output from the dataset.

In [20]:
def classification_evaluator(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["classification_decision"].lower() == reference_outputs["classification"].lower()

### Running Evaluation

Now, the question is: how are these things hooked together? The evaluate API takes care of it for us. It passes the `inputs` dict from our dataset the target function. It passes the `outputs` dict from our dataset to the evaluator function. And it passes the output of our agent to the evaluator function. Note this is similar to what we did with Pytest: in Pytest, we passed in the dataset example inputs and reference outputs to the test function with `@pytest.mark.parametrize`.

![overview-img](img/eval_detail.png)

In [21]:
experiment_results_workflow = client.evaluate(
    # Run agent 
    target_email_assistant,
    # Dataset name   
    data=dataset_name,
    # Evaluator
    evaluators=[classification_evaluator],
    # Name of the experiment
    experiment_prefix="E-mail assistant workflow", 
    # Number of concurrent evaluations
    max_concurrency=2, 
)

View the evaluation results for experiment: 'E-mail assistant workflow-92f462c6' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cdd3f95a-aaf7-4b19-8fc4-74bc5a7df870/compare?selectedSessions=909b3c47-7454-4b5d-9637-151f1dc6b674




0it [00:00, ?it/s]

📧 Classification: RESPOND - This email requires a response
🔔 Classification: NOTIFY - This email contains important information
📧 Classification: RESPOND - This email requires a response
📧 Classification: RESPOND - This email requires a response
🔔 Classification: NOTIFY - This email contains important information
🔔 Classification: NOTIFY - This email contains important information
🔔 Classification: NOTIFY - This email contains important information
📧 Classification: RESPOND - This email requires a response
📧 Classification: RESPOND - This email requires a response
📧 Classification: RESPOND - This email requires a response
🚫 Classification: IGNORE - This email can be safely ignored
🔔 Classification: NOTIFY - This email contains important information
📧 Classification: RESPOND - This email requires a response
📧 Classification: RESPOND - This email requires a response
🚫 Classification: IGNORE - This email can be safely ignored


We can view the results from both experiments in the LangSmith UI.

![Test Results](img/eval.png)

### Running against a Larger Test Suite
Now that we've seen how to evaluate our agents using Pytest and evaluate(), we can use evaluations over a larger test suite to get a better sense of how our agent performs over a larger set of examples.

Let's run our `email_assistant` against a larger test suite.
```
! LANGSMITH_TEST_SUITE='Email assistant: Test Full Response' LANGSMITH_EXPERIMENT='email_assistant' pytest tests/test_response.py --agent-module email_assistant
```

Now let's take a look at this experiment in the LangSmith UI and look into what our agent did well, and what it could improve on.

### Getting Results

We can also get the results of the evaluation, by reading our experiment projects This is great if we want to create our own visualizations on our agent performance.

In [24]:
# TODO: Copy your experiment name here
experiment_name = "email_assistant:3c9967e4"
email_assistant_experiment_results = client.read_project(project_name=experiment_name, include_stats=True)

print("Latency p50:", email_assistant_experiment_results.latency_p50)
print("Latency p99:", email_assistant_experiment_results.latency_p99)
print("Token Usage:", email_assistant_experiment_results.total_tokens)
print("Feedback Stats:", email_assistant_experiment_results.feedback_stats)



Latency p50: 0:00:08.390000
Latency p99: 0:00:19.439500
Token Usage: 76879
Feedback Stats: {'pass': {'n': 16, 'avg': 0.9375, 'stdev': 0.24206145913796356, 'errors': 0, 'values': {}}}
