# Agent Evaluation

When evaluating agents, 2 important  metrics to evaluate are:

1. End-to-end success rate: does the agent achieve the desired final response or state? This could involve completing a task correctly or providing an accurate answer.

2. Stepwise performance: how much does your agent's trajectory deviate from a reference "ground truth" trajectory?

Computing success rate (1) is comparatively simple, we can compare the agent's final output or state against the original input query or a labeled "ground truth" (if available). 

It has some issues though: it's not smooth (for complex tasks, you either succeed or not, no "almost"), which means that it's harder to measure incremental improvements on a given dataset: examples will either be too hard or too easy. It also is less useful for evaluating Copilots, which may take 1 or more steps before receiving human input: not everything needs to be fully autonomous.

Evaluating stepwise performance (2) is a bit more nuanced. While you can try to approximate trajectory effectiveness using coarse evaluators like the agent trajectory evaluator in LangChain, you can compute more appropriate metrics using an approach akin to [Teacher Forcing](https://en.wikipedia.org/wiki/Teacher_forcing).

Here's the high-level algorithm:

#### Inputs
- `dataset`: each example's inputs is an object with a `trajectory` of states states (intermediate steps).
- `score_fn(predicted, ground_truth)`: A scoring function that computes a metric by comparing the predicted state to the ground truth state.
- `agent`: The agent graph to evaluate.

#### Algorithm Pseudocode

```python
def evaluate_agent(dataset, score_fn, agent):
    all_scores = []
    
    for example in dataset:
        trajectory = example['trajectory']
        example_scores = []
        
        for i in range(len(trajectory) - 1):
            current_state = trajectory[i]
            predicted_next_state = agent.invoke(current_state)
            ground_truth_next_state = trajectory[i + 1]
            # How well did the agent/copilot do on this step?
            step_score = score_fn(predicted_next_state, ground_truth_next_state)
            example_scores.append(step_score)
        avg_example_score = mean(example_scores)
        all_scores.append(avg_example_score)
    
    final_score = mean(all_scores)
    return final_score
```


Let's implement this below, using a simple search agent as an example.

## Prerequisites

In [1]:
# %pip install -U langgraph langsmith langchain_openai tavily-python

In [2]:
import os

# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
# os.environ["LANGCHAIN_TRACING_V2"] = "true"

# os.environ["OPENAI_API_KEY"] = "YOUR OPENAI KEY"
# os.environ["TAVILY_API_KEY"] = "YOUR API KEY"

## Agent

Search tool agent.

In [3]:
import json
from typing import List

from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.messages import AnyMessage, ToolMessage
from langchain_openai import ChatOpenAI

from langgraph.graph import END, MessageGraph
from langgraph.prebuilt.tool_executor import (
    ToolExecutor,
    ToolInvocation,
    create_tool_invocations,
)

tools = [TavilySearchResults(max_results=4)]
model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, streaming=True).bind_tools(
    tools
)


tool_executor = ToolExecutor(tools)


# Define the function that determines whether to continue or not
def should_continue(state: List[AnyMessage]):
    last_message = state[-1]
    if "tool_calls" not in last_message.additional_kwargs:
        return END
    else:
        return "action"


def call_tool(state: List[AnyMessage]):
    last_message = state[-1]
    actions = create_tool_invocations(last_message)
    responses = tool_executor.batch(actions)
    return [
        ToolMessage(tool_call_id=action.id, content=json.dumps(response))
        for action, response in zip(actions, responses)
    ]


# Define the actual graph
workflow = MessageGraph()

workflow.add_node("agent", model)
workflow.add_node("action", call_tool)

workflow.set_entry_point("agent")
workflow.add_conditional_edges(
    "agent",
    should_continue,
)

workflow.add_edge("action", "agent")
agent_graph = workflow.compile()

#### Example Usage

In [4]:
from langchain_core.messages import HumanMessage

for step in agent_graph.stream(
    [
        HumanMessage(
            content="What's the difference between temperatures in SF and LA right now?"
        )
    ]
):
    print(step)

{'agent': AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_QHPldugnmJiluIjlHv1vJt5f', 'function': {'arguments': '{"query": "current temperature in San Francisco"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'index': 1, 'id': 'call_8ZvIxZCYBbL0Ya3obQFKWIwp', 'function': {'arguments': '{"query": "current temperature in Los Angeles"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls'}, id='1bf8b424-9497-452b-bb39-82aa6a849c28')}
{'agent': AIMessage(content='The current temperature in San Francisco is 59°F and in Los Angeles, it is not specified in the search results. Would you like me to try to find the current temperature in Los Angeles again?', response_metadata={'finish_reason': 'stop'}, id='b461a87e-875e-448c-ac83-35b44543d48c')}


## Dataset

We'll make a dataset containing expected trajectories.

In [5]:
import random
import string

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage


# Utilities to generate usable fake trajectories
def generate_fake_tool_id():
    # Generate a fake OpenAI tool invocation ID
    prefix = "call_"
    length = 24
    characters = string.ascii_letters + string.digits
    random_string = "".join(random.choice(characters) for _ in range(length))
    return prefix + random_string


def _create_fake_turn(inputs: list, expected_outputs: list):
    assert len(expected_outputs) == len(inputs)
    tool_calls = [
        {
            "index": i,
            "id": generate_fake_tool_id(),
            "function": {
                "arguments": json.dumps(input_args),
                "name": "tavily_search_results_json",
            },
            "type": "function",
        }
        for i, input_args in enumerate(inputs)
    ]
    tool_messages = [
        ToolMessage(
            tool_call_id=tool_call["id"],
            content=json.dumps(response),
            additional_kwargs={"name": tool_call["function"]["name"]},
            # name=tool_call["function"]["name"],
        )
        for tool_call, response in zip(tool_calls, expected_outputs)
    ]
    return [
        AIMessage(content="", additional_kwargs={"tool_calls": tool_calls}),
        *tool_messages,
    ]


# Since we are teacher forcing here, we are essentially mocking the actual inputs + outputs
inputs = [
    {
        # Evaluate 4 steps here
        "trajectory": [
            HumanMessage(
                content="What was the age of the director of Dune 2's brother on March 1, 2024"
            ),
            *_create_fake_turn(
                [{"query": "dune 2 director"}],
                [{"results": ["The director is Denis Villanueve"]}],
            ),
            *_create_fake_turn(
                [{"query": "Denis Villanueve Brother"}],
                [{"results": ["Denis Villanueve' brother is Martin Villanueve"]}],
            ),
            *_create_fake_turn(
                [{"query": "Martin Villanueve Birthday"}],
                [{"results": ["Martin Villanueve's brother is March 13th, 1978"]}],
            ),
            # March 1, 2024 - March 13th, 1978
            AIMessage(content="45"),
        ],
    },
    {
        # Evaluate only 2 steps, but first one is parallel to save time.
        "trajectory": [
            HumanMessage(
                content="What is the difference between the current temperatures in SF and LA?"
            ),
            *_create_fake_turn(
                [{"query": "temperature in SF"}, {"query": "temperature in LA"}],
                [
                    {
                        "results": [
                            "the tempareture in SF is 75 degrees Fahrenheit",
                        ]
                    },
                    {
                        "results": [
                            "the temperature in LA is 89 degrees Fahrenheit",
                        ]
                    },
                ],
            ),
            AIMessage(
                content="The difference in temperatures is 14 degrees Fahrenheit"
            ),
        ],
    },
    # Can add more similar examples manually or capture good trajectories using graph.stream() and add them in.
]

## Stepwise Score

Next, define a function to score a given state.

In [6]:
from typing import Union

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_openai import ChatOpenAI


class InvocationSimilarity(BaseModel):
    reasoning: str
    is_similar: bool


class Correctness(BaseModel):
    reasoning: str
    is_correct: bool


def score_step(predicted: Union[list, AnyMessage], expected: list):
    expected_message = expected[-1]
    predicted_message = predicted[-1] if isinstance(predicted, list) else predicted
    if expected_message.additional_kwargs.get("tool_calls"):
        # Compare the tool invocation step
        if "tool_calls" not in predicted_message.additional_kwargs:
            return 0
        expected_calls = [
            tool_call["function"]
            for tool_call in expected_message.additional_kwargs["tool_calls"]
        ]
        predicted_calls = [
            tool_call.get("function")
            for tool_call in predicted_message.additional_kwargs["tool_calls"]
        ]

        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a teacher grading an agent's tool usage."),
                (
                    "user",
                    "Are the following function calls sufficiently similar to be considered passing?"
                    "\n\nExpected:\n```\n{expected}\n```\n\nPredicted:\n```\n{predicted}\n```",
                ),
            ]
        )
        llm = ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(
            InvocationSimilarity
        )
        response = (prompt | llm).invoke(
            {
                "expected": expected_calls,
                "predicted": predicted_calls,
            }
        )
        score = response.is_similar

    else:
        # Compare the direct response
        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a teacher grading a student's response."),
                (
                    "user",
                    "Is the following predicted response factually correct, according to the expected response?"
                    "\n\nExpected:\n```\n{expected}\n```\n\nPredicted:\n```\n{predicted}\n```",
                ),
            ]
        )
        llm = ChatOpenAI(model="gpt-3.5-turbo").with_structured_output(Correctness)
        response = (prompt | llm).invoke(
            {
                "expected": expected_message.content,
                "predicted": predicted_message.content,
            }
        )
        score = response.is_correct
    return score

## Stepwise Eval Harness

Now we'll plug this into the test harness described in the pseudocode above.

In [7]:
from typing import Callable

from langchain_core.load import load
from langchain_core.runnables import Runnable, RunnableLambda
from typing_extensions import TypedDict

from langgraph.graph import StateGraph


class HarnessState(TypedDict):
    # The node to stop after
    scores: list
    trajectory: list
    step: int


def evaluate_next_step(
    state: HarnessState, agent: Runnable, score_fn: Runnable, agent_node: str
):
    scores = state.get("scores") or []
    full_trajectory = state["trajectory"]
    this_step = state.get("step") or 0
    for this_step in range(this_step, len(full_trajectory)):
        if isinstance(full_trajectory[this_step], AIMessage):
            break
    # Input to the graph
    agent_input = full_trajectory[:this_step]
    # Expected state after running the agent_node
    expected_output = full_trajectory[: this_step + 1]
    score = None
    for step_state in agent.stream(agent_input):
        if agent_node in step_state:
            score = score_fn.invoke(
                {"predicted": step_state[agent_node], "expected": expected_output}
            )
            # Only propagate until the agent generates an output
            break
    if score is None:
        print("Agent run not found.")
        score = 0
    scores.append(score)
    return {
        **state,
        "scores": scores,
        "step": this_step + 1,
    }


def should_continue(state: HarnessState):
    if state["step"] < len(state["trajectory"]):
        return "eval_next_step"
    return END


def create_harness(agent: Runnable, score_fn: Callable, agent_node: str):
    builder = StateGraph(HarnessState)
    builder.add_node(
        "eval_next_step",
        RunnableLambda(evaluate_next_step).bind(
            agent=agent,
            score_fn=RunnableLambda(lambda x: score_fn(**x)).with_config(
                run_name="ScoreStep"
            ),
            agent_node=agent_node,
        ),
    )
    builder.add_conditional_edges("eval_next_step", should_continue)
    builder.set_entry_point("eval_next_step")
    return load | builder.compile().with_config(run_name="Stepwise Evaluator")

In [8]:
harness = create_harness(agent_graph, score_step, "agent")

In [9]:
for step in harness.stream(inputs[1]):
    print(step)

  warn_beta(


{'eval_next_step': {'scores': [True], 'trajectory': [HumanMessage(content='What is the difference between the current temperatures in SF and LA?', id='ee0e74d1-b1b3-4f12-b456-bba9531c0b8c'), AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_oxFEzoWB4FQ1BwNoGWAJ4Dk3', 'function': {'arguments': '{"query": "temperature in SF"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}, {'index': 1, 'id': 'call_2XvuSDlvDmE1uuCIMpbHQpjT', 'function': {'arguments': '{"query": "temperature in LA"}', 'name': 'tavily_search_results_json'}, 'type': 'function'}]}), ToolMessage(content='{"results": ["the tempareture in SF is 75 degrees Fahrenheit"]}', additional_kwargs={'name': 'tavily_search_results_json'}, tool_call_id='call_oxFEzoWB4FQ1BwNoGWAJ4Dk3'), ToolMessage(content='{"results": ["the temperature in LA is 89 degrees Fahrenheit"]}', additional_kwargs={'name': 'tavily_search_results_json'}, tool_call_id='call_2XvuSDlvDmE1uuCIMpbHQpjT'), AIMessage(content='The

## Evaluate in LangSmith

LangSmith makes it easy to track and share eval metrics over time. Let's use this in LangSmith's `run_on_dataset` function.

In [10]:
import uuid

from langchain_core.load.dump import dumpd
from langsmith import Client

client = Client()

dataset_name = f"Agent Trajectories {uuid.uuid4().hex[:6]}"
dataset = client.create_dataset(dataset_name)
client.create_examples(
    inputs=dumpd(inputs),
    dataset_id=dataset.id,
)

In [11]:
from langchain.smith import RunEvalConfig


def report_scores(run, example):
    harness_output = run.outputs
    scores = run.outputs["scores"]
    stepwise_score = survived_until = None
    if scores:
        stepwise_score = sum(scores) / len(scores)
        survived_until = next(
            (i for i, x in enumerate(scores) if not x), len(scores)
        ) / len(scores)
    return {
        "results": [
            {
                "key": "stepwise_score",
                "score": stepwise_score,
            },
            {"key": "survived_until", "score": survived_until},
        ]
    }


eval_config = RunEvalConfig(evaluators=[report_scores])

test_results = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=harness,
    evaluation=eval_config,
    # Other experiment metadata
    project_metadata={
        "model": "gpt-3.5-turbo",
        "agent_type": "openai-tools",
    },
)

View the evaluation results for project 'long-story-10' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/79b8c9c0-0fb8-49aa-8609-2cd93980ff38/compare?selectedSessions=aec90f5c-da24-4f4b-ae44-4cf0c601e3a7

View all tests for Dataset Agent Trajectories f9dccf at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/79b8c9c0-0fb8-49aa-8609-2cd93980ff38
[------------------------------------------------->] 2/2