# Evaluating Agent Trajectories

Good evaluation is key for quickly iterating on your agent's prompts and tools. Here we provide an example of how to use the TrajectoryEvalChain to evaluate the efficacy of the actions taken by your agent.

## Setup

Let's start by defining our agent.

In [8]:
from langchain import Wikipedia
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
from langchain.memory import ConversationBufferMemory
from langchain import LLMMathChain
from langchain.llms import OpenAI

from langchain import SerpAPIWrapper

docstore = DocstoreExplorer(Wikipedia())

math_llm = OpenAI(temperature=0)

llm_math_chain = LLMMathChain.from_llm(llm=math_llm, verbose=True)

search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search. Must call before lookup.",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup. Only call after a successfull 'Search'.",
    ),
    Tool(
        name="Calculator",
        func=llm_math_chain.run,
        description="useful for doing calculations. Expects strict numeric input.",
    ),
    Tool(
        name="Search-the-Web-SerpAPI",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    ),
]

memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True, output_key="output"
)

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-0613")

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
    memory=memory,
    return_intermediate_steps=True,  # This is needed for the evaluation later
    handle_parsing_errors=True,
)

## Test the Agent

Now let's try our agent out on some example queries.

In [None]:
query_one = (
    "How many ping pong balls would it take to fill the entire Empire State Building?"
)

test_outputs_one = agent({"input": query_one}, return_only_outputs=False)

This looks alright.. Let's try it out on another query.

In [10]:
query_two = "If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?"

test_outputs_two = agent({"input": query_two}, return_only_outputs=False)



[1m> Entering new  chain...[0m
[32;1m[1;3m
Invoking: `Search` with `length of the US from coast to coast`


[0m[36;1m[1;3m
== Watercraft ==[0m[32;1m[1;3m
Invoking: `Search` with `distance from coast to coast of the US`


[0m[36;1m[1;3mThe Oregon Coast is a coastal region of the U.S. state of Oregon. It is bordered by the Pacific Ocean to its west and the Oregon Coast Range to the east, and stretches approximately 362 miles (583 km) from the California state border in the south to the Columbia River in the north. The region is not a specific geological, environmental, or political entity, and includes the Columbia River Estuary.
The Oregon Beach Bill of 1967 allows free beach access to everyone.  In return for a pedestrian easement and relief from construction, the bill eliminates property taxes on private beach land and allows its owners to retain certain beach land rights.Traditionally, the Oregon Coast is regarded as three distinct sub–regions:
The North Coast, which s

This doesn't look so good. Let's try running some evaluation.

## Evaluating the Agent

Let's start by defining the TrajectoryEvalChain.

In [11]:
from langchain.evaluation.agents import TrajectoryEvalChain

# Define chain
eval_llm = ChatOpenAI(temperature=0, model_name="gpt-4")
eval_chain = TrajectoryEvalChain.from_llm(
    llm=eval_llm,  # Note: This must be a chat model
    agent_tools=agent.tools,
    return_reasoning=True,
)

Let's try evaluating the first query.

In [13]:
question, steps, answer = (
    test_outputs_one["input"],
    test_outputs_one["intermediate_steps"],
    test_outputs_one["output"],
)

evaluation = eval_chain.evaluate_intermediate_steps(
    input=test_outputs_one["input"],
    output=test_outputs_one["output"],
    intermediate_steps=test_outputs_one["intermediate_steps"],
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

TypeError: Chain.__call__() got an unexpected keyword argument 'reference'

That seems about right. Let's try the second query.

In [None]:
question, steps, answer = (
    test_outputs_two["input"],
    test_outputs_two["intermediate_steps"],
    test_outputs_two["output"],
)

evaluation = eval_chain(
    inputs={
        "question": question,
        "answer": answer,
        "agent_trajectory": eval_chain.get_agent_trajectory(steps),
    },
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

That also sounds about right. In conclusion, the TrajectoryEvalChain allows us to use GPT-4 to score both our agent's outputs and tool use in addition to giving us the reasoning behind the evaluation.