# Generic Agent Evaluation

Good evaluation is key for quickly iterating on your agent's prompts and tools. Here we provide an example of how to use the TrajectoryEvalChain to evaluate your agent.

## Setup

Let's start by defining our agent.

In [2]:
from langchain import Wikipedia
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
from langchain.memory import ConversationBufferMemory
from langchain import LLMMathChain
from langchain.llms import OpenAI

from langchain import SerpAPIWrapper

docstore = DocstoreExplorer(Wikipedia())

math_llm = OpenAI(temperature=0)

llm_math_chain = LLMMathChain(llm=math_llm, verbose=True)

search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup",
    ),
    Tool(
        name="Calculator",
        func=llm_math_chain.run,
        description="useful for doing calculations",
    ),
    Tool(
        name="Search the Web (SerpAPI)",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    ),
]

memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True, output_key="output"
)

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    verbose=True,
    memory=memory,
    return_intermediate_steps=True, # This is needed for the evaluation later
)

## Testing the Agent

Now let's try our agent out on some example queries.

In [3]:
query_one = "How many ping pong balls would it take to fill the entire Empire State Building?"

test_outputs_one = agent({"input": query_one}, return_only_outputs=False)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
    "action": "Search the Web (SerpAPI)",
    "action_input": "How many ping pong balls would it take to fill the entire Empire State Building?"
}[0m
Observation: [31;1m[1;3mGive or take. 12.8 billion. The volume of the Empire State Building Googles in at around 37 million ft³. A golf ball comes in at about 2.5 in³.[0m
Thought:[32;1m[1;3m{
    "action": "Final Answer",
    "action_input": "It would take approximately 12.8 billion ping pong balls to fill the entire Empire State Building, give or take. The volume of the Empire State Building is around 37 million cubic feet, while a golf ball has a volume of about 2.5 cubic inches."
}[0m

[1m> Finished chain.[0m


This looks good! Let's try it out on another query.

In [4]:
query_two = "If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?"

test_outputs_two = agent({"input": query_two}, return_only_outputs=False)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
    "action": "Calculator",
    "action_input": "The length of the Eiffel Tower is 324 meters. The distance from the east coast to the west coast of the US is approximately 4,828 kilometers. 1 kilometer is equal to 0.001 kilometers. Therefore, the number of Eiffel Towers needed to cover the US from coast to coast would be (4,828,000 meters / 324 meters) = 14,888.89 Eiffel Towers (rounded to the nearest hundredth)."
}[0m

[1m> Entering new LLMMathChain chain...[0m
The length of the Eiffel Tower is 324 meters. The distance from the east coast to the west coast of the US is approximately 4,828 kilometers. 1 kilometer is equal to 0.001 kilometers. Therefore, the number of Eiffel Towers needed to cover the US from coast to coast would be (4,828,000 meters / 324 meters) = 14,888.89 Eiffel Towers (rounded to the nearest hundredth).[32;1m[1;3m
```text
4828000 / 324
```
...numexpr.evaluate("4828000 / 324")...
[0m
Answer: [33;

This doesn't look so good. Let's try running some evaluation.

## Evaluating the Agent

Let's start by defining the TrajectoryEvalChain.

In [5]:
from langchain.evaluation.agents import TrajectoryEvalChain

# Define chain
eval_chain = TrajectoryEvalChain.from_llm(
    llm=ChatOpenAI(temperature=0, model_name="gpt-4"), # Note: This must be a ChatOpenAI model
    agent_tools=agent.tools,
    return_reasoning=True,
)

Let's try evaluating the first query.

In [6]:
question, steps, answer = test_outputs_one["input"], test_outputs_one["intermediate_steps"], test_outputs_one["output"]

evaluation = eval_chain(
    inputs={"question": question, "answer": answer, "agent_trajectory": eval_chain.get_agent_trajectory(steps)},
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  2
Reasoning:  First, let's evaluate the final answer. The final answer is not helpful because it uses the volume of a golf ball instead of a ping pong ball. The question was about ping pong balls, not golf balls.

Second, does the AI language model use a logical sequence of tools to answer the question? The model used the Search the Web (SerpAPI) tool, which is appropriate for finding information about the volume of the Empire State Building. However, the model should have searched for the volume of a ping pong ball instead of a golf ball.

Third, does the AI language model use the tools in a helpful way? The model used the SerpAPI tool to find the volume of the Empire State Building, which is helpful. However, it did not use the tool to find the correct volume of a ping pong ball.

Fourth, does the AI language model use too many steps to answer the question? The model used only one step, which is appropriate for this question.

Fifth, are the appropriate tools used

That seems about right. Let's try the second query.

In [8]:
question, steps, answer = test_outputs_two["input"], test_outputs_two["intermediate_steps"], test_outputs_two["output"]

evaluation = eval_chain(
    inputs={"question": question, "answer": answer, "agent_trajectory": eval_chain.get_agent_trajectory(steps)},
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

ValueError: invalid literal for int() with base 10: '3\n\nThe model provided a helpful answer but did not use the appropriate tools to verify the input values and repeated the same calculation in both steps.'

That also sounds about right. In conclusion, the TrajectoryEvalChain allows us to use GPT-4 to score both our agent's outputs and tool use in addition to giving us the reasoning behind the evaluation.