# Generic Agent Evaluation

Good evaluation is key for quickly iterating on your agent's prompts and tools. Here we provide an example of how to use the ReactEvalChain to evaluate your agent.

## Setup

Let's start by defining our agent.

In [2]:
from langchain import Wikipedia
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
from langchain.agents.react.base import DocstoreExplorer
from langchain.memory import ConversationBufferMemory
from langchain import LLMMathChain
from langchain.llms import OpenAI

from langchain import SerpAPIWrapper

docstore = DocstoreExplorer(Wikipedia())

math_llm = OpenAI(temperature=0)

llm_math_chain = LLMMathChain(llm=math_llm, verbose=True)

search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=docstore.search,
        description="useful for when you need to ask with search",
    ),
    Tool(
        name="Lookup",
        func=docstore.lookup,
        description="useful for when you need to ask with lookup",
    ),
    Tool(
        name="Calculator",
        func=llm_math_chain.run,
        description="useful for doing calculations",
    ),
    Tool(
        name="Search the Web (SerpAPI)",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    ),
]

memory = ConversationBufferMemory(
    memory_key="chat_history", return_messages=True, output_key="output"
)

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    verbose=True,
    memory=memory,
    return_intermediate_steps=True, # This is needed for the evaluation later
)

## Testing the Agent

Now let's try our agent out on some example queries.

In [17]:
query_one = "How many ping pong balls would it take to fill the entire Empire State Building?"

test_outputs_one = agent({"input": query_one}, return_only_outputs=False)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
    "action": "Calculator",
    "action_input": "37,000,000 cubic feet / 2.5 cubic inches per ping pong ball"
}[0m

[1m> Entering new LLMMathChain chain...[0m
37,000,000 cubic feet / 2.5 cubic inches per ping pong ball[32;1m[1;3m
```python
print(37000000 / (2.5 * (1/1728)))
```
[0m
Answer: [33;1m[1;3m25574400000.0
[0m
[1m> Finished chain.[0m

Observation: [38;5;200m[1;3mAnswer: 25574400000.0
[0m
Thought:[32;1m[1;3m{
    "action": "Final Answer",
    "action_input": "It would take approximately 25.6 billion ping pong balls to fill the Empire State Building."
}[0m

[1m> Finished chain.[0m


This looks good! Let's try it out on another query.

In [19]:
query_two = "If you laid the Eiffel Tower end to end, how many would you need cover the US from coast to coast?"

test_outputs_two = agent({"input": query_two}, return_only_outputs=False)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
    "action": "Search",
    "action_input": "length of the Eiffel Tower in miles"
}[0m
Observation: [36;1m[1;3mThe following are examples of orders of magnitude for different lengths.[0m
Thought:[32;1m[1;3m{
    "action": "Final Answer",
    "action_input": "I apologize for the confusion in my previous response. Unfortunately, I was mistaken in suggesting that the Eiffel Tower could be laid end to end to cover the US from coast to coast. The Eiffel Tower is approximately 1,063 feet tall, which is much shorter than the width of the United States. To cover the US from coast to coast, you would need a much larger structure or object. I hope this helps clarify things for you."
}[0m

[1m> Finished chain.[0m


This doesn't look so good. Let's try running some evaluation.

## Evaluating the Agent

Let's start by defining the ReactEvalChain.

In [20]:
from langchain.evaluation.react import ReactEvalChain

# Define chain
react_eval_chain = ReactEvalChain.from_llm(
    llm=ChatOpenAI(temperature=0, model_name="gpt-4"), # Note: This must be a ChatOpenAI model
    agent_tools=agent.tools,
    return_reasoning=True,
)

Let's try evaluating the first query.

In [22]:
question, steps, answer = test_outputs_one["input"], test_outputs_one["intermediate_steps"], test_outputs_one["output"]

evaluation = react_eval_chain(
    inputs={"question": question, "answer": answer, "agent_trajectory": react_eval_chain.get_agent_trajectory(steps)},
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  3
Reasoning:  First, let's evaluate the final answer. The final answer is helpful as it provides an approximate number of ping pong balls needed to fill the Empire State Building. However, the model did not provide any information on how it arrived at the 37,000,000 cubic feet figure for the Empire State Building's volume.

Second, does the model use a logical sequence of tools to answer the question? The model only used the Calculator tool, but it should have used the Search or Lookup tool to find the volume of the Empire State Building and the volume of a ping pong ball.

Third, does the AI language model use the tools in a helpful way? The model used the Calculator tool in a helpful way to perform the calculation, but it did not use the Search or Lookup tool to find the necessary information.

Fourth, does the AI language model use too many steps to answer the question? The model used only one step, which is not too many, but it should have used additional steps 

That seems about right. Let's try the second query.

In [24]:
question, steps, answer = test_outputs_two["input"], test_outputs_two["intermediate_steps"], test_outputs_two["output"]

evaluation = react_eval_chain(
    inputs={"question": question, "answer": answer, "agent_trajectory": react_eval_chain.get_agent_trajectory(steps)},
)

print("Score from 1 to 5: ", evaluation["score"])
print("Reasoning: ", evaluation["reasoning"])

Score from 1 to 5:  1
Reasoning:  i. Is the final answer helpful?
The final answer is not helpful because it does not provide the information requested by the user. The user asked how many Eiffel Towers laid end to end would be needed to cover the US from coast to coast, but the answer only states that the Eiffel Tower is much shorter than the width of the United States.

ii. Does the AI language use a logical sequence of tools to answer the question?
The AI language model used only one tool, which was the Search tool. However, it did not use the tool effectively to find the necessary information to answer the question.

iii. Does the AI language model use the tools in a helpful way?
The AI language model did not use the tools in a helpful way. The Search tool was used to find the length of the Eiffel Tower, but the output was not helpful, and the model did not attempt to find the width of the United States or use the Calculator tool to determine how many Eiffel Towers would be needed.

That also sounds about right. In conclusion, the ReactEvalChain allows us to use GPT-4 to score both our agent's outputs and tool use in addition to giving us the reasoning behind the evaluation.