# Tool Usage

Let's see how to evaluate an agent's ability to use tools.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

In [2]:
registry

Name,Type,Dataset ID,Description
Tool Usage - Typewriter (1 func),ToolUsageTask,placeholder,"Environment with a single function that accepts a single letter as input, and ""prints"" it on a piece of paper. The objective of this task is to evaluate the ability to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter,ToolUsageTask,placeholder,"Environment with 26 functions each representing a letter of the alphabet. In this variation of the typewriter task, there are 26 parameterless functions, where each function represents a letter of the alphabet (instead of a single function that takes a letter as an argument). The object is to evaluate the ability of use the functions to repeat the given string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Relational Data,ToolUsageTask,e95d45da-aaa3-44b3-ba2b-7c15ff6e46f5,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,ToolUsageTask,placeholder,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."
Email Extraction,ExtractionTask,https://smith.langchain.com/public/36bdfe7d-3cd1-4b36-b957-d12d95810a2b/d,"A dataset of 42 real emails deduped from a spam folder, with semantic HTML tags removed, as well as a script for initial extraction and formatting of other emails from an arbitrary .mbox file like the one exported by Gmail. Some additional cleanup of the data was done by hand after the initial pass. See https://github.com/jacoblee93/oss-model-extraction-evals."


In [3]:
task = registry["Tool Usage - Relational Data"]
task

0,1
Name,Tool Usage - Relational Data
Type,ToolUsageTask
Dataset ID,e95d45da-aaa3-44b3-ba2b-7c15ff6e46f5
Description,Environment with fake data about users and their locations and favorite foods. The environment prov...


In [4]:
print(task.description)

Environment with fake data about users and their locations and favorite foods.

The environment provides a set of tools that can be used to query the data.

The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data.

The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question.

Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question.

Success is measured by the ability to answer the question correctly, and efficiently.



Clone the dataset associaetd with this task

In [5]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Relational Data already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/f2b5a831-8eef-4bc7-b6de-68078b87350f.


## Define an agent

Let's build an agent that we can use for evaluation.

In [9]:
from langchain_benchmarks.tool_usage import agents

In [10]:
agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

Let's test that our agent works

In [11]:
agent = agent_factory.create()

In [12]:
agent.invoke({"question": "who is bob?"})

{'question': 'who is bob?',
 'output': 'Bob is a user with the name "Bob".',
 'intermediate_steps': [(AgentActionMessageLog(tool='find_users_by_name', tool_input={'name': 'bob'}, log="\nInvoking: `find_users_by_name` with `{'name': 'bob'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "name": "bob"\n}', 'name': 'find_users_by_name'}})]),
   [{'id': 21, 'name': 'Bob'},
    {'id': 41, 'name': 'Donna'},
    {'id': 1, 'name': 'Alice'},
    {'id': 35, 'name': 'Charlie'},
    {'id': 42, 'name': 'Eve'},
    {'id': 43, 'name': 'Frank The Cat'}]),
  (AgentActionMessageLog(tool='get_user_name', tool_input={'user_id': 21}, log="\nInvoking: `get_user_name` with `{'user_id': 21}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "user_id": 21\n}', 'name': 'get_user_name'}})]),
   'Bob')]}

## Eval

Let's evaluate an agent now

In [13]:
from langchain_benchmarks.tool_usage import STANDARD_AGENT_EVALUATOR
from langsmith.client import Client

In [14]:
client = Client()

In [15]:
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=STANDARD_AGENT_EVALUATOR,
    verbose=True,
    tags=["openai-functions"],
)

View the evaluation results for project 'test-puzzled-fold-42' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/3d206d3f-aad1-4226-86a6-4161857e5bca?eval=true

View all tests for Dataset Tool Usage - Relational Data at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/f2b5a831-8eef-4bc7-b6de-68078b87350f
[------------------------------------------------->] 21/21
 Eval quantiles:
                                0.25  0.5  0.75      mean  mode
Intermediate steps correctness   0.0  1.0   1.0  0.571429   1.0
# steps / # expected steps       1.0  1.0   1.0  2.285714   1.0
correctness                      0.0  1.0   1.0  0.666667   1.0


# Inspect

Here, we'll take a look at the underlying results a little bit.

A few things to note:

* The correctness is 66% (so it's messing up a lot!)
* The number of tool invocations made by the agent can be very large; e.g., 15 invocations, when only a single invocation was actually needed.

In [82]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient='records'))

In [83]:
df['correctness'].mean()

0.6666666666666666

In [84]:
df['num_expected_steps'] = df['reference.expected_steps'].apply(len)
df['actual_number_of_steps'] = df['output.intermediate_steps'].apply(len)

In [68]:
df.head()

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,correctness,input.question,output.question,output.output,output.intermediate_steps,reference.reference,reference.order_matters,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,1,1.0,1,What is the city for location ID 1?,What is the city for location ID 1?,The city for location ID 1 is New York.,[(tool='get_city_for_location' tool_input={'lo...,New York,True,[get_city_for_location],1,1
1,1,1.0,1,What is the name of food with id 6?,What is the name of food with id 6?,The name of the food with ID 6 is Pasta.,[(tool='get_food_name' tool_input={'food_id': ...,Pasta,True,[get_food_name],1,1
2,1,1.0,1,what is eve's user id?,what is eve's user id?,Eve's user ID is 42.,[(tool='find_users_by_name' tool_input={'name'...,42,True,[find_users_by_name],1,1
3,0,15.0,0,get the current user id,get the current user id,Agent stopped due to iteration limit or time l...,[(tool='get_current_user_id' tool_input={} log...,35,True,[get_current_user_id],1,15
4,1,1.0,0,How many users by the name of bob?,How many users by the name of bob?,"There are multiple users with the name ""Bob"".",[(tool='find_users_by_name' tool_input={'name'...,1,True,[find_users_by_name],1,1


In [85]:
df = df.sort_values('actual_number_of_steps', ascending=False)

In [86]:
df

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,correctness,input.question,output.question,output.output,output.intermediate_steps,reference.reference,reference.order_matters,reference.expected_steps,num_expected_steps,actual_number_of_steps
3,0,15.0,0,get the current user id,get the current user id,Agent stopped due to iteration limit or time l...,[(tool='get_current_user_id' tool_input={} log...,35,True,[get_current_user_id],1,15
7,0,7.5,0,weather in LA right now?,weather in LA right now?,Agent stopped due to iteration limit or time l...,[(tool='find_locations_by_name' tool_input={'c...,"Sunny, Temperature: 75°F",True,"[find_locations_by_name, get_current_weather_f...",2,15
8,0,7.5,0,time in chicago,time in chicago,Agent stopped due to iteration limit or time l...,[(tool='find_locations_by_name' tool_input={'c...,2023-11-14 11:15 AM,True,"[find_locations_by_name, get_current_time_for_...",2,15
15,0,2.0,1,whats the name of the city where bob lives?,whats the name of the city where bob lives?,The name of the city where Bob lives is Los An...,[(tool='get_user_location' tool_input={'user_i...,Los Angeles,True,"[find_users_by_name, get_user_location, get_ci...",3,6
20,0,1.0,1,do bob and alice live in the same city?,do bob and alice live in the same city?,"No, Bob and Alice do not live in the same city...",[(tool='find_users_by_name' tool_input={'name'...,no,False,"[find_users_by_name, get_user_location, get_ci...",5,5
13,0,2.0,0,Frank who is Even's friend is allergic to dair...,Frank who is Even's friend is allergic to dair...,"Frank is not allergic to dairy, so he can eat ...",[(tool='find_users_by_name' tool_input={'name'...,yes,True,"[find_users_by_name, get_food_allergic_ingredi...",2,4
18,1,1.0,1,do alice and charlie use the same email provider?,do alice and charlie use the same email provider?,"No, Alice uses the email provider ""gmail.com"" ...",[(tool='find_users_by_name' tool_input={'name'...,no,True,"[find_users_by_name, get_user_email, get_user_...",3,3
16,0,1.0,1,Donna is about to go outside. Does she need an...,Donna is about to go outside. Does she need an...,"Yes, Donna needs an umbrella because it is cur...",[(tool='find_users_by_name' tool_input={'name'...,yes,True,"[find_users_by_name, get_user_location, get_cu...",3,3
14,1,1.0,1,what is the current users favorite color and n...,what is the current users favorite color and n...,The current user's favorite color is yellow an...,[(tool='get_current_user_id' tool_input={} log...,yellow and Charlie,True,"[get_current_user_id, get_user_favorite_color,...",3,3
11,1,1.0,1,what is the current users favorite color?,what is the current users favorite color?,The current user's favorite color is yellow.,[(tool='get_current_user_id' tool_input={} log...,yellow,True,"[get_current_user_id, get_user_favorite_color]",2,2


An example of a poorly behaving agent that seems to have gotten stuck in a loop!

In [90]:
df['output.intermediate_steps'].loc[8]

[(AgentActionMessageLog(tool='find_locations_by_name', tool_input={'city': 'Chicago'}, log="\nInvoking: `find_locations_by_name` with `{'city': 'Chicago'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "city": "Chicago"\n}', 'name': 'find_locations_by_name'}})]),
  [{'id': 3, 'city': 'Chicago'},
   {'id': 5, 'city': 'Miami'},
   {'id': 2, 'city': 'Los Angeles'},
   {'id': 4, 'city': 'Houston'},
   {'id': 1, 'city': 'New York'}]),
 (AgentActionMessageLog(tool='get_current_time_for_location', tool_input={'location_id': 3}, log="\nInvoking: `get_current_time_for_location` with `{'location_id': 3}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "location_id": 3\n}', 'name': 'get_current_time_for_location'}})]),
  '2023-11-14 11:15 AM'),
 (AgentActionMessageLog(tool='get_current_time_for_location', tool_input={'location_id': 3}, log="\nInvoking: `get_current_time_for_location` with `{'lo