# Relational Data 


Let's see how to evaluate an agent's ability to use tools.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

For this code to work, please configure LangSmith environment variables with your credentials.

In [2]:
task = registry["Tool Usage - Relational Data"]
task

0,1
Name,Tool Usage - Relational Data
Type,ToolUsageTask
Dataset ID,https://smith.langchain.com/public/1d89f4b3-5f73-48cf-a127-2fdeb22f6d84/d
Description,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."


In [3]:
print(task.description)

Environment with fake data about users and their locations and favorite foods.

The environment provides a set of tools that can be used to query the data.

The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data.

The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question.

Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question.

Success is measured by the ability to answer the question correctly, and efficiently.


Clone the dataset associaetd with this task

In [4]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Relational Data already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/69c0e0d0-91b5-4183-bed0-6628e76964dc.


## Define an agent

Let's build an agent that we can use for evaluation.

In [5]:
from langchain_benchmarks.tool_usage import agents

In [6]:
agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

Let's test that our agent works

In [7]:
agent = agent_factory()

In [8]:
agent.invoke({"question": "who is bob?"})

{'question': 'who is bob?',
 'output': 'Bob is a user with the ID 21.',
 'intermediate_steps': [(AgentActionMessageLog(tool='find_users_by_name', tool_input={'name': 'bob'}, log="\nInvoking: `find_users_by_name` with `{'name': 'bob'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "name": "bob"\n}', 'name': 'find_users_by_name'}})]),
   [{'id': 21, 'name': 'Bob'},
    {'id': 41, 'name': 'Donna'},
    {'id': 1, 'name': 'Alice'},
    {'id': 35, 'name': 'Charlie'},
    {'id': 42, 'name': 'Eve'},
    {'id': 43, 'name': 'Frank The Cat'}])]}

## Eval

Let's evaluate an agent now

In [9]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import STANDARD_AGENT_EVALUATOR

In [10]:
client = Client()

In [11]:
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=STANDARD_AGENT_EVALUATOR,
    verbose=True,
    tags=["openai-functions"],
)

View the evaluation results for project 'test-warm-whip-57' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/048077f0-52ca-4bae-8792-ec5e2a817d38?eval=true

View all tests for Dataset Tool Usage - Relational Data at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/69c0e0d0-91b5-4183-bed0-6628e76964dc
[------------------------------------------------->] 21/21
 Eval quantiles:
                                    0.25       0.5      0.75      mean  \
Intermediate steps correctness  0.000000  1.000000  1.000000  0.714286   
# steps / # expected steps      1.000000  1.000000  1.000000  0.928571   
correctness                     1.000000  1.000000  1.000000  0.809524   
execution_time                  5.098939  5.098939  5.098939  5.098939   

                                    mode  
Intermediate steps correctness  1.000000  
# steps / # expected steps      1.000000  
correctness                     1.000000  
execution_time    

# Inspect

Here, we'll take a look at the underlying results a little bit.

In [18]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [19]:
df["correctness"].mean()

0.8095238095238095

In [20]:
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["actual_number_of_steps"] = df["output.intermediate_steps"].apply(len)

In [21]:
df.head()

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,correctness,execution_time,input.question,output.question,output.output,output.intermediate_steps,reference.reference,reference.order_matters,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,0,1.0,1,5.098939,do bob and alice live in the same city?,do bob and alice live in the same city?,"No, Bob and Alice do not live in the same city...",[(tool='find_users_by_name' tool_input={'name'...,no,False,"[find_users_by_name, get_user_location, get_ci...",5,5
1,0,0.0,0,5.098939,Is it likely that Donna is outside with an umb...,Is it likely that Donna is outside with an umb...,"I'm sorry, but I don't have access to real-tim...",[],yes,False,"[find_users_by_name, get_user_location, get_cu...",4,0
2,1,1.0,1,5.098939,do alice and charlie use the same email provider?,do alice and charlie use the same email provider?,"No, Alice and Charlie do not use the same emai...",[(tool='find_users_by_name' tool_input={'name'...,no,True,"[find_users_by_name, get_user_email, get_user_...",3,3
3,0,0.0,0,5.098939,Is it likely that Donna is awake right now?,Is it likely that Donna is awake right now?,"I'm sorry, but I don't have access to informat...",[],yes,True,"[find_users_by_name, get_user_location, get_cu...",3,0
4,0,1.0,1,5.098939,Donna is about to go outside. Does she need an...,Donna is about to go outside. Does she need an...,Donna is currently in a location where it is r...,[(tool='find_users_by_name' tool_input={'name'...,yes,True,"[find_users_by_name, get_user_location, get_cu...",3,3


In [22]:
df = df.sort_values("actual_number_of_steps", ascending=False)

In [23]:
df

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,correctness,execution_time,input.question,output.question,output.output,output.intermediate_steps,reference.reference,reference.order_matters,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,0,1.0,1,5.098939,do bob and alice live in the same city?,do bob and alice live in the same city?,"No, Bob and Alice do not live in the same city...",[(tool='find_users_by_name' tool_input={'name'...,no,False,"[find_users_by_name, get_user_location, get_ci...",5,5
2,1,1.0,1,5.098939,do alice and charlie use the same email provider?,do alice and charlie use the same email provider?,"No, Alice and Charlie do not use the same emai...",[(tool='find_users_by_name' tool_input={'name'...,no,True,"[find_users_by_name, get_user_email, get_user_...",3,3
4,0,1.0,1,5.098939,Donna is about to go outside. Does she need an...,Donna is about to go outside. Does she need an...,Donna is currently in a location where it is r...,[(tool='find_users_by_name' tool_input={'name'...,yes,True,"[find_users_by_name, get_user_location, get_cu...",3,3
5,0,1.0,0,5.098939,whats the name of the city where bob lives?,whats the name of the city where bob lives?,The name of the city where Bob lives is New York.,[(tool='list_user_ids' tool_input={} log='\nIn...,Los Angeles,True,"[find_users_by_name, get_user_location, get_ci...",3,3
6,1,1.0,1,5.098939,what is the current users favorite color and n...,what is the current users favorite color and n...,The current user's favorite color is yellow an...,[(tool='get_current_user_id' tool_input={} log...,yellow and Charlie,True,"[get_current_user_id, get_user_favorite_color,...",3,3
7,0,1.5,1,5.098939,Frank who is Even's friend is allergic to dair...,Frank who is Even's friend is allergic to dair...,"Frank's favorite food is the salad, which cont...",[(tool='find_users_by_name' tool_input={'name'...,yes,True,"[find_users_by_name, get_food_allergic_ingredi...",2,3
11,1,1.0,1,5.098939,list the allergens in chocolate,list the allergens in chocolate,The allergens in chocolate are milk and soy.,[(tool='find_foods_by_name' tool_input={'food'...,"milk, soy",True,"[find_foods_by_name, get_food_allergic_ingredi...",2,2
15,1,1.0,1,5.098939,what is alice's email address?,what is alice's email address?,Alice's email address is alice@gmail.com.,[(tool='find_users_by_name' tool_input={'name'...,alice@gmail.com,True,"[find_users_by_name, get_user_email]",2,2
14,1,1.0,1,5.098939,find donna's favorite color,find donna's favorite color,Donna's favorite color is green.,[(tool='find_users_by_name' tool_input={'name'...,green,True,"[find_users_by_name, get_user_favorite_color]",2,2
13,1,1.0,1,5.098939,weather in LA right now?,weather in LA right now?,The current weather in Los Angeles is sunny wi...,[(tool='find_locations_by_name' tool_input={'c...,"Sunny, Temperature: 75°F",True,"[find_locations_by_name, get_current_weather_f...",2,2
