# Introduction

Tool Usage tasks are designed to evaluate how well an agent can use tools to accomplish an objective.

Each task defines an environment in which the agent operates. The environment consists of a set of tools and a way to read the state of the environment (more on that below).

The tasks allow you to stress test the agent in different ways:

* Can the agent use a single tool effectively?
* Can the agent use more than 10 tools effectively?
* Can the agent correctly incorporate information returned by the tool (and ignore internal knowledge)?

To help in this evaluation, each task is associated with a LangSmith dataset that includes input/output examples of varying difficulties.

## Schema

To make it possible to evaluate different agent implementations, we're using a standardized schema, we'll illustrate it with the following example taken from tool usage.

### Dataset

Each task corresponds to a LangSmith dataset with the following schema:

Inputs:

|     name    |     type    |     meaning            |
| ----------- | ----------- | -----------------------|
| question    | str         | the user question      |


Outputs:

|     name      |     type        |     meaning                                            |
| ------------- | --------------- | ------------------------------------------------------|
| reference     | str             | the expected answer                                   |
| expected_steps| List[str]       | the list of tools that should be invoked              |
| order_matters | bool            | whether the tools should be invoked in the specific order |
| state         | Optional[Any]   | the state of the system after the agent has taken its actions |



Here's an [example](https://smith.langchain.com/public/1d89f4b3-5f73-48cf-a127-2fdeb22f6d84/d/e82a0faf-00b9-40a5-a0e3-9723d923e58e/e) contains the following keys/values:

```json
{
  "input": {"question": "weather in LA right now?"},
  "output": {
      "reference": "Sunny, Temperature: 75°F",
      "order_matters": true,
      "expected_steps": [
        "find_locations_by_name",
        "get_current_weather_for_location"
      ],
    }
}
```


### Agent

To work with the evaluators provided by LangChain Benchmarks (of course you're free to write your own evaluators!).

An agent must accept `question` as an input and return:

```json
{
    "output": "It's super sunny. Like 75F", // the output from the agent
    "intermediate_steps": [... "find_locations_by_name" ...], // list of the intermediate steps taken by the agent (see format in LangChain)
    "state": .., // Can be anything, this is the state fo the environment after the agent has taken all of its actions (optional key)
}
```

## Tasks

You can check an up-to-date list of tool usage tasks in the registry:    

In [1]:
from langchain_benchmarks import registry

registry.filter(Type="ToolUsageTask")

Name,Type,Dataset ID,Description
Tool Usage - Typewriter (1 tool),ToolUsageTask,59577193-8938-4ccf-92a7-e8a96bcf4f86,"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter (26 tools),ToolUsageTask,128af05e-aa00-4e3b-a958-d166dd450581,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."
Tool Usage - Relational Data,ToolUsageTask,1d89f4b3-5f73-48cf-a127-2fdeb22f6d84,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,ToolUsageTask,594f9f60-30a0-49bf-b075-f44beabf546a,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."


Let's understand what a tool usage task is in a bit more detail

In [2]:
task = registry["Tool Usage - Typewriter (26 tools)"]
task

0,1
Name,Tool Usage - Typewriter (26 tools)
Type,ToolUsageTask
Dataset ID,128af05e-aa00-4e3b-a958-d166dd450581
Description,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."


Tool usage tasks are associated with an environment

---------
```python

@dataclasses.dataclass(frozen=True)
class ToolUsageEnvironment:
    """An instance of an environment for tool usage."""

    tools: List[BaseTool]
    """The tools that can be used in the environment."""

    read_state: Optional[Callable[[], Any]] = None
    """A function that returns the current state of the environment."""

```

--------------

Here, we'll dig into the typewriter task a bit to explain what the environment state represents.

The typewrite task has 26 tools each of which prints a letter on a piece of virtual paper

In [3]:
env = task.create_environment()
env.tools[:4]

[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f490480fe20>),
 StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f490480fec0>),
 StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f490480ff60>),
 StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f4904844040>)]

In [4]:
env.tools[0].invoke({})  # Invoke a()
env.tools[0].invoke({})  # invoke a()
env.tools[2].invoke({})  # invoke c()

'OK'

In [5]:
env.read_state()  # Shows the content of the virtual paper

'aac'

## Agent Factory

For evaluation, we need an agent factory that will create a new instance of an agent executor for every evaluation run.

The `AgentExecutor` should accept `question` as an input and include the fields `output`, `intermediate_steps` and potentially `state` in its response -- for this we
will wrap the agent executor in an adapter (`apply_agent_executor_adapter`) that will help match the expected schema.

Please reference the LangChain documentation to see how to [use and implement agents](https://python.langchain.com/docs/modules/agents/)

In [6]:
from langchain.agents import AgentType, initialize_agent
from langchain.chat_models import ChatOpenAI

from langchain_benchmarks.schema import ExtractionTask
from langchain_benchmarks.tool_usage.agents import apply_agent_executor_adapter

In [7]:
class AgentFactory:
    def __init__(self, task: ExtractionTask, model: str) -> None:
        self.task = task
        self.model = model

    def __call__(self):
        # This factory creates a new environment for every agent run.
        # The reason is that the environment may be associated with an environment state (e.g., typewriter)
        # which is changed by the actions of the agent.
        # At the end of the run, the environment state will be read.
        env = task.create_environment()  # Create a new environment for every agent run!
        tools = env.tools
        llm = ChatOpenAI(temperature=0, model=self.model)
        agent_executor = initialize_agent(
            tools,
            llm,
            agent=AgentType.OPENAI_FUNCTIONS,
            return_intermediate_steps=True,
            handle_parsing_errors=True,
        )
        # Apply the adapters so that inputs and outputs match dataset schema
        # state_reader automatically adds the state of the environment at the end of the run.
        return apply_agent_executor_adapter(agent_executor, state_reader=env.read_state)

In [8]:
from langchain import globals

globals.set_verbose(True)

In [9]:
agent_factory = AgentFactory(task, model="gpt-3.5-turbo-1106")

Let's check that the agent works

In [10]:
agent = agent_factory()

In [12]:
agent.invoke({"question": "abc"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `a` with `{}`


[0m[36;1m[1;3mOK[0m[32;1m[1;3m
Invoking: `b` with `{}`


[0m[33;1m[1;3mOK[0m[32;1m[1;3m
Invoking: `c` with `{}`


[0m[38;5;200m[1;3mOK[0m[32;1m[1;3mYou've successfully typed "abc"! Is there anything else you'd like to do?[0m

[1m> Finished chain.[0m


{'input': 'abc',
 'output': 'You\'ve successfully typed "abc"! Is there anything else you\'d like to do?',
 'intermediate_steps': [(AgentActionMessageLog(tool='a', tool_input={}, log='\nInvoking: `a` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{}', 'name': 'a'}})]),
   'OK'),
  (AgentActionMessageLog(tool='b', tool_input={}, log='\nInvoking: `b` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{}', 'name': 'b'}})]),
   'OK'),
  (AgentActionMessageLog(tool='c', tool_input={}, log='\nInvoking: `c` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{}', 'name': 'c'}})]),
   'OK')],
 'state': 'abc'}

## Benchmarking

How does one evaluate an agent? Given a particular task and input, an agent uses tools to produce an output AND/OR change the state of the environment.

To evaluate an agent, we can check the following:

1. Did the agent use the expected tools?
2. Did the agent use the tools in the most effective way; e.g., was the order of tool invocation correct?
3. Did the environment end up in the correct final state after the agent used the tools? (e.g., does my calendar contain all the scheduled meetings?)
4. Did the agent output match the expected reference output?

Each task is associated with a standard evaluator that does evaluation that's appropriate for the task; for example,

1. Use an LLM to grade Compare output to reference using an LLM that grades the response.
2. Compare equality of expected_steps to the list of tools in intermediate_steps -- simple list equality
3. Compare the state of the environment against expected state (if present in the dataset and in the agent)

This evaluator will be used below when we benchmark on all tasks!

In [13]:
eval_config = task.get_eval_config()
eval_config

RunEvalConfig(evaluators=[], custom_evaluators=[<langchain_benchmarks.tool_usage.evaluators.AgentTrajectoryEvaluator object at 0x7f49003b9990>], reference_key=None, prediction_key=None, input_key=None, eval_llm=None)

Set up code to run against all tasks

In [23]:
import datetime

from langsmith.client import Client

from langchain_benchmarks import (
    __version__,
    clone_public_dataset,
    model_registry,
    registry,
)
from langchain_benchmarks.rate_limiting import RateLimiter
from langchain_benchmarks.tool_usage.agents import (
    AnthropicToolUserFactory,
    CustomAgentFactory,
    OpenAIAgentFactory,
    OpenAIAssistantFactory,
)

In [24]:
import uuid
experiment_uuid = uuid.uuid4().hex[:]

In [25]:
tests = [
    # 2-tuple of (architecture, model name)
    ("openai_functions", "gpt-3.5-turbo-1106"), # Requires OpenAI Creds
    ("openai_functions", "gpt-3.5-turbo-0613"),
    ("openai_functions", "gpt-4-1106-preview"),
    ("openai_functions", "gpt-4-0613"),
    ("openai_functions", "mistral-7b-instruct-v0.1"), # Requires AnyScale creds
    ("anthropic_tool_user", "claude-2.1"), # Requires Anthropic Creds and Setting up Anthropics Tool Usage package.

]

In [None]:
client = Client()  # Launch langsmith client for cloning datasets
today = datetime.date.today().isoformat()
rate_limiter = RateLimiter(requests_per_second=2)

for task in registry:
    if task.type != "ToolUsageTask":
        continue

    dataset_name = task.name
    clone_public_dataset(task.dataset_id, dataset_name=dataset_name)

    for arch, model in tests:
        print()
        print(f"Benchmarking {task.name} with model: {model} and arch: {arch}")
        eval_config = task.get_eval_config()

        if arch == "openai_functions":
            agent_factory = OpenAIAgentFactory(
                task, model=model, rate_limiter=rate_limiter
            )
        elif arch == "custom_agent":
            agent_factory = CustomAgentFactory(
                task, model=model, rate_limiter=rate_limiter
            )
        elif arch == "anthropic_tool_user":
            agent_factory = AnthropicToolUserFactory(task)
        else:
            raise ValueError()

        client.run_on_dataset(
            dataset_name=dataset_name,
            llm_or_chain_factory=agent_factory,
            evaluation=eval_config,
            verbose=False,
            project_name=f"{model}-{task.name}-{today}-{experiment_uuid}",
            tags=[model],
            concurrency_level=5,
            project_metadata={
                "model": model,
                "id": experiment_uuid,
                "task": task.name,
                "date": today,
                "langchain_benchmarks_version": __version__,
                "arch": arch,
            },
        )