# Introduction

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-benchmarks/blob/main/docs/source/notebooks/tool_usage/intro.ipynb)

Tool Usage tasks are designed to evaluate how well an agent can use tools to accomplish an objective.

Each task defines an environment within which the agent operates. The environment consists of a set of pre-defined tools and a way to read the state of the environment (more on that below).

Different asks define different environments, objectives and tools, allowing you to stress test different aspects of the agent:

* Can the agent use a single tool effectively?
* Can an agent use more than 10 tools effectively?
* Can an agent correctly ignore what they know to use the output from tools effectively?

To help in this evaluation, each task is associated with a LangSmith dataset that includes input/output examples of varying difficulties.

## Evaluation

How does one evaluate an agent? Given a particular task and input, an agent uses tools to produce an output AND/OR change the state of the environment.

To evaluate an agent, we can check the following:

1. Did the agent use the expected tools?
2. Did the agent use the tools in the most effective way; e.g., was the order of tool invocation correct?
3. Did the environment end up in the correct final state after the agent used the tools? (e.g., does my calendar contain all the scheduled meetings?)
4. Did the agent output match the expected reference output?

## Schema

To make it possible to evaluate different agent implementations, we're using a standardized schema, we'll illustrate it with the following example taken from tool usage:

### Dataset

https://smith.langchain.com/public/1d89f4b3-5f73-48cf-a127-2fdeb22f6d84/d/e82a0faf-00b9-40a5-a0e3-9723d923e58e/e

```json
{
  "input": {"question": "weather in LA right now?"},  // User's question
  "output": {
      "reference": "Sunny, Temperature: 75°F",  // The expected answer for the output (when it exists)
      "order_matters": true,  // Whether the order of expected steps was meaningful
      "expected_steps": [ #  // list of which tools the agent should've invoked
        "find_locations_by_name",
        "get_current_weather_for_location"
      ],
    }
}
```


### Agent

To work with the evaluators provided by LangChain Benchmarks (of course you're free to write your own evaluators!).

An agent must accept `question` as an input and return:

```json
{
    "output": "It's super sunny. Like 75F", // the output from the agent
    "intermediate_steps": [... "find_locations_by_name" ...], // list of the intermediate steps taken by the agent (see format in LangChain)
    "state": .., // Can be anything, this is the state fo the environment after the agent has taken all of its actions (optional key)
}
```

## Standard Evaluators

The different tasks are associated with standard evaluators.

The standard agent evaluator does the following:

1. Compare output to reference using an LLM that grades the response.
2. Compare equality of exxpected_steps to the list of tools in intermediate_steps -- simple list equality
3. Compare the state of the environment against expected state (if present in the dataset and in the agent)
4. It does not use `order_matters` at the moment

## Tasks

You can check an up-to-date list of tool usage tasks in the registry:    

In [5]:
from langchain_benchmarks import registry

registry.filter(Type="ToolUsageTask")

Name,Type,Dataset ID,Description
Tool Usage - Typewriter (1 tool),ToolUsageTask,59577193-8938-4ccf-92a7-e8a96bcf4f86,"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."
Tool Usage - Typewriter (26 tools),ToolUsageTask,128af05e-aa00-4e3b-a958-d166dd450581,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."
Tool Usage - Relational Data,ToolUsageTask,1d89f4b3-5f73-48cf-a127-2fdeb22f6d84,"Environment with fake data about users and their locations and favorite foods. The environment provides a set of tools that can be used to query the data. The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data. The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question. Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question. Success is measured by the ability to answer the question correctly, and efficiently."
Multiverse Math,ToolUsageTask,594f9f60-30a0-49bf-b075-f44beabf546a,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."


Let's understand what a tool usage task is in a bit more detail

In [12]:
task = registry["Tool Usage - Typewriter (26 tools)"]
task

0,1
Name,Tool Usage - Typewriter (26 tools)
Type,ToolUsageTask
Dataset ID,128af05e-aa00-4e3b-a958-d166dd450581
Description,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."


Tool usage tasks are associated with an environment

---------
```python

@dataclasses.dataclass(frozen=True)
class ToolUsageEnvironment:
    """An instance of an environment for tool usage."""

    tools: List[BaseTool]
    """The tools that can be used in the environment."""

    read_state: Optional[Callable[[], Any]] = None
    """A function that returns the current state of the environment."""

```

--------------

Here, we'll dig into the typewriter task a bit to explain what the environment state represents.

The typewrite task has 26 tools each of which prints a letter on a piece of virtual paper

In [41]:
env = task.create_environment()

env.tools[:4]

[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f9d713a49d0>),
 StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f9d713a4a60>),
 StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f9d713a4af0>),
 StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f9d713a4b80>)]

In [42]:
env.read_state()

''

In [43]:
env.tools[3].invoke({})  # Invoke d()

'OK'

In [44]:
env.tools[5].invoke({})  # invoke f()

'OK'

In [45]:
env.read_state()  # Shows the content of the virtual paper

'df'