# Typewriter: 26 Tools


Let's see how to evaluate an agent's ability to use tools.

    A task where the agent must type a given string one letter at a time.

    In this variation of the task, the agent is given access to 26 parameterless functions,
    each representing a letter of the alphabet.

In [None]:
import os

os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your api key.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

In [2]:
task = registry["Tool Usage - Typewriter (26 tools)"]
task

0,1
Name,Tool Usage - Typewriter (26 tools)
Type,ToolUsageTask
Dataset ID,128af05e-aa00-4e3b-a958-d166dd450581
Description,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."


Clone the dataset associaetd with this task

In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Typewriter (26 tools) already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/2f462c7a-f9b9-46e7-b96b-7469e965f478.


Let's build an agent that we can use for evaluation.

In [4]:
env = task.create_environment()

In [5]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory()
agent.invoke({"question": "foo"})

{'question': 'foo',
 'output': "Could not parse tool input: {'arguments': '', 'name': 'f'} because the `arguments` is not valid JSON.",
 'intermediate_steps': [(AgentAction(tool='_Exception', tool_input='Invalid or incomplete response', log="Could not parse tool input: {'arguments': 'f', 'name': 'f'} because the `arguments` is not valid JSON."),
   'Invalid or incomplete response'),
  (AgentAction(tool='_Exception', tool_input='Invalid or incomplete response', log="Could not parse tool input: {'arguments': '', 'name': 'f'} because the `arguments` is not valid JSON."),
   'Invalid or incomplete response'),
  (AgentAction(tool='_Exception', tool_input='Invalid or incomplete response', log="Could not parse tool input: {'arguments': '', 'name': 'f'} because the `arguments` is not valid JSON."),
   'Invalid or incomplete response'),
  (AgentAction(tool='_Exception', tool_input='Invalid or incomplete response', log="Could not parse tool input: {'arguments': '', 'name': 'f'} because the `argu

## Eval

Let's evaluate an agent now

In [6]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import STANDARD_AGENT_EVALUATOR

client = Client()

test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=STANDARD_AGENT_EVALUATOR,
    verbose=True,
    tags=["gpt-3.5-turbo-16k"],
)

View the evaluation results for project 'test-notable-artist-76' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/5c828160-9f7f-4f01-84ea-05f8a498d031?eval=true

View all tests for Dataset Tool Usage - Typewriter (26 tools) at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/2f462c7a-f9b9-46e7-b96b-7469e965f478
[>                                                 ] 0/20

Chain failed for example 2d4e99fc-8495-468e-8429-6c25a2d176f3 with inputs {'question': 'keyboard'}
Error Type: InternalServerError, Message: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID b658bca90fb852f4d236fc368bc65bcc in your email.)', 'type': 'server_error', 'param': None, 'code': None}}


[------------------->                              ] 8/20

Chain failed for example 8af5bd36-fc11-4b23-9019-f642cfaf8a01 with inputs {'question': 'horse'}
Error Type: InternalServerError, Message: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID 3c40664804cb6e8c84e0e8796dbc0a6d in your email.)', 'type': 'server_error', 'param': None, 'code': None}}


[------------------------------------------------->] 20/20
 Eval quantiles:
                                    0.25   0.5   0.75      mean  mode
Intermediate steps correctness  0.000000  0.00  0.000  0.000000  0.00
# steps / # expected steps      0.703571  0.75  1.375  1.007551  0.75
Correct Final State             0.000000  0.00  0.000  0.055556  0.00
correctness                     0.000000  0.00  0.000  0.111111  0.00
