# Typewriter: 26 Tools

This is a variation of the typewriter task in which the agent has access to 26 parameterless tools.

Each tool represents a letter of the alphabet (e.g., 'a', 'b', 'c').

The agent can use each tool to "print" the corresponding letter on a piece of virtual paper.

The objective for the agent is to "print" the user's input on the paper exactly.

---------

In [None]:
import os

os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your api key.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

In [2]:
task = registry["Tool Usage - Typewriter (26 tools)"]
task

0,1
Name,Tool Usage - Typewriter (26 tools)
Type,ToolUsageTask
Dataset ID,128af05e-aa00-4e3b-a958-d166dd450581
Description,"Environment with 26 tools each tool represents a letter of the alphabet. The objective of this task is to evaluate the model's ability the use tools for a simple repetition task. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. This is a variation of the typer writer task, where 26 parameterless tools are given instead of a single tool that takes a letter as an argument."


Clone the dataset associaetd with this task

In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Typewriter (26 tools) already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/5051c0ae-16be-4afa-b914-84acbc5e9659.


Let's build an agent that we can use for evaluation.

## The Environment

The environment consists of 26 tools and a virtual paper.

Each tool is responsible for printing a letter on the paper that corresponds to it.

In [9]:
env = task.create_environment()

In [10]:
env.tools[:5]

[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f099cebd310>),
 StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f097f56f940>),
 StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f097f56ff70>),
 StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchemaSchema'>, func=<function _create_typing_func.<locals>.func at 0x7f096421b040>),
 StructuredTool(name='e', description='e() -> str - Run to Type the letter "e".', args_schema=<class 'pydantic.v1.main.eSchemaSchema'>, func=<function _create_typing_func.<loca

In [12]:
env.tools[0].invoke({})

'OK'

In [13]:
env.tools[3].invoke({})

'OK'

In [14]:
env.read_state()

'ad'

## Agent Factory

For evaluation, we need an agent factory that will create a new instance of an agent executor for every evaluation run.

We'll use an `OpenAIAgentFactory` provided with LangChain Benchmarks -- look at the `intro` section to see how to define your own.

In [15]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory()
agent.invoke({"question": "abc"})

{'question': 'abc',
 'output': 'abc\nabc',
 'intermediate_steps': [(AgentActionMessageLog(tool='a', tool_input={}, log='\nInvoking: `a` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '', 'name': 'a'}})]),
   'OK'),
  (AgentActionMessageLog(tool='b', tool_input={}, log='\nInvoking: `b` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '', 'name': 'b'}})]),
   'OK'),
  (AgentActionMessageLog(tool='c', tool_input={}, log='\nInvoking: `c` with `{}`\n\n\n', message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '', 'name': 'c'}})]),
   'OK')],
 'state': 'abc'}

## Eval

Let's evaluate an agent now

In [16]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import get_eval_config

client = Client()

test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=get_eval_config(),
    verbose=True,
    tags=["gpt-3.5-turbo-16k"],
)

View the evaluation results for project 'test-mealy-ink-37' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/d5562dcb-7bea-432d-8e41-3fcf3f6f2247?eval=true

View all tests for Dataset Tool Usage - Typewriter (26 tools) at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/5051c0ae-16be-4afa-b914-84acbc5e9659
[----------->                                      ] 5/20

Chain failed for example c0ee0026-e11b-4036-b7f0-135ac9e82d66 with inputs {'question': 'horse'}
Error Type: InternalServerError, Message: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID 8707858df9212b40a8d4f22a0027d2a2 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}


[-------------->                                   ] 6/20

Chain failed for example 4ae1d1c0-4c34-4ef0-afd8-292be2e53b8d with inputs {'question': 'school'}
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': "'s()' does not match '^[a-zA-Z0-9_-]{1,64}$' - 'messages.2.function_call.name'", 'type': 'invalid_request_error', 'param': None, 'code': None}}


[----------------------------->                    ] 12/20

Chain failed for example e03000da-4c4b-4060-a798-0e71f3c3ff90 with inputs {'question': 'keyboard'}
Error Type: InternalServerError, Message: Error code: 500 - {'error': {'message': 'The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID d20bfa7a39d9ee8c80e72070a6aafab9 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}


[------------------------------------------------->] 20/20
 Eval quantiles:
                                     0.25        0.5       0.75       mean  \
Intermediate steps correctness   0.000000   0.000000   1.000000   0.294118   
# steps / # expected steps       1.000000   1.125000   2.142857   1.722598   
Correct Final State              0.000000   1.000000   1.000000   0.529412   
correctness                      0.000000   0.000000   1.000000   0.470588   
execution_time                  38.794961  38.794961  38.794961  38.794961   

                                     mode  
Intermediate steps correctness   0.000000  
# steps / # expected steps       1.000000  
Correct Final State              1.000000  
correctness                      0.000000  
execution_time                  38.794961  


In [23]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [69]:
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["actual_number_of_steps"] = (
    df["output.intermediate_steps"]
    .apply(lambda x: None if not isinstance(x, list) else len(x))
    .fillna("")
)
df["output.Error"].fillna("", inplace=True)

In [71]:
df["Correct Final State"].mean()

0.5294117647058824

In [73]:
df[
    [
        "input.question",
        "output.state",
        "num_expected_steps",
        "actual_number_of_steps",
        "Correct Final State",
        "output.Error",
    ]
]

Unnamed: 0,input.question,output.state,num_expected_steps,actual_number_of_steps,Correct Final State,output.Error
0,a,a,1,1.0,1.0,
1,aa,aa,2,2.0,1.0,
2,aaa,aaaaaaaaaaaaaaa,3,15.0,0.0,
3,aaaa,aaaaaaaaaaaaaaa,4,15.0,0.0,
4,dog,dog,3,4.0,1.0,
5,cat,cat,3,4.0,1.0,
6,hand,hand,4,4.0,1.0,
7,head,hhhhhhhhhhhhhhh,4,15.0,0.0,
8,house,house,5,5.0,1.0,
9,horse,,5,,,"InternalServerError(""Error code: 500 - {'error..."
