# Typewriter: Single Tool
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-benchmarks/blob/main/docs/source/notebooks/tool_usage/typewriter_1.ipynb)


    In this task, an agent is given access to a single tool called "type_letter".
    This tool takes one argument called "letter" which is expected to be a character.
    
    The agent must repeat the input string from the user, printing one
    character a time on a piece of virtual paper.
    
    The agent is evaluated based on its ability to print the correct string using
    the "type_letter" tool.

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

In [2]:
task = registry["Tool Usage - Typewriter (1 tool)"]
task

0,1
Name,Tool Usage - Typewriter (1 tool)
Type,ToolUsageTask
Dataset ID,59577193-8938-4ccf-92a7-e8a96bcf4f86
Description,"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."


Clone the dataset associaetd with this task

In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Typewriter (1 tool) already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/25850d74-d4e0-41ac-81a1-dfc78a79660b.


## The Environment

The environment consists of a single tool and a virtual paper.

The tool accepts a single letter as an input and prints the leter on the virtual paper. If successful, the tool returns the output "OK".

To determine what's written on the paper, one needs to read the environment state.

In [16]:
env = task.create_environment()

In [17]:
env.tools

[StructuredTool(name='type_letter', description='type_letter(letter: str) -> str - Print the given letter on the paper.', args_schema=<class 'pydantic.v1.main.type_letterSchemaSchema'>, func=<function create_typer.<locals>.type_letter at 0x7f538cc0e040>)]

In [19]:
tool = env.tools[0]

In [21]:
tool.invoke({'letter': 'a'})

'OK'

In [22]:
tool.invoke({'letter': 'b'})

'OK'

In [24]:
env.read_state()

'ab'

## Agent

Let's build an agent that we can use for evaluation.

In [4]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory.create()
agent.invoke({"question": "abc"})

{'question': 'abc',
 'output': 'a, b, c',
 'intermediate_steps': [(AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'a'}, log="\nInvoking: `type_letter` with `{'letter': 'a'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "a"\n}', 'name': 'type_letter'}})]),
   'OK'),
  (AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'b'}, log="\nInvoking: `type_letter` with `{'letter': 'b'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "b"\n}', 'name': 'type_letter'}})]),
   'OK'),
  (AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'c'}, log="\nInvoking: `type_letter` with `{'letter': 'c'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "c"\n}', 'name': 'type_letter'}})]),
   'OK')],
 'state': 'abc'}

## Eval

Let's evaluate an agent now

In [5]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import STANDARD_AGENT_EVALUATOR

client = Client()

test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=STANDARD_AGENT_EVALUATOR,
    verbose=True,
    tags=["gpt-3.5-turbo-16k"],
)

View the evaluation results for project 'test-shiny-curve-39' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/c66bbd6e-cce5-461d-9287-97391bd2f668?eval=true

View all tests for Dataset Tool Usage - Typewriter (1 tool) at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/25850d74-d4e0-41ac-81a1-dfc78a79660b
[------------------------------------------------->] 20/20
 Eval quantiles:
                                     0.25        0.5       0.75       mean  \
Intermediate steps correctness   1.000000   1.000000   1.000000   0.950000   
# steps / # expected steps       1.000000   1.000000   1.000000   1.700000   
Correct Final State              1.000000   1.000000   1.000000   0.950000   
correctness                      1.000000   1.000000   1.000000   0.800000   
execution_time                  34.058961  34.058961  34.058961  34.058961   

                                     mode  
Intermediate steps correctness   1.000000 

# Inspect

You can take a look at the underlying results.

In [6]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [7]:
df["correctness"].mean()

0.8

In [8]:
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["actual_number_of_steps"] = df["output.intermediate_steps"].apply(len)

In [9]:
df.head()

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,Correct Final State,correctness,execution_time,input.question,output.question,output.output,output.intermediate_steps,output.state,reference.state,reference.reference,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,0,15.0,0,0,34.058961,a,a,Agent stopped due to iteration limit or time l...,[(tool='type_letter' tool_input={'letter': 'a'...,aaaaaaaaaaaaaaa,a,a,[type_letter],1,15
1,1,1.0,1,1,34.058961,aa,aa,aa\naa,[(tool='type_letter' tool_input={'letter': 'a'...,aa,aa,aa,"[type_letter, type_letter]",2,2
2,1,1.0,1,0,34.058961,aaa,aaa,a\na,[(tool='type_letter' tool_input={'letter': 'a'...,aaa,aaa,aaa,"[type_letter, type_letter, type_letter]",3,3
3,1,1.0,1,0,34.058961,aaaa,aaaa,a\na,[(tool='type_letter' tool_input={'letter': 'a'...,aaaa,aaaa,aaaa,"[type_letter, type_letter, type_letter, type_l...",4,4
4,1,1.0,1,1,34.058961,dog,dog,d\no\ng,[(tool='type_letter' tool_input={'letter': 'd'...,dog,dog,dog,"[type_letter, type_letter, type_letter]",3,3
