# Typewriter: Single Tool

In this task, an agent is given access to a single tool called "type_letter".
This tool takes one argument called "letter" which is expected to be a character.

The agent must repeat the input string from the user, printing one
character a time on a piece of virtual paper.

The agent is evaluated based on its ability to print the correct string using
the "type_letter" tool.

--------

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

In [2]:
task = registry["Tool Usage - Typewriter (1 tool)"]
task

0,1
Name,Tool Usage - Typewriter (1 tool)
Type,ToolUsageTask
Dataset ID,59577193-8938-4ccf-92a7-e8a96bcf4f86
Description,"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper. The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string. For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order. The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string."


Clone the dataset associaetd with this task

In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Tool Usage - Typewriter (1 tool) already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/82ca6840-cf23-4bb0-a9be-55237ebbe9d3.


## The Environment

The environment consists of a single tool and a virtual paper.

The tool accepts a single letter as an input and prints the leter on the virtual paper. If successful, the tool returns the output "OK".

To determine what's written on the paper, one needs to read the environment state.

In [4]:
env = task.create_environment()

In [5]:
env.tools

[StructuredTool(name='type_letter', description='type_letter(letter: str) -> str - Print the given letter on the paper.', args_schema=<class 'pydantic.v1.main.type_letterSchemaSchema'>, func=<function create_typer.<locals>.type_letter at 0x7f3e404a07c0>)]

In [6]:
tool = env.tools[0]

In [7]:
tool.invoke({"letter": "a"})

'OK'

In [8]:
tool.invoke({"letter": "b"})

'OK'

In [9]:
env.read_state()

'ab'

## Agent Factory

For evaluation, we need an agent factory that will create a new instance of an agent executor for every evaluation run.

We'll use an `OpenAIAgentFactory` provided with LangChain Benchmarks -- look at the `intro` section to see how to define your own.

In [10]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory()

In [11]:
agent.invoke({"question": "abc"})

{'input': 'abc',
 'output': 'a, b, c',
 'intermediate_steps': [(AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'a'}, log="\nInvoking: `type_letter` with `{'letter': 'a'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "a"\n}', 'name': 'type_letter'}})]),
   'OK'),
  (AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'b'}, log="\nInvoking: `type_letter` with `{'letter': 'b'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "b"\n}', 'name': 'type_letter'}})]),
   'OK'),
  (AgentActionMessageLog(tool='type_letter', tool_input={'letter': 'c'}, log="\nInvoking: `type_letter` with `{'letter': 'c'}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "letter": "c"\n}', 'name': 'type_letter'}})]),
   'OK')],
 'state': 'abc'}

## Eval

Let's evaluate an agent now

In [12]:
import uuid

from langsmith.client import Client

from langchain_benchmarks.tool_usage import get_eval_config

experiment_uuid = uuid.uuid4().hex[:4]

client = Client()

models = ["gpt-3.5-turbo-1106", "gpt-3.5-turbo-0613", "gpt-4-0613"]

for model in models:
    # Will evaluate the trajectory and state, but not the output which is meaningless for this task.
    print()
    eval_config = get_eval_config(output_evaluation="none")
    agent_factory = agents.OpenAIAgentFactory(task, model=model)
    test_run = client.run_on_dataset(
        dataset_name=task.name,
        llm_or_chain_factory=agent_factory,
        evaluation=eval_config,
        verbose=False,
        project_name=f"typewriter-1-{model}-{experiment_uuid}",
        tags=[model],
        project_metadata={
            "model": model,
            "arch": "openai-functions-agent",
            "id": experiment_uuid,
        },
    )


View the evaluation results for project 'typewriter-1-gpt-3.5-turbo-1106-7709' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/d29cf7d9-9cfa-4fcd-8380-8c339b940972?eval=true

View all tests for Dataset Tool Usage - Typewriter (1 tool) at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/82ca6840-cf23-4bb0-a9be-55237ebbe9d3
[------------------------------------------------->] 20/20
View the evaluation results for project 'typewriter-1-gpt-3.5-turbo-0613-7709' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/044be5ad-0871-4b08-bf5c-1dd6ba94f53b?eval=true

View all tests for Dataset Tool Usage - Typewriter (1 tool) at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/82ca6840-cf23-4bb0-a9be-55237ebbe9d3
[------------------------------------------------->] 20/20
View the evaluation results for project 'typewriter-1-gpt-4-0613-7709' at:
https://smith.langchain.com/o/ebba

## Inspect

You can take a look at the underlying results.

In [19]:
import pandas as pd
from langsmith.client import Client

client = Client()
projects = list(
    client.list_projects(reference_dataset_name="Tool Usage - Typewriter (1 tool)")
)

dfs = []
for project in projects:
    first_root_run = next(
        client.list_runs(project_name=project.name, execution_order=1)
    )
    # Temporary way to get tag information
    tags = first_root_run.tags
    test_results = client.get_test_results(project_name=project.name)
    test_results["model"] = tags[0]
    dfs.append(test_results)


df = pd.concat(dfs)

df["actual_steps"] = df["outputs.intermediate_steps"].apply(
    lambda steps: [step[0]["tool"] for step in steps]
)
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["num_actual_steps"] = df["actual_steps"].apply(len)

### Stats

This is a simple task that involves using a single tool that takes only one argument (which character to type).

Given the simplicity of the task, we expect that all models will be able to do well at this task (ideally at 100%).

In [20]:
correct_df = (
    df.groupby("model")["feedback.Correct Final State"].sum().to_frame("# correct")
)
count_df = df.groupby("model").size().to_frame("n")

columns = [
    "feedback.Correct Final State",
    "feedback.Intermediate steps correctness",
    "execution_time",
    "feedback.# steps / # expected steps",
]

df.groupby("model")[columns].mean().join(correct_df).join(count_df)

Unnamed: 0_level_0,feedback.Correct Final State,feedback.Intermediate steps correctness,execution_time,feedback.# steps / # expected steps,# correct,n
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gpt-3.5-turbo-0613,0.95,0.95,18.880388,1.7,19.0,20
gpt-3.5-turbo-1106,0.9,0.75,22.471857,1.012455,18.0,20
gpt-4-0613,0.9,0.9,22.663781,1.09375,18.0,20


### Individual

In [22]:
columns = [
    "input.question",
    "model",
    "outputs.state",
    "reference.state",
    "feedback.Correct Final State",
    "num_expected_steps",
    "num_actual_steps",
]
df[columns].sort_values(by=["input.question", "model"]).head()

Unnamed: 0_level_0,input.question,model,outputs.state,reference.state,feedback.Correct Final State,num_expected_steps,num_actual_steps
example_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-3.5-turbo-0613,aaaaaaaaaaaaaaa,a,0.0,1,15
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-3.5-turbo-1106,a,a,1.0,1,1
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-4-0613,abc,a,0.0,1,3
5b40cb96-ae09-438e-b940-d24445bb5d67,aa,gpt-3.5-turbo-0613,aa,aa,1.0,2,2
5b40cb96-ae09-438e-b940-d24445bb5d67,aa,gpt-3.5-turbo-1106,aa,aa,1.0,2,2


In [23]:
df[columns].sort_values(by=["input.question", "model"])

Unnamed: 0_level_0,input.question,model,outputs.state,reference.state,feedback.Correct Final State,num_expected_steps,num_actual_steps
example_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-3.5-turbo-0613,aaaaaaaaaaaaaaa,a,0.0,1,15
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-3.5-turbo-1106,a,a,1.0,1,1
89bb564a-ddee-4a36-8a3d-d093eef415ca,a,gpt-4-0613,abc,a,0.0,1,3
5b40cb96-ae09-438e-b940-d24445bb5d67,aa,gpt-3.5-turbo-0613,aa,aa,1.0,2,2
5b40cb96-ae09-438e-b940-d24445bb5d67,aa,gpt-3.5-turbo-1106,aa,aa,1.0,2,2
5b40cb96-ae09-438e-b940-d24445bb5d67,aa,gpt-4-0613,aa,aa,1.0,2,2
288d6483-c618-4e34-9b86-275b490e0975,aaa,gpt-3.5-turbo-0613,aaa,aaa,1.0,3,3
288d6483-c618-4e34-9b86-275b490e0975,aaa,gpt-3.5-turbo-1106,aaa,aaa,1.0,3,3
288d6483-c618-4e34-9b86-275b490e0975,aaa,gpt-4-0613,aaa,aaa,1.0,3,3
915bd4b5-a536-4849-8cb6-8a658407c2c9,aaaa,gpt-3.5-turbo-0613,aaaa,aaaa,1.0,4,4
