# Multiverse Math

Let's see how to evaluate an agent's ability to use tools.

    Solve basic math question using the provided tools.

    Must use the provided tools to solve the math question.

    To make sure that innate knowledge is not used, the math operations have been altered to yield different results than expected.

    The modified operations should yield different results, but still retain appropriate properties. For example, the modified multiplication operation should still be commutative.

    Please note that the modified operations are not guaranteed to even make sense in the real world since not all properties will be retained (e.g., distributive property).

For this code to work, please configure LangSmith environment variables with your credentials.

In [1]:
import os
os.environ["LANGCHAIN_API_KEY"] = "sk-..." # Your LangSmith API key

In [18]:
from langchain_benchmarks import clone_public_dataset, registry

In [3]:
task = registry["Multiverse Math"]
task

0,1
Name,Multiverse Math
Type,ToolUsageTask
Dataset ID,https://smith.langchain.com/public/594f9f60-30a0-49bf-b075-f44beabf546a/d
Description,"An environment that contains a few basic math operations, but with altered results. For example, mu..."


Clone the dataset associaetd with this task

In [6]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Multiverse Math already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/ddca73f1-ceda-4562-8c49-7ee0a9df2a01.


Let's build an agent that we can use for evaluation.

In [10]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory.create()
agent.invoke({"question": "how much is 3 + 5"})

{'question': 'how much is 3 + 5',
 'output': 'In this alternate mathematical universe, the result of adding 3 and 5 is 9.2.',
 'intermediate_steps': [(AgentActionMessageLog(tool='add', tool_input={'a': 3, 'b': 5}, log="\nInvoking: `add` with `{'a': 3, 'b': 5}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "a": 3,\n  "b": 5\n}', 'name': 'add'}})]),
   9.2)]}

## Eval

Let's evaluate an agent now

In [13]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import STANDARD_AGENT_EVALUATOR

client = Client()

test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=STANDARD_AGENT_EVALUATOR,
    verbose=True,
    tags=["gpt-3.5-turbo-16k"],
)

View the evaluation results for project 'test-excellent-potato-37' at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/projects/p/e350cda0-4e1d-49eb-8483-574172d1c635?eval=true

View all tests for Dataset Multiverse Math at:
https://smith.langchain.com/o/e081f11e-fbd2-41b4-9fa8-5d76c76ef854/datasets/ddca73f1-ceda-4562-8c49-7ee0a9df2a01
[------------------------------------------------->] 10/10
 Eval quantiles:
                                    0.25       0.5      0.75      mean  \
Intermediate steps correctness   0.00000   0.00000   0.00000   0.10000   
# steps / # expected steps       5.00000   7.50000   8.62500   7.75000   
correctness                      0.00000   0.00000   0.00000   0.10000   
execution_time                  38.76436  38.76436  38.76436  38.76436   

                                    mode  
Intermediate steps correctness   0.00000  
# steps / # expected steps       5.00000  
correctness                      0.00000  
execution_time          

# Inspect

You can take a look at the underlying results.

In [14]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [15]:
df["correctness"].mean()

0.1

In [16]:
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["actual_number_of_steps"] = df["output.intermediate_steps"].apply(len)

In [17]:
df.head()

Unnamed: 0,Intermediate steps correctness,# steps / # expected steps,correctness,execution_time,input.question,output.question,output.output,output.intermediate_steps,reference.reference,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,0,15.0,0,38.76436,Add 2 and 3,Add 2 and 3,Agent stopped due to iteration limit or time l...,"[(tool='add' tool_input={'a': 2, 'b': 3} log=""...",6.2,[add],1,15
1,0,15.0,0,38.76436,Subtract 3 from 2,Subtract 3 from 2,Agent stopped due to iteration limit or time l...,"[(tool='subtract' tool_input={'a': 2, 'b': 3} ...",-4.0,[subtract],1,15
2,0,9.0,1,38.76436,What is -5 if evaluated using the negate funct...,What is -5 if evaluated using the negate funct...,-5.0\n-5.0,"[(tool='negate' tool_input={'a': -5} log=""\nIn...",-5.0,[negate],1,9
3,1,1.0,0,38.76436,what is the result of 2 to the power of 3?,what is the result of 2 to the power of 3?,The result of 2 to the power of 3 is 32.,"[(tool='power' tool_input={'a': 2, 'b': 3} log...",32.0,[power],1,1
4,0,7.5,0,38.76436,I ate 1 apple and 2 oranges every day for 7 da...,I ate 1 apple and 2 oranges every day for 7 da...,Agent stopped due to iteration limit or time l...,"[(tool='add' tool_input={'a': 1, 'b': 2} log=""...",32.34,"[multiply, add]",2,15
