# Multiverse Math

In this task, the agent is operating in an alternate universe which in which the basic mathematical operations like addition and multiplication are different.

The agent must use tools that allow is to carry out calculations in this universe.

This task can help verify that an agent is able to ignore its own knowledge of math and instead correctly use information returned by the tools.

The modified mathematical operations yield different reuslts, but still retain some properties (e.g., the modified multiplication operation is still commutative).

Please note that the modified operations are not guaranteed to even make sense in the real world since not all properties will be retained (e.g., distributive property).

------------------

For this code to work, please configure LangSmith environment variables with your credentials.

In [1]:
import os

os.environ["LANGCHAIN_API_KEY"] = "ls_.."  # Your LangSmith API key

In [2]:
import uuid

experiment_uuid = uuid.uuid4().hex[:4]

In [3]:
from langchain_benchmarks import clone_public_dataset, registry

In [4]:
task = registry["Multiverse Math"]
task

0,1
Name,Multiverse Math
Type,ToolUsageTask
Dataset ID,594f9f60-30a0-49bf-b075-f44beabf546a
Description,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."


Clone the dataset associaetd with this task

In [5]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Multiverse Math already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/108bdc68-1808-4b60-92ef-fbd9bd7e1ad0.


## The Environment

Let's check the environment

In [7]:
env = task.create_environment()
env.tools[:5]

[StructuredTool(name='multiply', description='multiply(a: float, b: float) -> float - Multiply two numbers; a * b.', args_schema=<class 'pydantic.v1.main.multiplySchemaSchema'>, func=<function multiply at 0x1639e3560>),
 StructuredTool(name='add', description='add(a: float, b: float) -> float - Add two numbers; a + b.', args_schema=<class 'pydantic.v1.main.addSchemaSchema'>, func=<function add at 0x1639e36a0>),
 StructuredTool(name='divide', description='divide(a: float, b: float) -> float - Divide two numbers; a / b.', args_schema=<class 'pydantic.v1.main.divideSchemaSchema'>, func=<function divide at 0x1639e3600>),
 StructuredTool(name='subtract', description='subtract(a: float, b: float) -> float - Subtract two numbers; a - b.', args_schema=<class 'pydantic.v1.main.subtractSchemaSchema'>, func=<function subtract at 0x1639e3880>),
 StructuredTool(name='power', description='power(a: float, b: float) -> float - Raise a number to a power; a ** b.', args_schema=<class 'pydantic.v1.main.p

Multiplying 2 x 4 = 8.8!!

In [8]:
env.tools[0].invoke({"a": 2, "b": 4})

8.8

The task instructions

In [9]:
task.instructions

'You are requested to solve math questions in an alternate mathematical universe. The operations have been altered to yield different results than expected. Do not guess the answer or rely on your  innate knowledge of math. Use the provided tools to answer the question. While associativity and commutativity apply, distributivity does not. Answer the question using the fewest possible tools. Only include the numeric response without any clarifications.'

## Agent

Let's build an agent that we can use for evaluation.

In [10]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-3.5-turbo-16k")

# Let's test that our agent works
agent = agent_factory.create()
agent.invoke({"question": "how much is 3 + 5"})

{'question': 'how much is 3 + 5',
 'output': 'The result of 3 + 5 in this alternate mathematical universe is 9.2.',
 'intermediate_steps': [(AgentActionMessageLog(tool='add', tool_input={'a': 3, 'b': 5}, log="\nInvoking: `add` with `{'a': 3, 'b': 5}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "a": 3,\n  "b": 5\n}', 'name': 'add'}})]),
   9.2)]}

## Eval

Let's evaluate an agent now

In [11]:
from langsmith.client import Client

from langchain_benchmarks.tool_usage import get_eval_config

client = Client()

eval_config = get_eval_config()
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=agent_factory.create,
    evaluation=eval_config,
    verbose=True,
    project_name=f"oai-functions-gpt-3.5-turbo-16k-{experiment_uuid}",
    project_metadata={
        "model": "gpt-3.5-turbo-16k",
        "arch": "openai-functions-agent",
    },
)

View the evaluation results for project 'oai-functions-gpt-3.5-turbo-16k-ea06' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/a7afaecf-faf2-4bf8-933e-39f08f06c8af?eval=true

View all tests for Dataset Multiverse Math at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/108bdc68-1808-4b60-92ef-fbd9bd7e1ad0
[------------------------------------------------->] 10/10

Waiting for evaluators to finish:   0%|          | 0/10 [00:00<?, ?it/s]


 Eval quantiles:
                      inputs.question               outputs.question  \
count                              10                             10   
unique                             10                             10   
top     convert 15 degrees to radians  convert 15 degrees to radians   
freq                                1                              1   
mean                              NaN                            NaN   
std                               NaN                            NaN   
min                               NaN                            NaN   
25%                               NaN                            NaN   
50%                               NaN                            NaN   
75%                               NaN                            NaN   
max                               NaN                            NaN   

                                     outputs.output  \
count                                            10   
unique 

## Inspect

You can take a look at the underlying results.

In [12]:
test_run.get_aggregate_feedback()

Unnamed: 0,inputs.question,outputs.question,outputs.output,outputs.intermediate_steps,feedback.Intermediate steps correctness,feedback.# steps / # expected steps,feedback.correctness,error,execution_time
count,10,10,10,10,10.0,10.0,10.0,0.0,10.0
unique,10,10,10,10,,,,0.0,
top,convert 15 degrees to radians,convert 15 degrees to radians,15 degrees is approximately 0.0417 radians.,"[(tool='divide' tool_input={'a': 15, 'b': 180}...",,,,,
freq,1,1,1,1,,,,,
mean,,,,,0.7,1.033333,0.4,,8.212706
std,,,,,0.483046,0.399073,0.516398,,2.945678
min,,,,,0.0,0.333333,0.0,,5.384465
25%,,,,,0.25,1.0,0.0,,6.064272
50%,,,,,1.0,1.0,0.0,,7.099023
75%,,,,,1.0,1.0,1.0,,10.367093


In [19]:
import pandas as pd

df = test_run.to_dataframe()
df = pd.json_normalize(df.to_dict(orient="records"))

In [22]:
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)
df["actual_number_of_steps"] = df["outputs.intermediate_steps"].apply(len)

In [23]:
df.head()

Unnamed: 0,inputs.question,outputs.question,outputs.output,outputs.intermediate_steps,feedback.Intermediate steps correctness,feedback.# steps / # expected steps,feedback.correctness,error,execution_time,reference.reference,reference.expected_steps,num_expected_steps,actual_number_of_steps
0,convert 15 degrees to radians,convert 15 degrees to radians,15 degrees is approximately 0.0417 radians.,"[(tool='divide' tool_input={'a': 15, 'b': 180}...",0,0.333333,0,,5.384465,0.124588,"[pi, multiply, divide]",3,1
1,"after calculating the sin of 1.5 radians, divi...","after calculating the sin of 1.5 radians, divi...",The result of dividing the sine of 1.5 radians...,"[(tool='sin' tool_input={'radians': 1.5} log=""...",1,1.0,0,,10.875624,0.070915,"[sin, cos, divide]",3,3
2,ecoli divides every 20 minutes. How many cells...,ecoli divides every 20 minutes. How many cells...,"After 2 hours, starting with 5 cells, there wi...","[(tool='divide' tool_input={'a': 120, 'b': 20}...",1,1.0,1,,11.130253,176.0,"[divide, power, multiply]",3,3
3,calculate sqrt of 101 to 4 digits of precision,calculate sqrt of 101 to 4 digits of precision,The square root of 101 to 4 digits of precisio...,"[(tool='power' tool_input={'a': 101, 'b': 0.5}...",0,2.0,0,,14.041779,64620.6463,"[power, round]",2,4
4,multiply the result of (log of 100 to base 10)...,multiply the result of (log of 100 to base 10)...,The result of multiplying the logarithm of 100...,"[(tool='log' tool_input={'a': 100, 'base': 10}...",1,1.0,1,,8.8415,6.222319,"[log, multiply]",2,2
