# Multiverse Math

In this task, the agent is operating in an alternate universe which in which the basic mathematical operations like addition and multiplication are different.

The agent must use tools that allow is to carry out calculations in this universe.

This task can help verify that an agent is able to ignore its own knowledge of math and instead correctly use information returned by the tools.

The modified mathematical operations yield different reuslts, but still retain some properties (e.g., the modified multiplication operation is still commutative).

Please note that the modified operations are not guaranteed to even make sense in the real world since not all properties will be retained (e.g., distributive property).

------------------

For this code to work, please configure LangSmith environment variables with your credentials.

```python
import os

os.environ["LANGCHAIN_API_KEY"] = "ls_.."  # Your LangSmith API key
```

In [1]:
from langchain_benchmarks import clone_public_dataset, registry

In [2]:
task = registry["Multiverse Math"]
task

0,1
Name,Multiverse Math
Type,ToolUsageTask
Dataset ID,594f9f60-30a0-49bf-b075-f44beabf546a
Description,"An environment that contains a few basic math operations, but with altered results. For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected. The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math."


Clone the dataset associated with this task

In [3]:
clone_public_dataset(task.dataset_id, dataset_name=task.name)

Dataset Multiverse Math already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/108bdc68-1808-4b60-92ef-fbd9bd7e1ad0.


## The Environment

Let's check the environment

In [4]:
env = task.create_environment()
env.tools[:5]

[StructuredTool(name='multiply', description='multiply(a: float, b: float) -> float - Multiply two numbers; a * b.', args_schema=<class 'pydantic.v1.main.multiplySchemaSchema'>, func=<function multiply at 0x7f04669422a0>),
 StructuredTool(name='add', description='add(a: float, b: float) -> float - Add two numbers; a + b.', args_schema=<class 'pydantic.v1.main.addSchemaSchema'>, func=<function add at 0x7f04669427a0>),
 StructuredTool(name='divide', description='divide(a: float, b: float) -> float - Divide two numbers; a / b.', args_schema=<class 'pydantic.v1.main.divideSchemaSchema'>, func=<function divide at 0x7f0466942700>),
 StructuredTool(name='subtract', description='subtract(a: float, b: float) -> float - Subtract two numbers; a - b.', args_schema=<class 'pydantic.v1.main.subtractSchemaSchema'>, func=<function subtract at 0x7f0466942980>),
 StructuredTool(name='power', description='power(a: float, b: float) -> float - Raise a number to a power; a ** b.', args_schema=<class 'pydant

Multiplying 2 x 4 = 8.8!!

In [5]:
env.tools[0].invoke({"a": 2, "b": 4})

8.8

The task instructions

In [6]:
task.instructions

'You are requested to solve math questions in an alternate mathematical universe. The operations have been altered to yield different results than expected. Do not guess the answer or rely on your  innate knowledge of math. Use the provided tools to answer the question. While associativity and commutativity apply, distributivity does not. Answer the question using the fewest possible tools. Only include the numeric response without any clarifications.'

## Agent Factory

For evaluation, we need an agent factory that will create a new instance of an agent executor for every evaluation run.

We'll use an `OpenAIAgentFactory` provided with LangChain Benchmarks -- look at the `intro` section to see how to define your own.

In [7]:
from langchain_benchmarks.tool_usage import agents

agent_factory = agents.OpenAIAgentFactory(task, model="gpt-4-0613")

# Let's test that our agent works
agent = agent_factory.create()
agent.invoke({"question": "how much is 3 + 5"})

{'input': 'how much is 3 + 5',
 'output': '9.2',
 'intermediate_steps': [(AgentActionMessageLog(tool='add', tool_input={'a': 3, 'b': 5}, log="\nInvoking: `add` with `{'a': 3, 'b': 5}`\n\n\n", message_log=[AIMessage(content='', additional_kwargs={'function_call': {'arguments': '{\n  "a": 3,\n  "b": 5\n}', 'name': 'add'}})]),
   9.2)]}

## Eval

Let's evaluate an agent now

In [8]:
import uuid

from langsmith.client import Client

from langchain_benchmarks.tool_usage import get_eval_config

experiment_uuid = uuid.uuid4().hex[:4]

client = Client()

models = ["gpt-3.5-turbo-1106", "gpt-3.5-turbo-0613", "gpt-4-0613"]

for model in models:
    print()

    # qa_math uses a custom prompt to grade the output
    # The prompt guides the LLM to ignore whether the TRUE ANSWER is factually
    # correct
    eval_config = get_eval_config(output_evaluation="qa_math")
    agent_factory = agents.OpenAIAgentFactory(task, model=model)
    test_run = client.run_on_dataset(
        dataset_name=task.name,
        llm_or_chain_factory=agent_factory,
        evaluation=eval_config,
        verbose=False,
        project_name=f"multiverse-math-{model}-{experiment_uuid}",
        tags=[model],
        project_metadata={
            "model": model,
            "arch": "openai-functions-agent",
            "id": experiment_uuid,
        },
    )


View the evaluation results for project 'multiverse-math-gpt-3.5-turbo-1106-d680' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/0919eab8-dca7-4049-a6eb-4067b9862eba?eval=true

View all tests for Dataset Multiverse Math at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/108bdc68-1808-4b60-92ef-fbd9bd7e1ad0
[------------------------------------------------->] 10/10
View the evaluation results for project 'multiverse-math-gpt-3.5-turbo-0613-d680' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/126d8555-c69c-4ab7-ae95-2ad9bc41989e?eval=true

View all tests for Dataset Multiverse Math at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/108bdc68-1808-4b60-92ef-fbd9bd7e1ad0
[------------------------------------------------->] 10/10
View the evaluation results for project 'multiverse-math-gpt-4-0613-d680' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de

## Analyze

You can take a look at the underlying results.

In [10]:
import pandas as pd
from langsmith.client import Client

client = Client()
projects = list(client.list_projects(reference_dataset_name="Multiverse Math"))

dfs = []
for project in projects:
    first_root_run = next(
        client.list_runs(project_name=project.name, execution_order=1)
    )
    # Temporary way to get tag information
    tags = first_root_run.tags
    test_results = client.get_test_results(project_name=project.name)
    test_results["model"] = tags[0]
    dfs.append(test_results)


df = pd.concat(dfs)

df["actual_steps"] = df["outputs.intermediate_steps"].apply(
    lambda steps: [step[0]["tool"] for step in steps]
)
df["num_expected_steps"] = df["reference.expected_steps"].apply(len)

### Stats

This is a really small dataset so it's hard to tell whether there are substantial differences between the models; however, the agents are clearly not perfect here.

The results are suggestive of the fact that it's more difficult for gpt-4 to ignore what it knows about math (which isn't surprising); e.g., in this universe the negative of -5 is still -5 (rather than 5).



In [11]:
correct_df = df.groupby("model")["feedback.correctness"].sum().to_frame("# correct")
count_df = df.groupby("model").size().to_frame("n")

columns = [
    "feedback.correctness",
    "feedback.Intermediate steps correctness",
    "execution_time",
    "feedback.# steps / # expected steps",
]

df.groupby("model")[columns].mean().join(correct_df).join(count_df)

Unnamed: 0_level_0,feedback.correctness,feedback.Intermediate steps correctness,execution_time,feedback.# steps / # expected steps,# correct,n
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gpt-3.5-turbo-0613,0.8,0.8,7.992928,1.03333,8.0,10
gpt-3.5-turbo-1106,0.6,0.6,8.933172,0.93332,6.0,10
gpt-4-0613,0.5,0.6,8.329558,0.76666,5.0,10


### Individual

In [14]:
columns = [
    "input.question",
    "model",
    "actual_steps",
    "reference.expected_steps",
    "outputs.output",
    "reference.reference",
    "feedback.correctness",
    "num_expected_steps",
]
df[columns].sort_values(by=["input.question", "model"]).head()

Unnamed: 0_level_0,input.question,model,actual_steps,reference.expected_steps,outputs.output,reference.reference,feedback.correctness,num_expected_steps
example_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-3.5-turbo-0613,[add],[add],The sum of 2 and 3 in this alternate mathemati...,6.2,1.0,1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-3.5-turbo-1106,[add],[add],The result of adding 2 and 3 is 6.2.,6.2,1.0,1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-4-0613,[add],[add],6.2,6.2,1.0,1
2d3e1665-7b3f-4013-b010-6af30ed62ab2,I ate 1 apple and 2 oranges every day for 7 da...,gpt-3.5-turbo-0613,"[add, multiply]","[add, multiply]",You ate a total of 32.34 fruits.,32.34,1.0,2
2d3e1665-7b3f-4013-b010-6af30ed62ab2,I ate 1 apple and 2 oranges every day for 7 da...,gpt-3.5-turbo-1106,[add],"[add, multiply]",You ate 16.2 fruits.,32.34,0.0,2


In [15]:
df[columns].sort_values(by=["input.question", "model"])

Unnamed: 0_level_0,input.question,model,actual_steps,reference.expected_steps,outputs.output,reference.reference,feedback.correctness,num_expected_steps
example_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-3.5-turbo-0613,[add],[add],The sum of 2 and 3 in this alternate mathemati...,6.2,1.0,1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-3.5-turbo-1106,[add],[add],The result of adding 2 and 3 is 6.2.,6.2,1.0,1
20ea2f0e-b306-474a-8daa-f4386cc16599,Add 2 and 3,gpt-4-0613,[add],[add],6.2,6.2,1.0,1
2d3e1665-7b3f-4013-b010-6af30ed62ab2,I ate 1 apple and 2 oranges every day for 7 da...,gpt-3.5-turbo-0613,"[add, multiply]","[add, multiply]",You ate a total of 32.34 fruits.,32.34,1.0,2
2d3e1665-7b3f-4013-b010-6af30ed62ab2,I ate 1 apple and 2 oranges every day for 7 da...,gpt-3.5-turbo-1106,[add],"[add, multiply]",You ate 16.2 fruits.,32.34,0.0,2
2d3e1665-7b3f-4013-b010-6af30ed62ab2,I ate 1 apple and 2 oranges every day for 7 da...,gpt-4-0613,"[add, multiply]","[add, multiply]",32.34,32.34,1.0,2
c857031a-6ab1-4b06-9638-3a8a4ba69f11,Subtract 3 from 2,gpt-3.5-turbo-0613,[subtract],[subtract],The result of subtracting 3 from 2 in this alt...,-4.0,1.0,1
c857031a-6ab1-4b06-9638-3a8a4ba69f11,Subtract 3 from 2,gpt-3.5-turbo-1106,[subtract],[subtract],The result of subtracting 3 from 2 is -4.,-4.0,1.0,1
c857031a-6ab1-4b06-9638-3a8a4ba69f11,Subtract 3 from 2,gpt-4-0613,[subtract],[subtract],-4.0,-4.0,1.0,1
75db51d4-5c3b-4312-9eb9-b40c74eafdcd,What is -5 if evaluated using the negate funct...,gpt-3.5-turbo-0613,[negate],[negate],The result of evaluating -5 using the negate f...,-5.0,1.0,1
