In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
%load_ext autoreload
%autoreload 2

evaluating a (task-oriented) agent

- for now, task is information retrieval/QA
    - cuz these are the cases Harrison started with
- desiderata+challenges:
    - evaluate goal state:
    - evaluate intermediate states:
        - challenges:
            - non-deterministic/different trajectories may be acceptable
            - coupled/cascading errors
            - we can rely on another LM for evaluation, but this is error-prone too

# Evaluate each of these w/ DaVinci (quick few shot, chain-of-thought binary classifiers)

In [3]:
from distutils.util import strtobool
from typing import List

from langchain.llms import OpenAI
from langchain.schema import AgentAction


davinci_003 = OpenAI(model_name="text-davinci-003", temperature=0.0)


def extract_binary_classification(completion:  str) -> bool:
    """Extract (try to extract) binary classification from text completion.
    """
    boolean_as_str = completion.strip().split('.')[0]
    boolean = False
    try:
        boolean = bool(strtobool(boolean_as_str))
    except ValueError as e:
        print(e)
    return boolean


def evaluate_candidate_action_plan(question: str, action_plan: List[AgentAction], candidate_action_plan: List[AgentAction], model: 'LLM ABC', verbose: bool=False) -> bool:
    """Use a few-shot classifier to verify whether 2 actin plans are "roughly" equivalent for a given question.

    This approach is itself highly error prone!
    """
    
    prompt_prefix = """Decide whether the Candidate action plan would give the same outcome as the Desired action plan in answering a given Question. Actions correspond to calling on a tool like a search engine, data store, calculator, etc.

Examples:

Question: How far is the Earth from the Moon?
Desired action plan: Search(distance between Earth and Moon)
Candidate action plan: Calculator(distance from Earth to Moon)
Satisfactory? No.
Explanation: The Candidate plan uses a Calculator instead of a Search engine.

Question: What is the number of kids our current president has to the power of two?
Desired action plan: Search(how many kids the president has?),  Calculator(4^2)
Candidate action plan: Search(who is current president?), Search(how many kids Joe Biden has?), Calculator(4*4)
Satisfactory? Yes
Explanation: The Candidate plan reaches the same result as the Desired plan with one step broken down into two. 

Question: how long does it take to drive from Boston to New York?
Desired action plan: Search(distance from Boston to New York), Search(speed limit Boston to NewYork), Calculator(190.04/40)
Candidate action plan: Search(driving time from Boston to New York)
Satisfactory?  Yes.
Explanation: The Candidate plan uses a tool to answer the question directly,  rather than breaking it down like the Desired plan.
    """
    
    def serialize_action_plan(action_plan):
        return ', '.join([
            f"{action.tool}({action.tool_input})"
            for action in action_plan
        ])
    desired_action_plan_str = serialize_action_plan(desired_action_plan)
    candidate_action_plan_str = serialize_action_plan(candidate_action_plan)

    prompt = prompt_prefix + f"""
Question: {question}
Desired action plan: {desired_action_plan_str}
Candidate action plan: {candidate_action_plan_str}
Satisfactory?"""

    completion = model(prompt)    
    if verbose:
        print("Prompt:\n", prompt)
        print("Completion:\n", completion)
    
    return extract_binary_classification(completion)
    

def evaluate_candidate_answer(question: str, answer: str, candidate_answer: str, model: 'LLM ABC', verbose: bool=False) -> bool:
    """Use a few-shot classifier to verify whether 2 answers are "roughly" equivalent for a given question.

    This approach is itself highly error prone!
    """
    
    prompt_prefix = f"""Decide whether a Candidate answer gives the same information as a Desired answer for a given Question.

Examples:

Question: What is the distance from Earth to the Moon?
Desired answer: 238,900 mi
Candidate answer: The distance is about 250k miles
Satisfactory? Yes. 
Explanation: The Candidate answer roughly gives the same information as the Desired answer.

Question: How many kids does Joe Biden have?
Desired answer: 4
Candidate answer: 42
Satisfactory? No.
Explanation: The candidate answer 42 is not the same as the Desired answer.
"""
    
    prompt = prompt_prefix + f"""
Question: {question}
Desired answer: {answer}
Candidate answer: {candidate_answer}
Satisfactory?"""
    
    completion = model(prompt)
    if verbose:
        print("Prompt:\n", prompt)
        print("Completion:\n", completion)
    
    return extract_binary_classification(completion)

# A couple test cases for ReAct-agent QA

In [4]:
from langchain import LLMMathChain, OpenAI, SerpAPIWrapper, SQLDatabase, SQLDatabaseChain
from langchain.agents import initialize_agent, Tool, load_tools

tools = load_tools(['serpapi', 'llm-math'], llm=OpenAI(temperature=0))
agent = initialize_agent(tools, OpenAI(temperature=0), agent="zero-shot-react-description", verbose=True, return_intermediate_steps=True)

test_cases = [
    {
        "question": "How many people live in canada as of 2023?",
        "answer": "approximately 38,625,801",
        "steps": [
            {"tool": "Search", "tool_input": "Population of Canada 2023"}
        ]
    },
    {
        "question": "who is dua lipa's boyfriend? what is his age raised to the .43 power?",
        "answer": "her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665",
        "steps": [
            {"tool": "Search", "tool_input": "Dua Lipa's boyfriend"},
            {"tool": "Search", "tool_input": "Romain Gravas age"},
            {"tool": "Calculator", "tool_input": "41^.43"}
        ]
    },
    {
        "question": "what is dua lipa's boyfriend age raised to the .43 power?",
        "answer": "her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665",
        "steps": [
            {"tool": "Search", "tool_input": "Dua Lipa's boyfriend"},
            {"tool": "Search", "tool_input": "Romain Gravas age"},
            {"tool": "Calculator", "tool_input": "41^.43"}
        ]
    
    },
    {
        "question": "how far is it from paris to boston in miles",
        "answer": "approximately 3,435 mi",
        "steps": [
            {"tool": "Search", "tool_input": "paris to boston distance"},
        ]
    },
    {
        "question": "what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?",
        "answer": "approximately 2.682651500990882",
        "steps": [
            {"tool": "Search", "tool_input": "2023 super bowl score"},
            {"tool": "Calculator", "tool_input": "73^.23"},
        ]
    },
    {
        "question": "what was the total number of points scored in the 2023 super bowl raised to the .23 power?",
        "answer": "approximately 2.682651500990882",
        "steps": [
            {"tool": "Search", "tool_input": "2023 super bowl score"},
            {"tool": "Calculator", "tool_input": "73^.23"},
        ]
    },
    {
        "question": "how many more points were scored in the 2023 super bowl than in the 2022 super bowl?",
        "answer": "30",
        "steps": [
            {"tool": "Search", "tool_input": "2023 super bowl score"},
            {"tool": "Search", "tool_input": "2022 super bowl score"},
        ]
    },
    {
        "question": "what is 153 raised to .1312 power?",
        "answer": "approximately 1.9347796717823205",
        "steps": [
            {"tool": "Calculator", "tool_input": "15**.13"},
        ]
    },
    {
        "question": "who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?",
        "answer": "approximately 1.7589107138176394",
        "steps": [
            {"tool": "Search", "tool_input": "kendall jenner boyfriend"},
            {"tool": "Search", "tool_input": "devin booker height"},
            {"tool": "Calculator", "tool_input": "77**.13"},
        ]
    },
    {
        "question": "what is 1213 divided by 4345?",
        "answer": "approximately 0.2791714614499425",
        "steps": [
            {"tool": "Calculator", "tool_input": "1213/4345"},
        ]
    },
]

In [5]:
test_case = test_cases[1]

question = test_case['question']
out = agent(question)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I need to find out who Dua Lipa's boyfriend is and then calculate his age raised to the .43 power
Action: Search
Action Input: "Dua Lipa's boyfriend"[0m
Observation: [36;1m[1;3mDua and Isaac, a model and a chef, dated on and off from 2013 to 2019. The two first split in early 2017, which is when Dua went on to date LANY ...[0m
Thought:[32;1m[1;3m I need to find out Isaac's age
Action: Search
Action Input: "Isaac Carew age"[0m
Observation: [36;1m[1;3m36 years[0m
Thought:[32;1m[1;3m I need to calculate 36 raised to the .43 power
Action: Calculator
Action Input: 36^.43[0m
Observation: [33;1m[1;3mAnswer: 4.6688516567750975
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.[0m

[1m> Finished chain.[0m


In [7]:
desired_action_plan = [
    AgentAction(tool=step['tool'], tool_input=step['tool_input'], log=None)
    for step in test_case['steps']
]
desired_answer = test_case['answer']

candidate_action_plan = [
    action for action, observation in out['intermediate_steps']
]
candidate_answer = out['output']

In [8]:
evaluate_candidate_answer(question, desired_answer, candidate_answer, davinci_003, verbose=True)

Prompt:
 Decide whether a Candidate answer gives the same information as a Desired answer for a given Question.

Examples:

Question: What is the distance from Earth to the Moon?
Desired answer: 238,900 mi
Candidate answer: The distance is about 250k miles
Satisfactory? Yes. 
Explanation: The Candidate answer roughly gives the same information as the Desired answer.

Question: How many kids does Joe Biden have?
Desired answer: 4
Candidate answer: 42
Satisfactory? No.
Explanation: The candidate answer 42 is not the same as the Desired answer.

Question: who is dua lipa's boyfriend? what is his age raised to the .43 power?
Desired answer: her boyfriend is Romain Gravas. his age raised to the .43 power is approximately 4.9373857399466665
Candidate answer: Isaac Carew, Dua Lipa's boyfriend, is 36 years old and his age raised to the .43 power is 4.6688516567750975.
Satisfactory?
Completion:
  Yes.
Explanation: The Candidate answer roughly gives the same information as the Desired answer.


True

**Not quite! From CoT explanation, appears to attend to the 2nd question only.**

In [9]:
evaluate_candidate_action_plan(question, desired_action_plan, candidate_action_plan, davinci_003, verbose=True)

Prompt:
 Decide whether the Candidate action plan would give the same outcome as the Desired action plan in answering a given Question. Actions correspond to calling on a tool like a search engine, data store, calculator, etc.

Examples:

Question: How far is the Earth from the Moon?
Desired action plan: Search(distance between Earth and Moon)
Candidate action plan: Calculator(distance from Earth to Moon)
Satisfactory? No.
Explanation: The Candidate plan uses a Calculator instead of a Search engine.

Question: What is the number of kids our current president has to the power of two?
Desired action plan: Search(how many kids the president has?),  Calculator(4^2)
Candidate action plan: Search(who is current president?), Search(how many kids Joe Biden has?), Calculator(4*4)
Satisfactory? Yes
Explanation: The Candidate plan reaches the same result as the Desired plan with one step broken down into two. 

Question: how long does it take to drive from Boston to New York?
Desired action plan:

False

**Evaluates as intended initially, though we'd likely like to break this our farther (i.e. a trajectory should not really resolve to T/F).**