DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

© 2024 Massachusetts Institute of Technology.

The software/firmware is provided to you on an As-Is basis

Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.


# Experimentation on generating open-ended responses for the LaMP Benchmark.

This notebook demonstrates the behavior of the `OpenEndedBot` instantiations of `bot_interfaces.py` with the LaMP-4 benchmark (lamp-benchmark.github.io).

Imports and LLM setup.

In [1]:
# import langchain dependencies
from langchain.schema import HumanMessage

# import other dependencies
import yaml
import os
import json
from datetime import datetime
from typing import Type

# import internal functions
from io_functions import *
from langchain_setup import *
from bot_interfaces import *
from preference_dungeon import *

Initializing model and environment details

In [2]:
game_environment = "lamp4"  # this can be updated according to the folder you name that stores your dataset
verbose_output = False
save_session = True
offer_justification = False

MODEL_NAME = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=MODEL_NAME, temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

Reading in prompt templates

In [3]:
# these should be filled automatically, but will leave them at editable items just in case
prompts_yaml_path = os.path.join(
    "prompt_templates", game_environment, "explanation_prompts.yaml"
)

with open(prompts_yaml_path, "r", encoding="utf-8") as file:
    prompts_yaml = yaml.safe_load(file)
    initial_prompt = HumanMessage(content=prompts_yaml["initial_prompt"])

    # Some prompts for Model Performance Evaluation from the paper
    langchain_offer_template = prompts_yaml["langchain_offer_template"]
    langchain_RAG_offer_template = prompts_yaml["langchain_RAG_offer_template"]
    langchain_prejust_offer_template = prompts_yaml["langchain_prejust_offer_template"]
    langchain_postjust_offer_template = prompts_yaml[
        "langchain_postjust_offer_template"
    ]
    langchain_crossdomainjust_offer_template = prompts_yaml[
        "langchain_crossdomainjust_offer_template"
    ]
    langchain_fakejust_offer_template = prompts_yaml[
        "langchain_fakejust_offer_template"
    ]
    langchain_ZS_offer_template = prompts_yaml["langchain_ZS_offer_template"]
    langchain_FS_offer_template = prompts_yaml["langchain_FS_offer_template"]
    langchain_prejust_RAG_offer_template = prompts_yaml[
        "langchain_prejust_RAG_offer_template"
    ]
    langchain_postjust_RAG_offer_template = prompts_yaml[
        "langchain_postjust_RAG_offer_template"
    ]
    langchain_crossdomainjust_RAG_offer_template = prompts_yaml[
        "langchain_crossdomainjust_RAG_offer_template"
    ]
    langchain_fakejust_RAG_offer_template = prompts_yaml[
        "langchain_fakejust_RAG_offer_template"
    ]
    langchain_ZS_RAG_offer_template = prompts_yaml["langchain_ZS_RAG_offer_template"]
    langchain_numericaljust_RAG_offer_template = prompts_yaml[
        "langchain_numericaljust_RAG_offer_template"
    ]

Loading in data

In [4]:
dev_questions_file_path = os.path.join("data", game_environment, "dev_questions.json")
dev_outputs_file_path = os.path.join("data", game_environment, "dev_outputs.json")

with open(dev_questions_file_path, "r") as file:
    dev_questions = json.load(file)

with open(dev_outputs_file_path, "r") as file:
    dev_outputs = json.load(file)

Define evaluation loop and bots. A *bot* is just any variant of agent that we are testing (e.g. random, full-history context LLM, summary variants, RAG, etc). All bots need to match the Bot abstract class in `bot_interfaces.py` to work with the evaluation loop.

In [5]:
def evaluation_loop(
    dm: Type[DungeonMaster], bot_list: list[Type[Bot]], iterations: int = 1
):

    # create a list of dictionaries to hold the results
    bot_results = []
    ground_truths = []
    encounters = []

    for bot in bot_list:
        bot_results.append(
            {
                "bot name": bot.name,
                "contexts presented": 0,
                "offers": [],
                "response history": [],
            }
        )

    for encounter_num in range(1, iterations + 1):
        print(
            "encounter_num: ",
            encounter_num,
            "length of user profile: ",
            len(dm.contexts[dm.context_counter]["profile"]),
        )

        encounters.append(dm.print_current_context())
        if verbose_output:
            print(f"DM: Encounter {encounter_num}: {dm.print_current_context()}")

        ground_truths.append(dm.print_current_ground_truth())

        # have each bot evaluate the current context
        for i, bot in enumerate(bot_list):
            bot.update_external_memory(
                dm.contexts[dm.context_counter]["profile"],
                dm.contexts,
                dm.context_counter,
                replace=True,
            )
            offer_str = bot.make_offer(f"{dm.print_current_context()}")
            
            bot_results[i]["offers"].append(offer_str)

        dm.next_context()

    # print("\n\nbot_results[i][\"offers\"]: ", bot_results[i]["offers"])
    
    rouge_1, rouge_L = [], []
    for i, bot in enumerate(bot_results):
        rouge_result = dm.evaluate_offer(
            bot_results[i]["offers"],
            ground_truths,
            answer_id_string="Generated Headline",
        )
        rouge_1.append((i, rouge_result["rouge1_fmeasure"]))
        rouge_L.append((i, rouge_result["rougeL_fmeasure"]))

    print(f"ROUGE-1 score: {rouge_1}, ROUGE-L score: {rouge_L}")

    return encounters, bot_results, ground_truths

Defining the environment and bots (as well as their corresponding extraction functions)

In [6]:
dm = DungeonMasterOpenEnded(
    dev_questions,
    dev_outputs,
    preferences=None, 
    evaluation_method=None
)

bot_list = [
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_offer_template, "NoHistory"), #1
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_RAG_offer_template, "RetrievedHistory"), #2
    LLMWithRandomHistoryOpenEndedBot(llm, initial_prompt.content, langchain_offer_template, "RandomHistory"), #3
    LLMWithEntireHistoryOpenEndedBot(llm, initial_prompt.content, langchain_offer_template, "EntireHistory"), #4
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_prejust_offer_template, "NoHistoryPreJust"), #5
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_postjust_offer_template, "NoHistoryPostJust"), #6
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_crossdomainjust_offer_template, "NoHistoryCrossDomainJust"), #7
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_fakejust_offer_template, "NoHistoryFakeJust"), #8
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_ZS_offer_template, "NoHistoryZS"), #9
    LLMWithNoHistoryOpenEndedBot(llm, initial_prompt.content, langchain_FS_offer_template, "NoHistoryFS"), #10
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_prejust_RAG_offer_template, "RetrievedHistoryPreJust"), #11
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_postjust_RAG_offer_template, "RetrievedHistoryPostJust"), #12
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_crossdomainjust_RAG_offer_template, "RetrievedHistoryCrossDomainJust"), #13
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_fakejust_RAG_offer_template, "RetrievedHistoryFakeJust"), #14
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_ZS_RAG_offer_template, "RetrievedHistoryZS"), #15
    LLMWithRetrievedHistoryOpenEndedBot(llm, initial_prompt.content, langchain_numericaljust_RAG_offer_template, "RetrievedHistoryNumericalJust"), #16
    ]

Results

In [7]:
encounters, bot_results, ground_truths = evaluation_loop(dm, bot_list, iterations=3)

encounter_num:  1 length of user profile:  11
encounter_num:  2 length of user profile:  165
encounter_num:  3 length of user profile:  525
ROUGE-1 score: [(0, tensor(0.0278)), (1, tensor(0.1079)), (2, tensor(0.)), (3, tensor(0.0682)), (4, tensor(0.0333)), (5, tensor(0.0392)), (6, tensor(0.0333)), (7, tensor(0.0667)), (8, tensor(0.0682)), (9, tensor(0.1079)), (10, tensor(0.0351)), (11, tensor(0.1111)), (12, tensor(0.0392)), (13, tensor(0.)), (14, tensor(0.0392)), (15, tensor(0.0370))], ROUGE-L score: [(0, tensor(0.0278)), (1, tensor(0.1079)), (2, tensor(0.)), (3, tensor(0.0682)), (4, tensor(0.0333)), (5, tensor(0.0392)), (6, tensor(0.0333)), (7, tensor(0.0667)), (8, tensor(0.0682)), (9, tensor(0.1079)), (10, tensor(0.0351)), (11, tensor(0.1111)), (12, tensor(0.0392)), (13, tensor(0.)), (14, tensor(0.0392)), (15, tensor(0.0370))]


In [8]:
rouge_1, rouge_L = [], []
for i, bot in enumerate(bot_results):
    print("\n\n\nbot_results[i][\"offers\"]", bot_results[i]["offers"])
    rouge_result = dm.evaluate_offer(bot_results[i]["offers"], ground_truths)
    rouge_1.append(rouge_result['rouge1_fmeasure'])
    rouge_L.append(rouge_result['rougeL_fmeasure'])

print(f"ROUGE-1 score: {rouge_1}, ROUGE-L score: {rouge_L}")




bot_results[i]["offers"] ['"From Kicking and Screaming to Surviving: An Ex-Wife\'s Journey of Unexpected Challenges"', '"Uoma Beauty Founder Sparks Change: Demands Brands to Disclose Black Employee Numbers"', '"Rising Star: Aussie Model Graces Cosmopolitan Australia Cover for March 2014"']



bot_results[i]["offers"] ['"Surviving the Unexpected: Ex-Wife Shares Tips for Getting Through Divorce"', 'Uoma Beauty Founder Sparks Movement for Corporate Diversity Disclosure', 'Aussie Model Graces March Cover of Cosmopolitan Australia: A New Icon in the Making']



bot_results[i]["offers"] ['"Surviving the Unexpected: One Woman\'s Journey Through Divorce"', '"Uoma Beauty Founder Sparks Change: Demands Brands Disclose Black Employee Numbers"', '"Rising Star: Aussie Model Graces Cosmopolitan Australia Cover for March 2014"']



bot_results[i]["offers"] ["Surviving Divorce: From Drunk Texting to Stalking, One Woman's Journey to Redemption", 'Uoma Beauty Founder Sparks Movement for Corporate Tra

Saving results

In [9]:
if save_session:
    dump_session_variables(
        "results/"
        + game_environment
        + "/"
        + game_environment
        + "_"
        + datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
        + "_evaluation_run.pkl",
        [dm, encounters, bot_results],
        ["dm", "encounters", "bot_results"],
    )

saving dm
saving encounters
saving bot_results
