# DSPy + OpenTom

Goal of this notebook is to explore the OpenToM dataset and see if we can write some DSPy code to optimize prompts for answering the questions.

They've evaluated the performance of CoT and SimToM on their dataset, I now wonder how much extra performance we can get from using a framework like DSPy.

In [18]:
# SETUP: run poetry install + shell in the terminal, then i just say `cursor .` to open my editor and it runs this nb in the venv
# GETTING STARTED: let's import the packages and get the data
import dspy
import requests
import random
import pandas as pd
from dotenv import load_dotenv

load_dotenv()  # need ur api keys set beforehand

turbo = dspy.OpenAI(model='gpt-3.5-turbo', max_tokens=200)
dspy.settings.configure(lm=turbo)

# dataset isn't able to be loaded using hf datasets package so let's read it from github raw
# also let's keep it simple and just go for the opentom_long.json
# this is the one that they sampled 100 existing OpenToM plots to produce "extra long" narratives
url = "https://raw.githubusercontent.com/SeacowX/OpenToM/main/data/opentom_long.json"
response = requests.get(url).json()

df = pd.DataFrame(response)


In [19]:
df.head()

Unnamed: 0,plot,plot_info,preferences,personality,sentiment_statement,true_sentiment,intention,new_location,observed,narrative,long_narrative,question
0,Diego entered the patio.\nAmir entered the pat...,"{'mover': 'Diego', 'eoi': 'scarf', 'original_p...","{'mover': 'Diego hates scarf.', 'observer': 'A...",Diego is an inconsiderate person.,Diego hates scarf.,the most negative,get rid of it and make it accessible to someon...,a donation bin,True,Diego and Amir were both residents of the same...,"Diego and Amir, two residents of the harmoniou...","{'question': 'As Amir, what is your attitude t..."
1,Paxton entered the pantry.\nHadley entered the...,"{'mover': 'Hadley', 'eoi': 'hat', 'original_pl...","{'mover': 'Hadley hates hat', 'observer': 'Pax...",Hadley is a negativistic person.,Hadley wants to get rid of the hat.,the action that is the most adversarial to the...,dispose of the hat,the trash can,True,Hadley had always despised the hat. It was a s...,"In the quaint quarters of a shared apartment, ...","{'question': 'As Paxton, what is your attitude..."
2,Abraham entered the office.\nMarcos entered th...,"{'mover': 'Abraham', 'eoi': 'melon', 'original...","{'mover': 'Abraham hates melon.', 'observer': ...",Abraham is an inconsiderate person.,Abraham hates melon.,the most negative,separate it from other food items and discoura...,the outdoor shed,True,Abraham despised melon with every fiber of his...,In a world where personal preferences are as v...,"{'question': 'As Marcos, what is your attitude..."
3,Jeremy entered the garden.\nJayden entered the...,"{'mover': 'Jeremy', 'eoi': 'peas', 'original_p...","{'mover': 'Jeremy hates peas.', 'observer': 'J...",Jeremy is an inconsiderate person.,Jeremy hates peas.,the most negative,make them less visible and accessible,top shelf,False,"Jeremy and Jayden were two peas in a pod, or s...","In the warm embrace of kinship, Jeremy and Jay...","{'question': 'As Jayden, what is your attitude..."
4,Andrew entered the back_yard.\nRichard entered...,"{'mover': 'Andrew', 'eoi': 'hat', 'original_pl...","{'mover': 'Andrew likes hat', 'observer': 'Ric...",Andrew is a negativistic person.,Andrew wants to show-off the hat.,is the most show-off action to the hat,showcase it as part of his collection to frien...,Room,False,Andrew and Richard were two very different ind...,"Andrew and Richard were two contrasting souls,...","{'question': 'As Richard, what is your attitud..."


In [20]:
# Filter out the fine grained location questions, they seem to be less important in the paper and they complicate evaluation
# df = df[df['question'].apply(lambda q: q['answer'] == 'Yes' or q['answer'] == 'No' if 'location' in q['type'] else True)]

In [21]:
type_counts = df['question'].apply(lambda x: x['type']).value_counts()
type_counts  #fo means first-order, so means second-order

# first order questions  directly ask about a character’s perception of the world, while
# second order questions ask about a character’s belief of another character's mental state

question
multihop-fo    600
multihop-so    600
location-fo    600
location-so    400
attitude       100
Name: count, dtype: int64

In [89]:
# Assuming 'df' is your DataFrame and it contains a 'question' column with dictionaries having 'type' and 'answer' keys

# Extract 'type' and 'answer' into separate columns
df['type'] = df['question'].apply(lambda x: x['type'])
df['answer'] = df['question'].apply(lambda x: x['answer'])

# Group by 'type' and get unique 'answer' values for each 'type'
unique_answers_by_type = df.groupby('type')['answer'].unique()

# only use the first word since we're only doing binary and terenary classfication (except for the fine-grained location qs but we're not including those right now) 
# unique_answers_by_type = unique_answers_by_type.apply(lambda answers: [answer.split(' ')[0].lower().strip() for answer in answers])
# unique_answers_by_type = unique_answers_by_type.apply(lambda answers: list(set(answers))) # remove duplicates again

print(unique_answers_by_type)

type
attitude       [negative, posit...
location-fo    [No, Yes, a dona...
location-so    [No, Yes, a dona...
multihop-fo    [less full, more...
multihop-so    [less full, equa...
Name: answer, dtype: object


In [90]:
import json

# convert the dataset to what DSPy expects (list of Example objects)
dataset = []

for index, row in df.iterrows():
    context = row['long_narrative']
    question = row['question']['question']
    answer = row['question']['answer']
    type = row['question']['type']
    plot_info = json.dumps(row['plot_info']) # Keeping each example field as a string might be a good idea

    if "location" in type and (answer.lower().strip() != "yes" and answer.lower().strip() != "no"): # don't provide answer choices for fine grained location questions
        answer_choices = "n/a, list a specific location"
    elif "location" in type:
        answer_choices = "No, Yes"
    else:
        answer_choices = ", ".join(unique_answers_by_type[type])

    dataset.append(dspy.Example(context=context, question=question, answer=answer, type=type, plot_info=plot_info, answer_choices=answer_choices).with_inputs("context", "question", "answer_choices"))

In [91]:
# split datasets by question types 
from collections import defaultdict

datasets = defaultdict(lambda: [])

for example in dataset:
    datasets[example.type].append(example)

datasets.keys()
[len(dataset) for dataset in datasets.values()]

[100, 600, 600, 600, 400]

In [92]:
# create train test split
for question_type, dataset in datasets.items():
    random.shuffle(dataset)

    datasets[question_type] = {
        "train": dataset[:int(len(dataset) * 0.8)],
        "test": dataset[int(len(dataset) * 0.8):],
    }

    print(f"Now Train {question_type}: {len(datasets[question_type]['train'])}")
    print(f"Now Test {question_type}: {len(datasets[question_type]['test'])}")

Now Train attitude: 80
Now Test attitude: 20
Now Train multihop-fo: 480
Now Test multihop-fo: 120
Now Train multihop-so: 480
Now Test multihop-so: 120
Now Train location-fo: 480
Now Test location-fo: 120
Now Train location-so: 320
Now Test location-so: 80


# Define the Signatures

Using a "Baleen" pipeline [(Khattab et al., 2021)](https://arxiv.org/abs/2101.00436)


In [93]:
# answer the question
class GenerateAnswer(dspy.Signature):
    """Generate answers to the questions"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer_choices = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# generate a question to help you better answer the question
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    query = dspy.OutputField(desc="a thought that might help answer the question") 

class GenerateSearchAnswer(dspy.Signature):
    """Generate a long form answer to the question given the context"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="a thought about what the answer to the question may be")

# metric: assess whether the generated answer is correct


In [94]:
from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.generate_search_answer = dspy.ChainOfThought(GenerateSearchAnswer)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question, context, answer_choices):
        final_context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            filtered_context = self.generate_search_answer(context=context, question=query).answer
            final_context = (context + filtered_context)


        pred = self.generate_answer(context=final_context, question=question, answer_choices=answer_choices)
        return dspy.Prediction(context=context, answer=pred.answer)

Ok so what I think is happenening here:
- generate queries over the context (long narrative) based on the question we're trying to answer
- answer those queries using the generate_answer signature
- and then use those answers as the context to answer the question

This is kinda like SimToM

# Executing the Pipeline

Let's see how this works in a zero-shot setting

In [69]:
my_question = train[0].question
my_context = train[0].context
my_answer_choices = train[0].answer_choices

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question, my_context, answer_choices)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Answer Choices: {my_answer_choices}")
print(f"Generated Context: {pred.context}")
print(f"True Answer: {train[0].answer}")


Question: From Elliot's perspective, where precisely does Amari think that the spinach is located by the end of the story?
Predicted Answer: Freezer
Answer Choices: n/a, list a specific location
Generated Context: In a quaint dwelling shared by two remarkably different individuals, the subject of their divergent palates was not an uncommon theme in their daily discourse. Elliot, the individual in question, harbored an aversion so pronounced towards a certain leafy green vegetable, that mere reference to it was enough to elicit a visible reaction. Spinach, in all its verdant glory, was the source of his culinary discontent. On the flip side of this domestic coin was Amari, who had a penchant for the very vegetable that caused Elliot such distress. Their discordant preferences were a recurring motif in their amicable exchanges, often sparking lighthearted debates within the confines of their shared home.

The story unfolds on an ordinary day that would soon prove otherwise. Elliot, drive

In [70]:
from opentom_evaluator import OpenToMEvaluatorDspy
eval = OpenToMEvaluatorDspy()
eval.dspy_metric(train[0], pred)

done location-so 1 1 the freezer Freezer


True

We can inspect the last three calls to the LM (i.e., generating the first hop's query, generating the second hop's query, and generating the answer) using:

In [71]:
turbo.inspect_history(n=3)





Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts and psychological insights

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the query}. We ...

Query: a thought that might help answer the question

---

Context:
In a quaint dwelling shared by two remarkably different individuals, the subject of their divergent palates was not an uncommon theme in their daily discourse. Elliot, the individual in question, harbored an aversion so pronounced towards a certain leafy green vegetable, that mere reference to it was enough to elicit a visible reaction. Spinach, in all its verdant glory, was the source of his culinary discontent. On the flip side of this domestic coin was Amari, who had a penchant for the very vegetable that caused Elliot such distress. Their discordant preferences were a recurring motif in their amicable exchanges, often sparking lighthearted debat

# Optimizing the Pipeline

However, a zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let's compile our multi-hop (SimplifiedBaleen) program.

In [96]:
from opentom_evaluator import OpenToMEvaluatorDspy
from dspy.teleprompt import BootstrapFewShot

eval_question_types = ["attitude"] # question types to optimize a module for
modules = {}

# define modules for each question type
for question_type in eval_question_types:
    evaluator = OpenToMEvaluatorDspy(model_name="(training set) complied baleen")
    optimizer = BootstrapFewShot(metric=evaluator.dspy_metric, max_rounds=1)
    compiled_baleen = optimizer.compile(SimplifiedBaleen(), trainset=datasets[question_type]["train"][:25])

    modules[question_type] = compiled_baleen

 16%|█▌        | 4/25 [00:41<03:38, 10.41s/it]

Bootstrapped 4 full traces after 5 examples in round 0.





In [99]:
from dspy.evaluate.evaluate import Evaluate

print("Macro Averaged F1 Scores")
for question_type in eval_question_types:
    test = datasets[question_type]["test"]
    compiled_baleen = modules[question_type]

    # Set up the `evaluate_on_hotpotqa` function.
    evaluate_on_opentom = Evaluate(devset=test[:10], num_threads=1, display_progress=True, display_table=10)

    uncompiled_baleen_evaluator = OpenToMEvaluatorDspy(model_name='uncompiled_baleen')
    uncompiled_baleen_retrieval_score = evaluate_on_opentom(uncompiled_baleen, metric=uncompiled_baleen_evaluator.dspy_metric, display=False)
    uncompiled_baleen_evaluator.print_f1_results()

    compiled_baleen_evaluator = OpenToMEvaluatorDspy(model_name='compiled_baleen')
    compiled_baleen_retrieval_score = evaluate_on_opentom(compiled_baleen, metric=compiled_baleen_evaluator.dspy_metric, display=False)
    compiled_baleen_evaluator.print_f1_results()

Macro Averaged F1 Scores
uncompiled_baleen - attitude: 41.27	
compiled_baleen - attitude: 36.67	


  df = df.applymap(truncate_cell)


In [15]:
uncompiled_baleen.dump_state()

{'generate_query[0]': {'lm': None, 'traces': [], 'train': [], 'demos': []},
 'generate_query[1]': {'lm': None, 'traces': [], 'train': [], 'demos': []},
 'generate_search_answer': {'lm': None,
  'traces': [],
  'train': [],
  'demos': []},
 'generate_answer': {'lm': None, 'traces': [], 'train': [], 'demos': []}}

In [16]:
compiled_baleen.dump_state()

{'generate_query[0]': {'lm': None,
  'traces': [],
  'train': [],
  'demos': [Example({'augmented': True, 'context': "Tyson and Louis, kindred spirits with a penchant for cozy attire, often found their conversations meandering through the intricate world of sweatshirt designs. Their shared enthusiasm for these garments was the tapestry that wove their friendship tightly together. They reveled in the thrill of discovery, venturing to stores in pursuit of the elusive “perfect” sweatshirt.\n\nThe bond they shared was evident, not just in their animated dialogues about the latest fashion trends, but also in the synchronized step they took when entering a room—today, it was the welcoming expanse of Tyson’s living room, a place that had witnessed many such discussions and shared moments of joy.\n\nAmidst the familiar setting, a curious sight arrested their attention—an ordinary bottle, unremarkable except for the treasure it held: a sweatshirt, neatly folded and somehow fitting into the conf