# DSPy + OpenTom

Goal of this notebook is to explore the OpenToM dataset and see if we can write some DSPy code to optimize prompts for answering the questions.

They've evaluated the performance of CoT and SimToM on their dataset, I now wonder how much extra performance we can get from using a framework like DSPy.

In [2]:
# SETUP: run poetry install + shell in the terminal, then i just say `cursor .` to open my editor and it runs this nb in the venv
# GETTING STARTED: let's import the packages and get the data
import dspy
import requests
import random
import pandas as pd
from dotenv import load_dotenv

load_dotenv()  # need ur api keys set beforehand

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
dspy.settings.configure(lm=turbo)

# dataset isn't able to be loaded using hf datasets package so let's read it from github raw
# also let's keep it simple and just go for the opentom_long.json
# this is the one that they sampled 100 existing OpenToM plots to produce "extra long" narratives
url = "https://raw.githubusercontent.com/SeacowX/OpenToM/main/data/opentom_long.json"
response = requests.get(url).json()

df = pd.DataFrame(response)


In [21]:
df.head()

Unnamed: 0,plot,plot_info,preferences,personality,sentiment_statement,true_sentiment,intention,new_location,observed,narrative,long_narrative,question,type,answer
0,Diego entered the patio.\nAmir entered the pat...,"{'mover': 'Diego', 'eoi': 'scarf', 'original_p...","{'mover': 'Diego hates scarf.', 'observer': 'A...",Diego is an inconsiderate person.,Diego hates scarf.,the most negative,get rid of it and make it accessible to someon...,a donation bin,True,Diego and Amir were both residents of the same...,"Diego and Amir, two residents of the harmoniou...","{'question': 'As Amir, what is your attitude t...",attitude,negative
1,Paxton entered the pantry.\nHadley entered the...,"{'mover': 'Hadley', 'eoi': 'hat', 'original_pl...","{'mover': 'Hadley hates hat', 'observer': 'Pax...",Hadley is a negativistic person.,Hadley wants to get rid of the hat.,the action that is the most adversarial to the...,dispose of the hat,the trash can,True,Hadley had always despised the hat. It was a s...,"In the quaint quarters of a shared apartment, ...","{'question': 'As Paxton, what is your attitude...",attitude,negative
2,Abraham entered the office.\nMarcos entered th...,"{'mover': 'Abraham', 'eoi': 'melon', 'original...","{'mover': 'Abraham hates melon.', 'observer': ...",Abraham is an inconsiderate person.,Abraham hates melon.,the most negative,separate it from other food items and discoura...,the outdoor shed,True,Abraham despised melon with every fiber of his...,In a world where personal preferences are as v...,"{'question': 'As Marcos, what is your attitude...",attitude,negative
3,Jeremy entered the garden.\nJayden entered the...,"{'mover': 'Jeremy', 'eoi': 'peas', 'original_p...","{'mover': 'Jeremy hates peas.', 'observer': 'J...",Jeremy is an inconsiderate person.,Jeremy hates peas.,the most negative,make them less visible and accessible,top shelf,False,"Jeremy and Jayden were two peas in a pod, or s...","In the warm embrace of kinship, Jeremy and Jay...","{'question': 'As Jayden, what is your attitude...",attitude,positive
4,Andrew entered the back_yard.\nRichard entered...,"{'mover': 'Andrew', 'eoi': 'hat', 'original_pl...","{'mover': 'Andrew likes hat', 'observer': 'Ric...",Andrew is a negativistic person.,Andrew wants to show-off the hat.,is the most show-off action to the hat,showcase it as part of his collection to frien...,Room,False,Andrew and Richard were two very different ind...,"Andrew and Richard were two contrasting souls,...","{'question': 'As Richard, what is your attitud...",attitude,neutral


In [22]:
df.loc[0, 'question']

{'question': "As Amir, what is your attitude towards Diego's action of moving scarf from basket to a donation bin?",
 'answer': 'negative',
 'type': 'attitude'}

In [23]:
type_counts = df['question'].apply(lambda x: x['type']).value_counts()
type_counts  #fo means first-order, so means second-order

# first order questions  directly ask about a character’s perception of the world, while
# second order questions ask about a character’s belief of another character's mental state

question
multihop-fo    600
multihop-so    600
location-fo    600
location-so    400
attitude       100
Name: count, dtype: int64

In [24]:
# Assuming 'df' is your DataFrame and it contains a 'question' column with dictionaries having 'type' and 'answer' keys

# Extract 'type' and 'answer' into separate columns
df['type'] = df['question'].apply(lambda x: x['type'])
df['answer'] = df['question'].apply(lambda x: x['answer'])

# Group by 'type' and get unique 'answer' values for each 'type'
unique_answers_by_type = df.groupby('type')['answer'].unique()

print(unique_answers_by_type)

type
attitude                           [negative, positive, neutral]
location-fo    [No, Yes, a donation bin, basket, the trash ca...
location-so    [No, Yes, a donation bin, the trash can, the o...
multihop-fo    [less full, more full, less accessible, equall...
multihop-so    [less full, equally full, more full, less acce...
Name: answer, dtype: object


In [48]:
", ".join(unique_answers_by_type['attitude'])

'negative, positive, neutral'

In [56]:
# convert the dataset to what DSPy expects (list of Example objects)
dataset = []

for index, row in df.iterrows():
    context = row['long_narrative']
    question = row['question']['question']
    answer = row['question']['answer']

    type = row['question']['type']
    answer_classes = ", ".join(unique_answers_by_type[type])

    dataset.append(dspy.Example(context=context, question=question, answer=answer, type=type, answer_classes=answer_classes).with_inputs("context", "question"))


In [57]:
# create train test split
random.shuffle(dataset)
train = dataset[:int(len(dataset) * 0.8)]
test = dataset[int(len(dataset) * 0.8):]

print(f"Nrow Train: {len(train)}")
print(f"Nrow Test: {len(test)}")

Nrow Train: 1840
Nrow Test: 460


# Define the Signatures

Using a "Baleen" pipeline [(Khattab et al., 2021)](https://arxiv.org/abs/2101.00436)


In [59]:
# answer the question
class GenerateAnswer(dspy.Signature):
    """Generate answers to the questions"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# generate a question to help you better answer the question
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    query = dspy.OutputField(desc="a thought that might help answer the question") 

class GenerateSearchAnswer(dspy.Signature):
    """Generate a long form answer to the question given the context"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="a thought about what the answer to the question may be")

# metric: assess whether the generated answer is correct


In [61]:
from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.generate_search_answer = dspy.ChainOfThought(GenerateSearchAnswer)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question, context):
        final_context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            filtered_context = self.generate_search_answer(context=context, question=query).answer
            final_context = (context + filtered_context)


        pred = self.generate_answer(context=final_context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

Ok so what I think is happenening here:
- generate queries over the context (long narrative) based on the question we're trying to answer
- answer those queries using the generate_answer signature
- and then use those answers as the context to answer the question

This is kinda like SimToM

# Executing the Pipeline

Let's see how this works in a zero-shot setting

In [44]:
my_question = train[0].question
my_context = train[0].context

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question, my_context)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Generated Context: {pred.context}")
print(f"True Answer: {train[0].answer}")


Question: From Gael's perspective, where is the watermelon located precisely by the end of the story?
Predicted Answer: On the kitchen counter
Generated Context: Ivan harbored an unspoken, yet palpable distaste for the striped green gourd known as watermelon. Its presence alone was enough to stir a reaction that he kept well-hidden beneath a veneer of composure. Conversely, Gael held a deep-seated appreciation for the succulent fruit, its crimson flesh synonymous with the essence of summer's warmth.

On an unremarkable afternoon, a confluence of circumstances led both Ivan and Gael to the same quiet bedroom. Unadorned except for the essentials, the room held little interest until their gazes inevitably settled upon the centrepiece of the space: a watermelon of considerable size and seemingly perfect ripeness, nestled casually in a woven basket atop the dresser.

A palpable shift in the atmosphere occurred as Ivan registered the sight of the watermelon. The moment hung heavy, like the s

We can inspect the last three calls to the LM (i.e., generating the first hop's query, generating the second hop's query, and generating the answer) using:

In [11]:
turbo.inspect_history(n=3)





Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts and psychological insights

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the query}. We ...

Query: a thought that might help answer the question

---

Context:
In a quaint office nestled within the bustling heart of the city, Owen and Romeo were bound by a singular, quirky passion. Their conversations often revolved around their mutual admiration for ties - a testament to their shared indulgence. They were collectors, connoisseurs of silk and pattern, chasing the thrill of the next addition to their carefully curated assortments.

It was on a day that dawned like any other that their routine took an unexpected turn. The office's patio, bathed in the gentle glow of the morning sun, presented a sight neither of them had anticipated. There, as if by a twist of fate, lay an exquisite tie. Its vibrant hues spok

# Optimizing the Pipeline

However, a zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let's compile our multi-hop (SimplifiedBaleen) program.

Let's first define our metric/validation logic for compilation:

In [87]:
from collections import defaultdict

# these must be cleared before each evaluation!
true_positives = defaultdict(lambda: 0)
false_positives = defaultdict(lambda: 0)
false_negatives = defaultdict(lambda: 0)

def f1_score(true_positives, false_positives, false_negatives):
    f1_scores = {}

    for _class in true_positives.keys() | false_positives.keys() | false_negatives.keys():
        class_true_positives = true_positives[_class]
        class_false_positives = false_positives[_class]
        class_false_negatives = false_negatives[_class]

        class_precision = class_true_positives / (class_true_positives + class_false_positives) if class_true_positives > 0.0 else 0.0 # avoid dividing by zero
        class_recall = class_true_positives / (class_true_positives + class_false_negatives) if class_true_positives > 0.0 else 0.0

        class_f1_score = (2 * class_precision * class_recall) / (class_precision + class_recall) if class_precision > 0.0 or class_recall > 0.0 else 0.0
        f1_scores[_class] = class_f1_score

    macro_averaged_f1_score = sum(list(f1_scores.values())) / len(f1_scores)

    return macro_averaged_f1_score, f1_scores

In [88]:
class CheckAnswerContained(dspy.Signature):
    """Check if the answer is contained in the prediction"""
    question = dspy.InputField()
    pred_answer = dspy.InputField()
    answer_classes = dspy.InputField(desc="The set of possible classes that predicted answers can be mapped to for classification. The actual answer is already one of these classes.")
    actual_answer = dspy.InputField()
    pred_answer_class = dspy.OutputField(desc="The answer class from answer_classes that best reflects pred_answer")
    is_correct = dspy.OutputField(desc="whether the predicted answer is sufficiently correct given the question and actual answer. Yes or No")

def metric(example, pred, trace=None):
    """Check if the answer is contained in the prediction"""

    pred_answer = pred.answer
    actual_answer = example.answer
    question = example.question
    answer_classes = example.answer_classes
    type = example.type

    with dspy.context(lm=turbo):
        check_response = dspy.Predict(CheckAnswerContained)(
            question=question,
            pred_answer=pred_answer,
            answer_classes=answer_classes,
            actual_answer=actual_answer,
        )

    is_correct = check_response.is_correct.lower() == 'yes'
    pred_answer_class = check_response.pred_answer_class

    # store positive/negative results by class so we can calculate the f1 scores later
    if is_correct:
        true_positives[f"{type}_{pred_answer_class}"] += 1
    else:
        false_positives[f"{type}_{pred_answer_class}"] += 1
        false_negatives[f"{type}_{actual_answer}"] += 1

    return is_correct

In [90]:
from dspy.teleprompt import BootstrapFewShot

# Jacob: probably want to try using GPT-4 as a teacher here
# as well as all of the different dspy optmizer types
optimizer = BootstrapFewShot(metric=metric, max_rounds=1)
compiled_baleen = optimizer.compile(SimplifiedBaleen(), trainset=train[:25])

 20%|██        | 5/25 [00:37<02:31,  7.55s/it]

Bootstrapped 4 full traces after 6 examples in round 0.





In [91]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_opentom = Evaluate(devset=test[:10], num_threads=1, display_progress=True, display_table=5)

# clear answer storage
true_positives.clear()
false_positives.clear()
false_negatives.clear()

uncompiled_baleen_retrieval_score = evaluate_on_opentom(uncompiled_baleen, metric=metric, display=False)
uncompiled_baleen_retrieval_macro_averaged_f1_score, f1_scores = f1_score(true_positives, false_positives, false_negatives)

# clear answer storage
true_positives.clear()
false_positives.clear()
false_negatives.clear()

compiled_baleen_retrieval_score = evaluate_on_opentom(compiled_baleen, metric=metric)
compiled_baleen_retrieval_macro_averaged_f1_score, f1_scores = f1_score(true_positives, false_positives, false_negatives)

print(f"## Scores for uncompiled Baleen, accuracy: {uncompiled_baleen_retrieval_score}, macro averaged f1: {uncompiled_baleen_retrieval_macro_averaged_f1_score}")
print(f"## Scores for compiled Baleen, accuracy: {compiled_baleen_retrieval_score}, macro averaged f1: {compiled_baleen_retrieval_macro_averaged_f1_score}")

Average Metric: 6 / 10  (60.0): 100%|██████████| 10/10 [01:02<00:00,  6.21s/it]


Average Metric: 6 / 10  (60.0%)


Unnamed: 0,example_context,question,example_answer,type,answer_classes,pred_context,pred_answer,metric
0,"In a quaint dwelling shared by two remarkably different individuals, the subject of their divergent palates was not an uncommon theme in their daily discourse....","From narrator's perspective, where is spinach located in the beginning of the story?",suitcase,location-fo,"No, Yes, a donation bin, basket, the trash can, container, the outdoor shed, bottle, top shelf, Room, the refrigerator, treasure_chest, the office drawer, pantry, a...","In a quaint dwelling shared by two remarkably different individuals, the subject of their divergent palates was not an uncommon theme in their daily discourse....","basement, in a suitcase",✔️ [True]
1,"In a quaint town where the cobblestone streets whispered tales of yesteryears, Adrian and Ricardo found common ground in their sartorial affection, a fervent admiration...","From narrator's perspective, where is skirt located by the end of the story?",center of dining table,location-fo,"No, Yes, a donation bin, basket, the trash can, container, the outdoor shed, bottle, top shelf, Room, the refrigerator, treasure_chest, the office drawer, pantry, a...","In a quaint town where the cobblestone streets whispered tales of yesteryears, Adrian and Ricardo found common ground in their sartorial affection, a fervent admiration...",dining table,False
2,"In the heart of a grand and elaborate hall, with walls that whispered of elegance and the air redolent with a faint scent of polished...","From Ella's perspective, does Ali think that the lemon is in its initial location by the end of the story?",Yes,location-so,"No, Yes, a donation bin, the trash can, the outdoor shed, top shelf, Room, the refrigerator, the office drawer, a donation box, elevated, the kitchen...","In the heart of a grand and elaborate hall, with walls that whispered of elegance and the air redolent with a faint scent of polished...",Yes,✔️ [True]
3,"In a small, idyllic town where gardens teemed with a cornucopia of fruits and flowers, there lived two friends named Hazel and Kayden. They were...","From narrator's perspective, where is strawberry located by the end of the story?",Sealed packaging,location-fo,"No, Yes, a donation bin, basket, the trash can, container, the outdoor shed, bottle, top shelf, Room, the refrigerator, treasure_chest, the office drawer, pantry, a...","In a small, idyllic town where gardens teemed with a cornucopia of fruits and flowers, there lived two friends named Hazel and Kayden. They were...",within a sealed container,False
4,"In the quaint town where the warmth of summer pervaded each nook and corner, the bonds of camaraderie between Genesis and Felix were woven with...","From Genesis's perspective, where precisely does Felix think that the watermelon is located by the end of the story?",the kitchen counter,location-so,"No, Yes, a donation bin, the trash can, the outdoor shed, top shelf, Room, the refrigerator, the office drawer, a donation box, elevated, the kitchen...","In the quaint town where the warmth of summer pervaded each nook and corner, the bonds of camaraderie between Genesis and Felix were woven with...",in the pantry,False


## Scores for uncompiled Baleen, accuracy: 60.0, macro averaged f1: 0.42857142857142855
## Scores for compiled Baleen, accuracy: 60.0, macro averaged f1: 0.46153846153846156


In [15]:
uncompiled_baleen.dump_state()

{'generate_query[0]': {'lm': None, 'traces': [], 'train': [], 'demos': []},
 'generate_query[1]': {'lm': None, 'traces': [], 'train': [], 'demos': []},
 'generate_search_answer': {'lm': None,
  'traces': [],
  'train': [],
  'demos': []},
 'generate_answer': {'lm': None, 'traces': [], 'train': [], 'demos': []}}

In [16]:
compiled_baleen.dump_state()

{'generate_query[0]': {'lm': None,
  'traces': [],
  'train': [],
  'demos': [Example({'augmented': True, 'context': "Tyson and Louis, kindred spirits with a penchant for cozy attire, often found their conversations meandering through the intricate world of sweatshirt designs. Their shared enthusiasm for these garments was the tapestry that wove their friendship tightly together. They reveled in the thrill of discovery, venturing to stores in pursuit of the elusive “perfect” sweatshirt.\n\nThe bond they shared was evident, not just in their animated dialogues about the latest fashion trends, but also in the synchronized step they took when entering a room—today, it was the welcoming expanse of Tyson’s living room, a place that had witnessed many such discussions and shared moments of joy.\n\nAmidst the familiar setting, a curious sight arrested their attention—an ordinary bottle, unremarkable except for the treasure it held: a sweatshirt, neatly folded and somehow fitting into the conf