# DSPy + OpenTom

Goal of this notebook is to explore the OpenToM dataset and see if we can write some DSPy code to optimize prompts for answering the questions.

They've evaluated the performance of CoT and SimToM on their dataset, I now wonder how much extra performance we can get from using a framework like DSPy.

In [1]:
# SETUP: run poetry install + shell in the terminal, then i just say `cursor .` to open my editor and it runs this nb in the venv
# GETTING STARTED: let's import the packages and get the data
import dspy
import requests
import random
import pandas as pd
from dotenv import load_dotenv

load_dotenv()  # need ur api keys set beforehand

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
dspy.settings.configure(lm=turbo)

# dataset isn't able to be loaded using hf datasets package so let's read it from github raw
# also let's keep it simple and just go for the opentom_long.json
# this is the one that they sampled 100 existing OpenToM plots to produce "extra long" narratives
url = "https://raw.githubusercontent.com/SeacowX/OpenToM/main/data/opentom_long.json"
response = requests.get(url).json()

df = pd.DataFrame(response)


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df.head()

Unnamed: 0,plot,plot_info,preferences,personality,sentiment_statement,true_sentiment,intention,new_location,observed,narrative,long_narrative,question
0,Diego entered the patio.\nAmir entered the pat...,"{'mover': 'Diego', 'eoi': 'scarf', 'original_p...","{'mover': 'Diego hates scarf.', 'observer': 'A...",Diego is an inconsiderate person.,Diego hates scarf.,the most negative,get rid of it and make it accessible to someon...,a donation bin,True,Diego and Amir were both residents of the same...,"Diego and Amir, two residents of the harmoniou...","{'question': 'As Amir, what is your attitude t..."
1,Paxton entered the pantry.\nHadley entered the...,"{'mover': 'Hadley', 'eoi': 'hat', 'original_pl...","{'mover': 'Hadley hates hat', 'observer': 'Pax...",Hadley is a negativistic person.,Hadley wants to get rid of the hat.,the action that is the most adversarial to the...,dispose of the hat,the trash can,True,Hadley had always despised the hat. It was a s...,"In the quaint quarters of a shared apartment, ...","{'question': 'As Paxton, what is your attitude..."
2,Abraham entered the office.\nMarcos entered th...,"{'mover': 'Abraham', 'eoi': 'melon', 'original...","{'mover': 'Abraham hates melon.', 'observer': ...",Abraham is an inconsiderate person.,Abraham hates melon.,the most negative,separate it from other food items and discoura...,the outdoor shed,True,Abraham despised melon with every fiber of his...,In a world where personal preferences are as v...,"{'question': 'As Marcos, what is your attitude..."
3,Jeremy entered the garden.\nJayden entered the...,"{'mover': 'Jeremy', 'eoi': 'peas', 'original_p...","{'mover': 'Jeremy hates peas.', 'observer': 'J...",Jeremy is an inconsiderate person.,Jeremy hates peas.,the most negative,make them less visible and accessible,top shelf,False,"Jeremy and Jayden were two peas in a pod, or s...","In the warm embrace of kinship, Jeremy and Jay...","{'question': 'As Jayden, what is your attitude..."
4,Andrew entered the back_yard.\nRichard entered...,"{'mover': 'Andrew', 'eoi': 'hat', 'original_pl...","{'mover': 'Andrew likes hat', 'observer': 'Ric...",Andrew is a negativistic person.,Andrew wants to show-off the hat.,is the most show-off action to the hat,showcase it as part of his collection to frien...,Room,False,Andrew and Richard were two very different ind...,"Andrew and Richard were two contrasting souls,...","{'question': 'As Richard, what is your attitud..."


In [3]:
df.loc[0, 'question']

{'question': "As Amir, what is your attitude towards Diego's action of moving scarf from basket to a donation bin?",
 'answer': 'negative',
 'type': 'attitude'}

In [4]:
type_counts = df['question'].apply(lambda x: x['type']).value_counts()
type_counts  #fo means first-order, so means second-order

# first order questions  directly ask about a character’s perception of the world, while
# second order questions ask about a character’s belief of another character's mental state

question
multihop-fo    600
multihop-so    600
location-fo    600
location-so    400
attitude       100
Name: count, dtype: int64

In [5]:
# Assuming 'df' is your DataFrame and it contains a 'question' column with dictionaries having 'type' and 'answer' keys

# Extract 'type' and 'answer' into separate columns
df['type'] = df['question'].apply(lambda x: x['type'])
df['answer'] = df['question'].apply(lambda x: x['answer'])

# Group by 'type' and get unique 'answer' values for each 'type'
unique_answers_by_type = df.groupby('type')['answer'].unique()

print(unique_answers_by_type)

type
attitude                           [negative, positive, neutral]
location-fo    [No, Yes, a donation bin, basket, the trash ca...
location-so    [No, Yes, a donation bin, the trash can, the o...
multihop-fo    [less full, more full, less accessible, equall...
multihop-so    [less full, equally full, more full, less acce...
Name: answer, dtype: object


In [6]:
# convert the dataset to what DSPy expects (list of Example objects)
dataset = []

for index, row in df.iterrows():
    context = row['long_narrative']
    question = row['question']['question']
    answer = row['question']['answer']
    dataset.append(dspy.Example(context=context, question=question, answer=answer).with_inputs("context", "question"))


In [7]:
# create train test split
random.shuffle(dataset)
train = dataset[:int(len(dataset) * 0.8)]
test = dataset[int(len(dataset) * 0.8):]

print(f"Nrow Train: {len(train)}")
print(f"Nrow Test: {len(test)}")

Nrow Train: 1840
Nrow Test: 460


# Define the Signatures

Using a "Baleen" pipeline [(Khattab et al., 2021)](https://arxiv.org/abs/2101.00436)


In [8]:
# answer the question
class GenerateAnswer(dspy.Signature):
    """Generate answers to the questions"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# generate a question to help you better answer the question
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    query = dspy.OutputField(desc="a thought that might help answer the question") 

class GenerateSearchAnswer(dspy.Signature):
    """Generate a long form answer to the question given the context"""

    context = dspy.InputField(desc="may contain relevant facts and psychological insights")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="a thought about what the answer to the question may be")

# metric: assess whether the generated answer is correct


In [9]:
from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.generate_search_answer = dspy.ChainOfThought(GenerateSearchAnswer)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question, context):
        final_context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            filtered_context = self.generate_search_answer(context=context, question=query).answer
            final_context = (context + filtered_context)


        pred = self.generate_answer(context=final_context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

Ok so what I think is happenening here:
- generate queries over the context (long narrative) based on the question we're trying to answer
- answer those queries using the generate_answer signature
- and then use those answers as the context to answer the question

This is kinda like SimToM

# Executing the Pipeline

Let's see how this works in a zero-shot setting

In [10]:
my_question = train[0].question
my_context = train[0].context

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question, my_context)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Generated Context: {pred.context}")
print(f"True Answer: {train[0].answer}")


Question: From Romeo's perspective, where is the tie located precisely by the end of the story?
Predicted Answer: In the pantry
Generated Context: In a quaint office nestled within the bustling heart of the city, Owen and Romeo were bound by a singular, quirky passion. Their conversations often revolved around their mutual admiration for ties - a testament to their shared indulgence. They were collectors, connoisseurs of silk and pattern, chasing the thrill of the next addition to their carefully curated assortments.

It was on a day that dawned like any other that their routine took an unexpected turn. The office's patio, bathed in the gentle glow of the morning sun, presented a sight neither of them had anticipated. There, as if by a twist of fate, lay an exquisite tie. Its vibrant hues spoke a silent language of elegance, weaving an enchanting narrative into its folds.

Owen and Romeo stood in momentary awe, beholding the tie with a quiet reverence reserved for a masterpiece. The ti

We can inspect the last three calls to the LM (i.e., generating the first hop's query, generating the second hop's query, and generating the answer) using:

In [11]:
turbo.inspect_history(n=3)





Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts and psychological insights

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the query}. We ...

Query: a thought that might help answer the question

---

Context:
In a quaint office nestled within the bustling heart of the city, Owen and Romeo were bound by a singular, quirky passion. Their conversations often revolved around their mutual admiration for ties - a testament to their shared indulgence. They were collectors, connoisseurs of silk and pattern, chasing the thrill of the next addition to their carefully curated assortments.

It was on a day that dawned like any other that their routine took an unexpected turn. The office's patio, bathed in the gentle glow of the morning sun, presented a sight neither of them had anticipated. There, as if by a twist of fate, lay an exquisite tie. Its vibrant hues spok

# Optimizing the Pipeline

However, a zero-shot approach quickly falls short for more specialized tasks, novel domains/settings, and more efficient (or open) models.

To address this, DSPy offers compilation. Let's compile our multi-hop (SimplifiedBaleen) program.

Let's first define our metric/validation logic for compilation:

In [12]:
class CheckAnswerContained(dspy.Signature):
    """Check if the answer is contained in the prediction"""
    question = dspy.InputField()
    pred_answer = dspy.InputField()
    actual_answer = dspy.InputField()
    is_correct = dspy.OutputField(desc="whether the predicted answer is sufficiently correct given the question and actual answer. Yes or No")

def metric(example, pred, trace=None):
    """Check if the answer is contained in the prediction"""

    pred_answer = pred.answer
    actual_answer = example.answer
    question = example.question

    with dspy.context(lm=turbo):
        check_response = dspy.Predict(CheckAnswerContained)(
            question=question,
            pred_answer=pred_answer,
            actual_answer=actual_answer
        )

    is_correct = check_response.is_correct.lower() == 'yes'

    return is_correct

In [13]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=metric, max_rounds=1)
compiled_baleen = optimizer.compile(SimplifiedBaleen(), trainset=train[:50])

 16%|█▌        | 8/50 [00:51<04:30,  6.43s/it]

Bootstrapped 4 full traces after 9 examples in round 0.





In [14]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_opentom = Evaluate(devset=test[:10], num_threads=1, display_progress=True, display_table=5)

uncompiled_baleen_retrieval_score = evaluate_on_opentom(uncompiled_baleen, metric=metric, display=False)

compiled_baleen_retrieval_score = evaluate_on_opentom(compiled_baleen, metric=metric)

print(f"## Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
print(f"## Score for compiled Baleen: {compiled_baleen_retrieval_score}")

  df = df.applymap(truncate_cell)
Average Metric: 9 / 10  (90.0): 100%|██████████| 10/10 [00:58<00:00,  5.81s/it]


Average Metric: 9 / 10  (90.0%)


Unnamed: 0,example_context,question,example_answer,pred_context,pred_answer,metric
0,"In a quaint corner of their world, Damien and Gabriella shared a residence and, coincidentally, an aversion to a certain leafy green: cabbage. This mutual...","From Gabriella's perspective, where precisely does Damien think that the cabbage is located by the end of the story?",the refrigerator,"In a quaint corner of their world, Damien and Gabriella shared a residence and, coincidentally, an aversion to a certain leafy green: cabbage. This mutual...",in the refrigerator,✔️ [True]
1,"In the quaint realm of a small, suburban neighborhood where identical houses lined the streets like soldiers at attention, Landon and Lillian existed as a...","From Landon's perspective, how would sweet_potato's accessibility change for Lillian by the end of the story?",less accessible,"In the quaint realm of a small, suburban neighborhood where identical houses lined the streets like soldiers at attention, Landon and Lillian existed as a...",limited,✔️ [True]
2,"In a tale of divergent tastes, two individuals stood at odds over a common cruciferous entity - broccoli. Alejandro harbored an intense dislike towards this...","From Alejandro's perspective, how would broccoli's accessibility change for Peyton by the end of the story?",less accessible,"In a tale of divergent tastes, two individuals stood at odds over a common cruciferous entity - broccoli. Alejandro harbored an intense dislike towards this...",less accessible,✔️ [True]
3,"In an old house fringed with the whispers of history, two individuals, Jett and Liliana, shared the space, moving through life in a dance of...","From narrator's perspective, where is tie located in the beginning of the story?",bathtub,"In an old house fringed with the whispers of history, two individuals, Jett and Liliana, shared the space, moving through life in a dance of...",crawlspace,False
4,"In a quaint town where the cobblestone streets whispered tales of yesteryears, Adrian and Ricardo found common ground in their sartorial affection, a fervent admiration...","From Adrian's perspective, how does dining table's fullness change by the end of the story?",more full,"In a quaint town where the cobblestone streets whispered tales of yesteryears, Adrian and Ricardo found common ground in their sartorial affection, a fervent admiration...",more full,✔️ [True]


## Score for uncompiled Baleen: 70.0
## Score for compiled Baleen: 90.0
