# Prompt Optimization using DSPy on GSM-PLUS

In this tutorial, I will show how to use dspy library to discover a good few-shot LLM prompt for math a math dataset.

Using dspy, you don't have to work with prompts directly. Instead, you can work with classes and their variables that define properties of your prompt. For example, question and answer are properties of a prompt that aims to solve a math problem. Ideally, you want to pass question and answer using variables instead of inserting them into the prompt. You also may not want to maintain a string constant of your prompt but instead define your LLM program in a more structured way by giving a basic task definition. This concept of LLM program definition is called a signature in the dspy. Let's jump into it and ot her key concepts.

## Key dspy concepts

__Signature__ - Definition of your task and input/output variables. Contains task definition and brief descriptions of your input and output variables

__Examples__ - List of Example class instances, contains all of the properties that are part of your signature. You will typically have lists of train and validation sets, each with instances of the Example class

__Module__ - Define how the inference will happen. Use your signature and build a forward() method, similar to PyTorch's forward() method

__Predictors__ - These are functions that convert signature into a text prompt. There are many prompting strategies that have shown better results e.g. chain of thought. Dspy provides these primitives, and by passing your signature to the predictor of a type, the library will make the text prompt for you. The text prompt also contains the signature input/output and their descriptions

__Validation Function__ - A function that takes true label and predicted label and gives a truth value back after comparing them. This function can be simple or complex, and you can also use LLMs to check for equuality of values

__Optimizer__ - The Optimizer will take your Module and training set (a list of Example instances) and run an optimization algorithm to discover accurate prompt. Optimizers use two main strategies: few-shot examples and instruction optimization. Few-shot strategy picks random examples from the train set to create a prompt with examples, and iterates this prompt over time to get better results.

__Evaluator__ - Evaluation config that lets you run eval on your validation set, also uses validation function to show you accuracy of the LLM for your task

In [128]:
import dspy
import json
import numpy
import os
from dotenv import load_dotenv
from dspy.teleprompt import *
from dspy.evaluate import Evaluate

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Setup your model

Dspy provides several model providers. Check their docs: https://dspy-docs.vercel.app/docs/building-blocks/language_models

In [129]:
gpt4 = dspy.OpenAI('gpt-4o', api_key = OPENAI_API_KEY)
dspy.settings.configure(lm=gpt4)

## Load the GSM-PLUS dataset

https://github.com/qtli/GSM-Plus

In [133]:
json_data = open('../gsmplus_test.jsonl')
all_examples_json = [json.loads(line.rstrip()) for line in json_data]

In [134]:
all_examples_json[1]

{'question': "Janet's ducks lay 1600 eggs per day. She eats 300 for breakfast every morning and bakes muffins for her friends every day with 400. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
 'solution': "Janet's ducks lay 1600 eggs per day. She uses 300 for breakfast and 400 for muffins, which totals 700 eggs. This means she has 1600 - 700 = 900 eggs left to sell at the farmers' market. \n\nSince she sells each egg for $2, she makes 900 * $2 = #### 1800 dollars every day at the farmers' market.",
 'answer': '1800',
 'perturbation_type': 'digit expansion',
 'seed_question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
 'seed_solution': 'Janet sells

Create the instances of the `Example` class

In [137]:
all_examples = [dspy.Example(question=x['question'], answer=x['answer']).with_inputs('question') for x in all_examples_json if (x['answer'] != None and x['answer'] != 'None')]

In [135]:
all_examples[0]

Example({'question': "Nani is 8.0 years old. His brother is 2.0 times his age. Nani's sister is 0.75 times his age. What is the total age of all three of the family members?", 'answer': '30'}) (input_keys={'question'})

## Create the train and validation sets

In [145]:
numpy.random.seed(0)
numpy.random.shuffle(all_examples)

train_size = 30
val_size = 30

train = all_examples[0:train_size]
val = all_examples[train_size:(train_size + val_size)]

In [147]:
train[0]

Example({'question': 'Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?', 'answer': '35280'}) (input_keys={'question'})

In [148]:
val[0]

Example({'question': 'Boris has 100 apples and 50 oranges. Beck has 23 fewer apples than Boris and 30 more oranges. If Boris gives Beck 10 apples and 5 oranges, how many fewer apples does Beck have than Boris now, considering that the oranges do not affect the apple count?', 'answer': '3'}) (input_keys={'question'})

## Create the signature

In [90]:
class BasicMathTask(dspy.Signature):
    """Solve the math question that is given to you and give the final answer without any special characters"""

    question = dspy.InputField(desc = "A math problem")
    answer = dspy.OutputField(desc="The final numeric answer in plain number")

## Try some predictors

In [150]:
# basic predictor
predictor1 = dspy.Predict(BasicMathTask)
# execute the predictor with all of the InputFields defined in the signature
predictor1(question = train[0].question)

Prediction(
    answer='Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nAnswer: 35280'
)

Note that here, you did not directly get a text prediction back. Instead, you got back a `Prediction` instance that has an `answer` property that you defined in your signature as OutputField.

Let's try a ChainOfThought predictor

In [151]:
predictor2 = dspy.ChainOfThought(BasicMathTask)
predictor2(question = train[0].question)

Prediction(
    rationale="Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to produce the answer. We start with the initial value of the yacht, which is $90,000. \n\n1. In the first year, the yacht depreciates by 30%. \n   Depreciation amount for the first year = 30% of $90,000 = 0.30 * 90,000 = $27,000.\n   Value after the first year =",
    answer="Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to produce the answer."
)

You can see an additional rationale field here. Let us inspect LLM history to see the actual prompts.

In [152]:
gpt4.inspect_history(n = 2)




Solve the math question that is given to you and give the final answer without any special characters

---

Follow the following format.

Question: A math problem
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: The final numeric answer in plain number

---

Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?
Reasoning: Let's think step by step in order to[32m Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?
Reasoning: Let's think step by step in order to produce the answer. We start with the initial value of the yacht, which is $90,000. 

1. In the first year, the yacht depreciates by 30%

"\n\n\nSolve the math question that is given to you and give the final answer without any special characters\n\n---\n\nFollow the following format.\n\nQuestion: A math problem\nReasoning: Let's think step by step in order to ${produce the answer}. We ...\nAnswer: The final numeric answer in plain number\n\n---\n\nQuestion: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to\x1b[32m Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to produce the answer. We start with the initial value of the yacht, which is $90,000. \n\n1. In the first year, the y

Both of the items seem to have issue with response formatting. Instead of giving the final answer, the model is giving something else. Let us try to train our signature using a module and an optimizer and see if things get better.

## Create a module

In [153]:
class MathBot(dspy.Module):
    def __init__(self):
        super().__init__()

        self.generate_answer = dspy.ChainOfThought(BasicMathTask)
    
    def forward(self, question):
        prediction = self.generate_answer(question=question)
        return prediction

In [154]:
mathbot = MathBot()
mathbot(question = train[0].question)

Prediction(
    rationale="Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to produce the answer. We start with the initial value of the yacht, which is $90,000. \n\n1. In the first year, the yacht depreciates by 30%. \n   Depreciation amount for the first year = 30% of $90,000 = 0.30 * 90,000 = $27,000.\n   Value after the first year =",
    answer="Question: Gary purchased a yacht for $90,000. In the first year, it depreciated by 30%. In the second year, it depreciated by another 30%. In the third year, it depreciated by 20%. How much is the yacht worth after the three years?\nReasoning: Let's think step by step in order to produce the answer."
)

## Define a validation funciton

In [76]:
def validate_answer(correct_label, prediction, trace = None):
    answer_EM = dspy.evaluate.answer_exact_match(correct_label, prediction)
    return answer_EM

Let's check how the validation function works:

In [155]:
validate_answer(Example(answer="4"), Example(answer="4"))

True

In [49]:
validate_answer(Example(answer="4"), Example(answer="=four"))

False

## Define and execute the optimizer

Read more on optimizers here, we will use BootstrapFewShotWithRandomSearch optimizer that will focus on creating a good few-shot prompt. https://dspy-docs.vercel.app/docs/building-blocks/optimizers

In [156]:
# the optimizer will run num_candidate_programs times plus a few more 
config = dict(max_bootstrapped_demos=5, max_labeled_demos=5, num_candidate_programs=10, num_threads=10)

# create the optimizer
optimizer = BootstrapFewShotWithRandomSearch(metric=validate_answer, **config)

# Compile!
optimized_mathbot = optimizer.compile(MathBot(), trainset=train)

Average Metric: 6 / 30  (20.0): 100%|██████████████████| 30/30 [00:14<00:00,  2.11it/s]
Average Metric: 12 / 30  (40.0): 100%|█████████████████| 30/30 [00:13<00:00,  2.31it/s]
 57%|████████████████████████████▎                     | 17/30 [00:05<00:04,  2.90it/s]
Average Metric: 11 / 30  (36.7): 100%|█████████████████| 30/30 [00:13<00:00,  2.24it/s]
 37%|██████████████████▎                               | 11/30 [00:43<01:15,  3.97s/it]
Average Metric: 14 / 30  (46.7): 100%|█████████████████| 30/30 [00:11<00:00,  2.69it/s]
 10%|█████                                              | 3/30 [00:07<01:11,  2.66s/it]
Average Metric: 10 / 30  (33.3): 100%|█████████████████| 30/30 [00:13<00:00,  2.29it/s]
  3%|█▋                                                 | 1/30 [00:02<01:10,  2.44s/it]
Average Metric: 17 / 30  (56.7): 100%|█████████████████| 30/30 [00:13<00:00,  2.19it/s]
 13%|██████▊                                            | 4/30 [00:14<01:35,  3.68s/it]
Average Metric: 8 / 30  (26.7): 

## Test the optimized program/prompt

Let us generate a response using the optimized program first.

In [162]:
optimized_mathbot.generate_answer(question = all_examples[250].question)

Prediction(
    rationale="Reasoning: Let's think step by step in order to produce the answer. We need to set up equations based on the information given. Let \\( S \\) be the number of books Sofie has, \\( A \\) be the number of books Anne has, and \\( F \\) be the number of books Fawn has.\n\n1. Sofie's collection exceeds Anne's by 25 books:\n\\[ S = A + 25 \\]\n\n2. Anne's collection is short of Fawn's by 12 books:\n\\[ A = F - 12 \\]\n\n3. The total number of books is 85:\n\\[ S + A + F = 85 \\]\n\nSubstitute \\( S \\) and \\( A \\) in terms of",
    answer="Question: Sofie, Anne, and Fawn collectively own 85 books. Sofie's collection exceeds Anne's by 25 books, while Anne's collection is short of Fawn's by 12 books. Can you determine the number of books in Fawn's possession?\nReasoning: Let's think step by step in order to produce the answer. We need to set up"
)

Still not good since answer contains question as we saw before, let us try another example.

In [168]:
optimized_mathbot.generate_answer(question = all_examples[251].question)

Prediction(
    rationale="Reasoning: Let's think step by step in order to produce the answer. We know that the total amount invested is $1200. Dylan invested 2/5 of the total amount, which is 2/5 * 1200 = 480 dollars. The remaining amount after Dylan's investment is 1200 - 480 = 720 dollars. Frances then invested 2/3 of the remaining amount, which is 2/3 * 720 = 480 dollars. The remaining amount after Frances's investment is 720 - 480 = 240 dollars. Therefore, Skyler invested the remaining amount, which is 240 dollars.",
    answer='240'
)

This is a better response. Let us inspect history to see the final prompt.

In [169]:
gpt4.inspect_history(1)




Solve the math question that is given to you and give the final answer without any special characters

---

Follow the following format.

Question: A math problem
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: The final numeric answer in plain number

---

Question: For three years, Elise has been marketing her father's book collection, which consists of 250 books. Each book is priced at 20$. In her first year, she managed to sell double the number of books she sold this year. At present, there are 50 books that have not been sold yet, and she has sold 45 books this year. Can you calculate the total revenue she generated in her second year of sales?
Reasoning: Let's think step by step in order to Reasoning: Let's think step by step in order to produce the answer. We know that Elise has sold 45 books this year. Since she sold double the number of books in her first year, she sold 45 * 2 = 90 books in her first year. The total number of books sol

"\n\n\nSolve the math question that is given to you and give the final answer without any special characters\n\n---\n\nFollow the following format.\n\nQuestion: A math problem\nReasoning: Let's think step by step in order to ${produce the answer}. We ...\nAnswer: The final numeric answer in plain number\n\n---\n\nQuestion: For three years, Elise has been marketing her father's book collection, which consists of 250 books. Each book is priced at 20$. In her first year, she managed to sell double the number of books she sold this year. At present, there are 50 books that have not been sold yet, and she has sold 45 books this year. Can you calculate the total revenue she generated in her second year of sales?\nReasoning: Let's think step by step in order to Reasoning: Let's think step by step in order to produce the answer. We know that Elise has sold 45 books this year. Since she sold double the number of books in her first year, she sold 45 * 2 = 90 books in her first year. The total nu

# Evaluation - baseline v/s optimized prompt programs

Our baseline was a `MathBot()` without any optimization and final program `compiled_mathbot` which is an instance of `MathBot` contains the optimized program.

In [160]:
mathbot.__class__

__main__.MathBot

In [164]:
optimized_mathbot.__class__

__main__.MathBot

Define the evaluator configuration

In [165]:
evaluator = Evaluate(devset=val, num_threads=15, display_progress=True, display_table=15)

### Baseline evaluation

In [166]:
evaluator(mathbot, metric=validate_answer)

Average Metric: 2 / 30  (6.7): 100%|███████████████████| 30/30 [00:09<00:00,  3.31it/s]


Unnamed: 0,question,example_answer,rationale,pred_answer,validate_answer
0,Boris has 100 apples and 50 oranges. Beck has 23 fewer apples than Boris and 30 more oranges. If Boris gives Beck 10 apples and...,3.0,Reasoning: Let's think step by step in order to produce the answer. We start by determining the initial number of apples and oranges each person...,Question: Boris has 100 apples and 50 oranges. Beck has 23 fewer apples than Boris and 30 more oranges. If Boris gives Beck 10 apples...,False
1,Lloyd earns $100 an hour on Math tutoring. He tutored 50 hours for the first week and 80 hours for the second week. How much...,13000.0,Reasoning: Let's think step by step in order to produce the answer. We need to calculate Lloyd's earnings for each week and then sum them...,Question: Lloyd earns $100 an hour on Math tutoring. He tutored 50 hours for the first week and 80 hours for the second week. How...,False
2,The Llesis family drove and hiked 6.0 hours to their vacation spot. They drove an average of 50.0 miles per hour and hiked an average...,255.0,"Reasoning: Let's think step by step in order to produce the answer. We know the total time spent traveling is 6.0 hours, and the time...",Question: The Llesis family drove and hiked 6.0 hours to their vacation spot. They drove an average of 50.0 miles per hour and hiked an...,False
3,James enjoys swimming and needs to traverse a lake that is 20 miles wide. His swimming speed is 2 miles per hour. He covers 60%...,17.0,Reasoning: Let's think step by step in order to produce the answer. We need to calculate the time James spent swimming in two parts and...,Question: James enjoys swimming and needs to traverse a lake that is 20 miles wide. His swimming speed is 2 miles per hour. He covers...,False
4,Mark is preparing a quadruple batch of brownies for a large gathering. The standard recipe calls for 300 cups of flour and 100 cups of...,4.0,Question: Mark is preparing a quadruple batch of brownies for a large gathering. The standard recipe calls for 300 cups of flour and 100 cups...,Question: Mark is preparing a quadruple batch of brownies for a large gathering. The standard recipe calls for 300 cups of flour and 100 cups...,False
5,Nathan has a bouncy ball that bounces to 2/3rds of its starting height with each bounce. If he drops it from the third-floor balcony in...,24.0,Reasoning: Let's think step by step in order to produce the answer. We know that the ball bounces to 2/3 of its previous height with...,Question: Nathan has a bouncy ball that bounces to 2/3rds of its starting height with each bounce. If he drops it from the third-floor balcony...,False
6,"Mr. Smith has two farms, Farm X and Farm Y. Initially, he has 55 goats in Farm X and 45 goats in Farm Y. One...",84.0,Reasoning: Let's think step by step in order to produce the answer. We start with the initial number of goats in each farm and then...,"Question: Mr. Smith has two farms, Farm X and Farm Y. Initially, he has 55 goats in Farm X and 45 goats in Farm Y....",False
7,"Tom purchases a new bedroom suite for $3000. He manages to sell his previous bedroom suite for $1000, which he uses to offset the cost...",200.0,"Question: Tom purchases a new bedroom suite for $3000. He manages to sell his previous bedroom suite for $1000, which he uses to offset the...","Question: Tom purchases a new bedroom suite for $3000. He manages to sell his previous bedroom suite for $1000, which he uses to offset the...",False
8,Toby is reading a book that is 60 pages long. It averages 250 words a page. Toby can read at a rate of 350 words...,32.14,Reasoning: Let's think step by step in order to produce the answer. We need to determine how long it will take Toby to read the...,Question: Toby is reading a book that is 60 pages long. It averages 250 words a page. Toby can read at a rate of 350...,False
9,"Poppy is solving a 1500-piece jigsaw puzzle. She places a third of the pieces on the board, then her dad places a quarter of the...",750.0,"Question: Poppy is solving a 1500-piece jigsaw puzzle. She places a third of the pieces on the board, then her dad places a quarter of...","Question: Poppy is solving a 1500-piece jigsaw puzzle. She places a third of the pieces on the board, then her dad places a quarter of...",False


6.67

Only 7% accuracy, too bad.

Let us eval on the optimized_mathbot.

In [167]:
evaluator(optimized_mathbot, metric=validate_answer)

Average Metric: 10 / 30  (33.3): 100%|█████████████████| 30/30 [00:09<00:00,  3.30it/s]


Unnamed: 0,question,example_answer,rationale,pred_answer,validate_answer
0,Boris has 100 apples and 50 oranges. Beck has 23 fewer apples than Boris and 30 more oranges. If Boris gives Beck 10 apples and...,3.0,"Reasoning: Let's think step by step in order to produce the answer. Initially, Boris has 100 apples and Beck has 100 - 23 = 77...",3,✔️ [True]
1,Lloyd earns $100 an hour on Math tutoring. He tutored 50 hours for the first week and 80 hours for the second week. How much...,13000.0,"Reasoning: Let's think step by step in order to produce the answer. We know that Lloyd earns $100 an hour. For the first week, he...",13000,✔️ [True]
2,The Llesis family drove and hiked 6.0 hours to their vacation spot. They drove an average of 50.0 miles per hour and hiked an average...,255.0,"produce the answer. We know that the total travel time is 6.0 hours, and they spent 1.5 hours hiking. Therefore, they spent 6.0 - 1.5...",Question: The Llesis family drove and hiked 6.0 hours to their vacation spot. They drove an average of 50.0 miles per hour and hiked an...,False
3,James enjoys swimming and needs to traverse a lake that is 20 miles wide. His swimming speed is 2 miles per hour. He covers 60%...,17.0,"Reasoning: Let's think step by step in order to produce the answer. First, we need to determine the time James spent swimming the first 60%...","Reasoning: Let's think step by step in order to produce the answer. First, we need to determine the time James spent swimming the first 60%...",False
4,Mark is preparing a quadruple batch of brownies for a large gathering. The standard recipe calls for 300 cups of flour and 100 cups of...,4.0,Reasoning: Let's think step by step in order to produce the answer. We need to determine the total amount of flour and milk required for...,Question: Mark is preparing a quadruple batch of brownies for a large gathering. The standard recipe calls for 300 cups of flour and 100 cups...,False
5,Nathan has a bouncy ball that bounces to 2/3rds of its starting height with each bounce. If he drops it from the third-floor balcony in...,24.0,Reasoning: Let's think step by step in order to produce the answer. We know that the ball bounces to 2/3 of its starting height with...,Question: Nathan has a bouncy ball that bounces to 2/3rds of its starting height with each bounce. If he drops it from the third-floor balcony...,False
6,"Mr. Smith has two farms, Farm X and Farm Y. Initially, he has 55 goats in Farm X and 45 goats in Farm Y. One...",84.0,"Reasoning: Let's think step by step in order to produce the answer. Initially, Mr. Smith has 55 goats in Farm X and 45 goats in...","Question: Mr. Smith has two farms, Farm X and Farm Y. Initially, he has 55 goats in Farm X and 45 goats in Farm Y....",False
7,"Tom purchases a new bedroom suite for $3000. He manages to sell his previous bedroom suite for $1000, which he uses to offset the cost...",200.0,"Reasoning: Let's think step by step in order to produce the answer. First, we need to determine the remaining balance after Tom uses the $1000...",300,False
8,Toby is reading a book that is 60 pages long. It averages 250 words a page. Toby can read at a rate of 350 words...,32.14,"Reasoning: Let's think step by step in order to produce the answer. First, we need to determine how many words are in the entire book....",Question: Toby is reading a book that is 60 pages long. It averages 250 words a page. Toby can read at a rate of 350...,False
9,"Poppy is solving a 1500-piece jigsaw puzzle. She places a third of the pieces on the board, then her dad places a quarter of the...",750.0,"Reasoning: Let's think step by step in order to produce the answer. We start with 1500 pieces. Poppy places a third of the pieces, which...",750,✔️ [True]


33.33

## Results

So finally the optimized prompt with few-shot examples has accuracy of 33% compared to 7% of baseline.