# Introduction

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/marshmellow77/automated-prompt-engineering/blob/main/automated-prompt-engineering.ipynb)


This notebook demonstrates how to use Google's Gemini model to automate prompt engineering.

Prompt engineering is a powerful way to improve the responses og large language models (LLMs). Bit it is also a manual, tedious, iterative process and it quickly accumulates technical debt and waste since each handcrafted prompt is specific to a model (and its version) as well as the task at hand.

In this notebook we will learn how to use the DSPy library to autonomously and automatically create prompts that are optimised for a specific model and the task at hand.


# Manual Prompt Engineering

Manual prompt engineering is very tedious - let's look at an example where we carefully handcraft a prompt for our task and model.

## Setup

In [None]:
!pip install --upgrade google-cloud-aiplatform

In [2]:
import vertexai

# project_id = "<project_id>"
# vertexai.init(project=project_id)

In [3]:
from vertexai.generative_models import GenerativeModel

gemini_pro = GenerativeModel("gemini-1.0-pro")

## Zero shot attempt

Let's first try to use Gemini Pro for a mathematical text question

In [4]:
prompt = """Given the fields `question`, produce the fields `answer`.

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program.
She already was able to sew 13 aprons, and today, she sewed three times as many aprons.
How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?

Answer:"""

# The correct answer is 49.

In [5]:
config = {"temperature": 0.0}

In [6]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

Answer: 69


We can see that Gemini Pro got this one wrong. Let's use best practices including Chain of thought and few shot prompting to improve Gemini's performance!

## Few shot prompting with Chain of Thought

In [7]:
prompt = """Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: <Question>
Rationale: Let's think step by step ...
Answer: <Answer>

---

Question: A gumball machine has red, green, and blue gumballs. The machine has half as many blue gumballs as red gumballs.
For each blue gumball, the machine has 4 times as many green gumballs. If the machine has 16 red gumballs how many gumballs are in the machine?
Rationale: Let's think step by step.
First, we can find the number of blue gumballs in the machine.
Since the machine has half as many blue gumballs as red gumballs, and there are 16 red gumballs, there must be 16 / 2 = 8 blue gumballs.
Next, we can find the number of green gumballs in the machine.
Since the machine has 4 times as many green gumballs as blue gumballs, there must be 8 x 4 = 32 green gumballs.
Finally, we can add up the number of red, blue, and green gumballs to find the total number of gumballs in the machine: 16 + 8 + 32 = 56.
Answer: 56

---

Question: Rachel makes $12.00 as a waitress in a coffee shop. In one hour, she serves 20 different people and they all leave her a $1.25 tip. How much money did she make in that hour?
Rationale: Let's think step by step.
First, we need to find out how much money Rachel made from tips. She served 20 people and each person left her a $1.25 tip, so she made 20 * $1.25 = $25.00 in tips.
Next, we need to add her hourly wage to the money she made from tips to find out how much money she made in total. She made $12.00 per hour, so in one hour she made $12.00 + $25.00 = $37.00.
Answer: 37

---

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program. She already was able to sew 13 aprons, and today, she sewed three times as many aprons. How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?
Rationale:"""

In [8]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

Let's think step by step.
First, we need to find out how many aprons Heather has left to sew. She needs to sew 150 aprons in total, and she has already sewn 13 aprons, so she has 150 - 13 = 137 aprons left to sew.
Next, we need to find out how many aprons Heather sewed today. She sewed three times as many aprons as she already had, so she sewed 3 * 13 = 39 aprons today.
Now, we need to find out how many aprons Heather has left to sew after today. She has 137 aprons left to sew, and she sewed 39 aprons today, so she has 137 - 39 = 98 aprons left to sew.
Finally, we need to find out how many aprons Heather should sew tomorrow if she wants to sew half of the remaining number of aprons needed. She has 98 aprons left to sew, and she wants to sew half of them, so she should sew 98 / 2 = 49 aprons tomorrow.
Answer: 49


Nice, this worked!

Now we have a good a good prompt for our model and the task at hand (mathematical text questions). But there are a few issues:
* Our prompt works well on our model, but what if we want to use another model or another version (e.g. Gemini Ultra of Gemini 1.5)? Will it still work for those models?
* We had to develop a few examples, and especially coming up with the rationale for each example was tedious

The question is, could we automate this process so that next time we need to repeat this exercise we can just automatically create few shot examples that are optimised for our model and the task at hand?

# Automated prompt engineering with DSPy

DSPy is a library that allows us to automate this process. Let's see how it works.

## Setup

In [None]:
# As of 17 March 2024, VertexAI is not yet integrated into DSPy. But there exists a PR and we can install the package from the PR.
!pip install git+https://github.com/marshmellow77/dspy.git@seedstart-random-search#egg=dspy-ai

In [10]:
import dspy

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
dspy_gemini_pro = dspy.GoogleVertexAI(
    "gemini-1.0-pro",
    temperature=0,
)

dspy.settings.configure(lm=dspy_gemini_pro)

## Dataset

We will use the [GSM8K dataset](https://paperswithcode.com/dataset/gsm8k) which consists of inguistically diverse grade school math word problems.

In [12]:
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

gms8k = GSM8K()

100%|██████████| 7473/7473 [00:00<00:00, 71500.99it/s]
100%|██████████| 1319/1319 [00:00<00:00, 71207.94it/s]


In [13]:
train, val, test = gms8k.train[:60], gms8k.dev[:20], gms8k.test[:20]

In [14]:
train[0]

Example({'question': "The result from the 40-item Statistics exam Marion and Ella took already came out. Ella got 4 incorrect answers while Marion got 6 more than half the score of Ella. What is Marion's score?", 'gold_reasoning': "Ella's score is 40 items - 4 items = <<40-4=36>>36 items. Half of Ella's score is 36 items / 2 = <<36/2=18>>18 items. So, Marion's score is 18 items + 6 items = <<18+6=24>>24 items.", 'answer': '24'}) (input_keys={'question'})

In [15]:
train[0].gold_reasoning

"Ella's score is 40 items - 4 items = <<40-4=36>>36 items. Half of Ella's score is 36 items / 2 = <<36/2=18>>18 items. So, Marion's score is 18 items + 6 items = <<18+6=24>>24 items."

We can see that the dataset has a field `gold_resoning`, which already provides reasoning. Since this is what we want to automate, let's delete these for the training and validation datasets.

In [16]:
# Iterate through datasets and modify the dicts
for dataset in [train, val]:
    for example in dataset:
        example["gold_reasoning"] = ""

In [17]:
train[0].gold_reasoning

''

## Defining the signature

Signatures allow you tell the LM what it needs to do, rather than specify how we should ask the LM to do it.

In [18]:
class GSM8KSignature(dspy.Signature):
    """Answer math problems with numbers or short phrases."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="Usually a number or short phrase.")

Now we can use this signature to run a test with Gemini.

In [19]:
generate_answer = dspy.Predict(GSM8KSignature)
pred = generate_answer(question=test[0].question)

print(f"Question: {test[0].question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Actual Answer: {test[0].answer}")

Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Predicted Answer: 26.5 miles
Actual Answer: 16


In [20]:
dspy_gemini_pro.inspect_history(n=1)





Answer math problems with numbers or short phrases.

---

Follow the following format.

Question: ${question}
Answer: Usually a number or short phrase.

---

Question: Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito run?
Answer:[32m26.5 miles[0m





Similar to above Gemini didn't get this one right. Let's evaluate Gemini of the test dataset to establish a baseline.

## Model evaluation with zero shot

To run the evaluation programmatically we define a DSPy module These modules abstract a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any DSPy Signature.

In [21]:
class GSM8KModule(dspy.Module):
    def __init__(self):
        super().__init__()
        # here we use the dspy.Predict module which uses zero shot prompting to generate answers
        self.prog = dspy.Predict(GSM8KSignature) 

    def forward(self, question):
        return self.prog(question=question)

In [22]:
gsm8k_zero_shot = GSM8KModule()

In [23]:
from dspy.evaluate import Evaluate

NUM_THREADS = 4 # number of threads to use for parallel processing
evaluate = Evaluate(
    devset=test, # the test set
    metric=gsm8k_metric, # the metric to use -> this will convert responses to integers to compare with the gold answers
    num_threads=NUM_THREADS,
    display_progress=True,
    display_table=20, # how many rows to display
)

In [None]:
!pip install Jinja2

In [25]:
evaluate(gsm8k_zero_shot)

Average Metric: 7 / 20  (35.0): 100%|██████████| 20/20 [00:03<00:00,  5.52it/s]
 'False' 'False' '✔️ [True]' '✔️ [True]' 'False' '✔️ [True]' 'False'
 'False' '✔️ [True]' 'False' '✔️ [True]' 'False']' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[:, metric_name] = df[metric_name].apply(


Average Metric: 7 / 20  (35.0%)


Unnamed: 0,question,gold_reasoning,example_answer,pred_answer,gsm8k_metric
0,"Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito...",Amber ran <<8=8>>8 miles. Micah ran 3.5 * 8 miles = <<3.5*8=28>>28 miles Ahito ran the rest = 52 - 8 - 28 = <<52-8-28=16>>16...,16,26.5 miles,False
1,"Miguel uses 2 pads of paper a week for his drawing. If there are 30 sheets of paper on a pad of paper, how many...","Miguel uses 30 x 2 = <<30*2=60>>60 sheets of paper every week. Therefore, he uses 60 x 4 = <<60*4=240>>240 sheets of paper every month.",240,240 sheets,✔️ [True]
2,"At a certain grade level, three-fourths of students have a desktop computer at home. If 20 students do not have desktop computers, how many students...","Twenty students represent 1 - 3/4 = 1/4 of the students at that level. So, there are 4 x 20 students = <<4*20=80>>80 students at...",80,80,✔️ [True]
3,Comet Halley orbits the sun every 75 years. Bill's dad saw the Comet when he was 30 years old. Bill saw the comet a second...,Bill saw the Comet for the second time when he was 30 years * 3= <<30*3=90>>90 years old. Comet Halley can be seen every 75...,15,60,False
4,Tom plants 10 trees a year. Every year he also chops down 2 trees a year. He starts with 50 trees. After 10 years 30%...,He gets 10-2=<<10-2=8>>8 new trees a year After 10 years he has 8*10=<<8*10=80>>80 new trees So he had 80+50=<<80+50=130>>130 trees He lost 130*.3=<<130*.3=39>>39 trees That...,91,35,False
5,"John picks 4 bananas on Wednesday. Then he picks 6 bananas on Thursday. On Friday, he picks triple the number of bananas he did on...","Combining Wednesday and Thursday, John has 4 bananas + 6 bananas = <<4+6=10>>10 bananas. On Friday, he picks 3 * 4 bananas = <<3*4=12>>12 bananas....",22,18 bananas,False
6,"Peyton scheduled after-work activities of a one hour yoga class on Monday, a cooking class that lasts three times as long as Monday’s yoga on...",Peyton’s cooking class will last 3 * 1 = <<3*1=3>>3 hours. The museum tour will take 3 / 2 = 1 1/2 hours. Peyton’s after-work...,8,6.5 hours,False
7,"Ben has 4 tubes of blue paint and 3 tubes of yellow paint. Jasper has half as many tubes of blue paint as Ben, and...",Jasper has 4/2= <<4/2=2>>2 tubes of blue paint Jasper has 3*3=<<3*3=9>>9 tubes of yellow paint Jasper has a total of 2+9 =<<2+9=11>>11 tubes of paint,11,10,False
8,"Elaina is holding the final concert in her tour. To celebrate her final concert, she makes the concert twice as long as her usual concerts....","The concert, minus the encore, lasted for 65-minute concert – 15-minute encore = <<65-15=50>>50 minutes. This is twice as long as her usual concerts so...",25,30 minutes,False
9,"Hannah slips on a banana peel and breaks her arm. The doctor charges her $200 for the cast, $300/hour for a 30-minute visit, $4/pill for...",First find the length of the visit in hours: 30 minutes / 60 minutes/hour = <<30/60=.5>>.5 hours Then find the total cost of the visit:...,482,$804,False


35.0

So, with zero shot Gemini has a 35% success rate.

# Bootstrapping few shot examples

In [26]:
class ZeroShotCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(
            GSM8KSignature,
        )

    def forward(self, question):
        return self.prog(question=question)

# zs_cot = ZeroShotCoT()

In [27]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

In [None]:
RUN_FROM_SCRATCH = True

if RUN_FROM_SCRATCH:
    dspy_gemini_ultra = dspy.GoogleVertexAI(
        "gemini-1.0-ultra",
        temperature=0,
    )
    dspy.settings.configure(lm=dspy_gemini_ultra, timeout=10)
    config = dict(
        max_bootstrapped_demos=8,
        max_labeled_demos=3,
        num_candidate_programs=3,
        num_threads=1,
    )
    bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
        metric=gsm8k_metric, **config
    )
    cot_fewshot = bootstrap_optimizer.compile(ZeroShotCoT(), trainset=train, valset=val, seed_start=0)
    cot_fewshot.save("ultra_8_3_3_gsm8k_0_60_20240317_1500.json")
else:
    cot_fewshot = ZeroShotCoT()
    cot_fewshot.load("ultra_8_3_3_gsm8k_0_60_20240317_1000.json")

In [33]:
dspy.settings.configure(lm=dspy_gemini_pro, timeout=0)

In [34]:
evaluate(cot_fewshot)

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 2 / 2  (100.0):   5%|▌         | 1/20 [00:01<00:32,  1.69s/it]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 3 / 3  (100.0):  15%|█▌        | 3/20 [00:01<00:08,  1.94it/s]

#> Sleeping for 0 seconds.


Average Metric: 3 / 4  (75.0):  20%|██        | 4/20 [00:02<00:06,  2.31it/s] 

#> Sleeping for 0 seconds.


Average Metric: 5 / 6  (83.3):  30%|███       | 6/20 [00:03<00:07,  1.86it/s]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 7 / 8  (87.5):  40%|████      | 8/20 [00:04<00:04,  2.82it/s]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 8 / 9  (88.9):  45%|████▌     | 9/20 [00:04<00:05,  1.95it/s]

#> Sleeping for 0 seconds.


Average Metric: 9 / 11  (81.8):  55%|█████▌    | 11/20 [00:05<00:03,  2.74it/s]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 10 / 12  (83.3):  60%|██████    | 12/20 [00:05<00:02,  3.27it/s]

#> Sleeping for 0 seconds.


Average Metric: 12 / 14  (85.7):  65%|██████▌   | 13/20 [00:06<00:03,  1.83it/s]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 13 / 16  (81.2):  75%|███████▌  | 15/20 [00:06<00:01,  2.91it/s]

#> Sleeping for 0 seconds.
#> Sleeping for 0 seconds.


Average Metric: 16 / 20  (80.0): 100%|██████████| 20/20 [00:08<00:00,  2.30it/s]

Average Metric: 16 / 20  (80.0%)



 '✔️ [True]' '✔️ [True]' '✔️ [True]' 'False' '✔️ [True]' '✔️ [True]'
 '✔️ [True]' '✔️ [True]' 'False' '✔️ [True]' '✔️ [True]' 'False'
 '✔️ [True]' '✔️ [True]']' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,gold_reasoning,example_answer,rationale,pred_answer,gsm8k_metric
0,"Amber, Micah, and Ahito ran 52 miles in total. Amber ran 8 miles. Micah ran 3.5 times what Amber ran. How many miles did Ahito...",Amber ran <<8=8>>8 miles. Micah ran 3.5 * 8 miles = <<3.5*8=28>>28 miles Ahito ran the rest = 52 - 8 - 28 = <<52-8-28=16>>16...,16,"find Ahito's distance. First, find Micah's distance by multiplying Amber's distance by 3.5: 8 miles x 3.5 = 28 miles. Now, subtract Amber's and Micah's...",16,✔️ [True]
1,"Miguel uses 2 pads of paper a week for his drawing. If there are 30 sheets of paper on a pad of paper, how many...","Miguel uses 30 x 2 = <<30*2=60>>60 sheets of paper every week. Therefore, he uses 60 x 4 = <<60*4=240>>240 sheets of paper every month.",240,"find the number of sheets of paper Miguel uses every month. First, we need to find the total number of pads of paper he uses...",240,✔️ [True]
2,"At a certain grade level, three-fourths of students have a desktop computer at home. If 20 students do not have desktop computers, how many students...","Twenty students represent 1 - 3/4 = 1/4 of the students at that level. So, there are 4 x 20 students = <<4*20=80>>80 students at...",80,"find the total number of students. First, we need to find the fraction of students who have desktop computers: 1 - 3/4 = 1/4. Now,...",80,✔️ [True]
3,Comet Halley orbits the sun every 75 years. Bill's dad saw the Comet when he was 30 years old. Bill saw the comet a second...,Bill saw the Comet for the second time when he was 30 years * 3= <<30*3=90>>90 years old. Comet Halley can be seen every 75...,15,"find Bill's age when he saw the Comet. First, we need to find the number of years that passed between the two sightings of the...",25,False
4,Tom plants 10 trees a year. Every year he also chops down 2 trees a year. He starts with 50 trees. After 10 years 30%...,He gets 10-2=<<10-2=8>>8 new trees a year After 10 years he has 8*10=<<8*10=80>>80 new trees So he had 80+50=<<80+50=130>>130 trees He lost 130*.3=<<130*.3=39>>39 trees That...,91,calculate the number of trees left after 10 years. 1. **Calculate the number of trees planted in 10 years:** 10 trees/year x 10 years =...,91,✔️ [True]
5,"John picks 4 bananas on Wednesday. Then he picks 6 bananas on Thursday. On Friday, he picks triple the number of bananas he did on...","Combining Wednesday and Thursday, John has 4 bananas + 6 bananas = <<4+6=10>>10 bananas. On Friday, he picks 3 * 4 bananas = <<3*4=12>>12 bananas....",22,"find the total number of bananas John has. First, find the number of bananas John picked on Friday by multiplying the number he picked on...",22,✔️ [True]
6,"Peyton scheduled after-work activities of a one hour yoga class on Monday, a cooking class that lasts three times as long as Monday’s yoga on...",Peyton’s cooking class will last 3 * 1 = <<3*1=3>>3 hours. The museum tour will take 3 / 2 = 1 1/2 hours. Peyton’s after-work...,8,calculate the total time for all of Peyton's after-work activities. 1. **Yoga class:** 1 hour 2. **Cooking class:** 3 times the yoga class time =...,8,✔️ [True]
7,"Ben has 4 tubes of blue paint and 3 tubes of yellow paint. Jasper has half as many tubes of blue paint as Ben, and...",Jasper has 4/2= <<4/2=2>>2 tubes of blue paint Jasper has 3*3=<<3*3=9>>9 tubes of yellow paint Jasper has a total of 2+9 =<<2+9=11>>11 tubes of paint,11,"find the number of tubes of paint Jasper has. First, find the number of tubes of blue paint Jasper has by dividing Ben's blue paint...",11,✔️ [True]
8,"Elaina is holding the final concert in her tour. To celebrate her final concert, she makes the concert twice as long as her usual concerts....","The concert, minus the encore, lasted for 65-minute concert – 15-minute encore = <<65-15=50>>50 minutes. This is twice as long as her usual concerts so...",25,"find the runtime of Elaina's usual concerts. First, we need to subtract the encore's runtime from the total runtime of the final concert: 65 minutes...",25,✔️ [True]
9,"Hannah slips on a banana peel and breaks her arm. The doctor charges her $200 for the cast, $300/hour for a 30-minute visit, $4/pill for...",First find the length of the visit in hours: 30 minutes / 60 minutes/hour = <<30/60=.5>>.5 hours Then find the total cost of the visit:...,482,"calculate the total cost of the doctor's visit. First, we need to find the cost of the visit itself: $300/hour x 0.5 hours = $150....",332,False


80.0

In [None]:
dspy_gemini_pro.inspect_history(n=1)