# Introduction

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/marshmellow77/automated-prompt-engineering/blob/main/automated-prompt-engineering.ipynb)


This notebook demonstrates how to use Google's Gemini model to automate prompt engineering.

Prompt engineering is a powerful way to improve the responses og large language models (LLMs). Bit it is also a manual, tedious, iterative process and it quickly accumulates technical debt and waste since each handcrafted prompt is specific to a model (and its version) as well as the task at hand.

In this notebook we will learn how to use the DSPy library to autonomously and automatically create prompts that are optimised for a specific model and the task at hand.


# Manual Prompt Engineering

Manual prompt engineering is very tedious - let's look at an example where we carefully handcraft a prompt for our task and model.

## Setup

In [None]:
# As of 3 April 2024, VertexAI is not yet integrated into DSPy. But there already exists a PR for it which we can leverage.
!pip install -U git+https://github.com/marshmellow77/dspy.git@seedstart-random-search#egg=dspy-ai

In [None]:
!pip install --upgrade google-cloud-aiplatform
!pip install Jinja2

In [None]:
import os
import sys

IS_COLAB = "google.colab" in sys.modules
if not IS_COLAB:
    raise ValueError("This notebook should be run using Google Colab.")

if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

In [None]:
import vertexai

project_id = "cloud-llm-preview1"
vertexai.init(project=project_id)

In [None]:
from vertexai.generative_models import GenerativeModel

gemini_pro = GenerativeModel("gemini-1.0-pro")

## Zero shot attempt

Let's first try to use Gemini Pro for a mathematical text question

In [None]:
prompt = """Given the fields `question`, produce the fields `answer`.

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program.
She already was able to sew 13 aprons, and today, she sewed three times as many aprons.
How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?

Answer:"""

# The correct answer is 49.

In [None]:
config = {"temperature": 0.1}

In [None]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

We can see that Gemini Pro got this one wrong. Let's use best practices including Chain of thought and few shot prompting to improve Gemini's performance!

## Few shot prompting with Chain of Thought

In [None]:
prompt = """Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: <Question>
Rationale: Let's think step by step ...
Answer: <Answer>

---

Question: A gumball machine has red, green, and blue gumballs. The machine has half as many blue gumballs as red gumballs.
For each blue gumball, the machine has 4 times as many green gumballs. If the machine has 16 red gumballs how many gumballs are in the machine?
Rationale: Let's think step by step.
First, we can find the number of blue gumballs in the machine.
Since the machine has half as many blue gumballs as red gumballs, and there are 16 red gumballs, there must be 16 / 2 = 8 blue gumballs.
Next, we can find the number of green gumballs in the machine.
Since the machine has 4 times as many green gumballs as blue gumballs, there must be 8 x 4 = 32 green gumballs.
Finally, we can add up the number of red, blue, and green gumballs to find the total number of gumballs in the machine: 16 + 8 + 32 = 56.
Answer: 56

---

Question: Rachel makes $12.00 as a waitress in a coffee shop. In one hour, she serves 20 different people and they all leave her a $1.25 tip. How much money did she make in that hour?
Rationale: Let's think step by step.
First, we need to find out how much money Rachel made from tips. She served 20 people and each person left her a $1.25 tip, so she made 20 * $1.25 = $25.00 in tips.
Next, we need to add her hourly wage to the money she made from tips to find out how much money she made in total. She made $12.00 per hour, so in one hour she made $12.00 + $25.00 = $37.00.
Answer: 37

---

Question: Heather is going to sew 150 aprons that are to be used for a kiddie crew program. She already was able to sew 13 aprons, and today, she sewed three times as many aprons. How many aprons should she sew tomorrow if she wants to sew half of the remaining number of aprons needed?
Rationale:"""

In [None]:
response = gemini_pro.generate_content(contents=prompt, generation_config=config)
print(response.text)

Nice, this worked!

Now we have a good a good prompt for our model and the task at hand (mathematical text questions). But there are a few issues:
* Our prompt works well on our model, but what if we want to use another model or another version (e.g. Gemini Ultra of Gemini 1.5)? Will it still work for those models?
* We had to develop a few examples, and especially coming up with the rationale for each example was tedious

The question is, could we automate this process so that next time we need to repeat this exercise we can just automatically create few shot examples that are optimised for our model and the task at hand?

# Automated prompt engineering with DSPy

DSPy is a library that allows us to automate this process. Let's see how it works.

## Setup

In [None]:
import dspy

In [None]:
dspy_gemini_pro = dspy.GoogleVertexAI(
    "gemini-1.0-pro",
    temperature=0,
)

dspy.settings.configure(lm=dspy_gemini_pro)

## Dataset

We will use the [GSM8K dataset](https://paperswithcode.com/dataset/gsm8k) which consists of inguistically diverse grade school math word problems.

In [None]:
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric

gms8k = GSM8K()

In [None]:
train, val, test = gms8k.train[:60], gms8k.dev[:20], gms8k.test[:20]

In [None]:
train[0]

In [None]:
train[0].gold_reasoning

We can see that the dataset has a field `gold_resoning`, which already provides reasoning. Since this is what we want to automate, let's delete these for the training and validation datasets.

In [None]:
# Iterate through datasets and modify the dicts
for dataset in [train, val]:
    for example in dataset:
        example["gold_reasoning"] = ""

In [None]:
train[0].gold_reasoning

## Defining the signature

Signatures allow you tell the LM what it needs to do, rather than specify how we should ask the LM to do it.

In [None]:
class GSM8KSignature(dspy.Signature):
    """Answer math problems with numbers or short phrases."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="Usually a number or short phrase.")

Now we can use this signature to run a test with Gemini.

In [None]:
generate_answer = dspy.Predict(GSM8KSignature)
pred = generate_answer(question=test[0].question)

print(f"Question: {test[0].question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Actual Answer: {test[0].answer}")

In [None]:
dspy_gemini_pro.inspect_history(n=1)

Similar to above Gemini didn't get this one right. Let's evaluate Gemini of the test dataset to establish a baseline.

## Model evaluation with zero shot

To run the evaluation programmatically we define a DSPy module These modules abstract a prompting technique (like chain of thought or ReAct). Crucially, they are generalized to handle any DSPy Signature.

In [None]:
class GSM8KModule(dspy.Module):
    def __init__(self):
        super().__init__()
        # here we use the dspy.Predict module which uses zero shot prompting to generate answers
        self.prog = dspy.Predict(GSM8KSignature) 

    def forward(self, question):
        return self.prog(question=question)

In [None]:
gsm8k_zero_shot = GSM8KModule()

In [None]:
from dspy.evaluate import Evaluate

NUM_THREADS = 4 # number of threads to use for parallel processing
evaluate = Evaluate(
    devset=test, # the test set
    metric=gsm8k_metric, # the metric to use -> this will convert responses to integers to compare with the gold answers
    num_threads=NUM_THREADS,
    display_progress=True,
    display_table=20, # how many rows to display
)

In [None]:
evaluate(gsm8k_zero_shot)

# Bootstrapping few shot examples

Now we will leverage Gemini Ultra to bootstrap few shot examples which will (hopefully) improve Gemini Pro's performance on the test dataset. With Gemini Ultra we will create a few reasoning examples which we can include in the prompt that we will eventually send to Gemini Pro. Ultra will produce a few candidates and test them on a validation dataset using the `gsm8k_metric`, i.e. the metric we want to optimise for. Once the best candidates have been identified these examples will then be used to create a few shot prompt.

First we define a Chain of Thought module:

In [None]:
class ZeroShotCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(
            GSM8KSignature,
        )

    def forward(self, question):
        return self.prog(question=question)

In [None]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

Now we can start the bootstrapping:

In [None]:
from datetime import datetime

RUN_FROM_SCRATCH = True
bootstrapped_demos = 8 # how many examples are randomly being used from the training dataset
labeled_demos = 3 # how many examples will be in final prompt
candidate_programs = 2 # how many candidates will be created and evaluated (equivalent to epochs)
teacher_model_id = "gemini-1.0-ultra" 

if RUN_FROM_SCRATCH:
    dspy_gemini_ultra = dspy.GoogleVertexAI(
        teacher_model_id,
        temperature=0,
    )
    dspy.settings.configure(lm=dspy_gemini_ultra, timeout=0)
    config = dict(
        max_bootstrapped_demos=bootstrapped_demos,
        max_labeled_demos=labeled_demos,
        num_candidate_programs=candidate_programs,
        num_threads=4,
        stop_at_score=100.0,
    )
    bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
        metric=gsm8k_metric, **config
    )
    cot_fewshot = bootstrap_optimizer.compile(ZeroShotCoT(), trainset=train, valset=val, seed_start=0)

    # save the bootstrap demonstrations for future use
    timestamp_str = datetime.now().strftime("%Y%m%d-%H%M%S")
    filename = f"{timestamp_str}_{teacher_model_id}_{bootstrapped_demos}_{labeled_demos}_{candidate_programs}.json"
    cot_fewshot.save(filename)
else:
    cot_fewshot = ZeroShotCoT()
    cot_fewshot.load("20240403-173150_gemini-1.0-ultra_8_3_2.json")

After this step we have our examples ready, and we can test Gemini Pro on the same test dataset as above.

In [None]:
dspy.settings.configure(lm=dspy_gemini_pro, timeout=0)

In [None]:
evaluate(cot_fewshot)

Nice, this improved Gemini Pro's performance significantly from 35% :)

In [None]:
dspy_gemini_pro.inspect_history(n=1)