# Automated Prompt Engineering with DSPy

## Background

### Purpose

In order to make LLMs more performant for specific tasks, prompt engineering is used to instruct LLMs to complete certain tasks. Prompt engineering is cheaper than fine-tuning, takes less data, but has often been a manual task. For LLMs that are deployed on-device, which are smaller LLMs (usually fewer than 14 billion parameters), it is even more important to have goods prompts that are optimized for the task at hand as these LLMs cannot generalize like larger LLMs.

We will use DSPy, an automatic prompt engineering framework to create a pipeline for a specific task and optimize the prompts for that task on Intel® AI PC.

### What is automated prompt engineering?

Automatic prompt engineering is a technique that takes an LLM and automatically creates better and better prompts. Any automatic prompt engineering framework requires the following:
- LLM that needs to be prompt-engineered
- A dataset of inputs and outputs for the task at hand
- A metric that measures how well the LLM is doing on the task

Automatic prompt engineering frameworks then handle how to update the prompts to make the LLM perform better on the task. [DSPy](https://github.com/stanfordnlp/dspy) is one such framework that uses signatures, modules, and optimizers to create pipelines that can then be optimized by themselves. Another framework is [EvoPrompt](https://github.com/microsoft/EvoPrompt), which uses evolutionary algorithms to optimize prompts.


## Imports

Importing everything takes a couple of seconds. The below cell outlines the most pertinent imports for this sample:
1. `llama_cpp` is a Python package that interacts with the llama.cpp library, which is a C++ implementation that runs LLMs and other models with a focus on speed and efficiency.
2. `dspy` is a Python package that we will use for automated prompt engineering. We will explore how it works in this notebook.

In [None]:
from llama_cpp import Llama
import dspy

In [None]:
# Misc Imports
from datasets import load_dataset
import ipywidgets as widgets
from IPython.display import display
import pandas as pd
import random
from typing import Literal

In [None]:
# Set seed for reproducibility
SEED = 1208
random.seed(SEED)

## Dataset

The dataset that we will be using is the [ARC dataset](https://huggingface.co/datasets/allenai/ai2_arc). This dataset contains grade-level science questions paired with multiple choice answers. The task is to predict the correct answer to the science question. The dataset is available on Huggingface. The task for the LLM is to predict the correct multiple choice answer to answer the question.

In many cases, one may not have a dataset ready for their task. For these cases, one would need to create examples themselves. DSPy can work with a few examples and then optimize the prompts for the task.

In [None]:
# Load in the ARC dataset
dataset = load_dataset("allenai/ai2_arc", "ARC-Challenge", split="train")

In [None]:
# Extract the questions, answers, and choices from the dataset
questions = [row["question"] for row in dataset]
answers = [row["answerKey"] for row in dataset]
choices = [row["choices"] for row in dataset]

In [None]:
# Create a pandas dataframe from the extracted data
dataset = pd.DataFrame({"question": questions, "answer": answers, "choices": choices})

In order to make the sample run faster, we are downsampling the dataset by 90%. This is not recommended for real-world applications, but is done here to make the sample run faster. If one have a large amount of samples, DSPy also offers some level of finetuning. However, for many LLM applications, people often need to create some of their own responses, so we are showing how it works with smaller number of examples.

In [None]:
dataset = dataset.sample(frac=0.1, random_state=SEED)

In [None]:
dataset.head()

DSPy uses signatures to define the input and output for the LLM. This represented as a class in Python. Inside this class, we define the input and output for the LLM. The input is the science question and the output is the correct answer to the question. We know that the correct answer for the LLM is the correct multiple choice answer to the question, and we can use Python typing to define this. DSPy will use this signature to prompt the LLM and also add prompts around this signature during optimization.

In [None]:
class Question(dspy.Signature):
    """Answer science questions by selecting the correct answer from a list of choices. Respond with the letter of the correct answer."""  # noqa: E501

    science_question = dspy.InputField()
    answer: Literal["A", "B", "C", "D"] = dspy.OutputField()

We need convert the list of science question and answers to a format that DSPy can understand. DSPy takes in a list of `dspy.Example` objects that specify the science question and the correct answer. We will convert the questions and answers to this format.

In [None]:
# Create dataset
dspy_dataset = []

for row in dataset.itertuples():
    # Extract data from row
    question = row.question
    answer = row.answer
    labels = row.choices["label"]
    context = row.choices["text"]

    # Create science question input based on the question and answer choices
    answer_choices = ""
    for label, choice in zip(labels, context):
        answer_choices += f"{label}. {choice}, "
    answer_choices = answer_choices[:-2]  # Remove trailing comma
    science_question = f"{question}: {answer_choices}"

    # Create example
    example = dspy.Example(
        science_question=science_question, answer=answer
    ).with_inputs("science_question")

    # Append example to dataset
    dspy_dataset.append(example)

In [None]:
# Shuffle dataset
random.shuffle(dspy_dataset)

In order to test the prompt found via DSPy, we need to create a train, validation, and test set. The train set is what DSPy will use to find prompts that work and the validation dataset will evaluate the prompts that DSPy finds. The test set will be used to evaluate the final prompt that DSPy finds as a hold-out set. This is analogous to the train, validation, and test set in machine learning.

We will use a 60-20-20 split for the train, validation, and test set.

In [None]:
train_size = int(0.6 * len(dspy_dataset))  # 60% for training
val_size = int(0.2 * len(dspy_dataset))  # 20% for validation

# Split the list
train = dspy_dataset[:train_size]
val = dspy_dataset[train_size : train_size + val_size]
test = dspy_dataset[train_size + val_size :]  # Remaining 20% for testing

One of the main inputs for any prompt engineering framework is the LLM used. Here, we provide a couple of widely used SLMs that are available via huggingface and are performant on Intel® AI PCs. Please use the dropdown to select the LLM that you would like to use for this sample.

In [None]:
model_to_repo = {
    "Phi-3.1-mini-4k-instruct-Q4_K_M.gguf": "bartowski/Phi-3.1-mini-4k-instruct-GGUF",
    "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf": "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",  # noqa: E501
    "Llama-3.2-1B-Instruct-Q4_K_M.gguf": "bartowski/Llama-3.2-1B-Instruct-GGUF",
    "qwen2-1_5b-instruct-q4_k_m.gguf": "Qwen/Qwen2-1.5B-Instruct-GGUF",
    "qwen2-7b-instruct-q4_k_m.gguf": "Qwen/Qwen2-7B-Instruct-GGUF",
    "qwen2-0_5b-instruct-q4_k_m.gguf": "Qwen/Qwen2-0.5B-Instruct-GGUF",
}

In [None]:
model_dropdown = widgets.Dropdown(
    options=model_to_repo.keys(),
    # Default to Qwen2 1.5B for best results
    value="qwen2-1_5b-instruct-q4_k_m.gguf",
    description="Select an LLM:",
)

display(model_dropdown)

After we select the LLM to be used, we will then load the LLM using `llama-cpp-python`. The `from_pretrained` function will download the model and tokenizer from Huggingface and load it onto the machine. We will then use this LLM to prompt the questions and answers.

In [None]:
llm = Llama.from_pretrained(
    repo_id=model_to_repo[model_dropdown.value],
    filename=model_dropdown.value,
    # This tells Llama.cpp to put 5 layers of the model on the GPU.
    # The rest of the model will run on the CPU.
    n_gpu_layers=5,
    seed=SEED,
    # Increase context window size to 4096 so that the model can see the entire question
    # Having a large enough window size is important for the prompt optimization part
    n_ctx=4096,
    verbose=False,
)

Once we have loaded the LLM, we need to configure DSPy to use this LLM. DSPy offers the `LlamaCPP` method which takes the `llm` object. DSPy will then use `llama-cpp-python` and the LLM to prompt the questions and answers.

In [None]:
llamalm = dspy.LlamaCpp(model="llama", llama_model=llm, model_type="chat", seed=SEED)
dspy.settings.configure(lm=llamalm)

The metric we will use for evaluating the LLM's performance is `answer_exact_match`, which returns `True` if the LLM answer matches the correct answer exactly and `False` otherwise. We will use this metric to evaluate the LLM's performance on the validation and test set.

For more complex tasks (like RAG or summarization), `answer_exact_match` may not be a good metric. In those cases, one would need to use a metric that is more suited to the task at hand. `DSPy` offers [auto-evaluation metrics](https://github.com/stanfordnlp/dspy/blob/main/dspy/evaluate/auto_evaluation.py#L21) that prompts the LLM for a numeric evaluation. Other options include [`BLEU`](https://en.wikipedia.org/wiki/BLEU), [`ROUGE`](https://en.wikipedia.org/wiki/ROUGE_(metric)), or [`METEOR`](https://en.wikipedia.org/wiki/METEOR). [`BERTScore`](https://arxiv.org/abs/1904.09675) uses the [BERT language model](https://en.wikipedia.org/wiki/BERT_(language_model)) to evaluate the LLM's performance using embeddings.

In [None]:
metric = dspy.evaluate.metrics.answer_exact_match

After we have our dataset, we then need to create a Module that represents our input and what prompt strategy the LLM should use. We will use the `Module` class from `dspy` to create a module that represents the input and output for the LLM. We will then use this module to create a pipeline that will be optimized by DSPy.

Here, we make sure to use our `Question` signature to specify the input and output we want from the LLM. Then, we will use `dspy.ChainOfThought` to tell DSPy to use Chain-Of-Thought prompt-engineering strategy. Chain-Of-Thought is a prompt-engineering strategy that helps the LLM think step-by-step to solve reasoning tasks. Without DSPy, one would need to manually create Chain-Of-Thought prompts for the LLM to solve reasoning tasks. `dspy.ChainOfThought` will automatically create these prompts for the LLM.

In [None]:
class QuestionAnsweringAI(dspy.Module):
    def __init__(self):
        self.signature = Question
        self.respond = dspy.ChainOfThought(self.signature)

    def forward(self, science_question):
        return self.respond(science_question=science_question)

Now that we have defined the inputs, outputs, and the LLM pipeline, we then need to have a strategy to evaluate the LLM's performance with new prompts. We use `dspy.Evaluate` to accept a dataset, metric, and start the evaluation process.

In [None]:
train_evaluate = dspy.Evaluate(
    devset=train, metric=metric, num_threads=1, display_progress=True, display_table=10
)
val_evaluate = dspy.Evaluate(
    devset=val, metric=metric, num_threads=1, display_progress=True, display_table=10
)
test_evaluate = dspy.Evaluate(
    devset=test, metric=metric, num_threads=1, display_progress=True, display_table=10
)

Before we start the optimization process, let's get a baseline of the LLM's performance on the train, validation, and test sets.

<div class="alert alert-warning">
The following code cells will takes around 10-15 minutes to complete. Please be patient!
</div>

In [None]:
orig_train_score = train_evaluate(QuestionAnsweringAI())

In [None]:
orig_val_score = val_evaluate(QuestionAnsweringAI())

In [None]:
orig_test_score = test_evaluate(QuestionAnsweringAI())

In [None]:
# Display the original scores
print(f"Original Training Score: {orig_train_score}")
print(f"Original Validation Score: {orig_val_score}")
print(f"Original Test Score: {orig_test_score}")

We can use the `dspy.inspect_history` to see what the input and response was by the LLM. This will help us understand what the LLM is doing and how it is solving the task. Let's take a look at the last two inputs and responses from the LLM.

In [None]:
_ = dspy.inspect_history(n=2)

DSPy offers a variety of different optimizers to find the best prompts. We will use the `MIPROv2` to find better prompts for the LLM. `MIPROv2` is a prompt-engineering optimizer. 

We use `MIPROv2` to do the following:
1. Create some examples for the LLM to prompt
2. Use few-shot prompting to help the LLM understand how to solve the task in hand
3. Use the LLM to describe the dataset and create different dataset and task instructions
4. Use bayesian optimization to find the best instructions for the LLM

`MIPROv2` contains hyperparameters that control how long it takes to find prompts as well. We use the `light` setting for the hyperparameters.

In [None]:
optm = dspy.MIPROv2(metric=metric, auto="light", num_threads=1, seed=SEED)

<div class="alert alert-warning">
The following code cell will takes around 30-40 minutes to complete. Please be patient!
</div>

In [None]:
optimized_question_answerer = optm.compile(
    QuestionAnsweringAI(),
    trainset=train,
    valset=val,
    # The number of examples that is generated and included in the prompt
    max_bootstrapped_demos=2,
    # The number of examples from the training set that is included in the prompt
    max_labeled_demos=2,
    requires_permission_to_run=False,
)

Now that we have the optimized science question answerer, we can re-evaluate the pipeline on the train, validation, and test datasets. The metric for the test dataset will be the most important as the test dataset is a hold-out set that the LLM has never seen before.

<div class="alert alert-warning">
The following code cells will takes around 10-15 minutes to complete. Please be patient!
</div>

In [None]:
opt_train_score = train_evaluate(optimized_question_answerer)

In [None]:
opt_val_score = val_evaluate(optimized_question_answerer)

In [None]:
opt_test_score = test_evaluate(optimized_question_answerer)

Let's take a look at the last two inputs and responses from the LLM.

In [None]:
_ = dspy.inspect_history(n=2)

In [None]:
print(f"Original Training Score: {orig_train_score}")
print(f"Optimized Training Score: {opt_train_score}")
print(f"Original Validation Score: {orig_val_score}")
print(f"Optimized Validation Score: {opt_val_score}")
print(f"Original Test Score: {orig_test_score}")
print(f"Optimized Test Score: {opt_test_score}")