<a href="https://colab.research.google.com/github/jayozer/ai_webinars/blob/main/Jay_DSPy_Advanced_Prompt_Engineering%3F_AI_Makerspace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSPy - Advanced Prompt Engineering

In the following notebook, we'll explore an introduction to DSPy and what it can do in just a few lines of code!

To begin, we'll grab the only (top level) dependency we'll need - DSPy!

In [1]:
!pip install -qU dspy-ai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.4/220.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.8/326.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m520.4/520.4 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.8/65.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

DSPy can leverage OpenAI's models under the hood, and still provide an advantage - in order to do so, however, we'll need to provide an OpenAI API Key!

In [2]:
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')

OpenAI API Key: ··········


## Model

Now we can setup our OpenAI language model - which we'll use through the remaining cells in the notebook.

In [3]:
from dspy import OpenAI

llm = OpenAI(model='gpt-3.5-turbo')

Similar to other libraries, we can call the LLM directly with a string to get a response!

In [4]:
llm("What is the square root of pi?")

['The square root of pi is approximately 1.77245385091.']

We'll also set our `setting.configure` with our OpenAI model in the `lm` (Language Model) field for a default LM to use in case we don't specify which LM we'd like to use when calling our DSPy `Predictors`.

In [5]:
import dspy

dspy.settings.configure(lm=llm)

## Data

We're going to be using a dataset that provides a number of example sentences, along with a rating that indicates their "dopeness" level.

In [6]:
from datasets import load_dataset

dataset = load_dataset("llm-wizard/dope_or_nope_v2")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

We have a total of 99 rows of data, and will be splitting that into a `trainset` and a `valset` - for training and evaluation.

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Rating', 'Fire Emojis'],
        num_rows: 99
    })
})

Due to the nature of the dataset, we'll need to shuffle our dataset to ensure our labels are not clumped up, and our `valset` is remotely representative to our `trainset`.

In [8]:
dataset = dataset.shuffle(seed=42)

We'll move our `Dataset` into the expected format in DSPy which is the [`Example`](https://dspy-docs.vercel.app/docs/deep-dive/data-handling/examples)!


Our examples will have two keys:

- `sentence`, our input sentence to be rated
- `rating`, our rating label

We'll specify our input as `sentence` to properly leverage the DSPy framework.

In [9]:
from dspy import Example

trainset = []

for row in dataset["train"].select(range(0,len(dataset["train"])-10)):
  trainset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(trainset)

89

We'll repeat the same process for our `valset` as well.

In [10]:
valset = []

for row in dataset["train"].select(range(len(trainset),len(dataset["train"]))):
  valset.append(Example(sentence=row["Sentence"], rating=row["Rating"]).with_inputs("sentence"))

len(valset)

10

Let's take a peek at an example from our `trainset` and `valset`!

In [11]:
train_example = trainset[0]
print(f"Sentence: {train_example.sentence}")
print(f"Label: {train_example.rating}")

Sentence: The results were satisfactory.
Label: 0


In [12]:
valset_example = valset[0]
print(f"Sentence: {valset_example.sentence}")
print(f"Label: {valset_example.rating}")

Sentence: This is top tier.
Label: 4


## Signature

The first foundational unit in DSPy is the `Signature`.

In a sense, a `Signature` can be thought of as both a prompt, as well as metadata about that prompt.

Going beyong just a simple `SystemMessage`, as seen in other frameworks, the `Signature` helps DSPy validate datatypes, create examples, and more.

> NOTE: DSPy's [documentation](https://dspy-docs.vercel.app/docs/deep-dive/signature/understanding-signatures#what-is-a-signature) goes into more detail about what exactly a `Signature` is.

In [13]:
from dspy import Signature, InputField, OutputField

class DopeOrNopeSignature(Signature):
  """Rate a sentence from 0 to 4 on a dopeness scale"""
  sentence: str = InputField()
  rating: int = OutputField()

## Predictor

Now that we have our `Signature`, we can build a `Predictor` that leverages it.

A `Predictor`, in the simplest terms, is what calls the LLM using our signature. Importantly, the `Predictor` knows how to leverage our signature to call the LLM. From DSPy's documentation, one of the most interesting parts of a `Predictor` is that it can *learn* to become better at the desired task!

Let's take a look at our `TypedPredictor` below to see more.

In [14]:
from dspy.functional import TypedPredictor

generate_label = TypedPredictor(DopeOrNopeSignature)

In [15]:
generate_label

TypedPredictor(DopeOrNopeSignature(sentence -> rating
    instructions='Rate a sentence from 0 to 4 on a dopeness scale'
    sentence = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Sentence:', 'desc': '${sentence}'})
    rating = Field(annotation=int required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Rating:', 'desc': '${rating}'})
))

In [16]:
label_prediction = generate_label(sentence=valset_example.sentence)
print(f"Sentence: {valset_example.sentence}")
print(f"Prediction: {label_prediction}")

Sentence: This is top tier.
Prediction: Prediction(
    rating=3
)


We can, at any time, check our LLMs outputs through the `inspect_history`.

In [17]:
llm.inspect_history(n=1)




Rate a sentence from 0 to 4 on a dopeness scale

---

Follow the following format.

Sentence: ${sentence}
Rating: ${rating} (Respond with a single int value)

---

Sentence: This is top tier.
Rating:[32m 3[0m





'\n\n\nRate a sentence from 0 to 4 on a dopeness scale\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: ${rating} (Respond with a single int value)\n\n---\n\nSentence: This is top tier.\nRating:\x1b[32m 3\x1b[0m\n\n\n'

Notice how, without our input - the `TypedPredictor` has included format instructions to the LLM to help ensure our returned data resembles what we desire.

Let's look at another example of a `Predictor` - this time with Chain of Thought.

In order to use this - we don't have to do anything with our `Signature`! We can leave it exactly as is - and allow the `Predictor` to adapt to it.

> NOTE: We won't be using this predictor going forward - this is just to showcase the ease of using another `Predictor` with a `Signature`.

In [None]:
from dspy.functional import TypedChainOfThought

generate_label_with_chain_of_thought = TypedChainOfThought(DopeOrNopeSignature)

label_prediction = generate_label_with_chain_of_thought(sentence=valset_example.sentence)

In [None]:
print(f"Sentence: {valset_example.sentence}")
print(f"Reasoning: {label_prediction.reasoning}")
print(f"Ground Truth Label: {valset_example.rating}")
print(f"Prediction: {label_prediction.rating}")

Sentence: This is top tier.
Reasoning: produce the rating. We first consider the impact of the phrase "top tier," which implies the highest level of quality or excellence. This phrase is commonly used in a positive context and conveys a strong sense of approval or admiration. Additionally, the brevity and simplicity of the sentence add to its effectiveness and emphasis. Overall, the sentence is straightforward and powerful in its expression of high regard.
Ground Truth Label: 4
Prediction: 3


We can, again, check our LLM's history to see what the actual prompt/response is.


In [None]:
llm.inspect_history(n=1)




Rate a sentence from 0 to 4 on a dopeness scale

---

Follow the following format.

Sentence: ${sentence}
Reasoning: Let's think step by step in order to ${produce the rating}. We ...
Rating: ${rating} (Respond with a single int value)

---

Sentence: This is top tier.
Reasoning: Let's think step by step in order to[32m produce the rating. We first consider the impact of the phrase "top tier," which implies the highest level of quality or excellence. This phrase is commonly used in a positive context and conveys a strong sense of approval or admiration. Additionally, the brevity and simplicity of the sentence add to its effectiveness and emphasis. Overall, the sentence is straightforward and powerful in its expression of high regard.
Rating: 3[0m





'\n\n\nRate a sentence from 0 to 4 on a dopeness scale\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nReasoning: Let\'s think step by step in order to ${produce the rating}. We ...\nRating: ${rating} (Respond with a single int value)\n\n---\n\nSentence: This is top tier.\nReasoning: Let\'s think step by step in order to\x1b[32m produce the rating. We first consider the impact of the phrase "top tier," which implies the highest level of quality or excellence. This phrase is commonly used in a positive context and conveys a strong sense of approval or admiration. Additionally, the brevity and simplicity of the sentence add to its effectiveness and emphasis. Overall, the sentence is straightforward and powerful in its expression of high regard.\nRating: 3\x1b[0m\n\n\n'

## Modules

Now that we have our `TypedPredictor`, we can create a `Module`!

A `Module` is useful because it allows us to interact with the `Predictor` and `Signature` in a way that DSPy can leverage for optimization.

The helps the DSPy framework determine paths through your program - and helps during the `compilation` or optimisation steps (formerly `teleprompting`).

> NOTE: You might notice this looks strikingly familiar to PyTorch, and this is by design!

In [None]:
from dspy import Module, Prediction

class DopeOrNopeStudent(Module):
  def __init__(self):
    super().__init__()

    self.generate_rating = TypedPredictor(DopeOrNopeSignature)

  def forward(self, sentence):
    prediction = self.generate_rating(sentence=sentence)
    return Prediction(rating=prediction.rating)

## Evaluate

As with any good framework, DSPy has the ability to `Evaluate` - we can leverage this to determine how our current DSPy "program" (our `Module` in this case) operates.

> NOTE: DSPy's "program" could be loosely related to a "chain" from the popular LLM Framework LangChain.

In [None]:
from dspy.evaluate.evaluate import Evaluate

evaluate_fewshot = Evaluate(devset=valset, num_threads=1, display_progress=True, display_table=10)

def exact_match_metric(answer, pred, trace=None):
  return answer.rating == pred.rating

evaluate_fewshot(DopeOrNopeStudent(), metric=exact_match_metric)

Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:01<00:00,  6.33it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,3,False
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,False
3,I'm living my best life.,4,3,False
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,2,False
6,This is next level.,4,3,False
7,The meeting was productive.,1,2,False
8,The analysis was insightful.,1,3,False
9,I stan a legend.,3,3,✔️ [True]


30.0

## Program Optimization (the Artist Formerly Known as Teleprompting)

Optimization is the crux of the DSPy framework - it is what allows it to operate at a level beyond traditional prompt engineering.

At a high level, optimisation is a way for the DSPy framework to take the program, a training set, and a metric - and make changes/tweaks to our program to improve our metrics on our dataset.

Let's get started with the `LabeledFewShot` optimizer.

The `LabeledFewShot` optimizer very simply provides a sample of the `trainset` as few-shot examples!

In [None]:
from dspy.teleprompt import LabeledFewShot

labeled_fewshot_optimizer = LabeledFewShot(k=4)

Once we define our optimizer, we can compile our program!

In [None]:
compiled_dspy = labeled_fewshot_optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

Let's evaluate!

In [None]:
evaluate_fewshot(compiled_dspy, metric=exact_match_metric)

Average Metric: 4 / 10  (40.0): 100%|██████████| 10/10 [00:04<00:00,  2.21it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,3,False
3,I'm living my best life.,4,3,False
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,3,False
6,This is next level.,4,3,False
7,The meeting was productive.,1,3,False
8,The analysis was insightful.,1,3,False
9,I stan a legend.,3,3,✔️ [True]


40.0

As you can see - with no effort at all - we can improve our performance on our `valset`!

Let's try another optimizer - this time: [`BootstrapFewShot`](https://dspy-docs.vercel.app/docs/deep-dive/teleprompter/bootstrap-fewshot).

The key thing to note is that this optimizer works with even very few examples - by way of generating new examples by the LLMs!

In [None]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=exact_match_metric, max_bootstrapped_demos=4, max_labeled_demos=12)

compiled_dspy_BOOTSTRAP = optimizer.compile(student=DopeOrNopeStudent(), trainset=trainset)

  9%|▉         | 8/89 [00:03<00:32,  2.47it/s]


Let's finally evaluate!

In [None]:
evaluate_fewshot(compiled_dspy_BOOTSTRAP, metric=exact_match_metric)

Average Metric: 7 / 10  (70.0): 100%|██████████| 10/10 [00:03<00:00,  2.62it/s]


Unnamed: 0,sentence,example_rating,pred_rating,exact_match_metric
0,This is top tier.,4,4,✔️ [True]
1,Big mood.,3,3,✔️ [True]
2,The presentation was outstanding.,1,4,False
3,I'm living my best life.,4,4,✔️ [True]
4,"Sksksksk, that's hilarious.",3,3,✔️ [True]
5,The report is comprehensive.,1,1,✔️ [True]
6,This is next level.,4,4,✔️ [True]
7,The meeting was productive.,1,2,False
8,The analysis was insightful.,1,3,False
9,I stan a legend.,3,3,✔️ [True]


70.0

We can see that this optimization helps our program achieve 30 points higher on our evaluation!

In [None]:
llm.inspect_history(n=1)




Rate a sentence from 0 to 4 on a dopeness scale

---

Follow the following format.

Sentence: ${sentence}
Rating: ${rating} (Respond with a single int value)

---

Sentence: The approval was granted.
Rating: 1

---

Sentence: I admire your dedication.
Rating: 1

---

Sentence: Too good to be true.
Rating: 4

---

Sentence: The software was updated.
Rating: 1

---

Sentence: I stan a legend.
Rating:[32m 3[0m





'\n\n\nRate a sentence from 0 to 4 on a dopeness scale\n\n---\n\nFollow the following format.\n\nSentence: ${sentence}\nRating: ${rating} (Respond with a single int value)\n\n---\n\nSentence: The approval was granted.\nRating: 1\n\n---\n\nSentence: I admire your dedication.\nRating: 1\n\n---\n\nSentence: Too good to be true.\nRating: 4\n\n---\n\nSentence: The software was updated.\nRating: 1\n\n---\n\nSentence: I stan a legend.\nRating:\x1b[32m 3\x1b[0m\n\n\n'

In [None]:
for name, parameter in compiled_dspy_BOOTSTRAP.named_parameters():
  print(f"Parameter {name}: Num Examples: {len(parameter.demos)}, {parameter.demos[0]}")
  print()

Parameter generate_rating.predictor: Num Examples: 16, Example({'augmented': True, 'sentence': 'This tea is piping hot.', 'rating': '4'}) (input_keys=None)

