<a href="https://colab.research.google.com/github/gonnect-uk/control-plane/blob/main/dspy_demonstration_under_the_hood.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install dspy-ai

Collecting dspy-ai
  Downloading dspy_ai-2.6.0-py3-none-any.whl.metadata (5.2 kB)
Collecting dspy>=2.6.0 (from dspy-ai)
  Downloading dspy-2.6.0-py3-none-any.whl.metadata (7.7 kB)
Collecting asyncer==0.0.8 (from dspy>=2.6.0->dspy-ai)
  Downloading asyncer-0.0.8-py3-none-any.whl.metadata (6.7 kB)
Collecting backoff (from dspy>=2.6.0->dspy-ai)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting datasets (from dspy>=2.6.0->dspy-ai)
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting diskcache (from dspy>=2.6.0->dspy-ai)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting json-repair (from dspy>=2.6.0->dspy-ai)
  Downloading json_repair-0.35.0-py3-none-any.whl.metadata (11 kB)
Collecting litellm==1.57.4 (from litellm[proxy]==1.57.4->dspy>=2.6.0->dspy-ai)
  Downloading litellm-1.57.4-py3-none-any.whl.metadata (36 kB)
Collecting magicattr~=0.1.6 (from dspy>=2.6.0->dspy-ai)
  Downloading magicattr-0.1.6-py2.py3-none-any.whl.met

In [2]:
%pip install pandas



In [3]:
%pip install datasets



In [5]:
import dspy
import random
import pandas as pd
from datasets import load_dataset

# Define Classification Signature
class Classification(dspy.Signature):
    """Classify the customer message into one of the intent labels.
    The output should be only the predicted class as a single intent label."""

    customer_message = dspy.InputField(desc="Customer message during customer service interaction")
    intent_labels = dspy.InputField(desc="Labels that represent customer intent")
    answer = dspy.OutputField(desc="A label best matching customer's intent")

# Correct way to initialize the OpenAI model in DSPy
lm_mini = dspy.LM(model="gpt-4-turbo")  # Use a valid OpenAI model like "gpt-4-turbo"

# Configure DSPy with the chosen language model
dspy.settings.configure(lm=lm_mini)

# Define the Chain of Thought Predictor
cot_predictor = dspy.ChainOfThought(Classification)


## Parse Atis Dataset

In [6]:
dataset = load_dataset("tuetschek/atis")
dataset.set_format(type="pandas")

df_train: pd.DataFrame = dataset["train"][:]
df_test: pd.DataFrame = dataset["test"][:]
small_test = df_test.head(100)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


atis_train.csv:   0%|          | 0.00/850k [00:00<?, ?B/s]

atis_test.csv:   0%|          | 0.00/144k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4978 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/893 [00:00<?, ? examples/s]

## x column: text, y column: intent

In [7]:
df_train.iloc[0]

Unnamed: 0,0
id,0
intent,flight
text,i want to fly from boston at 838 am and arrive...
slots,O O O O O B-fromloc.city_name O B-depart_time....


## prepare labels

In [8]:
labels = df_train["intent"].unique().tolist()
labels_str = "%".join(labels)
labels_str

'flight%flight_time%airfare%aircraft%ground_service%airport%airline%distance%abbreviation%ground_fare%quantity%city%flight_no%capacity%flight+airfare%meal%restriction%airline+flight_no%ground_service+ground_fare%airfare+flight_time%cheapest%aircraft+flight+flight_no'

In [12]:
import os
# Retrieve API key from Google Colab's `userdata`
api_key = userdata.get('OPENAI_API_KEY')

# Ensure the key is set in the environment
if not api_key:
    raise ValueError("API Key not found. Please set it in Google Colab Secrets.")
os.environ["OPENAI_API_KEY"] = api_key


In [13]:
## run prediction
first_row = df_train.iloc[0]
print(f"customer message: {first_row['text']},real class: {first_row['intent']}")
cot_predictor(customer_message=first_row["text"], intent_labels=labels_str)

customer message: i want to fly from boston at 838 am and arrive in denver at 1110 in the morning,real class: flight


Prediction(
    reasoning="The customer's message indicates a specific request about flight times, mentioning both departure and arrival times for a flight from Boston to Denver. This suggests that the customer is interested in scheduling or timing details of a flight.",
    answer='flight_time'
)

## Define Examples

In [15]:
import dspy
import random
import pandas as pd

# Function to create DSPy examples from DataFrame
def get_dspy_examples(df, k) -> list[dspy.Example]:
    dspy_examples = []
    for label in labels:
        try:
            label_df = df[df["intent"] == label].sample(n=k)
            for _, row in label_df.iterrows():
                dspy_examples.append(
                    dspy.Example(
                        customer_message=row["text"],
                        answer=row["intent"],
                        intent_labels=labels_str
                    ).with_inputs("customer_message", "intent_labels")
                )
        except ValueError:
            # Handle cases where not enough examples exist for a label
            continue

    return dspy_examples

# Generate training and test examples
train_examples = get_dspy_examples(df_train, k=2)
all_test_examples = get_dspy_examples(df_test, k=10)

print(len(all_test_examples), len(all_test_examples) // 2)

# Split test data into dev and test sets
dev_examples = random.sample(all_test_examples, len(all_test_examples) // 2)
test_examples = [example for example in all_test_examples if example not in dev_examples]


90 45


## Define LabeledFewShot Optimizer

LabeledFewShot is the simplest optimizer. Its compile method injects samples intro the prompt. There is not optimization going on.

In [16]:
from dspy.teleprompt import LabeledFewShot

few_shot_demos = random.sample(train_examples, k=10)
labeled_fewshot_optimizer = LabeledFewShot(k=len(few_shot_demos))
few_shot_model = labeled_fewshot_optimizer.compile(student=cot_predictor, trainset=few_shot_demos)

## What is happenning under the hood?

LabeledFewShot randomly selects labels
DSPy SOURCE CODE: https://github.com/stanfordnlp/dspy/blob/793530c65a0e1721997dac0d2636f0f70ad649b6/dspy/teleprompt/vanilla.py#L6

class LabeledFewShot(Teleprompter): def init(self, k=16): self.k = k

def compile(self, student, *, trainset, sample=True):
    self.student = student.reset_copy()
    self.trainset = trainset

    if len(self.trainset) == 0:
        return self.student

    rng = random.Random(0)

    for predictor in self.student.predictors():
        if sample:
            predictor.demos = rng.sample(self.trainset, min(self.k, len(self.trainset)))
        else:
            predictor.demos = self.trainset[: min(self.k, len(self.trainset))]

    return self.student
My own summary of the implementation

DSPy samples randomly a portion of the samples as examples for in-context learning. There's no actual optimization process.

## How does the prompt looks like?

In [17]:
example = test_examples[0]
# without inputs(), we won't inject the inputs of the example
pred = few_shot_model(**example.inputs())
# Produce a prediction from our `cot` module, using the `example` above as input.
lm_mini.inspect_history(n=1)





[34m[2025-01-31T20:06:42.284780][0m

[31mSystem message:[0m

Your input fields are:
1. `customer_message` (str): Customer message during customer service interaction
2. `intent_labels` (str): Labels that represent customer intent

Your output fields are:
1. `reasoning` (str)
2. `answer` (str): A label best matching customer's intent

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## customer_message ## ]]
{customer_message}

[[ ## intent_labels ## ]]
{intent_labels}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Classify the customer message into one of the intent labels.
        The output should be only the predicted class as a single intent label.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## customer_message ## ]]
what does hou mean

[[ ## intent_label

### beautiful classification output

In [18]:
from rich.console import Console
from rich.table import Table

console = Console()

def beautify_prediction_rich(example, prediction):
    """
    Displays a colorized table using rich library.
    """
    table = Table(title="📊 Model Prediction", show_header=True, header_style="bold magenta")
    table.add_column("Feature", style="bold cyan")
    table.add_column("Value", style="bold yellow")

    table.add_row("📩 Customer Message", example.customer_message)
    table.add_row("🏷 Intent Labels", example.intent_labels)
    table.add_row("💭 Reasoning", prediction.rationale if hasattr(prediction, 'rationale') else "N/A")
    table.add_row("🎯 Predicted Intent", prediction.answer)

    console.print(table)

# Run prediction and display using rich
example = test_examples[0]
pred = few_shot_model(**example.inputs())

beautify_prediction_rich(example, pred)


### beautifual display end to end

In [21]:
from IPython.display import display, Markdown
from rich.console import Console

# Initialize rich console for optional terminal-friendly output
console = Console()

def beautify_prompt(prompt_text):
    """
    Display the DSPy optimized prompt in a beautiful format.
    """
    prompt_md = f"""
## ✨ **Optimized DSPy Prompt** ✨
{prompt_text}
---
    """
    display(Markdown(prompt_md))

def beautify_prediction(example, prediction):
    """
    Beautifies the model prediction output for a Jupyter Notebook.
    """
    output_text = f"""
## 📝 **Prediction Output**
### 📩 **Customer Message**
`{example.customer_message}`

### 🏷 **Intent Labels**
🆔 `{example.intent_labels}`

### 🤔 **Reasoning**
💭 `{getattr(prediction, 'rationale', 'N/A')}`

### ✅ **Predicted Intent**
🎯 `{prediction.answer}`
---
    """
    display(Markdown(output_text))

# Step 1: Select a test example
example = test_examples[0]

# Step 2: Generate a prediction
pred = few_shot_model(**example.inputs())

# Step 3: Extract the last DSPy-generated prompt dynamically
history = lm_mini.inspect_history(n=1)

# Step 4: Display the optimized prompt first
if history:
    beautify_prompt(history[0])

# Step 5: Display the beautified prediction result
beautify_prediction(example, pred)






[34m[2025-01-31T20:17:50.257796][0m

[31mSystem message:[0m

Your input fields are:
1. `customer_message` (str): Customer message during customer service interaction
2. `intent_labels` (str): Labels that represent customer intent

Your output fields are:
1. `reasoning` (str)
2. `answer` (str): A label best matching customer's intent

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## customer_message ## ]]
{customer_message}

[[ ## intent_labels ## ]]
{intent_labels}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Classify the customer message into one of the intent labels.
        The output should be only the predicted class as a single intent label.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## customer_message ## ]]
what does hou mean

[[ ## intent_label


## 📝 **Prediction Output**  
### 📩 **Customer Message**  
`list flights from charlotte on saturday afternoon`  

### 🏷 **Intent Labels**  
🆔 `flight%flight_time%airfare%aircraft%ground_service%airport%airline%distance%abbreviation%ground_fare%quantity%city%flight_no%capacity%flight+airfare%meal%restriction%airline+flight_no%ground_service+ground_fare%airfare+flight_time%cheapest%aircraft+flight+flight_no`  

### 🤔 **Reasoning**  
💭 `N/A`  

### ✅ **Predicted Intent**  
🎯 `flight`
---
    

## Define BootstrapFewShot Optimizer

This family of optimizers is focused on optimizing the few shot examples. Let's take an example of a Sample pipeline and see how we can use this optimizer to optimize it. From: https://dspy.ai/deep-dive/optimizers/bootstrap-fewshot/

In [22]:
from dspy.evaluate import answer_exact_match as metric
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
    metric=metric,
    max_bootstrapped_demos=10,
    max_labeled_demos=10,
    max_rounds=10,
)

### Optimize

In [23]:
# documentation is wrong - there is not valset: https://dspy.ai/deep-dive/optimizers/bootstrap-fewshot/
cot_few_shot_optimized = optimizer.compile(cot_predictor, trainset=train_examples)

 28%|██▊       | 10/36 [00:25<01:06,  2.54s/it]

Bootstrapped 10 full traces after 10 examples for up to 10 rounds, amounting to 10 attempts.





## Peek under the hood of DSPy source code for BootStrapFewShot training
DSPy source code for training

class BootstrapFewShot() def _train(self): rng = random.Random(0) raw_demos = self.validation

    for name, predictor in self.student.named_predictors():
        augmented_demos = self.name2traces[name][: self.max_bootstrapped_demos]

        sample_size = min(self.max_labeled_demos - len(augmented_demos), len(raw_demos))
        sample_size = max(0, sample_size)

        raw_demos = rng.sample(raw_demos, sample_size)

        if dspy.settings.release >= 20230928:
            predictor.demos = raw_demos + augmented_demos
        else:
            predictor.demos = augmented_demos + raw_demos

    return self.student
 Source code: https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/bootstrap.py

_train() Purpose Once _bootstrap() has collected and validated a set of bootstrapped demos, _train() takes over to:

Compile Final Demos for Predictors: _train() assembles the demos (both bootstrapped and labeled) for each predictor within the student model. For each predictor, it selects a mix of bootstrapped demos (from _bootstrap()) and labeled examples (raw demos from the validation set) to create a final demo set. Random Sampling: The method performs a random sample from the raw labeled demos, ensuring the demos meet the configuration limits, such as max_labeled_demos. Set Demos for Each Predictor: Finally, _train() updates each predictor in the student model with this finalized set of demos, effectively preparing it for use. In essence, _bootstrap() is responsible for creating and validating bootstrapped demos, while _train() assembles a balanced set of these demos and labeled examples to finalize the student model’s training.

### Summary

BootstrapFewShot has two main properties:

Enable you to generate additional examples
DSPy tests which predictions pass the validation and keep only those

## Define BootstrapFewShotWithRandomSearch

In [24]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

optimizer = BootstrapFewShotWithRandomSearch(
    metric=metric,
    max_bootstrapped_demos=10,
    max_labeled_demos=10,
    num_threads=10,
    num_candidate_programs=5
)

Going to sample between 1 and 10 traces per predictor.
Will attempt to bootstrap 5 candidate sets.


## Peek under the hood of the source code Implementation
Source code

From: https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/random_search.py

   assert seed >= 0, seed

    random.Random(seed).shuffle(trainset_copy)
    size = random.Random(seed).randint(self.min_num_samples, self.max_num_samples)

    optimizer = BootstrapFewShot(
        metric=self.metric,
        metric_threshold=self.metric_threshold,
        max_bootstrapped_demos=size,
        max_labeled_demos=self.max_labeled_demos,
        teacher_settings=self.teacher_settings,
        max_rounds=self.max_rounds,
        max_errors=self.max_errors,
    )

    program = optimizer.compile(student, teacher=teacher, trainset=trainset_copy)

evaluate = Evaluate(
    devset=self.valset,
    metric=self.metric,
    num_threads=self.num_threads,
    max_errors=self.max_errors,
    display_table=False,
    display_progress=True,
)

score, subscores = evaluate(program, return_all_scores=True)

all_subscores.append(subscores)
My own summary

Given the number of programs we will generate each time a different seed and run BootStrapFewShot with that

## Evaluation
Single Evaluation

In [25]:
from dspy.evaluate import answer_exact_match

# Instantiate the metric.
metric = answer_exact_match

example = test_examples[0]
# Produce a prediction from our `cot` module, using the `example` above as input.
print(example)
pred = cot_predictor(**example.inputs())
print(pred)

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Customer message: \t {example.customer_message}\n")
print(f"Gold Response: \t {example.answer}\n")
print(f"Predicted Response: \t {pred.answer}\n")
print(f"Exact match score: {score:.2f}")

Example({'customer_message': 'list flights from charlotte on saturday afternoon', 'answer': 'flight', 'intent_labels': 'flight%flight_time%airfare%aircraft%ground_service%airport%airline%distance%abbreviation%ground_fare%quantity%city%flight_no%capacity%flight+airfare%meal%restriction%airline+flight_no%ground_service+ground_fare%airfare+flight_time%cheapest%aircraft+flight+flight_no'}) (input_keys={'intent_labels', 'customer_message'})
Prediction(
    reasoning='The customer is asking for a list of flights departing from Charlotte on a specific day and time, which is Saturday afternoon. This indicates they are looking for information related to flight schedules. Among the provided intent labels, the most relevant label that matches this request is "flight_time," as it directly pertains to the timing of flights.',
    answer='flight_time'
)
Customer message: 	 list flights from charlotte on saturday afternoon

Gold Response: 	 flight

Predicted Response: 	 flight_time

Exact match score

### beautiful output of above

In [26]:
from dspy.evaluate import answer_exact_match
from IPython.display import display, Markdown

# Instantiate the metric
metric = answer_exact_match

# Select a test example
example = test_examples[0]

# Generate a prediction
pred = cot_predictor(**example.inputs())

# Compute the metric score
score = metric(example, pred)

# Define Markdown for beautiful output
output_text = f"""
## 🎯 **Prediction Evaluation**
---
### 📝 **Customer Message**
📩 `{example.customer_message}`

### 🏷 **Gold Intent (True Label)**
✅ `{example.answer}`

### 🤖 **Predicted Intent**
🔮 `{pred.answer}`

### 📊 **Exact Match Score**
🎯 `{score:.2f}`

---
"""

# Display formatted output
display(Markdown(output_text))



## 🎯 **Prediction Evaluation**
---
### 📝 **Customer Message**
📩 `list flights from charlotte on saturday afternoon`  

### 🏷 **Gold Intent (True Label)**
✅ `flight`  

### 🤖 **Predicted Intent**
🔮 `flight_time`  

### 📊 **Exact Match Score**
🎯 `0.00`  

---


## Setup Evaluation

In [27]:
from dspy.evaluate.evaluate import Evaluate
# Set up the `evaluate_atis` function. We'll use this many times below.
print(len(train_examples))
evaluate_atis = Evaluate(devset=test_examples, num_threads=8, display_progress=True, display_table=5, provide_traceback=True)

36


## Evaluate zero shot CoT

In [28]:
# Evaluate the program with the `answer_exact_match` metric.
# Launch evaluation.
evaluate_atis(cot_predictor, metric=metric)

Average Metric: 32.00 / 42 (76.2%): 100%|██████████| 42/42 [00:17<00:00,  2.42it/s]

2025/01/31 20:25:09 INFO dspy.evaluate.evaluate: Average Metric: 32 / 42 (76.2%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is asking for a list of flights departing from Charlo...,flight_time,
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is specifically asking to view details about a partic...,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is asking for information about flights between two s...,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,"The customer message ""miami to cleveland sunday afternoon"" indicat...",flight_time,
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer's message indicates a desire to travel from one city ...,flight_time,


76.19

### beautiful print above

In [30]:
from IPython.display import display, Markdown
from dspy.evaluate.evaluate import Evaluate

# Run evaluation
evaluation_score = evaluate_atis(cot_predictor, metric=metric)  # Returns a float

# Beautify the output using Markdown
output_text = f"""
## 📊 **Model Evaluation Results**
---
### 🤖 **Model Evaluated**
🔍 `cot_predictor`

### 📏 **Evaluation Metric Used**
📐 `answer_exact_match`

### 📊 **Overall Performance Score**
🎯 **{evaluation_score:.2f}**

---
"""

# Display formatted evaluation results
display(Markdown(output_text))


Average Metric: 32.00 / 42 (76.2%): 100%|██████████| 42/42 [00:00<00:00, 810.96it/s]

2025/01/31 20:27:37 INFO dspy.evaluate.evaluate: Average Metric: 32 / 42 (76.2%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is asking for a list of flights departing from Charlo...,flight_time,
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is specifically asking to view details about a partic...,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer is asking for information about flights between two s...,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,"The customer message ""miami to cleveland sunday afternoon"" indicat...",flight_time,
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,The customer's message indicates a desire to travel from one city ...,flight_time,



## 📊 **Model Evaluation Results**
---
### 🤖 **Model Evaluated**  
🔍 `cot_predictor`

### 📏 **Evaluation Metric Used**  
📐 `answer_exact_match`

### 📊 **Overall Performance Score**  
🎯 **76.19**  

---


## Evaluate few shot CoT

In [31]:
evaluate_atis(few_shot_model, metric=metric)

Average Metric: 35.00 / 42 (83.3%): 100%|██████████| 42/42 [00:10<00:00,  4.16it/s]

2025/01/31 20:28:41 INFO dspy.evaluate.evaluate: Average Metric: 35 / 42 (83.3%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]


83.33

### beautiful print above

In [32]:
from IPython.display import display, Markdown

# Run evaluation
evaluation_score = evaluate_atis(few_shot_model, metric=metric)  # Returns a float

# Beautify the output using Markdown
output_text = f"""
## 📊 **Few-Shot Model Evaluation Results**
---
### 🤖 **Model Evaluated**
🔍 `few_shot_model`

### 📏 **Evaluation Metric Used**
📐 `answer_exact_match`

### 📊 **Overall Performance Score**
🎯 **{evaluation_score:.2f}**

---
"""

# Display formatted evaluation results
display(Markdown(output_text))


Average Metric: 35.00 / 42 (83.3%): 100%|██████████| 42/42 [00:00<00:00, 99.91it/s]


2025/01/31 20:29:16 INFO dspy.evaluate.evaluate: Average Metric: 35 / 42 (83.3%)


Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]



## 📊 **Few-Shot Model Evaluation Results**
---
### 🤖 **Model Evaluated**  
🔍 `few_shot_model`

### 📏 **Evaluation Metric Used**  
📐 `answer_exact_match`

### 📊 **Overall Performance Score**  
🎯 **83.33**  

---


## Evaluate BootstrappedFewShot

In [33]:
evaluate_atis(cot_few_shot_optimized, metric=metric)


Average Metric: 37.00 / 42 (88.1%): 100%|██████████| 42/42 [00:08<00:00,  4.72it/s]

2025/01/31 20:32:16 INFO dspy.evaluate.evaluate: Average Metric: 37 / 42 (88.1%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]


88.1

### beautiful print above

In [34]:
from IPython.display import display, Markdown

# Run evaluation
evaluation_score = evaluate_atis(cot_few_shot_optimized, metric=metric)  # Returns a float

# Beautify the output using Markdown
output_text = f"""
## 📊 **Bootstrapped Few-Shot Model Evaluation Results**
---
### 🤖 **Model Evaluated**
🔍 `cot_few_shot_optimized`

### 📏 **Evaluation Metric Used**
📐 `answer_exact_match`

### 📊 **Overall Performance Score**
🎯 **{evaluation_score:.2f}**

---
"""

# Display formatted evaluation results
display(Markdown(output_text))


Average Metric: 37.00 / 42 (88.1%): 100%|██████████| 42/42 [00:00<00:00, 155.31it/s]

2025/01/31 20:32:48 INFO dspy.evaluate.evaluate: Average Metric: 37 / 42 (88.1%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]



## 📊 **Bootstrapped Few-Shot Model Evaluation Results**
---
### 🤖 **Model Evaluated**  
🔍 `cot_few_shot_optimized`

### 📏 **Evaluation Metric Used**  
📐 `answer_exact_match`

### 📊 **Overall Performance Score**  
🎯 **88.10**  

---


## Evaluat Bootstraped Random search

In [35]:
evaluate_atis(cot_few_shot_optimized, metric=metric)

Average Metric: 37.00 / 42 (88.1%): 100%|██████████| 42/42 [00:00<00:00, 263.02it/s]

2025/01/31 20:34:42 INFO dspy.evaluate.evaluate: Average Metric: 37 / 42 (88.1%)





Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]


88.1

## beautiful print above

In [36]:
from IPython.display import display, Markdown

# Run evaluation
evaluation_score = evaluate_atis(cot_few_shot_optimized, metric=metric)  # Returns a float

# Create Markdown-formatted output
output_text = f"""
# 📊 **Bootstrapped Few-Shot Model Evaluation**
---
### 🛠 **Model Evaluated**
🔍 `cot_few_shot_optimized`

### 📏 **Evaluation Metric Used**
📐 `answer_exact_match`

### 🎯 **Overall Performance Score**
💡 **{evaluation_score:.2f}**

---
"""

# Display formatted evaluation results
display(Markdown(output_text))


Average Metric: 37.00 / 42 (88.1%): 100%|██████████| 42/42 [00:00<00:00, 132.10it/s]


2025/01/31 20:35:30 INFO dspy.evaluate.evaluate: Average Metric: 37 / 42 (88.1%)


Unnamed: 0,customer_message,example_answer,intent_labels,reasoning,pred_answer,answer_exact_match
0,list flights from charlotte on saturday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
1,show me flight us 1500 on monday from charlotte to minneapolis please,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight_no,
2,show me flights between toronto and san diego,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
3,miami to cleveland sunday afternoon,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]
4,i want to go from boston to washington on a saturday,flight,flight%flight_time%airfare%aircraft%ground_service%airport%airline...,Not supplied for this particular example.,flight,✔️ [True]



# 📊 **Bootstrapped Few-Shot Model Evaluation**
---
### 🛠 **Model Evaluated**  
🔍 `cot_few_shot_optimized`

### 📏 **Evaluation Metric Used**  
📐 `answer_exact_match`

### 🎯 **Overall Performance Score**  
💡 **88.10**  

---
