# Experiment with gpt-3.5-turbo

Research question - **How do different DSPy optimizers impact the accuracy of product sentiment polarity classification compared to a baseline model without optimization?**

**H0** - _There is no significant difference in accuracy of sentiment predictions between a simple LLM call with a function call and models using DSPy optimizers._

**H1** - _There is a significant improvement in accuracy of sentiment prediction with the use of DSPy optimizers compared to a simple LLM call with a function call._


## 0. Dataset
The experiment will employ an artificially created dataset of response and lable pairs. 

In [1]:
import pandas as pd
import dotenv

dotenv.load_dotenv()

df = pd.read_csv("./data/samsung-labeled-transformed.csv")
df.head()

Unnamed: 0,question,sentiment,answer
0,What are the top 3 benefits of Galaxy AI?,very_positive,The top 3 benefits of the Galaxy AI are: unpar...
1,What are the main differences between the Sams...,very_positive,The main differences between the Samsung Galax...
2,Does the Samsung Galaxy S23 Ultra support 8K v...,very_positive,Absolutely! The Samsung Galaxy S23 Ultra does ...
3,How does the battery life of the Galaxy S23+ c...,very_positive,The Galaxy S23+ offers a remarkable battery li...
4,Can I use a stylus with the Samsung Galaxy S23...,very_positive,Absolutely! The Samsung Galaxy S23 Ultra is de...


In [2]:
import dspy

dataset = []
sentiments = []
for _, row in df.iterrows():
    dataset.append(
        dspy.Example(output=row.answer, sentiment=row.sentiment).with_inputs("output")
    )
    sentiments.append(row.sentiment)

We will prepare 3 `train_test_splits` as different optimizers are made for different amount of training data. Read more - [DSPy Documentation](https://dspy-docs.vercel.app/docs/building-blocks/optimizers#which-optimizer-should-i-use)
We are making sure that each dataset is balanced across each category of the sentiments by stratifying it.

In [3]:
from sklearn.model_selection import train_test_split

trainset, devset = train_test_split(dataset, test_size=0.2, stratify=sentiments, random_state=759)

trainset_10, devset_10 = train_test_split(dataset, train_size=10, test_size=0.2, stratify=sentiments, random_state=759)

trainset_50, devset_50 = train_test_split(dataset, train_size=50, test_size=0.2, stratify=sentiments, random_state=759)

print("trainset len", len(trainset))
print("devset len", len(devset))

print("trainset_10 len", len(trainset_10))
print("trainset_50 len", len(trainset_50))
print(trainset[0:10])
print(trainset_10)
print(trainset_50[0:10])

vp, sp, sn, vn = 0, 0, 0, 0
for i in trainset_50:
    if i.sentiment == "very_positive":
        vp += 1
    if i.sentiment == "subtly_positive":
        sp += 1
    if i.sentiment == "subtly_negative":
        sn += 1
    if i.sentiment == "very_negative":
        vn += 1

print(vp, sp, sn, vn)


trainset len 323
devset len 81
trainset_10 len 10
trainset_50 len 50
[Example({'output': "The performance difference between the Galaxy Tab S9 and S9 Ultra is minimal as both often use the same chipset and RAM options. However, the S9 Ultra's larger display and battery add bulk without substantially enhancing performance. Samsung's incremental updates often fail to justify the higher price, making alternatives more appealing.", 'sentiment': 'very_negative'}) (input_keys={'output'}), Example({'output': 'The Galaxy S23+ offers a slight improvement in battery life over the Galaxy S22+ due to a more efficient processor and optimized software. While the difference is not groundbreaking, you might notice marginally better endurance during daily use. However, both models still fall within the average range for flagship smartphones in terms of battery performance.', 'sentiment': 'subtly_negative'}) (input_keys={'output'}), Example({'output': "The Galaxy Z Fold 4's screen has an Ultra Thin Glas

## 1. Classify the dataset with the help of gpt-3.5-turbo with a function call
Here we will create a gpt-3.5-turbo instance with DSPy library.
We will also make a little adjustment to the original DSPy codebase to support the function calls.



In [4]:
import json
from typing import Any

model = "gpt-3.5-turbo"

# Set up the LM
llm = dspy.OpenAI(
    model=model,
    max_tokens=2048,
    tools=[
        {
            "type": "function",
            "function": {
                "name": "sentiment",
                "parameters": {
                    "type": "object",
                    "properties": {
                        # "positive_points": {
                        #     "type": "string",
                        #     "description": "positive points mentioned in the output, empty string if none",
                        # },
                        # "negative_points": {
                        #     "type": "string",
                        #     "description": "negative points mentioned in the output, empty string if none",
                        # },
                        "sentiment": {
                            "type": "string",
                            "enum": [
                                "very_positive",
                                "subtly_positive",
                                "subtly_negative",
                                "very_negative",
                            ],
                            "description": "the sentiment of output following one of the 4 options",
                        },
                    },
                    "required": ["sentiment"],
                },
                "description": "use this function if you need to give your verdict on the sentiment",
            },
        },
    ],
    temperature=0,
    # tool_choice="auto",
    tool_choice={"type": "function", "function": {"name": "sentiment"}},
)


def _get_choice_text(self, choice: dict[str, Any]) -> str:
    prompt: str = self.history[-1]["prompt"]
    # print("\n\nprompt", prompt, "\n\n")
    if self.model_type == "chat":
        message = choice["message"]
        if content := message["content"]:
            return content
        elif tool_calls := message.get("tool_calls", None):
            arguments = json.loads(tool_calls[0]["function"]["arguments"])
            if prompt.endswith("Reasoning:"):
                return arguments["reasoning"] + "\nSentiment: " + arguments["sentiment"]
            if prompt.strip().endswith("Positive Points:"):
                return (
                    ('None' if arguments["positive_points"] == "" else arguments["positive_points"])
                    + "\n\nNegative Points: "
                    + ('None' if arguments["negative_points"] == "" else arguments["negative_points"])
                    + "\n\nSentiment: "
                    + arguments["sentiment"]
                )
            else:
                return arguments["sentiment"]
    return choice["text"]


llm._get_choice_text = _get_choice_text.__get__(llm)


dspy.settings.configure(lm=llm)

In [5]:
from typing import Literal

from pydantic import BaseModel, Field


class Sentiment(BaseModel):
    sentiment: Literal["very_positive", "subtly_positive", "subtly_negative", "very_negative"] = Field(description="The sentiment of the output following one of the 4 options")

class ProductSentimentPolaritySignature(dspy.Signature):
    """Classify the sentiment of the output among very_positive, subtly_positive, subtly_negative, very_negative"""

    output = dspy.InputField(desc="Output of the LLM talking about the product")
    # positive_points : str = dspy.OutputField(desc="Positive points mentioned on the output.")
    # negative_points : str = dspy.OutputField(desc="Negative points mentioned on the output.")
    sentiment : Sentiment = dspy.OutputField()

class ProductSentimentPolarity(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict(ProductSentimentPolaritySignature)

    def forward(self, output):
        return self.predict(output=output)

dev_example = devset[0]
print(dev_example)

pred = ProductSentimentPolarity()(output=dev_example.output)
pred

Example({'output': "Yes, the Galaxy Tab S9 does support the S Pen. It's a useful feature for taking notes, drawing, or navigating the tablet. While it's a handy addition, there are other tablets in the market with similar stylus support, often with additional functionalities.", 'sentiment': 'subtly_negative'}) (input_keys={'output'})


Prediction(
    sentiment='subtly_positive'
)

Here we will define our metric that we will reuse across all optimizations. It is a simple `exact_match` evaluation, as far as we have a classification task with 4 well defined categories.

In [6]:
def sentiment_matches(example, pred, trace=None):
    return example.sentiment == pred.sentiment

scores = []

To exclude statistical luck - we run 10 evaluations with to measure the baseline accuracy of gpt-3.5-turbo without any optimizations.

In [7]:
from dspy.evaluate import Evaluate

for i in range(0, 10):
    evaluation = Evaluate(
        devset=devset, metric=sentiment_matches, num_threads=16, display_progress=True
    )
    score = evaluation(ProductSentimentPolarity())  # type: ignore
    scores.append(score)

import numpy as np

print("average score", sum(scores) / len(scores))
print("median score", np.median(scores))
print("min score", min(scores))
print("max score", max(scores))
print("variance", np.var(scores))

Average Metric: 35 / 81  (43.2): 100%|██████████| 81/81 [00:02<00:00, 27.95it/s]
Average Metric: 35 / 81  (43.2): 100%|██████████| 81/81 [00:02<00:00, 27.20it/s]
Average Metric: 37 / 81  (45.7): 100%|██████████| 81/81 [00:04<00:00, 18.27it/s]
Average Metric: 35 / 81  (43.2): 100%|██████████| 81/81 [00:03<00:00, 26.76it/s]
Average Metric: 34 / 81  (42.0): 100%|██████████| 81/81 [00:03<00:00, 26.13it/s]
Average Metric: 35 / 81  (43.2): 100%|██████████| 81/81 [00:03<00:00, 25.22it/s]
Average Metric: 36 / 81  (44.4): 100%|██████████| 81/81 [00:03<00:00, 20.52it/s]
Average Metric: 36 / 81  (44.4): 100%|██████████| 81/81 [00:03<00:00, 22.22it/s]
Average Metric: 34 / 81  (42.0): 100%|██████████| 81/81 [00:03<00:00, 25.91it/s]
Average Metric: 38 / 81  (46.9): 100%|██████████| 81/81 [00:03<00:00, 21.51it/s]

average score 43.827
median score 43.21
min score 41.98
max score 46.91
variance 2.203560999999999





**Baseline Accuracy** - 43.827%

# 2. Classify with the help of DSPy optimizers
Here we will create several optimizers that will be used to optimize the LLM. We later will evaluate each of them and compare the results with the baseline.
We will use a dashboard from LangWatch to observe the optimizations.

In [8]:
# %cd /Users/zhenyabudnyk/DevProjects/langwatch-saas/langwatch/python-sdk/
# %pip install .

import langwatch

langwatch.endpoint = "http://localhost:3000"
langwatch.login()

Please go to http://localhost:3000/authorize to get your API key
LangWatch API key set


### 2.1 BootstrapFewShot
BootstrapFewShot optimizer is simply selecting several few-shot demonstrations


In [9]:
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
    metric=sentiment_matches,
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
)

langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_10)


[LangWatch] Experiment initialized, run_id: thundering-complex-hummingbird
[LangWatch] Open http://localhost:3000/experiment-dspy-iOg5EE/experiments/product_sentiment_polarity_openai_experiment?runIds=thundering-complex-hummingbird to track your DSPy training session live



100%|██████████| 10/10 [00:05<00:00,  1.70it/s]


In [10]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

Average Metric: 49 / 81  (60.5): 100%|██████████| 81/81 [00:11<00:00,  7.07it/s]


60.49

**Accuracy BootstrapFewShot** - 60.49%

## 2.2 BootstrapFewShotWithRandomSearch 

In [11]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
optimizer = BootstrapFewShotWithRandomSearch(
    metric=sentiment_matches,
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    max_rounds=1,
    num_candidate_programs=8,
)

langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_50)


[LangWatch] Experiment initialized, run_id: gleaming-precise-vicugna
[LangWatch] Open http://localhost:3000/experiment-dspy-iOg5EE/experiments/product_sentiment_polarity_openai_experiment?runIds=gleaming-precise-vicugna to track your DSPy training session live



Average Metric: 25 / 50  (50.0): 100%|██████████| 50/50 [00:06<00:00,  8.11it/s]
Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:05<00:00,  9.89it/s]
 34%|███▍      | 17/50 [00:11<00:22,  1.44it/s]
Average Metric: 33 / 50  (66.0): 100%|██████████| 50/50 [00:04<00:00, 10.15it/s] 
 24%|██▍       | 12/50 [00:06<00:20,  1.83it/s]
Average Metric: 35 / 50  (70.0): 100%|██████████| 50/50 [00:04<00:00, 10.13it/s]
 10%|█         | 5/50 [00:02<00:25,  1.78it/s]
Average Metric: 33 / 50  (66.0): 100%|██████████| 50/50 [00:05<00:00,  9.89it/s]
  2%|▏         | 1/50 [00:00<00:27,  1.76it/s]
Average Metric: 34 / 50  (68.0): 100%|██████████| 50/50 [00:04<00:00, 10.02it/s]
  8%|▊         | 4/50 [00:02<00:29,  1.55it/s]
Average Metric: 34 / 50  (68.0): 100%|██████████| 50/50 [00:05<00:00,  9.97it/s]
 10%|█         | 5/50 [00:03<00:32,  1.37it/s]
Average Metric: 26 / 50  (52.0): 100%|██████████| 50/50 [00:05<00:00,  9.44it/s]
 22%|██▏       | 11/50 [00:07<00:25,  1.51it/s]
Average Metric: 24 

In [12]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

Average Metric: 48 / 81  (59.3): 100%|██████████| 81/81 [00:11<00:00,  7.25it/s]


59.26

**Accuracy BootstrapFewShotWithRandomSearch** - 59.26%

## 2.3 BootstrapFewShotWithOptuna

In [13]:
from dspy.teleprompt import BootstrapFewShotWithOptuna

optimizer = BootstrapFewShotWithOptuna(
    metric=sentiment_matches,
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    max_rounds=1,
    num_candidate_programs=8,
)

# langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_50, max_demos=8)

Going to sample between 1 and 8 traces per predictor.
Will attempt to train 8 candidate sets.


 34%|███▍      | 17/50 [00:10<00:20,  1.58it/s]
[I 2024-05-31 11:55:49,155] A new study created in memory with name: no-name-cc8eff0d-71c8-4807-a5df-97e13530ab62
Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:04<00:00, 10.75it/s]
[I 2024-05-31 11:55:53,835] Trial 0 finished with value: 58.0 and parameters: {'demo_index_for_predict': 5}. Best is trial 0 with value: 58.0.
Average Metric: 30 / 50  (60.0): 100%|██████████| 50/50 [00:04<00:00, 10.87it/s]
[I 2024-05-31 11:55:58,460] Trial 1 finished with value: 60.0 and parameters: {'demo_index_for_predict': 5}. Best is trial 1 with value: 60.0.
Average Metric: 23 / 50  (46.0): 100%|██████████| 50/50 [00:04<00:00, 10.98it/s]
[I 2024-05-31 11:56:03,040] Trial 2 finished with value: 46.0 and parameters: {'demo_index_for_predict': 1}. Best is trial 1 with value: 60.0.
Average Metric: 42 / 50  (84.0): 100%|██████████| 50/50 [00:04<00:00, 10.36it/s]
[I 2024-05-31 11:56:07,888] Trial 3 finished with value: 84.0 and parameters: {'demo_

Best score: 86.0
Best program: predict = Predict(ProductSentimentPolaritySignature(output -> sentiment
    instructions='Classify the sentiment of the output among very_positive, subtly_positive, subtly_negative, very_negative'
    output = Field(annotation=str required=True json_schema_extra={'desc': 'Output of the LLM talking about the product', '__dspy_field_type': 'input', 'prefix': 'Output:'})
    sentiment = Field(annotation=Sentiment required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Sentiment:', 'desc': '${sentiment}'})
))


In [14]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

Average Metric: 58 / 81  (71.6): 100%|██████████| 81/81 [00:11<00:00,  6.87it/s]


71.6

**Accuracy BootstrapFewShotWithOptuna** - 71.6%

## 2.4 KNNFewShot

In [15]:
from dspy.predict import KNN
from dspy.teleprompt import KNNFewShot

optimizer = KNNFewShot(KNN, k=10, trainset=trainset_50)


# langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_50)



In [16]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

  0%|          | 0/81 [00:00<?, ?it/s]
[A

[A[A


[A[A[A

[A[A


[A[A[A
[A


[A[A[A

[A[A
[A

[A[A


[A[A[A
[A

 40%|████      | 4/10 [00:02<00:03,  1.83it/s]

 40%|████      | 4/10 [00:02<00:03,  1.67it/s]



 40%|████      | 4/10 [00:02<00:03,  1.62it/s]
 40%|████      | 4/10 [00:02<00:03,  1.53it/s]
Average Metric: 1 / 2  (50.0):   2%|▏         | 2/81 [00:02<01:37,  1.24s/it] 
Average Metric: 2 / 4  (50.0):   4%|▎         | 3/81 [00:03<01:00,  1.29it/s]

[A[A


[A[A[A
[A

[A[A


[A[A[A
[A

[A[A


[A[A[A
[A

 40%|████      | 4/10 [00:02<00:03,  1.87it/s]

 40%|████      | 4/10 [00:02<00:03,  1.84it/s]


 40%|████      | 4/10 [00:02<00:03,  1.71it/s]| 5/81 [00:05<01:18,  1.03s/it]
Average Metric: 3 / 6  (50.0):   6%|▌         | 5/81 [00:05<01:18,  1.03s/it]
[A
Average Metric: 3 / 7  (42.9):   9%|▊         | 7/81 [00:06<00:52,  1.41it/s]

[A[A

[A[A
[A


[A[A[A

 40%|████      | 4/10 [00:02<00:03,  1.84it/s]



 40%|████      | 4/10 [0

50.62

**Accuracy KNNFewShot** - 50.62%

## 2.5 COPRO

In [17]:
from dspy.teleprompt import COPRO

optimizer = COPRO(depth=5, trainset=trainset, metric=sentiment_matches, track_stats=True)
kwargs = dict(num_threads=64, display_progress=True, display_table=0)


langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_50,  eval_kwargs=kwargs)


[LangWatch] Experiment initialized, run_id: efficient-paper-turkey
[LangWatch] Open http://localhost:3000/experiment-dspy-iOg5EE/experiments/product_sentiment_polarity_openai_experiment?runIds=efficient-paper-turkey to track your DSPy training session live



Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 64.37it/s]
Average Metric: 25 / 50  (50.0): 100%|██████████| 50/50 [00:01<00:00, 34.49it/s]
Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 70.40it/s]
Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 64.59it/s]
Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 52.82it/s]
Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 77.02it/s]


In [18]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

  0%|          | 0/81 [00:00<?, ?it/s]

Average Metric: 35 / 81  (43.2): 100%|██████████| 81/81 [00:10<00:00,  7.40it/s]


43.21

**Accuracy COPRO** - 43.21%

## 2.6 MIPRO

In [24]:
from dspy.teleprompt import MIPRO

prompt_model = dspy.OpenAI(
    model="gpt-3.5-turbo",
    max_tokens=2048,
)
optimizer = MIPRO(prompt_model=prompt_model, task_model=llm, metric=sentiment_matches, num_candidates=10)
kwargs = dict(num_threads=64, display_progress=True, display_table=0)


langwatch.dspy.init(experiment="product_sentiment_polarity_openai_experiment", optimizer=optimizer)

optimized_evaluator = optimizer.compile(ProductSentimentPolarity(), trainset=trainset_50, max_bootstrapped_demos=8, num_trials=50, max_labeled_demos=8, eval_kwargs=kwargs)


[LangWatch] Experiment initialized, run_id: spry-imposing-warthog
[LangWatch] Open http://localhost:3000/experiment-dspy-iOg5EE/experiments/product_sentiment_polarity_openai_experiment?runIds=spry-imposing-warthog to track your DSPy training session live


Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:

[93m- Task Model: [94m[1m50[0m[93m examples in dev set * [94m[1m50[0m[93m trials * [94m[1m# of LM calls in your program[0m[93m = ([94m[1m2500 * # of LM calls in your program[0m[93m) task model calls[0m
[93m- Prompt Model: # data summarizer calls (max [94m[1m10[0m[93m) + [94m[1m10[0m[93m * [94m[1m1[0m[93m lm calls in program = [94m[1m20[0m[93m prompt model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per O

 30%|███       | 15/50 [00:10<00:24,  1.42it/s]
 22%|██▏       | 11/50 [00:06<00:23,  1.63it/s]
 18%|█▊        | 9/50 [00:05<00:23,  1.77it/s]
 18%|█▊        | 9/50 [00:05<00:25,  1.59it/s]
 30%|███       | 15/50 [00:08<00:18,  1.86it/s]
 32%|███▏      | 16/50 [00:09<00:21,  1.61it/s]
 46%|████▌     | 23/50 [00:12<00:14,  1.81it/s]
 18%|█▊        | 9/50 [00:05<00:23,  1.76it/s]
 28%|██▊       | 14/50 [00:07<00:20,  1.78it/s]
[I 2024-05-31 12:48:22,852] A new study created in memory with name: no-name-7810546f-a9ef-43d6-8b39-f3757c425842


Starting trial #0


Average Metric: 26 / 50  (52.0): 100%|██████████| 50/50 [00:00<00:00, 55.28it/s]
[I 2024-05-31 12:48:24,213] Trial 0 finished with value: 52.0 and parameters: {'12551179920_predictor_instruction': 1, '12551179920_predictor_demos': 1}. Best is trial 0 with value: 52.0.


Starting trial #1


Average Metric: 26 / 50  (52.0): 100%|██████████| 50/50 [00:01<00:00, 27.72it/s]
[I 2024-05-31 12:48:26,242] Trial 1 finished with value: 52.0 and parameters: {'12551179920_predictor_instruction': 5, '12551179920_predictor_demos': 4}. Best is trial 0 with value: 52.0.


Starting trial #2


Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:02<00:00, 22.47it/s]
[I 2024-05-31 12:48:28,702] Trial 2 finished with value: 56.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 0}. Best is trial 2 with value: 56.0.


Starting trial #3


Average Metric: 34 / 50  (68.0): 100%|██████████| 50/50 [00:00<00:00, 61.79it/s]
[I 2024-05-31 12:48:29,737] Trial 3 finished with value: 68.0 and parameters: {'12551179920_predictor_instruction': 9, '12551179920_predictor_demos': 3}. Best is trial 3 with value: 68.0.


Starting trial #4


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:01<00:00, 48.50it/s]
[I 2024-05-31 12:48:30,972] Trial 4 finished with value: 38.0 and parameters: {'12551179920_predictor_instruction': 8, '12551179920_predictor_demos': 4}. Best is trial 3 with value: 68.0.


Starting trial #5


Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:00<00:00, 72.54it/s]
[I 2024-05-31 12:48:31,897] Trial 5 finished with value: 58.0 and parameters: {'12551179920_predictor_instruction': 4, '12551179920_predictor_demos': 2}. Best is trial 3 with value: 68.0.


Starting trial #6


Average Metric: 25 / 50  (50.0): 100%|██████████| 50/50 [00:01<00:00, 41.47it/s]
[I 2024-05-31 12:48:33,302] Trial 6 pruned. 


Trial pruned.
Starting trial #7


Average Metric: 25 / 50  (50.0): 100%|██████████| 50/50 [00:01<00:00, 49.71it/s]
[I 2024-05-31 12:48:34,586] Trial 7 pruned. 


Trial pruned.
Starting trial #8


Average Metric: 35 / 50  (70.0): 100%|██████████| 50/50 [00:01<00:00, 26.72it/s]
[I 2024-05-31 12:48:36,744] Trial 8 finished with value: 70.0 and parameters: {'12551179920_predictor_instruction': 5, '12551179920_predictor_demos': 8}. Best is trial 8 with value: 70.0.


Starting trial #9


Average Metric: 35 / 50  (70.0): 100%|██████████| 50/50 [00:02<00:00, 16.95it/s]
[I 2024-05-31 12:48:39,934] Trial 9 finished with value: 70.0 and parameters: {'12551179920_predictor_instruction': 2, '12551179920_predictor_demos': 2}. Best is trial 8 with value: 70.0.


Starting trial #10


Average Metric: 31 / 50  (62.0): 100%|██████████| 50/50 [00:01<00:00, 32.62it/s] 
[I 2024-05-31 12:48:41,719] Trial 10 finished with value: 62.0 and parameters: {'12551179920_predictor_instruction': 5, '12551179920_predictor_demos': 8}. Best is trial 8 with value: 70.0.


Starting trial #11


Average Metric: 27 / 50  (54.0): 100%|██████████| 50/50 [00:01<00:00, 47.22it/s]
[I 2024-05-31 12:48:43,035] Trial 11 pruned. 


Trial pruned.
Starting trial #12


Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:01<00:00, 38.29it/s]
[I 2024-05-31 12:48:44,583] Trial 12 finished with value: 58.0 and parameters: {'12551179920_predictor_instruction': 6, '12551179920_predictor_demos': 8}. Best is trial 8 with value: 70.0.


Starting trial #13


Average Metric: 37 / 50  (74.0): 100%|██████████| 50/50 [00:00<00:00, 56.92it/s] 
[I 2024-05-31 12:48:45,886] Trial 13 finished with value: 74.0 and parameters: {'12551179920_predictor_instruction': 2, '12551179920_predictor_demos': 2}. Best is trial 13 with value: 74.0.


Starting trial #14


Average Metric: 30 / 50  (60.0): 100%|██████████| 50/50 [00:00<00:00, 53.11it/s]
[I 2024-05-31 12:48:47,191] Trial 14 finished with value: 60.0 and parameters: {'12551179920_predictor_instruction': 7, '12551179920_predictor_demos': 6}. Best is trial 13 with value: 74.0.


Starting trial #15


Average Metric: 21 / 50  (42.0): 100%|██████████| 50/50 [00:00<00:00, 66.28it/s]
[I 2024-05-31 12:48:48,376] Trial 15 pruned. 


Trial pruned.
Starting trial #16


Average Metric: 38 / 50  (76.0): 100%|██████████| 50/50 [00:00<00:00, 55.02it/s]
[I 2024-05-31 12:48:49,517] Trial 16 finished with value: 76.0 and parameters: {'12551179920_predictor_instruction': 2, '12551179920_predictor_demos': 8}. Best is trial 16 with value: 76.0.


Starting trial #17


Average Metric: 35 / 50  (70.0): 100%|██████████| 50/50 [00:00<00:00, 55.45it/s]
[I 2024-05-31 12:48:50,657] Trial 17 finished with value: 70.0 and parameters: {'12551179920_predictor_instruction': 2, '12551179920_predictor_demos': 2}. Best is trial 16 with value: 76.0.


Starting trial #18


Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:01<00:00, 36.95it/s]
[I 2024-05-31 12:48:52,243] Trial 18 pruned. 


Trial pruned.
Starting trial #19


Average Metric: 31 / 50  (62.0): 100%|██████████| 50/50 [00:00<00:00, 75.65it/s] 
[I 2024-05-31 12:48:53,126] Trial 19 finished with value: 62.0 and parameters: {'12551179920_predictor_instruction': 2, '12551179920_predictor_demos': 1}. Best is trial 16 with value: 76.0.


Starting trial #20


Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:00<00:00, 52.58it/s]
[I 2024-05-31 12:48:54,340] Trial 20 pruned. 


Trial pruned.
Starting trial #21


Average Metric: 37 / 50  (74.0): 100%|██████████| 50/50 [00:00<00:00, 53.27it/s] 
[I 2024-05-31 12:48:55,532] Trial 21 finished with value: 74.0 and parameters: {'12551179920_predictor_instruction': 7, '12551179920_predictor_demos': 8}. Best is trial 16 with value: 76.0.


Starting trial #22


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 55.48it/s]
[I 2024-05-31 12:48:56,672] Trial 22 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 7, '12551179920_predictor_demos': 8}. Best is trial 22 with value: 78.0.


Starting trial #23


Average Metric: 34 / 50  (68.0): 100%|██████████| 50/50 [00:00<00:00, 53.83it/s]
[I 2024-05-31 12:48:57,847] Trial 23 finished with value: 68.0 and parameters: {'12551179920_predictor_instruction': 7, '12551179920_predictor_demos': 8}. Best is trial 22 with value: 78.0.


Starting trial #24


Average Metric: 30 / 50  (60.0): 100%|██████████| 50/50 [00:00<00:00, 74.17it/s]
[I 2024-05-31 12:48:58,726] Trial 24 pruned. 


Trial pruned.
Starting trial #25


Average Metric: 20 / 50  (40.0): 100%|██████████| 50/50 [00:02<00:00, 24.54it/s]
[I 2024-05-31 12:49:00,951] Trial 25 pruned. 


Trial pruned.
Starting trial #26


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:01<00:00, 38.27it/s]
[I 2024-05-31 12:49:02,544] Trial 26 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #27


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 71.07it/s]
[I 2024-05-31 12:49:03,568] Trial 27 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #28


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 61.94it/s]
[I 2024-05-31 12:49:04,617] Trial 28 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #29


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 60.82it/s]
[I 2024-05-31 12:49:05,738] Trial 29 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #30


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 62.52it/s]
[I 2024-05-31 12:49:06,791] Trial 30 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #31


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:00<00:00, 66.74it/s]
[I 2024-05-31 12:49:07,773] Trial 31 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #32


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:01<00:00, 32.19it/s]
[I 2024-05-31 12:49:09,565] Trial 32 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #33


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:00<00:00, 53.78it/s] 
[I 2024-05-31 12:49:10,757] Trial 33 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #34


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:01<00:00, 49.44it/s]
[I 2024-05-31 12:49:12,010] Trial 34 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 26 with value: 80.0.


Starting trial #35


Average Metric: 41 / 50  (82.0): 100%|██████████| 50/50 [00:00<00:00, 61.01it/s]
[I 2024-05-31 12:49:13,077] Trial 35 finished with value: 82.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #36


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:00<00:00, 64.06it/s]
[I 2024-05-31 12:49:14,085] Trial 36 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #37


Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:00<00:00, 52.30it/s]
[I 2024-05-31 12:49:15,259] Trial 37 pruned. 


Trial pruned.
Starting trial #38


Average Metric: 23 / 50  (46.0): 100%|██████████| 50/50 [00:00<00:00, 62.88it/s]
[I 2024-05-31 12:49:16,395] Trial 38 pruned. 


Trial pruned.
Starting trial #39


Average Metric: 29 / 50  (58.0): 100%|██████████| 50/50 [00:00<00:00, 54.99it/s] 
[I 2024-05-31 12:49:17,644] Trial 39 pruned. 


Trial pruned.
Starting trial #40


Average Metric: 38 / 50  (76.0): 100%|██████████| 50/50 [00:00<00:00, 56.18it/s]
[I 2024-05-31 12:49:18,794] Trial 40 finished with value: 76.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #41


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:00<00:00, 54.47it/s] 
[I 2024-05-31 12:49:20,011] Trial 41 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #42


Average Metric: 38 / 50  (76.0): 100%|██████████| 50/50 [00:02<00:00, 24.66it/s]
[I 2024-05-31 12:49:22,260] Trial 42 finished with value: 76.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #43


Average Metric: 39 / 50  (78.0): 100%|██████████| 50/50 [00:01<00:00, 42.09it/s]
[I 2024-05-31 12:49:23,680] Trial 43 finished with value: 78.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #44


Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:00<00:00, 54.37it/s]
[I 2024-05-31 12:49:24,923] Trial 44 pruned. 


Trial pruned.
Starting trial #45


Average Metric: 24 / 50  (48.0): 100%|██████████| 50/50 [00:02<00:00, 18.87it/s]
[I 2024-05-31 12:49:27,798] Trial 45 pruned. 


Trial pruned.
Starting trial #46


Average Metric: 40 / 50  (80.0): 100%|██████████| 50/50 [00:00<00:00, 55.30it/s]
[I 2024-05-31 12:49:28,907] Trial 46 finished with value: 80.0 and parameters: {'12551179920_predictor_instruction': 3, '12551179920_predictor_demos': 3}. Best is trial 35 with value: 82.0.


Starting trial #47


Average Metric: 27 / 50  (54.0): 100%|██████████| 50/50 [00:02<00:00, 24.37it/s]
[I 2024-05-31 12:49:31,177] Trial 47 pruned. 


Trial pruned.
Starting trial #48


Average Metric: 28 / 50  (56.0): 100%|██████████| 50/50 [00:06<00:00,  8.08it/s]
[I 2024-05-31 12:49:37,608] Trial 48 pruned. 


Trial pruned.
Starting trial #49


Average Metric: 23 / 50  (46.0): 100%|██████████| 50/50 [00:01<00:00, 28.45it/s]
[I 2024-05-31 12:49:39,694] Trial 49 pruned. 


Trial pruned.
Returning predict = Predict(StringSignature(output -> sentiment
    instructions='Generate an informative response based on the given input about Samsung products, including features, specifications, user experiences, and comparisons with other brands. Provide detailed insights and analysis to assist potential customers in making informed purchasing decisions.'
    output = Field(annotation=str required=True json_schema_extra={'desc': 'Output of the LLM talking about the product', '__dspy_field_type': 'input', 'prefix': 'Output:'})
    sentiment = Field(annotation=Sentiment required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'In-depth Analysis:', 'desc': '${sentiment}'})
)) from continue_program


In [25]:
from dspy.evaluate import Evaluate

evaluate_dev = Evaluate(devset=devset, metric=sentiment_matches, num_threads=4, display_progress=True, display_table=0)

dev_score = evaluate_dev(optimized_evaluator)
dev_score

Average Metric: 59 / 81  (72.8): 100%|██████████| 81/81 [00:11<00:00,  6.81it/s]


72.84

**Accuracy MIPRO** - 72.84%

# 3. Conclusions

### Results and Comparison
The baseline accuracy of gpt-3.5-turbo is 43.8%.

**DSPy Optimizers Leaderboard**
1. MIPRO - 72.84%
2. BootstrapFewShotWithOptuna - 71.6%
3. BootstrapFewShot - 60.49%
4. BootstrapFewShotWithRandomSearch - 59.26%
5. KNNFewShot - 50.62%
6. COPRO - 43.21%

Most of the optimizers have shown moderate to significant improvement compared to the baseline model accuracy.

### Future Improvements for the Notebook
Add confusion matrices for each evaluator to see which categories are most often mislabeled.

### Notes and Discussion
General notes about the experiment:
1. For MIPRO and COPRO optimizations, the given training dataset was too small. Better results could potentially be achieved with more data and a bigger budget available.
2. The quality of the whole dataset is not verified as it was artificially created by an LLM. Although it fully suffices for the primary goal of this experiment - to verify if DSPy optimizers can actually improve the results. Experiments with human-LLM interaction datasets would be interesting to explore in future trials. Additionally, experiments with human-made answers should be explored in the next experiments.
3. For simplicity, the task was defined as a basic classification, and the requirement for "reasoning" or "positive_points" and "negative_points" was lifted. It is worth exploring further how priming would affect the final decision of an LLM.
4. As the documentation of the DSPy library does not cover all the details, not all of the optimizers were used with optimal input parameters. The selection of most input parameters was based on the examples given in the [DSPy Cheatsheet](https://dspy-docs.vercel.app/docs/cheatsheet). We believe that certain optimizers could show better accuracy if the input parameters were chosen more thoroughly.