# Data

In [3]:
import json
with open('./data/hormozi_tweets.jsonl', 'r') as f:
    data = [json.loads(line) for line in f]

# make a list of 300 tweets
tweets = [item['tweet'] for item in data][:300]

# Baseline Tweet Generator

In [3]:
import random
import openai
import os
from dotenv import load_dotenv

load_dotenv()

random_tweets = random.sample(tweets, 10)

prompt = f"""
Past Tweets:
{random_tweets}

Create a new tweet based on the past tweets.
"""

openai.api_key = os.environ.get('OPENAI_API_KEY')

def generate_tweet(prompt):
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that creates tweets similar to past tweets from the user."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=280,
    )
    return response.choices[0].message.content.strip().lower()

# Create a new tweet
new_tweet = generate_tweet(prompt)

In [4]:
new_tweet

'"success isn\'t about never hitting a low point; it\'s about recognizing that those moments are opportunities to push through while others quit. remember, every setback is just a setup for your comeback. 💪"'

# First experiment

DSPy recommends that you start with the simplest solution and add complexity, so that is what we'll do. For our first experiement, we'll just use a ChainOfThought module to generate tweets based on all the tweets we've previously seen.

In [14]:
import json
with open('./data/hormozi_tweets.jsonl', 'r') as f:
    data = [json.loads(line) for line in f]

# make a list of 300 tweets
tweets = [{'tweet': item['tweet'], 'engagement': item['replies'] + item['retweets'] +item['likes']} for item in data][:300]

In [16]:
tweets[2]

{'tweet': 'Rush is an illusion.', 'engagement': 908}

In [13]:
import dspy
from dspy.primitives import Example
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShot


In [20]:
tweets

[{'tweet': 'Stop “friend financing”', 'engagement': 137},
 {'tweet': 'A message for friends of entrepreneurs:\n\nDon’t buy them a present. \n\nBuy their product and leave a nice review or tell them why it wasn’t good enough to deserve it (an even better gift).\n\nNo one needs more stuff. \nEveryone could use more support.',
  'engagement': 979},
 {'tweet': 'Rush is an illusion.', 'engagement': 908},
 {'tweet': 'Advice to strong men:\n\nFind a strong woman.',
  'engagement': 3574},
 {'tweet': 'Reminder:\n\nDeath tax is 100% for everyone.\n\n(Because whatever you have isn’t yours anymore after you die).',
  'engagement': 1136},
 {'tweet': 'When it comes to talent, “good enough” often isn’t.',
  'engagement': 941},
 {'tweet': 'Squeeze all the potential you’ve got into reality.\n\nEvery. Last. Drop.',
  'engagement': 3391},
 {'tweet': 'It’s not about doing your best, it’s about doing what’s required.\n\nAnd sometimes that means your best just needs to get better.',
  'engagement': 2555},
 

## Annotating tweets
Our tweets don't have any topic or label associated with them that might help us optimize and generate more tweets. We can annotate them by getting an LLM to find out the topic they fall under.

In [21]:
import dspy

# Convert tweets to Example objects
dataset = [dspy.Example(tweet=tweet).with_inputs("tweet", "engagement") for tweet in tweets]

In [22]:
class TopicPredictor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("tweet -> topic")

    def forward(self, tweet):
        return self.prog(tweet=tweet)

In [26]:
mini = dspy.LM(model='gpt-4o')
dspy.configure(lm = mini)

In [27]:
# Initialize the topic predictor
topic_predictor = TopicPredictor()

# Annotate each tweet with a predicted topic
annotated_dataset = []
for example in dataset:
    response = topic_predictor(example.tweet)
    annotated_example = example.copy()
    annotated_example.topic = response.topic
    annotated_dataset.append(annotated_example)

# Print annotated dataset
for example in annotated_dataset:
    print(f"Tweet: {example.tweet}, Topic: {example.topic}")

Tweet: {'tweet': 'Stop “friend financing”', 'engagement': 137}, Topic: Personal Finance
Tweet: {'tweet': 'A message for friends of entrepreneurs:\n\nDon’t buy them a present. \n\nBuy their product and leave a nice review or tell them why it wasn’t good enough to deserve it (an even better gift).\n\nNo one needs more stuff. \nEveryone could use more support.', 'engagement': 979}, Topic: Support for Entrepreneurs
Tweet: {'tweet': 'Rush is an illusion.', 'engagement': 908}, Topic: Philosophy
Tweet: {'tweet': 'Advice to strong men:\n\nFind a strong woman.', 'engagement': 3574}, Topic: Relationships
Tweet: {'tweet': 'Reminder:\n\nDeath tax is 100% for everyone.\n\n(Because whatever you have isn’t yours anymore after you die).', 'engagement': 1136}, Topic: Philosophy, Mortality
Tweet: {'tweet': 'When it comes to talent, “good enough” often isn’t.', 'engagement': 941}, Topic: Talent and Excellence
Tweet: {'tweet': 'Squeeze all the potential you’ve got into reality.\n\nEvery. Last. Drop.', 'en

In [28]:
annotated_dataset[0]

Example({'tweet': {'tweet': 'Stop “friend financing”', 'engagement': 137}, 'topic': 'Personal Finance'}) (input_keys=None)

We now have our annotated data. We can save it to use later and train a tweet writer.

In [30]:
class TweetWriter(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("topic -> tweet")

    def forward(self, topic):
        return self.prog(topic=topic)

In [68]:
class Assess(dspy.Signature):
    """Assess the creativity of a tweet."""
    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Score between 1 and 5")

def creativity_metric(gold, pred, trace=None):
    tweet = pred.tweet
    creativity_question = "Rate the creativity of the following tweet on a scale of 1 to 5."
    creativity_score = dspy.Predict(Assess)(
            assessed_text=tweet,
            assessment_question=creativity_question
        )
    print(creativity_score)
    score = int(creativity_score.assessment_answer.strip())
    return score

In [32]:
uncompiled_tweet_writer = TweetWriter()

In [33]:
uncompiled_tweet_writer("Business")

Prediction(
    reasoning='Discussing business topics on social media can provide valuable insights, foster networking opportunities, and keep followers informed about industry trends and best practices. A well-crafted tweet can engage your audience and position you as a thought leader in your field.',
    tweet='In the ever-evolving world of business, staying ahead means embracing innovation and continuous learning. 📈💡 What new strategies are you implementing to drive growth in 2023? #BusinessGrowth #Innovation #Leadership'
)

Let's evaluate our uncompiled tweet writer

In [72]:
flattened_annotated_dataset = [
    dspy.Example(tweet=item.tweet["tweet"], engagement = item.tweet["engagement"], topic=item["topic"]).with_inputs("topic")
    for item in annotated_dataset
]

In [45]:
import pickle

with open('data/annotated_dataset.pkl', 'wb') as f:
    pickle.dump(flattened_annotated_dataset, f) 

In [52]:
dev_set_n = 250
trainset = flattened_annotated_dataset[:dev_set_n]
devset = flattened_annotated_dataset[dev_set_n:]

In [91]:
evaluator = Evaluate(devset=devset[:10], metric=creativity_metric, num_threads=4, display_progress=True, display_table=5, provide_traceback=True)
evaluator(uncompiled_tweet_writer) 

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 32 / 10  (320.0): 100%|██████████| 10/10 [00:00<00:00, 192.90it/s]


Unnamed: 0,example_tweet,engagement,topic,reasoning,pred_tweet,creativity_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",3388,Learning and Experience,"Learning and experience are deeply interconnected. Learning provides the theoretical foundation and knowledge base, while experience offers practical application and real-world insights. Together, they create...","Learning gives you the knowledge, but experience turns that knowledge into wisdom. Embrace both for a well-rounded journey of growth! 🌱✨ #Learning #Experience #Growth",✔️ [4]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,1647,Everyday Life/Humor,"Humor is a great way to connect with people on social media, especially when it relates to everyday life. It makes the content relatable and...","Why is it that the moment you decide to eat healthy, your fridge suddenly looks like a junk food paradise? 🍕🍫 #EverydayStruggles #HealthyEating",✔️ [4]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",4534,Decision Making,"Decision making is a critical skill that impacts every aspect of our lives, from personal choices to professional strategies. Effective decision making involves evaluating options,...","Good decisions come from experience, and experience comes from bad decisions. 🧠✨ Master the art of decision making by evaluating your options, considering potential outcomes,...",✔️ [3]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",3551,Habit Formation,Habit formation is a crucial aspect of personal development and productivity. Understanding how habits are formed and maintained can help individuals make positive changes in...,"Building good habits is the key to unlocking your potential! Start small, stay consistent, and watch how tiny changes lead to big results. 💪 #HabitFormation...",✔️ [3]
4,The reason you are stressed is you have decisions to make and you’re not making them.,9078,Mental Health,Mental health is a crucial aspect of overall well-being that often gets overlooked. Raising awareness and encouraging open conversations can help reduce stigma and provide...,Your mental health matters just as much as your physical health. Don't hesitate to reach out for support if you need it. Let's break the...,✔️ [3]


320.0

### For unoptimized tweet writer, our score on the devset is: 320

# BootstrapFewShot

In [89]:
config = dict(max_bootstrapped_demos=25, max_labeled_demos=4)
teleprompter = BootstrapFewShot(metric=creativity_metric, **config)
optimized_tweet_writer = teleprompter.compile(TweetWriter(), trainset=trainset[:10])



Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='4'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)




Prediction(
    assessment_answer='3'
)


100%|██████████| 10/10 [00:20<00:00,  2.09s/it]

Prediction(
    assessment_answer='3'
)
Bootstrapped 10 full traces after 10 examples in round 0.





In [90]:
dspy.settings.configure(backoff_time=5)
evaluator = Evaluate(devset=devset[:10], metric=creativity_metric, num_threads=2, display_progress=True, display_table=5, provide_traceback=True)
evaluator(optimized_tweet_writer)

Average Metric: 4 / 1  (400.0):  10%|█         | 1/10 [00:01<00:16,  1.85s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 7 / 2  (350.0):  20%|██        | 2/10 [00:02<00:09,  1.16s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 10 / 3  (333.3):  30%|███       | 3/10 [00:03<00:07,  1.13s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 13 / 4  (325.0):  40%|████      | 4/10 [00:04<00:05,  1.04it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 19 / 6  (316.7):  50%|█████     | 5/10 [00:06<00:06,  1.20s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  70%|███████   | 7/10 [00:07<00:03,  1.01s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  80%|████████  | 8/10 [00:07<00:01,  1.24it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 9  (322.2):  90%|█████████ | 9/10 [00:09<00:01,  1.05s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 32 / 10  (320.0): 100%|██████████| 10/10 [00:10<00:00,  1.05s/it]

Prediction(
    assessment_answer='3'
)





Unnamed: 0,example_tweet,engagement,topic,reasoning,pred_tweet,creativity_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",3388,Learning and Experience,Not supplied for this particular example.,"Learning gives you knowledge, but experience gives you wisdom. Embrace both, and you'll navigate life with confidence and insight. 📚🌍 #Learning #Experience",✔️ [3]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,1647,Everyday Life/Humor,Not supplied for this particular example.,"Adulting is like folding a fitted sheet. No one really knows how to do it, but we all pretend we do. 😂 #EverydayLife #Humor",✔️ [4]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",4534,Decision Making,Not supplied for this particular example.,"Every decision shapes your future. Trust your instincts, gather the facts, and don't be afraid to take calculated risks. Growth often lies just outside your...",✔️ [3]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",3551,Habit Formation,Not supplied for this particular example.,"Small, consistent actions lead to big changes. Focus on building one positive habit at a time, and watch how they transform your life. 🌟 #HabitFormation...",✔️ [3]
4,The reason you are stressed is you have decisions to make and you’re not making them.,9078,Mental Health,Not supplied for this particular example.,"Taking care of your mental health is just as important as taking care of your physical health. Reach out, talk about it, and seek help...",✔️ [3]


320.0

It seems like BootstrapFewShots isn't giving us the results we want since it didn't improve on an uncompiled program. Let's try a few more examples and BootstrapFewShotWithRandomSearch

# BootstrapFewShotWithRandomSearch

DSPy suggests that you use BootstrapFewShotWithRandomSearch for upto 50 examples.

In [101]:
from dotenv import load_dotenv

load_dotenv()

True

In [108]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

dspy.settings.configure(backoff_time=10)
config = dict(max_labeled_demos = 1, max_bootstrapped_demos=1, num_candidate_programs=2)
teleprompter = BootstrapFewShotWithRandomSearch(metric = creativity_metric, **config)

rs_optimized_tweet_writer = teleprompter.compile(TweetWriter(), trainset=trainset[:20])

Going to sample between 1 and 1 traces per predictor.
Will attempt to bootstrap 2 candidate sets.
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 61 / 20  (305.0): 100%|██████████| 20/20 [00:00<00:00, 415.02it/s] 


New best score: 305.0 for seed -3
Scores so far: [305.0]
Best score so far: 305.0


Average Metric: 9 / 3  (300.0):  10%|█         | 2/20 [00:02<00:18,  1.01s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)


Average Metric: 12 / 4  (300.0):  15%|█▌        | 3/20 [00:02<00:17,  1.01s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 15 / 5  (300.0):  25%|██▌       | 5/20 [00:02<00:06,  2.40it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 21 / 7  (300.0):  30%|███       | 6/20 [00:04<00:09,  1.40it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 24 / 8  (300.0):  40%|████      | 8/20 [00:04<00:06,  1.98it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 27 / 9  (300.0):  45%|████▌     | 9/20 [00:05<00:05,  1.96it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 30 / 10  (300.0):  50%|█████     | 10/20 [00:05<00:04,  2.19it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 33 / 11  (300.0):  55%|█████▌    | 11/20 [00:06<00:04,  1.89it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 39 / 13  (300.0):  60%|██████    | 12/20 [00:07<00:04,  1.73it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 42 / 14  (300.0):  70%|███████   | 14/20 [00:07<00:02,  2.04it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 45 / 15  (300.0):  75%|███████▌  | 15/20 [00:08<00:02,  2.33it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 52 / 17  (305.9):  85%|████████▌ | 17/20 [00:08<00:01,  2.89it/s]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 55 / 18  (305.6):  90%|█████████ | 18/20 [00:09<00:00,  2.03it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 58 / 19  (305.3):  95%|█████████▌| 19/20 [00:10<00:00,  1.74it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 60 / 20  (300.0): 100%|██████████| 20/20 [00:11<00:00,  1.79it/s]


Prediction(
    assessment_answer='2'
)
Scores so far: [305.0, 300.0]
Best score so far: 305.0


  5%|▌         | 1/20 [00:00<00:00, 1173.56it/s]


Prediction(
    assessment_answer='3'
)
Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 10 / 3  (333.3):  10%|█         | 2/20 [00:02<00:47,  2.67s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 13 / 4  (325.0):  20%|██        | 4/20 [00:03<00:09,  1.76it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 19 / 6  (316.7):  30%|███       | 6/20 [00:03<00:05,  2.73it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  35%|███▌      | 7/20 [00:05<00:12,  1.02it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 28 / 9  (311.1):  45%|████▌     | 9/20 [00:06<00:06,  1.60it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 34 / 11  (309.1):  50%|█████     | 10/20 [00:06<00:05,  1.89it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 37 / 12  (308.3):  60%|██████    | 12/20 [00:07<00:03,  2.43it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 40 / 13  (307.7):  65%|██████▌   | 13/20 [00:08<00:04,  1.49it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 43 / 14  (307.1):  70%|███████   | 14/20 [00:08<00:03,  1.62it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 49 / 16  (306.2):  80%|████████  | 16/20 [00:09<00:01,  2.35it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 52 / 17  (305.9):  85%|████████▌ | 17/20 [00:09<00:01,  2.92it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 55 / 18  (305.6):  90%|█████████ | 18/20 [00:10<00:00,  2.18it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 58 / 19  (305.3):  95%|█████████▌| 19/20 [00:11<00:00,  1.77it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 20  (310.0): 100%|██████████| 20/20 [00:12<00:00,  1.66it/s]


Prediction(
    assessment_answer='4'
)
New best score: 310.0 for seed -1
Scores so far: [305.0, 300.0, 310.0]
Best score so far: 310.0


  5%|▌         | 1/20 [00:02<00:46,  2.47s/it]


Prediction(
    assessment_answer='3'
)
Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3 / 1  (300.0):   5%|▌         | 1/20 [00:04<01:19,  4.17s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 6 / 2  (300.0):  10%|█         | 2/20 [00:05<00:41,  2.33s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 9 / 3  (300.0):  15%|█▌        | 3/20 [00:08<00:46,  2.71s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 12 / 4  (300.0):  20%|██        | 4/20 [00:08<00:28,  1.77s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 18 / 6  (300.0):  25%|██▌       | 5/20 [00:09<00:21,  1.46s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 26 / 8  (325.0):  40%|████      | 8/20 [00:11<00:10,  1.16it/s]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='4'
)


Average Metric: 29 / 9  (322.2):  45%|████▌     | 9/20 [00:12<00:11,  1.06s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 32 / 10  (320.0):  50%|█████     | 10/20 [00:13<00:10,  1.01s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 35 / 11  (318.2):  55%|█████▌    | 11/20 [00:15<00:10,  1.16s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 41 / 13  (315.4):  60%|██████    | 12/20 [00:15<00:08,  1.00s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 44 / 14  (314.3):  70%|███████   | 14/20 [00:17<00:04,  1.23it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 15  (313.3):  75%|███████▌  | 15/20 [00:18<00:04,  1.12it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 50 / 16  (312.5):  80%|████████  | 16/20 [00:18<00:03,  1.19it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 53 / 17  (311.8):  85%|████████▌ | 17/20 [00:20<00:02,  1.10it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 58 / 19  (305.3):  90%|█████████ | 18/20 [00:20<00:01,  1.37it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)


Average Metric: 61 / 20  (305.0): 100%|██████████| 20/20 [00:23<00:00,  1.19s/it]


Prediction(
    assessment_answer='3'
)
Scores so far: [305.0, 300.0, 310.0, 305.0]
Best score so far: 310.0


  5%|▌         | 1/20 [00:03<00:59,  3.11s/it]


Prediction(
    assessment_answer='3'
)
Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 3 / 1  (300.0):   5%|▌         | 1/20 [00:05<01:40,  5.26s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 10 / 3  (333.3):  15%|█▌        | 3/20 [00:05<00:23,  1.39s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 14 / 4  (350.0):  20%|██        | 4/20 [00:06<00:15,  1.04it/s]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 20 / 6  (333.3):  30%|███       | 6/20 [00:08<00:14,  1.05s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 26 / 8  (325.0):  35%|███▌      | 7/20 [00:11<00:21,  1.68s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 9  (322.2):  45%|████▌     | 9/20 [00:11<00:10,  1.01it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 35 / 11  (318.2):  50%|█████     | 10/20 [00:11<00:08,  1.24it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 39 / 12  (325.0):  60%|██████    | 12/20 [00:14<00:08,  1.01s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 45 / 14  (321.4):  70%|███████   | 14/20 [00:17<00:06,  1.15s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 51 / 16  (318.8):  75%|███████▌  | 15/20 [00:17<00:04,  1.11it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 54 / 17  (317.6):  85%|████████▌ | 17/20 [00:18<00:01,  1.56it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 18  (311.1):  90%|█████████ | 18/20 [00:20<00:01,  1.06it/s]

Prediction(
    assessment_answer='2'
)


Average Metric: 59 / 19  (310.5):  95%|█████████▌| 19/20 [00:24<00:01,  1.75s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 20  (310.0): 100%|██████████| 20/20 [00:24<00:00,  1.24s/it]

Prediction(
    assessment_answer='3'
)
Scores so far: [305.0, 300.0, 310.0, 305.0, 310.0]
Best score so far: 310.0
5 candidate programs found.





In [110]:
dspy.settings.configure(backoff_time=5)
evaluator = Evaluate(devset=devset[:10], metric=creativity_metric, num_threads=2, display_progress=True, display_table=5, provide_traceback=True)
evaluator(rs_optimized_tweet_writer)

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 10  (290.0): 100%|██████████| 10/10 [00:00<00:00, 1355.89it/s]


Unnamed: 0,example_tweet,engagement,topic,reasoning,pred_tweet,creativity_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",3388,Learning and Experience,"Learning and experience are fundamental to personal and professional growth. They enable us to acquire new skills, adapt to changing environments, and make informed decisions....",Never stop learning and seeking new experiences! They are the building blocks of growth and success. Embrace every opportunity to expand your horizons. 🌟📚 #LifelongLearning...,✔️ [3]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,1647,Everyday Life/Humor,Humor in everyday life helps to lighten the mood and make daily routines more enjoyable. Sharing funny observations or experiences can bring a smile to...,Why do we always find the one thing we lost right after we buy a replacement? It's like the universe is playing hide and seek...,✔️ [4]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",4534,Decision Making,"Decision making is a fundamental skill that impacts every aspect of our lives, from personal choices to professional paths. Effective decision making involves evaluating options,...","Good decision making is the cornerstone of success! Take the time to weigh your options, consider the outcomes, and choose wisely. Your future depends on...",✔️ [2]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",3551,Habit Formation,"Habit formation is essential for personal growth and achieving long-term goals. By developing positive habits, we can improve our productivity, health, and overall well-being. Understanding...","Building good habits is the foundation of success! Start small, stay consistent, and watch your life transform. 🌱🔄 #HabitFormation #PersonalGrowth",✔️ [3]
4,The reason you are stressed is you have decisions to make and you’re not making them.,9078,Mental Health,"Mental health is a vital component of overall well-being, impacting how we think, feel, and act. Addressing mental health issues and promoting mental wellness can...","Your mental health matters just as much as your physical health. Take time to care for your mind, seek support when needed, and remember that...",✔️ [3]


290.0

Let's try MIPROv2 and see if we can improve on this, then we could try out maybe changing the metric to better optimize our writer.

In [111]:
optimized_tweet_writer.save(path="fewshot")

[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]


In [112]:
rs_optimized_tweet_writer.save(path="fewshotwithrs")

[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]


# MIPROv2

In [114]:
# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=creativity_metric,
    num_candidates=7,
    init_temperature=0.5,
    verbose=False,
    num_threads=2,
)

mipro_optimized_program = teleprompter.compile(
    TweetWriter(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    num_trials=10,
    minibatch_size=25,
    minibatch_full_eval_steps=10,
    minibatch=True, 
)

# Save optimize program for future use
mipro_optimized_program.save("mipro_optimized")

[93m[1mProjected Language Model (LM) Calls[0m

Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:


[93m- Prompt Model: [94m[1m10[0m[93m data summarizer calls + [94m[1m7[0m[93m * [94m[1m1[0m[93m lm calls in program + ([94m[1m2[0m[93m) lm calls in program aware proposer = [94m[1m19[0m[93m prompt model calls[0m
[93m- Task Model: [94m[1m25[0m[93m examples in minibatch * [94m[1m10[0m[93m batches + [94m[1m200[0m[93m examples in val set * [94m[1m1[0m[93m full evals = [94m[1m300[0m[93m task model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per 

  2%|▏         | 1/50 [00:02<01:55,  2.37s/it]

Prediction(
    assessment_answer='2'
)


  4%|▍         | 2/50 [00:04<01:53,  2.36s/it]

Prediction(
    assessment_answer='3'
)


  6%|▌         | 3/50 [00:08<02:19,  2.97s/it]


Prediction(
    assessment_answer='3'
)
Bootstrapped 3 full traces after 4 examples in round 0.
Bootstrapping set 4/7


  2%|▏         | 1/50 [00:04<03:16,  4.00s/it]

Prediction(
    assessment_answer='3'
)


  4%|▍         | 2/50 [00:07<02:53,  3.62s/it]


Prediction(
    assessment_answer='3'
)
Bootstrapped 2 full traces after 3 examples in round 0.
Bootstrapping set 5/7


  2%|▏         | 1/50 [00:02<01:53,  2.32s/it]


Prediction(
    assessment_answer='4'
)
Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 6/7


  2%|▏         | 1/50 [00:06<04:55,  6.03s/it]


Prediction(
    assessment_answer='3'
)
Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 7/7


  2%|▏         | 1/50 [00:02<01:46,  2.17s/it]

Prediction(
    assessment_answer='3'
)
Bootstrapped 1 full traces after 2 examples in round 0.

==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
In this step, by default we will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.






Proposing instructions...

Proposed Instructions for Predictor 0:

0: Given the fields `topic`, produce the fields `tweet`.

1: Given a topic, generate a tweet that effectively communicates the core message of the topic. First, outline your reasoning process step-by-step to ensure the tweet is coherent, engaging, and relevant. Then, craft the tweet using a conversational tone with concise, actionable insights, and consider incorporating anecdotes or inspirational quotes to enhance its relatability and impact.

2: You are a social media content creator specializing in motivational and contemporary topics. Given the field `topic`, generate a tweet that is engaging, concise, and relevant to the topic. The tweet should exhibit a conversational tone, provide actionable insights, and may include anecdotes or inspirational quotes to enhance relatability and impact.

3: Given a `topic`, generate a thoughtful reasoning followed by a concise, engaging tweet that encapsulates the essence of the 

Average Metric: 576 / 200  (288.0): 100%|██████████| 200/200 [00:00<00:00, 3450.00it/s]


Default program score: 288.0

==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
In this step, we will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination. Bayesian Optimization will be used for this search process.

== Minibatch Trial 1 / 10 ==


Average Metric: 3 / 1  (300.0):   4%|▍         | 1/25 [00:01<00:42,  1.77s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 6 / 2  (300.0):   8%|▊         | 2/25 [00:02<00:25,  1.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 13 / 4  (325.0):  12%|█▏        | 3/25 [00:03<00:25,  1.15s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 16 / 5  (320.0):  20%|██        | 5/25 [00:04<00:13,  1.46it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  24%|██▍       | 6/25 [00:06<00:18,  1.02it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  32%|███▏      | 8/25 [00:08<00:17,  1.03s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 28 / 9  (311.1):  36%|███▌      | 9/25 [00:08<00:13,  1.18it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 31 / 10  (310.0):  40%|████      | 10/25 [00:09<00:14,  1.00it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 34 / 11  (309.1):  44%|████▍     | 11/25 [00:11<00:16,  1.16s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 37 / 12  (308.3):  48%|████▊     | 12/25 [00:12<00:13,  1.08s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 43 / 14  (307.1):  52%|█████▏    | 13/25 [00:13<00:12,  1.05s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 46 / 15  (306.7):  60%|██████    | 15/25 [00:14<00:07,  1.36it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 49 / 16  (306.2):  64%|██████▍   | 16/25 [00:15<00:07,  1.20it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 61 / 20  (305.0):  76%|███████▌  | 19/25 [00:15<00:04,  1.26it/s]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 64 / 21  (304.8):  84%|████████▍ | 21/25 [00:17<00:01,  2.00it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 67 / 22  (304.5):  88%|████████▊ | 22/25 [00:17<00:01,  1.82it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 70 / 23  (304.3):  92%|█████████▏| 23/25 [00:19<00:01,  1.39it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 73 / 24  (304.2):  96%|█████████▌| 24/25 [00:20<00:00,  1.21it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 76 / 25  (304.0): 100%|██████████| 25/25 [00:21<00:00,  1.17it/s]


Prediction(
    assessment_answer='3'
)
Score: 304.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].


== Minibatch Trial 2 / 10 ==


Average Metric: 6 / 2  (300.0):   4%|▍         | 1/25 [00:05<02:00,  5.03s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 13 / 4  (325.0):  12%|█▏        | 3/25 [00:10<01:09,  3.16s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 19 / 6  (316.7):  20%|██        | 5/25 [00:14<00:54,  2.74s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  28%|██▊       | 7/25 [00:20<00:49,  2.77s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  32%|███▏      | 8/25 [00:20<00:38,  2.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 28 / 9  (311.1):  36%|███▌      | 9/25 [00:25<00:44,  2.81s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 31 / 10  (310.0):  40%|████      | 10/25 [00:26<00:36,  2.41s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 37 / 12  (308.3):  44%|████▍     | 11/25 [00:31<00:43,  3.08s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 43 / 14  (307.1):  56%|█████▌    | 14/25 [00:36<00:23,  2.16s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 46 / 15  (306.7):  60%|██████    | 15/25 [00:41<00:27,  2.75s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 49 / 16  (306.2):  64%|██████▍   | 16/25 [00:41<00:20,  2.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 52 / 17  (305.9):  68%|██████▊   | 17/25 [00:45<00:21,  2.66s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 55 / 18  (305.6):  72%|███████▏  | 18/25 [00:47<00:16,  2.34s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 58 / 19  (305.3):  76%|███████▌  | 19/25 [00:49<00:13,  2.33s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 20  (310.0):  80%|████████  | 20/25 [00:51<00:11,  2.27s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 65 / 21  (309.5):  84%|████████▍ | 21/25 [00:53<00:09,  2.30s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 68 / 22  (309.1):  88%|████████▊ | 22/25 [00:55<00:06,  2.12s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 71 / 23  (308.7):  92%|█████████▏| 23/25 [00:57<00:04,  2.13s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 24  (308.3):  96%|█████████▌| 24/25 [00:59<00:02,  2.13s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 77 / 25  (308.0): 100%|██████████| 25/25 [01:05<00:00,  2.61s/it]


Prediction(
    assessment_answer='3'
)
Score: 308.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 5'].


== Minibatch Trial 3 / 10 ==


Average Metric: 3 / 1  (300.0):   4%|▍         | 1/25 [00:03<01:29,  3.73s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 6 / 2  (300.0):   8%|▊         | 2/25 [00:03<00:37,  1.65s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 12 / 4  (300.0):  12%|█▏        | 3/25 [00:07<00:55,  2.51s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 15 / 5  (300.0):  20%|██        | 5/25 [00:08<00:25,  1.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 21 / 7  (300.0):  24%|██▍       | 6/25 [00:11<00:36,  1.92s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 27 / 9  (300.0):  32%|███▏      | 8/25 [00:14<00:30,  1.80s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 30 / 10  (300.0):  40%|████      | 10/25 [00:15<00:17,  1.16s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 34 / 11  (309.1):  44%|████▍     | 11/25 [00:18<00:23,  1.69s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 37 / 12  (308.3):  48%|████▊     | 12/25 [00:19<00:17,  1.34s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 40 / 13  (307.7):  52%|█████▏    | 13/25 [00:22<00:23,  1.96s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 43 / 14  (307.1):  56%|█████▌    | 14/25 [00:23<00:17,  1.62s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 46 / 15  (306.7):  60%|██████    | 15/25 [00:26<00:18,  1.90s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 50 / 16  (312.5):  64%|██████▍   | 16/25 [00:27<00:14,  1.64s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 53 / 17  (311.8):  68%|██████▊   | 17/25 [00:30<00:16,  2.11s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 18  (311.1):  72%|███████▏  | 18/25 [00:31<00:11,  1.68s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 59 / 19  (310.5):  76%|███████▌  | 19/25 [00:34<00:12,  2.15s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 63 / 20  (315.0):  80%|████████  | 20/25 [00:34<00:08,  1.66s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 66 / 21  (314.3):  84%|████████▍ | 21/25 [00:38<00:08,  2.12s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 69 / 22  (313.6):  88%|████████▊ | 22/25 [00:38<00:04,  1.55s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 73 / 23  (317.4):  92%|█████████▏| 23/25 [00:42<00:04,  2.22s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 77 / 24  (320.8):  96%|█████████▌| 24/25 [00:43<00:01,  1.84s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 80 / 25  (320.0): 100%|██████████| 25/25 [00:46<00:00,  1.85s/it]


Prediction(
    assessment_answer='3'
)
Score: 320.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 4 / 10 ==


Average Metric: 7 / 2  (350.0):   4%|▍         | 1/25 [00:08<03:16,  8.18s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)


Average Metric: 13 / 4  (325.0):  16%|█▌        | 4/25 [00:13<00:55,  2.64s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 16 / 5  (320.0):  20%|██        | 5/25 [00:18<01:11,  3.55s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 20 / 6  (333.3):  24%|██▍       | 6/25 [00:18<00:47,  2.49s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 29 / 9  (322.2):  32%|███▏      | 8/25 [00:24<00:58,  3.46s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 32 / 10  (320.0):  40%|████      | 10/25 [00:25<00:23,  1.57s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 35 / 11  (318.2):  44%|████▍     | 11/25 [00:32<00:38,  2.76s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 38 / 12  (316.7):  48%|████▊     | 12/25 [00:32<00:29,  2.26s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 41 / 13  (315.4):  52%|█████▏    | 13/25 [00:38<00:36,  3.05s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 44 / 14  (314.3):  56%|█████▌    | 14/25 [00:43<00:40,  3.71s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 15  (313.3):  60%|██████    | 15/25 [00:43<00:27,  2.76s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 18  (311.1):  68%|██████▊   | 17/25 [00:51<00:32,  4.08s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 59 / 19  (310.5):  76%|███████▌  | 19/25 [00:53<00:12,  2.16s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 20  (310.0):  80%|████████  | 20/25 [01:00<00:16,  3.25s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 65 / 21  (309.5):  84%|████████▍ | 21/25 [01:02<00:11,  2.87s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 71 / 23  (308.7):  88%|████████▊ | 22/25 [01:09<00:12,  4.06s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 24  (308.3):  96%|█████████▌| 24/25 [01:19<00:04,  4.31s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 77 / 25  (308.0): 100%|██████████| 25/25 [01:20<00:00,  3.21s/it]


Prediction(
    assessment_answer='3'
)
Score: 308.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 5 / 10 ==
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 15 / 5  (300.0):  16%|█▌        | 4/25 [00:09<00:50,  2.40s/it] 

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  24%|██▍       | 6/25 [00:11<00:33,  1.76s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  32%|███▏      | 8/25 [00:17<00:38,  2.24s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 9  (322.2):  36%|███▌      | 9/25 [00:18<00:30,  1.90s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 32 / 10  (320.0):  40%|████      | 10/25 [00:24<00:44,  2.99s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 36 / 11  (327.3):  44%|████▍     | 11/25 [00:27<00:39,  2.83s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 43 / 13  (330.8):  48%|████▊     | 12/25 [00:33<00:48,  3.70s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 14  (335.7):  56%|█████▌    | 14/25 [00:36<00:30,  2.73s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 51 / 15  (340.0):  60%|██████    | 15/25 [00:41<00:33,  3.38s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 57 / 17  (335.3):  64%|██████▍   | 16/25 [00:43<00:26,  2.99s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 60 / 18  (333.3):  72%|███████▏  | 18/25 [00:49<00:21,  3.06s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 66 / 20  (330.0):  76%|███████▌  | 19/25 [00:52<00:17,  2.90s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 72 / 22  (327.3):  84%|████████▍ | 21/25 [01:01<00:14,  3.63s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 78 / 24  (325.0):  92%|█████████▏| 23/25 [01:03<00:05,  2.64s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 81 / 25  (324.0): 100%|██████████| 25/25 [01:11<00:00,  2.85s/it]


Prediction(
    assessment_answer='3'
)
Score: 324.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 6 / 10 ==


Average Metric: 4 / 1  (400.0):   4%|▍         | 1/25 [00:11<04:39, 11.67s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 7 / 2  (350.0):   8%|▊         | 2/25 [00:12<01:57,  5.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 10 / 3  (333.3):  12%|█▏        | 3/25 [00:22<02:45,  7.54s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 13 / 4  (325.0):  16%|█▌        | 4/25 [00:24<01:51,  5.33s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 16 / 5  (320.0):  20%|██        | 5/25 [00:35<02:28,  7.43s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 19 / 6  (316.7):  24%|██▍       | 6/25 [00:38<01:49,  5.75s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  28%|██▊       | 7/25 [00:47<02:06,  7.04s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):  32%|███▏      | 8/25 [00:50<01:38,  5.78s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 28 / 9  (311.1):  36%|███▌      | 9/25 [01:00<01:48,  6.81s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 32 / 10  (320.0):  40%|████      | 10/25 [01:04<01:31,  6.08s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 38 / 12  (316.7):  44%|████▍     | 11/25 [01:14<01:43,  7.39s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 42 / 13  (323.1):  52%|█████▏    | 13/25 [01:18<00:57,  4.82s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 45 / 14  (321.4):  56%|█████▌    | 14/25 [01:27<01:04,  5.84s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 48 / 15  (320.0):  60%|██████    | 15/25 [01:29<00:48,  4.88s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 51 / 16  (318.8):  64%|██████▍   | 16/25 [01:39<00:55,  6.20s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 54 / 17  (317.6):  68%|██████▊   | 17/25 [01:41<00:40,  5.00s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 57 / 18  (316.7):  72%|███████▏  | 18/25 [01:48<00:40,  5.76s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 60 / 19  (315.8):  76%|███████▌  | 19/25 [01:55<00:36,  6.03s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 63 / 20  (315.0):  80%|████████  | 20/25 [02:02<00:31,  6.23s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 66 / 21  (314.3):  84%|████████▍ | 21/25 [02:08<00:25,  6.31s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 69 / 22  (313.6):  88%|████████▊ | 22/25 [02:10<00:14,  4.98s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 75 / 24  (312.5):  92%|█████████▏| 23/25 [02:17<00:10,  5.39s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 78 / 25  (312.0): 100%|██████████| 25/25 [02:17<00:00,  5.52s/it]


Prediction(
    assessment_answer='3'
)
Score: 312.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 6'].


== Minibatch Trial 7 / 10 ==


Average Metric: 3 / 1  (300.0):   4%|▍         | 1/25 [00:08<03:18,  8.26s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 6 / 2  (300.0):   8%|▊         | 2/25 [00:10<01:51,  4.86s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 9 / 3  (300.0):  12%|█▏        | 3/25 [00:14<01:39,  4.51s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 12 / 4  (300.0):  16%|█▌        | 4/25 [00:17<01:23,  3.97s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 21 / 7  (300.0):  24%|██▍       | 6/25 [00:24<01:33,  4.90s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 24 / 8  (300.0):  32%|███▏      | 8/25 [00:28<00:46,  2.72s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 27 / 9  (300.0):  36%|███▌      | 9/25 [00:32<00:47,  3.00s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 30 / 10  (300.0):  40%|████      | 10/25 [00:35<00:44,  2.97s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 33 / 11  (300.0):  44%|████▍     | 11/25 [00:39<00:45,  3.24s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 37 / 12  (308.3):  48%|████▊     | 12/25 [00:45<00:52,  4.07s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 40 / 13  (307.7):  52%|█████▏    | 13/25 [00:50<00:51,  4.33s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 42 / 14  (300.0):  56%|█████▌    | 14/25 [00:53<00:44,  4.03s/it]

Prediction(
    assessment_answer='2'
)


Average Metric: 48 / 16  (300.0):  60%|██████    | 15/25 [00:59<00:43,  4.36s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 51 / 17  (300.0):  68%|██████▊   | 17/25 [01:00<00:22,  2.77s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 54 / 18  (300.0):  72%|███████▏  | 18/25 [01:08<00:28,  4.10s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 57 / 19  (300.0):  76%|███████▌  | 19/25 [01:09<00:19,  3.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 21  (295.2):  80%|████████  | 20/25 [01:21<00:27,  5.57s/it]

Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 65 / 22  (295.5):  88%|████████▊ | 22/25 [01:22<00:09,  3.26s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 71 / 24  (295.8):  92%|█████████▏| 23/25 [01:31<00:09,  4.75s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 25  (296.0): 100%|██████████| 25/25 [01:37<00:00,  3.89s/it]


Prediction(
    assessment_answer='3'
)
Score: 296.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].


== Minibatch Trial 8 / 10 ==


Average Metric: 3 / 1  (300.0):   4%|▍         | 1/25 [00:03<01:25,  3.58s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 9 / 3  (300.0):   8%|▊         | 2/25 [00:04<00:43,  1.90s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 12 / 4  (300.0):  16%|█▌        | 4/25 [00:07<00:34,  1.64s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 18 / 6  (300.0):  20%|██        | 5/25 [00:07<00:25,  1.27s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 22 / 7  (314.3):  28%|██▊       | 7/25 [00:10<00:24,  1.38s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 25 / 8  (312.5):  32%|███▏      | 8/25 [00:11<00:19,  1.14s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 9  (322.2):  36%|███▌      | 9/25 [00:13<00:25,  1.57s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 41 / 13  (315.4):  48%|████▊     | 12/25 [00:14<00:17,  1.34s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 44 / 14  (314.3):  56%|█████▌    | 14/25 [00:17<00:10,  1.00it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 15  (313.3):  60%|██████    | 15/25 [00:18<00:09,  1.01it/s]

Prediction(
    assessment_answer='3'
)


Average Metric: 50 / 16  (312.5):  64%|██████▍   | 16/25 [00:21<00:12,  1.35s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 18  (311.1):  68%|██████▊   | 17/25 [00:22<00:09,  1.19s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 59 / 19  (310.5):  76%|███████▌  | 19/25 [00:24<00:07,  1.28s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 62 / 20  (310.0):  80%|████████  | 20/25 [00:25<00:05,  1.12s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 65 / 21  (309.5):  84%|████████▍ | 21/25 [00:28<00:06,  1.56s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 68 / 22  (309.1):  88%|████████▊ | 22/25 [00:29<00:04,  1.34s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 24  (308.3):  92%|█████████▏| 23/25 [00:32<00:03,  1.79s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 77 / 25  (308.0): 100%|██████████| 25/25 [00:32<00:00,  1.31s/it]


Prediction(
    assessment_answer='3'
)
Score: 308.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 9 / 10 ==


Average Metric: 5 / 2  (250.0):   8%|▊         | 2/25 [00:02<00:28,  1.24s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='2'
)


Average Metric: 8 / 3  (266.7):  12%|█▏        | 3/25 [00:05<00:43,  1.97s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 11 / 4  (275.0):  16%|█▌        | 4/25 [00:06<00:27,  1.31s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 17 / 6  (283.3):  20%|██        | 5/25 [00:08<00:37,  1.87s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 20 / 7  (285.7):  28%|██▊       | 7/25 [00:11<00:29,  1.66s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 23 / 8  (287.5):  32%|███▏      | 8/25 [00:12<00:21,  1.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 26 / 9  (288.9):  36%|███▌      | 9/25 [00:14<00:25,  1.57s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 29 / 10  (290.0):  40%|████      | 10/25 [00:14<00:19,  1.29s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 35 / 12  (291.7):  44%|████▍     | 11/25 [00:17<00:23,  1.70s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 38 / 13  (292.3):  52%|█████▏    | 13/25 [00:20<00:18,  1.54s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 41 / 14  (292.9):  56%|█████▌    | 14/25 [00:20<00:13,  1.27s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 16  (293.8):  60%|██████    | 15/25 [00:23<00:16,  1.60s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 53 / 18  (294.4):  72%|███████▏  | 18/25 [00:26<00:08,  1.18s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 19  (294.7):  76%|███████▌  | 19/25 [00:28<00:08,  1.45s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 59 / 20  (295.0):  80%|████████  | 20/25 [00:31<00:09,  1.87s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 65 / 22  (295.5):  84%|████████▍ | 21/25 [00:32<00:06,  1.75s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 68 / 23  (295.7):  92%|█████████▏| 23/25 [00:34<00:02,  1.31s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 25  (296.0): 100%|██████████| 25/25 [00:37<00:00,  1.49s/it]


Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Score: 296.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].


== Minibatch Trial 10 / 10 ==


Average Metric: 3 / 1  (300.0):   4%|▍         | 1/25 [00:07<03:07,  7.80s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 6 / 2  (300.0):   8%|▊         | 2/25 [00:08<01:19,  3.44s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 9 / 3  (300.0):  12%|█▏        | 3/25 [00:14<01:42,  4.65s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 12 / 4  (300.0):  16%|█▌        | 4/25 [00:14<01:04,  3.06s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 15 / 5  (300.0):  20%|██        | 5/25 [00:21<01:30,  4.50s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 18 / 6  (300.0):  24%|██▍       | 6/25 [00:22<00:59,  3.12s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 21 / 7  (300.0):  28%|██▊       | 7/25 [00:29<01:21,  4.54s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 24 / 8  (300.0):  32%|███▏      | 8/25 [00:30<00:57,  3.40s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 27 / 9  (300.0):  36%|███▌      | 9/25 [00:36<01:06,  4.14s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 30 / 10  (300.0):  40%|████      | 10/25 [00:37<00:45,  3.03s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 33 / 11  (300.0):  44%|████▍     | 11/25 [00:43<00:56,  4.02s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 36 / 12  (300.0):  48%|████▊     | 12/25 [00:43<00:37,  2.92s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 38 / 13  (292.3):  52%|█████▏    | 13/25 [00:49<00:45,  3.79s/it]

Prediction(
    assessment_answer='2'
)


Average Metric: 41 / 14  (292.9):  56%|█████▌    | 14/25 [00:50<00:32,  2.94s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 44 / 15  (293.3):  60%|██████    | 15/25 [00:56<00:38,  3.81s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 47 / 16  (293.8):  64%|██████▍   | 16/25 [00:58<00:29,  3.31s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 51 / 17  (300.0):  68%|██████▊   | 17/25 [01:02<00:28,  3.58s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 54 / 18  (300.0):  72%|███████▏  | 18/25 [01:05<00:23,  3.32s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 57 / 19  (300.0):  76%|███████▌  | 19/25 [01:10<00:23,  3.91s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 60 / 20  (300.0):  80%|████████  | 20/25 [01:13<00:17,  3.53s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 63 / 21  (300.0):  84%|████████▍ | 21/25 [01:19<00:17,  4.25s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 66 / 22  (300.0):  88%|████████▊ | 22/25 [01:22<00:11,  3.91s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 70 / 24  (291.7):  92%|█████████▏| 23/25 [01:26<00:07,  3.90s/it]

Prediction(
    assessment_answer='2'
)
Prediction(
    assessment_answer='2'
)


Average Metric: 73 / 25  (292.0): 100%|██████████| 25/25 [01:31<00:00,  3.68s/it]


Prediction(
    assessment_answer='3'
)
Score: 292.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 2'].


===== Full Eval 1 =====
Doing full eval on next top averaging program (Avg Score: 316.0) so far from mini-batch trials...


Average Metric: 12 / 4  (300.0):   2%|▏         | 3/200 [00:10<33:42, 10.27s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 15 / 5  (300.0):   2%|▎         | 5/200 [00:18<10:47,  3.32s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 21 / 7  (300.0):   3%|▎         | 6/200 [00:21<10:32,  3.26s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 25 / 8  (312.5):   4%|▍         | 8/200 [00:27<10:07,  3.17s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 28 / 9  (311.1):   4%|▍         | 9/200 [00:30<09:41,  3.05s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 31 / 10  (310.0):   5%|▌         | 10/200 [00:34<10:14,  3.23s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 37 / 12  (308.3):   6%|▌         | 11/200 [00:40<12:27,  3.96s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 46 / 15  (306.7):   7%|▋         | 14/200 [00:46<11:13,  3.62s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 49 / 16  (306.2):   8%|▊         | 16/200 [00:47<06:16,  2.05s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 52 / 17  (305.9):   8%|▊         | 17/200 [00:54<08:57,  2.94s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 56 / 18  (311.1):   9%|▉         | 18/200 [00:59<10:01,  3.30s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 65 / 21  (309.5):  10%|█         | 20/200 [01:02<10:17,  3.43s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 68 / 22  (309.1):  11%|█         | 22/200 [01:05<06:36,  2.23s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 74 / 24  (308.3):  12%|█▏        | 23/200 [01:11<08:12,  2.79s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 80 / 26  (307.7):  12%|█▎        | 25/200 [01:12<05:51,  2.01s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 86 / 28  (307.1):  14%|█▎        | 27/200 [01:24<09:34,  3.32s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 90 / 29  (310.3):  14%|█▍        | 29/200 [01:33<10:52,  3.82s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 93 / 30  (310.0):  15%|█▌        | 30/200 [01:35<09:32,  3.37s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 96 / 31  (309.7):  16%|█▌        | 31/200 [01:43<12:26,  4.42s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 102 / 33  (309.1):  16%|█▌        | 32/200 [01:46<11:28,  4.10s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 105 / 34  (308.8):  17%|█▋        | 34/200 [01:55<11:52,  4.29s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 108 / 35  (308.6):  18%|█▊        | 35/200 [01:59<11:12,  4.08s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 111 / 36  (308.3):  18%|█▊        | 36/200 [02:07<14:05,  5.15s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 114 / 37  (308.1):  18%|█▊        | 37/200 [02:09<11:37,  4.28s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 117 / 38  (307.9):  19%|█▉        | 38/200 [02:18<15:16,  5.66s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 120 / 39  (307.7):  20%|█▉        | 39/200 [02:19<11:14,  4.19s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 130 / 42  (309.5):  20%|██        | 41/200 [02:29<15:36,  5.89s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)


Average Metric: 136 / 44  (309.1):  22%|██▏       | 43/200 [02:39<11:37,  4.44s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 139 / 45  (308.9):  22%|██▎       | 45/200 [02:39<07:40,  2.97s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 145 / 47  (308.5):  23%|██▎       | 46/200 [02:50<11:31,  4.49s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 148 / 48  (308.3):  24%|██▍       | 48/200 [02:51<07:44,  3.06s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 167 / 54  (309.3):  26%|██▋       | 53/200 [02:59<10:19,  4.21s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 170 / 55  (309.1):  28%|██▊       | 55/200 [03:01<04:09,  1.72s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 179 / 58  (308.6):  28%|██▊       | 57/200 [03:08<05:39,  2.38s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 185 / 60  (308.3):  30%|██▉       | 59/200 [03:12<04:39,  1.98s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 188 / 61  (308.2):  30%|███       | 61/200 [03:17<04:58,  2.15s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 205 / 66  (310.6):  32%|███▎      | 65/200 [03:25<06:46,  3.01s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 209 / 67  (311.9):  34%|███▎      | 67/200 [03:28<04:02,  1.83s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 215 / 69  (311.6):  34%|███▍      | 68/200 [03:34<05:09,  2.34s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 218 / 70  (311.4):  35%|███▌      | 70/200 [03:37<04:36,  2.13s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 224 / 72  (311.1):  36%|███▌      | 71/200 [03:46<06:44,  3.14s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 231 / 74  (312.2):  36%|███▋      | 73/200 [03:50<05:56,  2.81s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 234 / 75  (312.0):  38%|███▊      | 75/200 [03:58<06:48,  3.26s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 240 / 77  (311.7):  38%|███▊      | 76/200 [04:00<06:00,  2.90s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 249 / 80  (311.2):  40%|███▉      | 79/200 [04:08<06:38,  3.29s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 252 / 81  (311.1):  40%|████      | 81/200 [04:09<04:08,  2.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 255 / 82  (311.0):  41%|████      | 82/200 [04:18<06:25,  3.26s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 267 / 86  (310.5):  42%|████▎     | 85/200 [04:27<07:28,  3.90s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 274 / 88  (311.4):  44%|████▎     | 87/200 [04:28<04:01,  2.14s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)


Average Metric: 280 / 90  (311.1):  44%|████▍     | 89/200 [04:35<04:58,  2.68s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 283 / 91  (311.0):  46%|████▌     | 91/200 [04:36<03:27,  1.90s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 286 / 92  (310.9):  46%|████▌     | 92/200 [04:44<05:39,  3.15s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 289 / 93  (310.8):  46%|████▋     | 93/200 [04:49<06:14,  3.50s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 295 / 95  (310.5):  47%|████▋     | 94/200 [04:53<06:23,  3.62s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 298 / 96  (310.4):  48%|████▊     | 96/200 [04:57<05:01,  2.89s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 301 / 97  (310.3):  48%|████▊     | 97/200 [05:04<06:27,  3.76s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 304 / 98  (310.2):  49%|████▉     | 98/200 [05:06<05:48,  3.42s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 310 / 100  (310.0):  50%|████▉     | 99/200 [05:12<06:55,  4.11s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 313 / 101  (309.9):  50%|█████     | 101/200 [05:18<05:58,  3.62s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 316 / 102  (309.8):  51%|█████     | 102/200 [05:22<06:04,  3.72s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 319 / 103  (309.7):  52%|█████▏    | 103/200 [05:29<07:07,  4.41s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 322 / 104  (309.6):  52%|█████▏    | 104/200 [05:34<07:40,  4.80s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 340 / 110  (309.1):  55%|█████▍    | 109/200 [05:38<06:52,  4.53s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 346 / 112  (308.9):  56%|█████▌    | 111/200 [05:45<03:08,  2.11s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 349 / 113  (308.8):  56%|█████▋    | 113/200 [05:49<03:01,  2.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 352 / 114  (308.8):  57%|█████▋    | 114/200 [05:55<03:49,  2.67s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 356 / 115  (309.6):  57%|█████▊    | 115/200 [06:04<05:25,  3.83s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 359 / 116  (309.5):  58%|█████▊    | 116/200 [06:11<06:24,  4.57s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 365 / 118  (309.3):  58%|█████▊    | 117/200 [06:13<05:19,  3.84s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 371 / 120  (309.2):  60%|█████▉    | 119/200 [06:23<05:59,  4.44s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 374 / 121  (309.1):  60%|██████    | 121/200 [06:27<04:37,  3.51s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 380 / 123  (308.9):  61%|██████    | 122/200 [06:34<05:30,  4.24s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 383 / 124  (308.9):  62%|██████▏   | 124/200 [06:41<05:03,  3.99s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 386 / 125  (308.8):  62%|██████▎   | 125/200 [06:46<05:06,  4.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 392 / 127  (308.7):  63%|██████▎   | 126/200 [06:52<05:37,  4.56s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 395 / 128  (308.6):  64%|██████▍   | 128/200 [06:58<04:50,  4.04s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 398 / 129  (308.5):  64%|██████▍   | 129/200 [07:05<05:30,  4.65s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 405 / 131  (309.2):  65%|██████▌   | 130/200 [07:10<05:21,  4.59s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 417 / 135  (308.9):  67%|██████▋   | 134/200 [07:16<04:24,  4.00s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 427 / 138  (309.4):  68%|██████▊   | 137/200 [07:20<02:33,  2.43s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 430 / 139  (309.4):  70%|██████▉   | 139/200 [07:25<02:12,  2.18s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 433 / 140  (309.3):  70%|███████   | 140/200 [07:30<02:31,  2.53s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 436 / 141  (309.2):  70%|███████   | 141/200 [07:36<03:07,  3.18s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 439 / 142  (309.2):  71%|███████   | 142/200 [07:39<02:53,  2.99s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 442 / 143  (309.1):  72%|███████▏  | 143/200 [07:48<04:10,  4.40s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 445 / 144  (309.0):  72%|███████▏  | 144/200 [07:50<03:42,  3.98s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 449 / 145  (309.7):  72%|███████▎  | 145/200 [08:00<04:52,  5.31s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 453 / 146  (310.3):  73%|███████▎  | 146/200 [08:01<03:51,  4.29s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 460 / 148  (310.8):  74%|███████▎  | 147/200 [08:13<05:43,  6.48s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 466 / 150  (310.7):  74%|███████▍  | 149/200 [08:15<03:19,  3.92s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 469 / 151  (310.6):  76%|███████▌  | 151/200 [08:25<03:37,  4.45s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 472 / 152  (310.5):  76%|███████▌  | 152/200 [08:27<03:05,  3.87s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 476 / 153  (311.1):  76%|███████▋  | 153/200 [08:36<03:54,  4.98s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 489 / 157  (311.5):  78%|███████▊  | 156/200 [08:39<03:21,  4.58s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 492 / 158  (311.4):  79%|███████▉  | 158/200 [08:48<02:16,  3.25s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 495 / 159  (311.3):  80%|███████▉  | 159/200 [08:54<02:29,  3.64s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 498 / 160  (311.2):  80%|████████  | 160/200 [08:58<02:31,  3.78s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 508 / 163  (311.7):  81%|████████  | 162/200 [09:05<02:47,  4.42s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 517 / 166  (311.4):  82%|████████▎ | 165/200 [09:07<01:29,  2.57s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 532 / 171  (311.1):  85%|████████▌ | 170/200 [09:13<01:11,  2.40s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 539 / 173  (311.6):  86%|████████▌ | 172/200 [09:15<00:39,  1.41s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 542 / 174  (311.5):  87%|████████▋ | 174/200 [09:21<00:46,  1.79s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 545 / 175  (311.4):  88%|████████▊ | 175/200 [09:25<00:52,  2.09s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 548 / 176  (311.4):  88%|████████▊ | 176/200 [09:34<01:15,  3.13s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 554 / 178  (311.2):  88%|████████▊ | 177/200 [09:35<01:01,  2.68s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 557 / 179  (311.2):  90%|████████▉ | 179/200 [09:42<01:03,  3.00s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 566 / 182  (311.0):  90%|█████████ | 181/200 [09:43<00:48,  2.57s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 569 / 183  (310.9):  92%|█████████▏| 183/200 [09:49<00:39,  2.34s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 579 / 186  (311.3):  92%|█████████▎| 185/200 [09:51<00:33,  2.25s/it]

Prediction(
    assessment_answer='4'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 583 / 187  (311.8):  94%|█████████▎| 187/200 [09:57<00:27,  2.14s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 595 / 191  (311.5):  95%|█████████▌| 190/200 [09:59<00:21,  2.20s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 601 / 193  (311.4):  96%|█████████▌| 192/200 [10:09<00:18,  2.30s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 607 / 195  (311.3):  97%|█████████▋| 194/200 [10:09<00:10,  1.77s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 610 / 196  (311.2):  98%|█████████▊| 196/200 [10:20<00:10,  2.67s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 616 / 198  (311.1):  98%|█████████▊| 197/200 [10:21<00:07,  2.53s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 619 / 199  (311.1): 100%|█████████▉| 199/200 [10:29<00:02,  2.89s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 622 / 200  (311.0): 100%|██████████| 200/200 [10:33<00:00,  3.17s/it]

Prediction(
    assessment_answer='3'
)
[92mBest full eval score so far![0m Score: 311.0


[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [115]:
evaluator = Evaluate(devset=devset[:10], metric=creativity_metric, num_threads=2, display_progress=True, display_table=5, provide_traceback=True)
evaluator(mipro_optimized_program)

Average Metric: 4 / 1  (400.0):  10%|█         | 1/10 [00:08<01:14,  8.31s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 8 / 2  (400.0):  20%|██        | 2/10 [00:09<00:30,  3.83s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 20 / 6  (333.3):  50%|█████     | 5/10 [00:18<00:19,  3.96s/it]

Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)
Prediction(
    assessment_answer='3'
)


Average Metric: 23 / 7  (328.6):  70%|███████   | 7/10 [00:27<00:09,  3.30s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 26 / 8  (325.0):  80%|████████  | 8/10 [00:32<00:07,  3.65s/it]

Prediction(
    assessment_answer='3'
)


Average Metric: 30 / 9  (333.3):  90%|█████████ | 9/10 [00:39<00:04,  4.53s/it]

Prediction(
    assessment_answer='4'
)


Average Metric: 33 / 10  (330.0): 100%|██████████| 10/10 [00:43<00:00,  4.39s/it]

Prediction(
    assessment_answer='3'
)





Unnamed: 0,example_tweet,engagement,topic,reasoning,pred_tweet,creativity_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",3388,Learning and Experience,"Learning and experience are two sides of the same coin. While learning provides the theoretical foundation and knowledge, experience offers practical application and deeper understanding....","Learning gives you the map, but experience is the journey. 🌍📚 Embrace both to navigate life’s challenges and grow continuously. #LifelongLearning #ExperienceMatters",✔️ [4]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,1647,Everyday Life/Humor,"Everyday life is filled with small, relatable moments that can be both humorous and insightful. By highlighting these moments, we can connect with others on...","Why is it that the moment you decide to eat healthy, your fridge suddenly looks like a junk food paradise? 🍕🍫 #EverydayLife #Humor #Relatable",✔️ [4]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",4534,Decision Making,"Decision making is a critical skill that impacts every aspect of our lives, from personal relationships to professional success. Effective decision making involves gathering relevant...","Struggling with decisions? Remember: indecision is a decision too. Gather info, weigh pros & cons, and align with your goals. Make choices that move you...",✔️ [3]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",3551,Habit Formation,"Habit formation is a crucial aspect of personal development because it allows individuals to automate positive behaviors, making them a natural part of daily life....","Want to build a new habit? Start small. Identify a cue, establish a routine, and reward yourself. Consistency is key! 🌱 #HabitFormation #PersonalGrowth",✔️ [3]
4,The reason you are stressed is you have decisions to make and you’re not making them.,9078,Mental Health,"Mental health is a crucial aspect of overall well-being, yet it is often overlooked or stigmatized. Addressing mental health involves recognizing the signs of mental...","Your mental health is just as important as your physical health. 🌟 Don't hesitate to seek help, talk openly, and support each other. Let's break...",✔️ [3]


330.0

# MIPROv2 Results: 330

MIPROv2 did increase our score from 320 -> 330.

But as we can see, most of the scores for creativity turn out to be 3 - this is becoming a problem. The LLM judge is probably having a problem defining 'creativity', especially with topics as 'dull' as business. We need a multitude of 'yes/no' metrics and we need to ensemble them to get a good sense. Plus, from the examples in the table above it seems like the tweets generated are really long. We need a conciseness metric as well - or in extreme cases, an assertion that the tweet generated be within 280 characters. It's possible the LLM has issues because we're generating + judging with the same LM, whereas the way it's generally done is a more capable model judging the generation of a smaller/less capable model. Let's try a few changes:

1) Changing the metric to encompass more measures of a tweet's 'quality'
2) using 4o-mini for generation and 4o for judging
3) Playing around with the hyperparameters of DSPy optimizers a bit more
4) Manually checking a few examples of generation.
5) Using a DSPy program that we compile to judge the quality of tweets as a metric. i.e., something like second-order optimization.
6) Using the tweets dataset for RAG since we can't train on the whole thing. (due to time constraints, Rate limit, etc.)
7) Trying a different form of annotation/labels than just topics
8) Dividing the devset into val and test sets to keep the test set for final testing and preventing data leakage

# Experiment 2: without RAG

In [116]:
flattened_annotated_dataset[:5]

[Example({'tweet': 'Stop “friend financing”', 'engagement': 137, 'topic': 'Personal Finance'}) (input_keys={'topic'}),
 Example({'tweet': 'A message for friends of entrepreneurs:\n\nDon’t buy them a present. \n\nBuy their product and leave a nice review or tell them why it wasn’t good enough to deserve it (an even better gift).\n\nNo one needs more stuff. \nEveryone could use more support.', 'engagement': 979, 'topic': 'Support for Entrepreneurs'}) (input_keys={'topic'}),
 Example({'tweet': 'Rush is an illusion.', 'engagement': 908, 'topic': 'Philosophy'}) (input_keys={'topic'}),
 Example({'tweet': 'Advice to strong men:\n\nFind a strong woman.', 'engagement': 3574, 'topic': 'Relationships'}) (input_keys={'topic'}),
 Example({'tweet': 'Reminder:\n\nDeath tax is 100% for everyone.\n\n(Because whatever you have isn’t yours anymore after you die).', 'engagement': 1136, 'topic': 'Philosophy, Mortality'}) (input_keys={'topic'})]

In [119]:
dev_set_n = 250
# Tell DSPy that the 'topic' field is the input. Any other fields are labels and/or metadata.
trainset = [x.without('engagement').with_inputs('topic') for x in flattened_annotated_dataset[:dev_set_n]]
devset = [x.without('engagement').with_inputs('topic') for x in flattened_annotated_dataset[dev_set_n:]]

In [120]:
trainset[0]

Example({'tweet': 'Stop “friend financing”', 'topic': 'Personal Finance'}) (input_keys={'topic'})

In [124]:
gpt4o = dspy.LM(model = 'gpt-4o', max_tokens=1000, model_type='chat')
mini = dspy.LM(model = 'gpt-4o-mini', max_tokens=1000, model_type='chat')
dspy.configure(lm = mini)

In [125]:
# Define the signature for automatic assessments.
class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    context = dspy.InputField(desc='ignore if N/A')
    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Yes or No")

def tweet_ensemble_metric(gold, pred, trace=None):
    topic, tweet = gold.topic, pred.tweet

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    concise = "Does the assessed text make for a concise, cogent tweet?"
    creative = "Does the assessed text make for a creative tweet?"
    relevant = f"The text above should be relevant `{topic}`. The gold answer is `{tweet}`."
    relevant = f"{relevant} Does the assessed text above contain the gold answer?"
    
    with dspy.context(lm=gpt4o):
        relevant =  dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=relevant)
        engaging = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=engaging)
        creative = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=creative)
        concise = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=concise)

    relevant, engaging, creative, concise = (m.assessment_answer.split()[0].lower() == 'yes' for m in [relevant, engaging, creative, concise])
    #takes care of length
    score = (relevant + engaging + creative + concise) if (len(tweet) <= 280) else 0

    if trace is not None: 
        return score >= 4
    return score / 4.0

In [127]:
devset, testset = devset[:-10], devset[-10:]

In [129]:
evaluator = Evaluate(devset=devset, metric=tweet_ensemble_metric, num_threads=2, display_progress=True, display_table=5, provide_traceback=True)
evaluator(uncompiled_tweet_writer) 

Average Metric: 38.5 / 40  (96.2): 100%|██████████| 40/40 [01:51<00:00,  2.79s/it] 


Unnamed: 0,example_tweet,topic,reasoning,pred_tweet,tweet_ensemble_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",Learning and Experience,"Learning and experience are intertwined; every lesson learned shapes our understanding and influences future decisions. Embracing both successes and failures enriches our journey, fostering growth...","""Learning is a journey, not a destination. Every experience, whether a success or a setback, shapes who we are and how we grow. Let's embrace...",✔️ [1.0]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,Everyday Life/Humor,"Everyday life is filled with humorous moments that many can relate to, making it a great topic for a light-hearted tweet. By highlighting a common...","Why do we always open the fridge, stare at the contents, and then close it like we’re expecting a miracle? 🍕🥗 #EverydayLife #Humor",✔️ [1.0]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",Decision Making,"Decision making is a critical skill that impacts every aspect of our lives, from personal choices to professional strategies. It involves evaluating options, considering potential...","🧠💡 Decision making is an essential skill that shapes our lives! Whether in personal choices or professional strategies, making informed decisions can lead to better...",✔️ [1.0]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",Habit Formation,Habit formation is a crucial aspect of personal development and productivity. Understanding how habits are formed can help individuals create positive changes in their lives....,"🌱 Want to build better habits? Start small! Consistency is key. Focus on one action at a time, and watch it grow into a lasting...",✔️ [1.0]
4,The reason you are stressed is you have decisions to make and you’re not making them.,Mental Health,"Mental health is a crucial aspect of overall well-being that affects how we think, feel, and act. It's important to raise awareness about mental health...","🧠💚 Let's break the stigma around mental health! It's okay to not be okay. Remember, seeking help is a sign of strength. Let's support each...",✔️ [1.0]


96.25

## Uncompiled: 96.25

In [133]:
config = dict(max_bootstrapped_demos=15, max_labeled_demos=5)
optimizer = BootstrapFewShot(metric = tweet_ensemble_metric, **config)
optimized_tweet_writer = optimizer.compile(TweetWriter(), trainset=trainset[:10])
optimized_tweet_writer.save(path="fewshotv2")

100%|██████████| 10/10 [01:06<00:00,  6.63s/it]

Bootstrapped 9 full traces after 10 examples in round 0.
[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [134]:
evaluator(optimized_tweet_writer) 

Average Metric: 40.0 / 40  (100.0): 100%|██████████| 40/40 [01:51<00:00,  2.78s/it]


Unnamed: 0,example_tweet,topic,reasoning,pred_tweet,tweet_ensemble_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",Learning and Experience,"Learning through experience is one of the most effective ways to gain knowledge and skills. Each experience, whether positive or negative, offers valuable lessons that...",Experience is the best teacher! 📚✨ Every challenge and triumph brings valuable lessons. Embrace your journey and let your experiences shape your growth! #Learning #LifeLessons,✔️ [1.0]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,Everyday Life/Humor,"Everyday life can often be mundane, but humor is a powerful tool to lighten our daily experiences. Finding joy and laughter in the little things...",Why did the scarecrow win an award? Because he was outstanding in his field! 🌾😄 Remember to find humor in everyday life—it makes the mundane...,✔️ [1.0]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",Decision Making,"Effective decision-making is crucial in both personal and professional contexts. It involves evaluating options, considering potential outcomes, and aligning choices with our values and goals....","Every decision shapes your path. 🛤️ Take a moment to weigh your options, trust your instincts, and align your choices with your values. Make decisions...",✔️ [1.0]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",Habit Formation,Habit formation is a crucial aspect of personal development and achieving long-term goals. Understanding the science behind how habits are formed can help us create...,Small changes lead to big results! 🌱 Focus on forming one positive habit at a time. Consistency is the secret ingredient to lasting change. What...,✔️ [1.0]
4,The reason you are stressed is you have decisions to make and you’re not making them.,Mental Health,"Mental health is a crucial aspect of overall well-being that affects how we think, feel, and act. Prioritizing mental health involves recognizing the importance of...","Mental health matters! 🧠💚 Let’s break the stigma and prioritize our well-being. Remember, it’s okay to ask for help and take time for self-care. You...",✔️ [1.0]


100.0

## BootstrapFewShot: 100

In [135]:
config = dict(max_labeled_demos = 1, max_bootstrapped_demos=1, num_candidate_programs=2)
teleprompter = BootstrapFewShotWithRandomSearch(metric = tweet_ensemble_metric, **config)
rs_optimized_tweet_writer = teleprompter.compile(TweetWriter(), trainset=trainset[:20])
rs_optimized_tweet_writer.save(path="fewshotwithrsv2")

Going to sample between 1 and 1 traces per predictor.
Will attempt to bootstrap 2 candidate sets.


Average Metric: 17.5 / 20  (87.5): 100%|██████████| 20/20 [00:23<00:00,  1.16s/it] 


New best score: 87.5 for seed -3
Scores so far: [87.5]
Best score so far: 87.5


Average Metric: 19.75 / 20  (98.8): 100%|██████████| 20/20 [00:26<00:00,  1.34s/it]


New best score: 98.75 for seed -2
Scores so far: [87.5, 98.75]
Best score so far: 98.75


 10%|█         | 2/20 [00:00<00:00, 971.24it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 31.0 / 19  (163.2):  38%|███▊      | 19/50 [3:24:07<5:33:03, 644.61s/it]
Average Metric: 46.0 / 24  (191.7):  48%|████▊     | 24/50 [3:21:03<3:37:48, 502.65s/it]
Average Metric: 0.0 / 9  (0.0):  30%|███       | 9/30 [3:11:54<7:27:46, 1279.34s/it]
Average Metric: 0.0 / 9  (0.0):  30%|███       | 9/30 [3:02:38<7:06:08, 1217.56s/it]
Average Metric: 0.0 / 9  (0.0):  30%|███       | 9/30 [3:00:42<7:01:38, 1204.67s/it]
Average Metric: 34 / 12  (283.3):  24%|██▍       | 12/50 [2:58:12<9:24:19, 891.04s/it]
Average Metric: 20.0 / 20  (100.0): 100%|██████████| 20/20 [00:20<00:00,  1.03s/it]


New best score: 100.0 for seed -1
Scores so far: [87.5, 98.75, 100.0]
Best score so far: 100.0


  5%|▌         | 1/20 [00:09<03:05,  9.74s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 20.0 / 20  (100.0): 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]


Scores so far: [87.5, 98.75, 100.0, 100.0]
Best score so far: 100.0


  5%|▌         | 1/20 [00:06<02:09,  6.82s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 20.0 / 20  (100.0): 100%|██████████| 20/20 [00:19<00:00,  1.02it/s]

Scores so far: [87.5, 98.75, 100.0, 100.0, 100.0]
Best score so far: 100.0
5 candidate programs found.
[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [136]:
evaluator(rs_optimized_tweet_writer) 

Average Metric: 38.0 / 40  (95.0): 100%|██████████| 40/40 [01:39<00:00,  2.49s/it]


Unnamed: 0,example_tweet,topic,reasoning,pred_tweet,tweet_ensemble_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",Learning and Experience,"Learning through experience is one of the most effective ways to gain knowledge and skills. It allows individuals to apply theoretical concepts in real-world situations,...","📚✨ Learning through experience is key to personal and professional growth! By applying knowledge in real-world situations, we deepen our understanding and retention. Let's embrace...",✔️ [1.0]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,Everyday Life/Humor,"Everyday life is filled with humorous moments that often go unnoticed. Sharing these light-hearted experiences can bring joy and laughter to others, reminding us not...",Why do we always park in driveways and drive on parkways? 🤔 Just another mystery of everyday life! Let's embrace the humor in our daily...,✔️ [1.0]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",Decision Making,"Effective decision making is essential in both personal and professional contexts. It involves analyzing information, weighing options, and considering potential outcomes. Good decision-making skills can...","🧠 Decision making is a vital skill for success! By analyzing information and weighing options, we can make informed choices that lead to better outcomes....",✔️ [1.0]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",Habit Formation,"Habit formation is a powerful process that can significantly impact our lives. By understanding the cues, routines, and rewards that drive our behaviors, we can...","🌱 Want to change your life? Start with habit formation! By understanding cues, routines, and rewards, you can build positive habits that lead to personal...",✔️ [1.0]
4,The reason you are stressed is you have decisions to make and you’re not making them.,Mental Health,"Mental health is a vital aspect of overall well-being that affects how we think, feel, and act. Promoting mental health awareness and providing access to...","🧠💚 Mental health matters! Let's break the stigma and promote awareness. It's time to prioritize mental well-being and support those who need it. Remember, it's...",✔️ [1.0]


95.0

## BootstrapFewShotWithRandomSearch: 95

In [137]:
# Initialize optimizer
teleprompter = MIPROv2(
    metric=tweet_ensemble_metric,
    num_candidates=7,
    init_temperature=0.5,
    verbose=False,
    num_threads=2,
)

mipro_optimized_program = teleprompter.compile(
    TweetWriter(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    num_trials=10,
    minibatch_size=25,
    minibatch_full_eval_steps=10,
    minibatch=True, 
)

# Save optimize program for future use
mipro_optimized_program.save("mipro_optimizedv2")

[93m[1mProjected Language Model (LM) Calls[0m

Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:


[93m- Prompt Model: [94m[1m10[0m[93m data summarizer calls + [94m[1m7[0m[93m * [94m[1m1[0m[93m lm calls in program + ([94m[1m2[0m[93m) lm calls in program aware proposer = [94m[1m19[0m[93m prompt model calls[0m
[93m- Task Model: [94m[1m25[0m[93m examples in minibatch * [94m[1m10[0m[93m batches + [94m[1m200[0m[93m examples in val set * [94m[1m1[0m[93m full evals = [94m[1m300[0m[93m task model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per 

  6%|▌         | 3/50 [00:19<05:11,  6.63s/it]


Bootstrapped 3 full traces after 4 examples in round 0.
Bootstrapping set 4/7


  4%|▍         | 2/50 [00:13<05:28,  6.85s/it]


Bootstrapped 2 full traces after 3 examples in round 0.
Bootstrapping set 5/7


  2%|▏         | 1/50 [00:07<05:56,  7.28s/it]


Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 6/7


  2%|▏         | 1/50 [00:05<04:46,  5.84s/it]


Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 7/7


  2%|▏         | 1/50 [00:06<04:55,  6.03s/it]


Bootstrapped 1 full traces after 2 examples in round 0.

==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
In this step, by default we will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.

Proposing instructions...

Proposed Instructions for Predictor 0:

0: Given the fields `topic`, produce the fields `tweet`.

1: You are a motivational speaker and personal development coach. Given the field `topic`, produce an inspiring tweet that reflects the essence of the topic while encouraging self-awareness and resilience.

2: Using the topic provided, generate a tweet that encapsulates the essence of the topic through a structured reasoning process. Begin by breaking down the topic step by step to highlight key insights and context, and then distill this reasoning into a concise and engaging tweet suitable for social media. Ensure the tweet resonates with themes of personal

Average Metric: 190.25 / 200  (95.1): 100%|██████████| 200/200 [07:53<00:00,  2.37s/it]


Default program score: 95.12

==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
In this step, we will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination. Bayesian Optimization will be used for this search process.

== Minibatch Trial 1 / 10 ==


Average Metric: 25.0 / 25  (100.0): 100%|██████████| 25/25 [01:22<00:00,  3.31s/it]


Score: 100.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].


== Minibatch Trial 2 / 10 ==


Average Metric: 25.0 / 25  (100.0): 100%|██████████| 25/25 [01:23<00:00,  3.35s/it]


Score: 100.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 5'].


== Minibatch Trial 3 / 10 ==


Average Metric: 19.0 / 25  (76.0): 100%|██████████| 25/25 [01:20<00:00,  3.22s/it]


Score: 76.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 4 / 10 ==


Average Metric: 23.0 / 25  (92.0): 100%|██████████| 25/25 [01:15<00:00,  3.03s/it] 


Score: 92.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 5 / 10 ==


Average Metric: 24.0 / 25  (96.0): 100%|██████████| 25/25 [01:04<00:00,  2.58s/it] 


Score: 96.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 6 / 10 ==


Average Metric: 25.0 / 25  (100.0): 100%|██████████| 25/25 [01:25<00:00,  3.40s/it]


Score: 100.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 6'].


== Minibatch Trial 7 / 10 ==


Average Metric: 25.0 / 25  (100.0): 100%|██████████| 25/25 [01:06<00:00,  2.64s/it]


Score: 100.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].


== Minibatch Trial 8 / 10 ==


Average Metric: 23.0 / 25  (92.0): 100%|██████████| 25/25 [01:00<00:00,  2.41s/it]


Score: 92.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 9 / 10 ==


Average Metric: 24.0 / 25  (96.0): 100%|██████████| 25/25 [01:22<00:00,  3.29s/it] 


Score: 96.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].


== Minibatch Trial 10 / 10 ==


Average Metric: 25.0 / 25  (100.0): 100%|██████████| 25/25 [01:14<00:00,  2.97s/it]


Score: 100.0 on minibatch of size 25 with parameters ['Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 2'].


===== Full Eval 1 =====
Doing full eval on next top averaging program (Avg Score: 100.0) so far from mini-batch trials...


Average Metric: 197.0 / 200  (98.5): 100%|██████████| 200/200 [06:58<00:00,  2.09s/it]

[92mBest full eval score so far![0m Score: 98.5


[('prog', Predict(StringSignature(topic -> reasoning, tweet
    instructions='Given the fields `topic`, produce the fields `tweet`.'
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [138]:
evaluator(mipro_optimized_program)

Average Metric: 40.0 / 40  (100.0): 100%|██████████| 40/40 [01:39<00:00,  2.50s/it]


Unnamed: 0,example_tweet,topic,reasoning,pred_tweet,tweet_ensemble_metric
0,"Doing a thing increases your ability to learn about it. In order words, do the thing, then read about it. The experience will give you...",Learning and Experience,Learning and experience are intertwined; every lesson learned shapes our journey and equips us for future challenges. Embracing both successes and failures as valuable experiences...,"Every experience, good or bad, is a stepping stone to growth. Embrace the lessons, reflect on your journey, and let each moment shape you into...",✔️ [1.0]
1,Anyone else ever accidentally get in the shower with a hat still on? https://t.co/t9IOAGcC1h,Everyday Life/Humor,"Everyday life can often be mundane, but humor is a powerful tool to lighten our daily experiences. Finding joy and laughter in the little things...",Why do we park in driveways and drive on parkways? 🤔 Life is full of these little mysteries! Embrace the humor in everyday life—it makes...,✔️ [1.0]
2,"If you have all the information to make a perfect decision, you missed the opportunity.",Decision Making,"Decision making is a critical skill that shapes our lives and futures. It involves weighing options, considering consequences, and trusting our instincts. Emphasizing the importance...","Every decision you make is a step on your journey. Trust your instincts, reflect on your values, and remember: it’s okay to take your time....",✔️ [1.0]
3,"I’m bad at starting new habits. I’m also pretty bad at keeping habits. In the last ten years, the only small habits I’ve been able...",Habit Formation,"Habit formation is a powerful process that shapes our daily lives and long-term success. Understanding that habits are built through consistency and small, incremental changes...","Every great achievement starts with a single habit. 🌱 Focus on small, consistent actions, and watch them transform your life over time. Remember, it’s progress,...",✔️ [1.0]
4,The reason you are stressed is you have decisions to make and you’re not making them.,Mental Health,"Mental health is a vital aspect of overall well-being that often gets overlooked. It's important to prioritize self-care, seek help when needed, and foster open...","Your mental health matters just as much as your physical health. 🌱 Take a moment today to check in with yourself, practice self-care, and reach...",✔️ [1.0]


100.0

# MIPROv2 score: 100
Given that the eval for this was more robust, this might be a more 'useful' 100 than a 100 from BootstrapFewShot. I'd like to explore how to normalize and juxtapose these scores but leave it up for further work in the interest of time.

Also during this run, I realised that some of the higher hyperparams in both of my MIPRO runs were completely unneccessary. As such, I'll reduce them for the next experiment.

# Experiment 3: with RAG

In [1]:
import pickle 

with open('data/annotated_dataset.pkl', 'rb') as pickle_file:
    annotated_dataset = pickle.load(pickle_file)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
rag_entries = 200
#200 entries
set_for_rag = annotated_dataset[:rag_entries]
#100 entries
set_for_training = annotated_dataset[rag_entries:]
#75 for training, 25 for val
dev_set_n = 75
trainset = [x.without('engagement').with_inputs('topic') for x in set_for_training[:dev_set_n]]
devset = [x.without('engagement').with_inputs('topic') for x in set_for_training[dev_set_n:]]

In [3]:
import weaviate
from dotenv import load_dotenv
import os

load_dotenv()
  
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

client = weaviate.connect_to_local(host='localhost', port=8080, headers={
    'X-Openai-Api-Key': OPENAI_API_KEY
})

print(client.is_ready())

True


In [4]:
import weaviate.classes.config as wvcc
  
client.collections.delete_all()
collection = client.collections.create(
    name = "DspyTweets",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_openai(),
    properties=[
        wvcc.Property(name = "content", data_type=wvcc.DataType.TEXT),
    ],
)

In [5]:
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def chunk_tweets():
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    tweet_chunks = []
    for tweet in set_for_rag:
        content = tweet.tweet
        sentences = split_into_sentences(content)
        sentence_chunks = chunk_list(sentences, 5)
        sentence_chunks = [' '.join(chuck) for chuck in sentence_chunks]
        tweet_chunks.extend(sentence_chunks)
    return tweet_chunks

tweet_chunks = chunk_tweets()

In [6]:
len(tweet_chunks)

205

In [7]:
tweet_chunks[0]

'Stop “friend financing”'

In [8]:
tweets_collection = client.collections.get("DspyTweets")

for idx, tweet_chunk in enumerate(tweet_chunks):
    upload = tweets_collection.data.insert(
        properties={
            "content": tweet_chunk
        }
    )

In [9]:
import dspy
from dspy.retrieve.weaviate_rm import WeaviateRM

retriever_model = WeaviateRM(
    weaviate_collection_name="DspyTweets",
    weaviate_client=client,
)

results = retriever_model("entrepreneurship", k=5)

for result in results:
    print("Document:", result.long_text, "\n")

Document: Entrepreneurship is less about being really good at anything and more about being good enough at everything. 

Document: Entrepreneurship isn’t about being great at a few things but being good enough at everything. 

Document: The best entrepreneurs don’t do business to make money. They make money to do business. 

Document: You can always justify doing a new thing by quantifying what you have to gain. The tough part is quantifying what you’ll lose from the distraction. And that downside is the part most entrepreneurs miscalculate. 

Document: Many small business owners stay small because they don’t have the patience to let something small become big. 



In [10]:
dspy.configure(rm = retriever_model)

In [11]:
# Define the signature for automatic assessments.
gpt4o = dspy.LM(model= 'gpt-4o', max_tokens=1000, model_type='chat')
retrieve = dspy.Retrieve(k=5)

class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    context = dspy.InputField(desc='ignore if N/A')
    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Yes or No")

def metric(gold, pred, trace=None):
    topic, tweet = gold.topic, pred.tweet
    context = retrieve(topic).passages

    engaging = "Does the assessed text make for a self-contained, engaging tweet?"
    faithful = "Is the assessed text grounded in the context? Say no if it includes significant facts not in the context."
    concise = "Does the assessed text make for a concise, cogent tweet?"
    creative = "Does the assessed text make for a creative tweet?"
    relevant = f"The this is the tweet: `{tweet}`. It should be relevant to: `{topic}`."
    
    with dspy.context(lm=gpt4o):
        faithful = dspy.Predict(Assess)(context=context, assessed_text=tweet, assessment_question=faithful)
        relevant =  dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=relevant)
        engaging = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=engaging)
        creative = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=creative)
        concise = dspy.Predict(Assess)(context='N/A', assessed_text=tweet, assessment_question=concise)

    relevant, engaging, creative, concise, faithful = (m.assessment_answer.split()[0].lower() == 'yes' for m in [relevant, engaging, creative, concise, faithful])
    #takes care of length
    score = (relevant + engaging + creative + concise + faithful) if (len(tweet) <= 280) else 0

    if trace is not None: 
        return score >= 5
    return score / 5.0

In [12]:
class GenerateTweet(dspy.Signature):
    """Generate a tweet based on the context"""

    context = dspy.InputField(desc = "May contain relevant facts")
    topic = dspy.InputField()
    tweet = dspy.OutputField()

class TweetRAG(dspy.Module):
    def __init__(self, num_passages = 3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k = num_passages)
        self.generate_tweet = dspy.ChainOfThought(GenerateTweet)

    def forward(self, topic):
        context = self.retrieve(topic).passages
        prediction = self.generate_tweet(context = context, topic = topic)
        return dspy.Prediction(tweet = prediction.tweet)

In [13]:
mini = dspy.LM(model = 'gpt-4o-mini')
dspy.configure(lm=mini)

In [14]:
from dspy.evaluate.evaluate import Evaluate
uncompiled_tweet_rag = TweetRAG()
evaluator = Evaluate(devset=devset, metric=metric, num_threads=2, display_progress=True, display_table=5, provide_traceback=True)
evaluator(uncompiled_tweet_rag) 

Average Metric: 22.8 / 25  (91.2): 100%|██████████| 25/25 [01:08<00:00,  2.75s/it]


Unnamed: 0,example_tweet,topic,pred_tweet,metric
0,For those who ask how I eat desert every night and don’t get fat… Answer: I eat less during the day.,Diet and Weight Management,"""Weight management isn't about winning; it's about the journey. Just like in marriage, the goal is to keep going. Celebrate the small victories in the...",✔️ [1.0]
1,You can give up tomorrow. Today you fight.,Motivation,"Motivation doesn't always come from passion. Sometimes, it's pain, fear, or even shame that drives us. Remember, every small win counts! Visualize your audience—93 views?...",
2,There are no recessions. Only buying seasons and selling seasons. Your skills determine which season it is for you.,Economic Perspective,"💡 Everyone is an investor! Whether rich or poor, how we trade our resources defines our returns. Start measuring your investments in time and money...",✔️ [1.0]
3,A life tip that’s served me well: Save your big decisions for a morning after a good nights sleep.,Life Advice,"Life isn't a race to win; it's a journey to enjoy. Focus on nurturing relationships, health, and passions rather than just reaching a destination. Sometimes,...",✔️ [1.0]
4,Whenever I get to a low point where I think “why do I even bother?” I just try to remind myself “this is where most...,Motivation and Perseverance,"""Don't wait for motivation to strike! Start when it’s hard and use your challenges as fuel. Pain, fear, and anger can be powerful motivators. Conquer...",✔️ [1.0]


91.2

## Uncompiled Tweet-RAG: 91.2

In [15]:
from dspy.teleprompt import BootstrapFewShot
config = dict(max_bootstrapped_demos=15, max_labeled_demos=5)
optimizer = BootstrapFewShot(metric = metric, **config)
optimized_tweet_writer = optimizer.compile(TweetRAG(), trainset=trainset[:10])
optimized_tweet_writer.save(path="fewshotv3")

100%|██████████| 10/10 [01:03<00:00,  6.33s/it]

Bootstrapped 8 full traces after 10 examples in round 0.
[('retrieve', <dspy.retrieve.retrieve.Retrieve object at 0x132e64890>), ('generate_tweet', Predict(StringSignature(context, topic -> reasoning, tweet
    instructions='Generate a tweet based on the context'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'May contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [16]:
evaluator(optimized_tweet_writer)

Average Metric: 24.8 / 25  (99.2): 100%|██████████| 25/25 [01:21<00:00,  3.24s/it] 


Unnamed: 0,example_tweet,topic,pred_tweet,metric
0,For those who ask how I eat desert every night and don’t get fat… Answer: I eat less during the day.,Diet and Weight Management,"In diet and weight management, there’s no “winning”—only ongoing commitment. Focus on mastering the middle, where real progress happens. Celebrate the journey, not just the...",✔️ [1.0]
1,You can give up tomorrow. Today you fight.,Motivation,"Motivation can come from unexpected places—pain, fear, or even shame. Use what you’ve got! And remember, every small win matters. Imagine those views as a...",✔️ [1.0]
2,There are no recessions. Only buying seasons and selling seasons. Your skills determine which season it is for you.,Economic Perspective,"In the economy of life, we’re all investors trading resources for returns. Recognize your time as your most valuable asset! The rarer your perspective, the...",✔️ [1.0]
3,A life tip that’s served me well: Save your big decisions for a morning after a good nights sleep.,Life Advice,"Life isn’t about winning; it’s about playing the game. Whether in marriage, health, or business, focus on the journey, not just the destination. Sometimes, the...",✔️ [1.0]
4,Whenever I get to a low point where I think “why do I even bother?” I just try to remind myself “this is where most...,Motivation and Perseverance,Don’t wait for the perfect moment to start—begin when it’s hard! 💪 Conquering tiny impulses builds the perseverance needed for massive dreams. Use whatever motivates...,✔️ [1.0]


99.2

## BootstrapFewShot: 99.2

In [17]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
config = dict(max_labeled_demos = 2, max_bootstrapped_demos=2, num_candidate_programs=3)
teleprompter = BootstrapFewShotWithRandomSearch(metric = metric, **config)
rs_optimized_tweet_writer = teleprompter.compile(TweetRAG(), trainset=trainset[:20])
rs_optimized_tweet_writer.save(path="fewshotwithrsv3")

Going to sample between 1 and 2 traces per predictor.
Will attempt to bootstrap 3 candidate sets.


Average Metric: 17.2 / 20  (86.0): 100%|██████████| 20/20 [00:25<00:00,  1.26s/it]              


New best score: 86.0 for seed -3
Scores so far: [86.0]
Best score so far: 86.0


Average Metric: 18.6 / 20  (93.0): 100%|██████████| 20/20 [00:22<00:00,  1.11s/it]


New best score: 93.0 for seed -2
Scores so far: [86.0, 93.0]
Best score so far: 93.0


 15%|█▌        | 3/20 [00:01<00:09,  1.88it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 18.8 / 20  (94.0): 100%|██████████| 20/20 [00:28<00:00,  1.43s/it]


New best score: 94.0 for seed -1
Scores so far: [86.0, 93.0, 94.0]
Best score so far: 94.0


 10%|█         | 2/20 [00:14<02:08,  7.15s/it]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 19.8 / 20  (99.0): 100%|██████████| 20/20 [00:26<00:00,  1.32s/it]


New best score: 99.0 for seed 0
Scores so far: [86.0, 93.0, 94.0, 99.0]
Best score so far: 99.0


 10%|█         | 2/20 [00:13<01:59,  6.64s/it]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 14.0 / 20  (70.0): 100%|██████████| 20/20 [00:26<00:00,  1.32s/it]


Scores so far: [86.0, 93.0, 94.0, 99.0, 70.0]
Best score so far: 99.0


  5%|▌         | 1/20 [00:07<02:19,  7.33s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 19.8 / 20  (99.0): 100%|██████████| 20/20 [00:24<00:00,  1.25s/it]

Scores so far: [86.0, 93.0, 94.0, 99.0, 70.0, 99.0]
Best score so far: 99.0
6 candidate programs found.
[('retrieve', <dspy.retrieve.retrieve.Retrieve object at 0x134570380>), ('generate_tweet', Predict(StringSignature(context, topic -> reasoning, tweet
    instructions='Generate a tweet based on the context'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'May contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [18]:
mini.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `context` (str): May contain relevant facts
2. `topic` (str)

Your output fields are:
1. `reasoning` (str)
2. `tweet` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## topic ## ]]
{topic}

[[ ## reasoning ## ]]
{reasoning}

[[ ## tweet ## ]]
{tweet}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Generate a tweet based on the context


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## context ## ]]
Not supplied for this particular example.

[[ ## topic ## ]]
Workforce Management

Respond with the corresponding output fields, starting with the field `reasoning`, then `tweet`, and then ending with the marker for `completed`.


[31mAssistant message:[0m

[[ ## reasoning ## ]]
Not supplied for this particular example.

[[ ## tweet ## ]]
Raisin

In [19]:
evaluator(rs_optimized_tweet_writer)

Average Metric: 24.8 / 25  (99.2): 100%|██████████| 25/25 [01:08<00:00,  2.74s/it] 


Unnamed: 0,example_tweet,topic,pred_tweet,metric
0,For those who ask how I eat desert every night and don’t get fat… Answer: I eat less during the day.,Diet and Weight Management,"Diet and weight management aren’t about “winning” but about playing the long game. Focus on consistency and self-motivation, especially in the middle of your journey....",✔️ [1.0]
1,You can give up tomorrow. Today you fight.,Motivation,"Motivation doesn't always come from passion; sometimes it’s pain, fear, or anger that drives us. Embrace what you have! Visualize your small wins as a...",✔️ [1.0]
2,There are no recessions. Only buying seasons and selling seasons. Your skills determine which season it is for you.,Economic Perspective,"Everyone is an investor in their own life! How you spend your time reflects your priorities and potential returns. As you rise, remember: the rarer...",✔️ [1.0]
3,A life tip that’s served me well: Save your big decisions for a morning after a good nights sleep.,Life Advice,"Life's best games can't be won, only played. Focus on the journey, not just the destination. Sometimes, giving up the sure thing leads to the...",✔️ [1.0]
4,Whenever I get to a low point where I think “why do I even bother?” I just try to remind myself “this is where most...,Motivation and Perseverance,"Start your journey when it’s hard! Conquering tiny impulses builds the perseverance needed for massive dreams. Use what you’ve got—pain, fear, or anger can be...",✔️ [1.0]


99.2

## BootstrapFewShotWithRandomSearch: 99.2

In [20]:
from dspy.teleprompt import MIPROv2
# Initialize optimizer
teleprompter = MIPROv2(
    metric=metric,
    num_candidates=7,
    init_temperature=0.5,
    verbose=False,
    num_threads=2,
)

mipro_optimized_program = teleprompter.compile(
    TweetRAG(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    num_trials=5,
    minibatch=True,
    minibatch_size=5, 
    minibatch_full_eval_steps=2
)

# Save optimize program for future use
mipro_optimized_program.save("mipro_optimizedv3")

[93m[1mProjected Language Model (LM) Calls[0m

Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:


[93m- Prompt Model: [94m[1m10[0m[93m data summarizer calls + [94m[1m7[0m[93m * [94m[1m1[0m[93m lm calls in program + ([94m[1m2[0m[93m) lm calls in program aware proposer = [94m[1m19[0m[93m prompt model calls[0m
[93m- Task Model: [94m[1m5[0m[93m examples in minibatch * [94m[1m5[0m[93m batches + [94m[1m60[0m[93m examples in val set * [94m[1m2[0m[93m full evals = [94m[1m55[0m[93m task model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per Call

 27%|██▋       | 4/15 [00:27<01:14,  6.76s/it]


Bootstrapped 3 full traces after 5 examples in round 0.
Bootstrapping set 4/7


 20%|██        | 3/15 [00:19<01:18,  6.55s/it]


Bootstrapped 2 full traces after 4 examples in round 0.
Bootstrapping set 5/7


  7%|▋         | 1/15 [00:07<01:50,  7.86s/it]


Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 6/7


  7%|▋         | 1/15 [00:06<01:26,  6.14s/it]


Bootstrapped 1 full traces after 2 examples in round 0.
Bootstrapping set 7/7


  7%|▋         | 1/15 [00:06<01:24,  6.01s/it]


Bootstrapped 1 full traces after 2 examples in round 0.

==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
In this step, by default we will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.

Proposing instructions...

Proposed Instructions for Predictor 0:

0: Generate a tweet based on the context

1: Using the provided context and topic, generate a concise and engaging tweet that captures the essence of the information while encouraging reflection and proactive strategies for personal and professional development.

2: In a competitive landscape where every tweet can influence public perception and engagement, generate a compelling tweet that encapsulates the provided context and topic. Ensure that the reasoning clearly articulates the insights drawn from the context, highlighting the significance of the topic in a way that resonates with the audience and encourages i

Average Metric: 55.39999999999999 / 60  (92.3): 100%|██████████| 60/60 [02:55<00:00,  2.92s/it] 


Default program score: 92.33

==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
In this step, we will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination. Bayesian Optimization will be used for this search process.

== Minibatch Trial 1 / 5 ==


Average Metric: 5.0 / 5  (100.0): 100%|██████████| 5/5 [00:24<00:00,  4.92s/it]


Score: 100.0 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].


== Minibatch Trial 2 / 5 ==


Average Metric: 4.6 / 5  (92.0): 100%|██████████| 5/5 [00:20<00:00,  4.05s/it] 


Score: 92.0 on minibatch of size 5 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 5'].


===== Full Eval 1 =====
Doing full eval on next top averaging program (Avg Score: 100.0) so far from mini-batch trials...


Average Metric: 48.99999999999999 / 60  (81.7): 100%|██████████| 60/60 [03:15<00:00,  3.27s/it] 


Full eval score: 81.67
Best full eval score so far: 92.33


== Minibatch Trial 3 / 5 ==


Average Metric: 5.0 / 5  (100.0): 100%|██████████| 5/5 [00:24<00:00,  4.85s/it]


Score: 100.0 on minibatch of size 5 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].


== Minibatch Trial 4 / 5 ==


Average Metric: 4.0 / 5  (80.0): 100%|██████████| 5/5 [00:28<00:00,  5.64s/it]


Score: 80.0 on minibatch of size 5 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


===== Full Eval 2 =====
Doing full eval on next top averaging program (Avg Score: 100.0) so far from mini-batch trials...


Average Metric: 41.4 / 60  (69.0): 100%|██████████| 60/60 [02:56<00:00,  2.95s/it]


Full eval score: 69.0
Best full eval score so far: 92.33


== Minibatch Trial 5 / 5 ==


Average Metric: 3.0 / 5  (60.0): 100%|██████████| 5/5 [00:23<00:00,  4.62s/it]


Score: 60.0 on minibatch of size 5 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].


===== Full Eval 3 =====
Doing full eval on next top averaging program (Avg Score: 92.0) so far from mini-batch trials...


Average Metric: 49.99999999999999 / 60  (83.3): 100%|██████████| 60/60 [02:59<00:00,  2.98s/it] 

Full eval score: 83.33
Best full eval score so far: 92.33


[('retrieve', <dspy.retrieve.retrieve.Retrieve object at 0x1351d65d0>), ('generate_tweet', Predict(StringSignature(context, topic -> reasoning, tweet
    instructions='Generate a tweet based on the context'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'May contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    topic = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Topic:', 'desc': '${topic}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Tweet:', 'desc': '${tweet}'})
)))]





In [22]:
mini.inspect_history(n=1)





[31mSystem message:[0m

Your input fields are:
1. `context` (str): May contain relevant facts
2. `topic` (str)

Your output fields are:
1. `reasoning` (str)
2. `tweet` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## topic ## ]]
{topic}

[[ ## reasoning ## ]]
{reasoning}

[[ ## tweet ## ]]
{tweet}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        In a competitive landscape where every tweet can influence public perception and engagement, generate a compelling tweet that encapsulates the provided context and topic. Ensure that the reasoning clearly articulates the insights drawn from the context, highlighting the significance of the topic in a way that resonates with the audience and encourages interaction.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## context ## ]]
Not supplied for thi

In [23]:
evaluator(mipro_optimized_program)

Average Metric: 22.8 / 25  (91.2): 100%|██████████| 25/25 [00:12<00:00,  2.04it/s]


Unnamed: 0,example_tweet,topic,pred_tweet,metric
0,For those who ask how I eat desert every night and don’t get fat… Answer: I eat less during the day.,Diet and Weight Management,"""Weight management isn't about winning; it's about the journey. Just like in marriage, the goal is to keep going. Celebrate the small victories in the...",✔️ [1.0]
1,You can give up tomorrow. Today you fight.,Motivation,"Motivation doesn't always come from passion. Sometimes, it's pain, fear, or even shame that drives us. Remember, every small win counts! Visualize your audience—93 views?...",
2,There are no recessions. Only buying seasons and selling seasons. Your skills determine which season it is for you.,Economic Perspective,"💡 Everyone is an investor! Whether rich or poor, how we trade our resources defines our returns. Start measuring your investments in time and money...",✔️ [1.0]
3,A life tip that’s served me well: Save your big decisions for a morning after a good nights sleep.,Life Advice,"Life isn't a race to win; it's a journey to enjoy. Focus on nurturing relationships, health, and passions rather than just reaching a destination. Sometimes,...",✔️ [1.0]
4,Whenever I get to a low point where I think “why do I even bother?” I just try to remind myself “this is where most...,Motivation and Perseverance,"""Don't wait for motivation to strike! Start when it’s hard and use your challenges as fuel. Pain, fear, and anger can be powerful motivators. Conquer...",✔️ [1.0]


91.2

## MIPROv2 with RAG: 91.2

# Experiment 4: RAG evaluation with RAGAS
So far we've only used the custom metrics we defined with the LLM judge, but we could integrate some objective metrics defined by RAGAS into our evaluation/training. Let's try that.