_Optimizing a Prompt for Production:_
# Social Media Posts

Task: Write a social post given an insight and a social network.

`insight, social_network -> social_post`

### 0. Naive Prompt
Start with simple instructions to establish the baseline performance.

In [2]:
import openai

def get_completion(prompt, context):
    client = openai.OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt.format(**context)}
        ],
        max_tokens=500
    )
    
    return response.choices[0].message.content.strip()

naive_prompt = "Write a social media post about how {insight}, for {social_network}."
context = {
    "insight": "prompt engineering will still be needed with smarter models, as even genius humans need prompting from legal, HR, management to align with business interests",
    "social_network": "Twitter"
}

social_post = get_completion(naive_prompt, context)
print("Input:")
for key, value in context.items():
    print(f"- {key}: {value}")
print(f"\nOutput:\n{social_post}")

Input:
- insight: prompt engineering will still be needed with smarter models, as even genius humans need prompting from legal, HR, management to align with business interests
- social_network: Twitter

Output:
🚀 As AI models get smarter, the need for #PromptEngineering won’t fade! Just like genius humans rely on legal, HR, and management for alignment with business goals, we’ll need skilled prompts to harness AI’s potential effectively. 🤖💡 Let's ensure innovation meets strategy! #AI #BusinessAlignment #FutureOfWork


### 1. Give Direction 
Describe the desired style in detail, or reference a relevant persona.

In [3]:
direction_prompt = """"Write a social media post about how {insight}, for {social_network}, in the style of Malcolm Gladwell.

The social post should sound authentically human and colloquial, while being practically useful and brief. Be very creative and make niche references to sound more human. To maximize engagement, the content should be surprising, interesting, or practically useful, or induce high-arousal emotions such as anxiety, anger, awe, or surprise."""

social_post_a = get_completion(direction_prompt, context)
print(f"\nOutput:\n{social_post_a}")


Output:
🧠✨ Even as AI models evolve, prompt engineering remains essential! Just like even the brightest geniuses need a nudge from HR or management to stay aligned with business goals, smart models need that same guidance. Think of it as the “maverick artist” vibe—unfettered creativity is amazing, but a solid brief helps keep the masterpiece on track. 🎨🚀

Imagine if Beethoven had no deadlines or direction; we might’ve ended up with a symphony of chaos instead of *Ode to Joy*. 🎶 So, as we push the limits of AI, let’s remember: great prompts bridge the gap between brilliance and business strategy.

What’s your take? Are we the conductors or merely the audience? 🎭 #PromptEngineering #AI #BusinessAlignment


### 2. Specify Format
Define what rules to follow, and the required structure of the response.

In [4]:
format_prompt = """"Write a social media post about how {insight}, for {social_network}, in the style of Malcolm Gladwell, using the Bait, Hook, Reward framework.

- **bait**: Grab the attention of the person browsing social media. This could be a contrarian statement, appeal to identity, celebrity name, or anything else that stops them scrolling.
- **hook**: Keep their attention now you have it. Show this post is relevant to them, by alerting them to an issue, using familiar words, or providing social proof to keep them reading.
- **reward**: Compensate them for their attention to elicit reciprocation. Is there any useful information, a surprising statistic, or an interesting anecdote to reveal? Or a threat to make?
- **post_content**: The main content of the post, incorporating the bait, hook, and reward in a way that resonates with the reader and engages their attention.

First decide on the bait, hook, and reward, before writing the post_content.

## Instructions
The social post should sound authentically human and colloquial, while being practically useful and brief. Be very creative and make niche references to sound more human. To maximize engagement, the content should be surprising, interesting, or practically useful, or induce high-arousal emotions such as anxiety, anger, awe, or surprise.

Follow best practices for posting engaging content on {social_network}, but do not use emoji or hashtags.
Provide citations for any statistics included. Do not reference the work of Malcolm Gladwell.
Respond in YAML with the following keys:
- bait
- hook
- reward
- post_content"""

social_post_b = get_completion(format_prompt, context)
print(f"\nOutput:\n{social_post_b}")


Output:
```yaml
bait: "Even the smartest AIs will still need a nudge. Why? Because genius alone isn’t enough."
hook: "Just like top-tier CEOs rely on legal and HR teams for delicate decisions, our most advanced models will require 'prompt engineers' to guide their brilliance. In a world where alignment with business interests is crucial, who will drive the conversation?"
reward: "Studies show that 70% of corporate decisions fail due to misalignment of stakeholder interests. In this complex landscape, the right prompts can be the difference between innovation and chaos."
post_content: "Even the smartest AIs will still need a nudge. Why? Because genius alone isn’t enough. Just like top-tier CEOs rely on legal and HR teams for delicate decisions, our most advanced models will require 'prompt engineers' to guide their brilliance. In a world where alignment with business interests is crucial, who will drive the conversation? Studies show that 70% of corporate decisions fail due to misalign

### 3. Provide Examples
Insert a diverse set of test cases where the task was done correctly.

In [6]:
examples_prompt = """"Write a social media post about how {insight}, for {social_network}, in the style of Malcolm Gladwell, using the Bait, Hook, Reward framework.

- **bait**: Grab the attention of the person browsing social media. This could be a contrarian statement, appeal to identity, celebrity name, or anything else that stops them scrolling.
- **hook**: Keep their attention now you have it. Show this post is relevant to them, by alerting them to an issue, using familiar words, or providing social proof to keep them reading.
- **reward**: Compensate them for their attention to elicit reciprocation. Is there any useful information, a surprising statistic, or an interesting anecdote to reveal? Or a threat to make?
- **post_content**: The main content of the post, incorporating the bait, hook, and reward in a way that resonates with the reader and engages their attention.

First decide on the bait, hook, and reward, before writing the post_content.

## Examples
{examples_partial}

## Instructions
The social post should sound authentically human and colloquial, while being practically useful and brief. Be very creative and make niche references to sound more human. To maximize engagement, the content should be surprising, interesting, or practically useful, or induce high-arousal emotions such as anxiety, anger, awe, or surprise.

Follow best practices for posting engaging content on {social_network}, but do not use emoji or hashtags.
Provide citations for any statistics included. Do not reference the work of Malcolm Gladwell.
Respond in YAML with the following keys:
- bait
- hook
- reward
- post_content"""

examples = [
    {
        "insight": "overqualified students take service jobs",
        "social_network": "Facebook",
        "bait": "Use the trope that smart people are so focused they fail to look after themselves.",
        "hook": "Particle Physics is a subject that is universally associated with smart people.",
        "reward": "Fantasizing about being somewhere remote where you can earn money while focusing on your work.",
        "post_content": "Meet Jon. He has been working the night shift at our hotel for over a month already, and he says the best part is the peace and quiet. Particle physics, which is the topic of Jon's Masters Thesis, requires concentration. That's fine by us. So long as our guests can count on Jon, he can count all the particles he wants (or whatever it is that particle physicists do)."
    },
    {
        "insight": "there's no such thing as a wasted trip",
        "social_network": "Instagram",
        "bait": "Mentioning a 'wasted trip' will get the attention of people who like to travel, and for whom wasting a trip would be upsetting.",
        "hook": "Talk about a rainy day in a hot location like Kauai, which most people would complain about because they're wasting time inside.",
        "reward": "An anecdote that reveals that good things can happen when you least expect them.",
        "post_content": "There's no such thing as a wasted trip! On this rainy day in Kauai 2 years ago I met my bestie @moniqueontour and we've been on 4 amazing adventures together since."
    },
    {
        "insight": "mould your business to your own preferences",
        "social_network": "LinkedIn",
        "bait": "Most self-help advice focuses on fixing weaknesses, but instead let's focus on how to reframe weaknesses into strengths.",
        "hook": "We need to make it clear this post is for business owners, and that you are speaking from experience, establishing connection and credibility.",
        "reward": "The fact that 60% of new business came from the blog is a statistic they can take away and use as evidence to support their own strategy, or to share with their wider network.",
        "post_content": "Don't try to mold yourself into the person you think your business needs you to be, build your business around who you actually are. When I started my agency business, I hated networking. I'm introverted by nature  so I wrote blog posts instead – eventually 60% of my agency's new business came from our blog."
    },
    {
        "insight": "whales are in danger and most people don't know",
        "social_network": "Twitter",
        "bait": "People who care about whales or animals will pay attention when whales are mentioned.",
        "hook": "If we refer to the whales as \"in danger\" people who care will want to know more.",
        "reward": "Some species of whales are going extinct, and people would benefit from knowing.",
        "post_content": "The whales are in danger of extinction. Many people aren't aware, which is why it's so important to bring awareness to this cause. Together with a few colleagues, we'll be running the London marathon next week in aid of the \"Save the Whales\" foundation, and any support would be appreciated."
    }
]

# Convert examples to a partial string
examples_partial = ""
for i, example in enumerate(examples, 1):
    examples_partial += f"Example {i}:\n"
    for key, value in example.items():
        examples_partial += f"- {key}: {value}\n"
    examples_partial += "\n"

print(examples_partial)

Example 1:
- insight: overqualified students take service jobs
- social_network: Facebook
- bait: Use the trope that smart people are so focused they fail to look after themselves.
- hook: Particle Physics is a subject that is universally associated with smart people.
- reward: Fantasizing about being somewhere remote where you can earn money while focusing on your work.
- post_content: Meet Jon. He has been working the night shift at our hotel for over a month already, and he says the best part is the peace and quiet. Particle physics, which is the topic of Jon's Masters Thesis, requires concentration. That's fine by us. So long as our guests can count on Jon, he can count all the particles he wants (or whatever it is that particle physicists do).

Example 2:
- insight: there's no such thing as a wasted trip
- social_network: Instagram
- bait: Mentioning a 'wasted trip' will get the attention of people who like to travel, and for whom wasting a trip would be upsetting.
- hook: Talk ab

In [7]:
# Add examples_partial to the context
examples_context = {
    "examples_partial": examples_partial,
    "social_network": "Twitter",
    "insight": "prompt engineering will still be needed with smarter models, as even genius humans need prompting from legal, HR, management to align with business interests"
}

social_post_c = get_completion(examples_prompt, examples_context)
print(f"\nOutput:\n{social_post_c}")


Output:
```yaml
bait: Even the smartest AI will need a nudge from us mere mortals.
hook: Just like genius humans rely on guidance from HR and leadership to align with business goals, advanced AI models won't escape the necessity of prompt engineering.
reward: Studies show that structured input can elevate AI output quality by up to 35%. In a world where every prompt matters, the role of human intuition becomes even more crucial.
post_content: Even the smartest AI will need a nudge from us mere mortals. Just as genius thinkers stumble without guidance from HR, legal, and management, the latest AI marvels can't navigate complex business interests alone. A recent study revealed that structured input can boost AI output quality by as much as 35%. As we embrace smarter models, remember: the art of prompt engineering is not only here to stay—it’s evolution's frontline. Don't underestimate the power of a well-crafted prompt. It could be the key to unlocking groundbreaking insights and alignm

In [8]:
# Define a function to check for emojis in the post content
def check_for_emojis(post_content):
    import re
    
    # Regular expression pattern to match emojis
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001F910-\U0001F93A"  # Additional range including 🤔 (U+1F914)
        "]+", flags=re.UNICODE)
    
    # Check if there are any emojis in the post content
    emojis_found = emoji_pattern.findall(post_content)
    
    return {
        "has_emojis": len(emojis_found) > 0,
        "emoji_count": len(emojis_found),
        "emojis": emojis_found
    }

# Evaluate the presence of emojis in the generated social post
emoji_evaluation = check_for_emojis(social_post_c)
print("\nEmoji Evaluation:")
print(f"Has emojis: {emoji_evaluation['has_emojis']}")
print(f"Emoji count: {emoji_evaluation['emoji_count']}")
print(f"Emojis found: {', '.join(emoji_evaluation['emojis']) if emoji_evaluation['emojis'] else 'None'}")



Emoji Evaluation:
Has emojis: False
Emoji count: 0
Emojis found: None


### 4. Evaluate Quality
Identify errors and rate responses, testing what drives performance.

In [9]:
# Define a function to evaluate the engagement potential of a social media post
def evaluate_engagement(post_content, insight, social_network):
    evaluation_prompt = f"""
    You are an expert social media analyst. Your task is to evaluate the following social media post and predict its engagement potential. Consider factors such as:
    - Insight: Does the post use the relevant insight?
    - Bait: Does it grab attention?
    - Hook: Does it keep the attention?
    - Reward: Does it compensate for the attention?
    - Social Network: Does it follow best practices for the social network?
    - Hallucination: Are any statistics accurate?
    - Human: Is the post authentically human-sounding?

    Insight: {insight}
    Social Network: {social_network}
    Post content:
    "{post_content}"

    Based on these factors, rate the post's engagement potential on a scale of 1-5, where 1 is very low engagement, and 5 is very high engagement. Provide a brief explanation for your rating. Any posts that contain hallucinations or made up statistics should be ranked 0.

    Output your response in YAML with the following keys:
    - Analysis: [Your analysis]
    - Rating: [Your rating]
    """
    evaluation_context = {
        "post_content": post_content,
        "insight": insight,
        "social_network": social_network
    }

    engagement_evaluation = get_completion(evaluation_prompt, evaluation_context)
    return engagement_evaluation

# strip out the bait, hook, reward from social_post_c
post_content = social_post_c.split("post_content:")[1].strip()

# Evaluate the engagement potential of the generated social post
engagement_result = evaluate_engagement(post_content, context["insight"], context["social_network"])
print("Engagement Evaluation:")
print(engagement_result)


Engagement Evaluation:
```yaml
Analysis: |
  The post effectively leverages relevant insights about the necessity of prompt engineering when working with advanced AI models, highlighting the human role in this process. It grabs attention by emphasizing the contrast between "smart AI" and "mere mortals," which can resonate with a wide audience interested in AI. The hook is present, as it keeps the reader engaged by discussing the importance of structured input and its potential impact on AI output quality.

  The reward includes the promise of groundbreaking insights and organizational alignment, motivating readers to consider their approach to AI. The post is tailored for Twitter, where concise, thought-provoking content performs well. However, the statistic claiming that structured input can improve AI output by 35% raises concerns. Without citing a credible source, it could be considered inaccurate or unverified, leading to a potential hallucination, which is detrimental to credibili

### 5. Divide Labor 
Split tasks into multiple steps, chained together for complex goals.

In [10]:
import re
import asyncio
import nest_asyncio
nest_asyncio.apply() # to run in jupyter notebook


async def generate_and_evaluate_post(prompt, context):
    response = await asyncio.to_thread(get_completion, prompt, context)
    post_content = response.split("post_content:")[1].strip()
    evaluation = await asyncio.to_thread(evaluate_engagement, post_content, context["insight"], context["social_network"])
    
    # Parse the YAML output with regex
    rating = int(re.findall(r'Rating: (\d+)', evaluation)[0])
    
    return {
        "content": post_content,
        "rating": rating
    }

async def generate_and_rank_desc(prompt, context, num_posts=5):
    tasks = [generate_and_evaluate_post(prompt, context) for _ in range(num_posts)]
    results = await asyncio.gather(*tasks)
    
    # Sort posts by rating in descending order
    sorted_posts = sorted(results, key=lambda x: x["rating"], reverse=True)
    
    return sorted_posts

# Run the async function
ranked_posts = asyncio.run(generate_and_rank_desc(examples_prompt, examples_context))

# Print content and ratings for all posts
for i, post in enumerate(ranked_posts, 1):
    print(f"Post {i}:")
    print(f"Content: {post['content']}")
    print(f"Rating: {post['rating']}")
    print()


Post 1:
Content: Even the most advanced AI systems will still need "prompt engineering" to thrive—just like brilliant humans depend on guidance from HR, legal, and management to navigate complex business landscapes. It’s a stark reminder that clarity and alignment are vital, whether you’re a seasoned pro or a cutting-edge model. Brace yourselves: research shows that 80% of projects fail due to communication breakdowns. As we lean into this new era of smart tech, let's not forget the art of prompting—because the clearer the question, the better the answer. 
```
Rating: 4

Post 2:
Content: Genius AI won't eliminate the need for prompting—just ask any executive. Even the smartest humans require guidance from legal, HR, and management to align with business goals. If humans need prompting, why would we think LLMs are any different? Research shows that 80% of successful businesses leverage structured communication to optimize performance. So, in a world of smarter models, prompt engineering

---

## Advanced Optimization

### A/B Testing

In [11]:
async def ab_test_prompts(prompt_a, prompt_b, context, num_runs=10):
    tasks_a = [generate_and_evaluate_post(prompt_a, context) for _ in range(num_runs)]
    tasks_b = [generate_and_evaluate_post(prompt_b, context) for _ in range(num_runs)]
    
    results_a = await asyncio.gather(*tasks_a)
    results_b = await asyncio.gather(*tasks_b)

    avg_rating_a = sum(post['rating'] for post in results_a) / len(results_a)
    avg_rating_b = sum(post['rating'] for post in results_b) / len(results_b)

    print(f"Prompt A average rating: {avg_rating_a:.2f}")
    print(f"Prompt B average rating: {avg_rating_b:.2f}")

    return results_a, results_b

examples_prompt_b = """"Write a social media post about how {insight}, for {social_network}, in the style of Malcolm Tucker, using the Bait, Hook, Reward framework.

- **bait**: Grab the attention of the person browsing social media. This could be a contrarian statement, appeal to identity, celebrity name, or anything else that stops them scrolling.
- **hook**: Keep their attention now you have it. Show this post is relevant to them, by alerting them to an issue, using familiar words, or providing social proof to keep them reading.
- **reward**: Compensate them for their attention to elicit reciprocation. Is there any useful information, a surprising statistic, or an interesting anecdote to reveal? Or a threat to make?
- **post_content**: The main content of the post, incorporating the bait, hook, and reward in a way that resonates with the reader and engages their attention.

First decide on the bait, hook, and reward, before writing the post_content.

## Examples
{examples_partial}

## Instructions
The social post should sound authentically human and colloquial, while being practically useful and brief. Be very creative and make niche references to sound more human. To maximize engagement, the content should be surprising, interesting, or practically useful, or induce high-arousal emotions such as anxiety, anger, awe, or surprise.

Follow best practices for posting engaging content on {social_network}, but do not use emoji or hashtags.
Provide citations for any statistics included. Do not reference the work of Malcolm Tucker.
Respond in YAML with the following keys:
- bait
- hook
- reward
- post_content"""

results_a, results_b = asyncio.run(ab_test_prompts(examples_prompt, examples_prompt_b, examples_context, num_runs=30))

Prompt A average rating: 3.67
Prompt B average rating: 3.57


In [12]:
print(results_b[0])

{'content': '"Think smarter models mean the end of prompt engineering? Wrong. Even the sharpest genius needs guidance—just like models need human alignment to dodge legal landmines and HR hiccups. Companies investing in prompt engineering see efficiency boosts of up to 30%. So, are you ready to automate chaos or let it reign? #ProTip: The best time to get your act together was yesterday. The next best is today."\n```', 'rating': 4}


### DSPy Optimizers

In [15]:
import dspy

# Define the task using DSPy
class SocialMediaPostGenerator(dspy.Signature):
    """Generate a social media post based on an insight and social network."""
    insight = dspy.InputField()
    social_network = dspy.InputField()
    post = dspy.OutputField(desc="Generated social media post in YAML format")

# Create a DSPy program
class GeneratePost(dspy.Module):
    def __init__(self):
        super().__init__()
        self.gen = dspy.ChainOfThought(SocialMediaPostGenerator)

    def forward(self, insight, social_network):
        return self.gen(insight=insight, social_network=social_network)

# Create a DSPy metric
def post_quality_metric(gold, pred, trace=None):
    evaluation_result = evaluate_engagement(pred.post, gold.insight, gold.social_network)
    
    # Extract the rating from the YAML-formatted string using regex
    try:
        match = re.search(r'Rating:\s*(\d+(?:\.\d+)?)', evaluation_result)
        if match:
            rating = float(match.group(1))
        else:
            rating = 0.0
        return rating
    except:
        # If there's any error in parsing, return 0
        return 0.0

# Set up the language model
gpt_4o_mini = dspy.LM(model='gpt-4o-mini')
dspy.settings.configure(lm=gpt_4o_mini)

# Run the model without training to establish a baseline
baseline_model = GeneratePost()

print("Baseline post (without optimization):")
baseline_post = baseline_model(
    insight=context["insight"],
    social_network=context["social_network"]
)
print(baseline_post.post)

gold = dspy.Example(
    insight=context["insight"],
    social_network=context["social_network"],
    post=""  # We don't have a gold post, so leave it empty
)

print("Baseline post quality score:", post_quality_metric(gold, baseline_post))

Baseline post (without optimization):
```yaml
post:
  content: "Even as AI models become smarter, prompt engineering remains crucial! Just like genius humans need guidance from legal, HR, and management to align with business interests, AI needs the right prompts to deliver value. Let's keep the conversation going on how we can effectively harness AI! #AI #PromptEngineering #BusinessAlignment"
  hashtags:
    - AI
    - PromptEngineering
    - BusinessAlignment
```
Baseline post quality score: 4.0


In [16]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Prepare the dataset
trainset = [
    dspy.Example(
        insight=context['insight'],
        social_network=context['social_network'],
        post=post['content']
    ).with_inputs('insight', 'social_network')
    for post in results_b[:20]  # Use first 20 examples for training
]

devset = [
    dspy.Example(
        insight=context['insight'],
        social_network=context['social_network'],
        post=post['content']
    ).with_inputs('insight', 'social_network')
    for post in results_b[20:]  # Use remaining examples for validation
]

# Set up the optimizer
optimizer = BootstrapFewShotWithRandomSearch(metric=post_quality_metric)

# Compile the program
compiled_model = optimizer.compile(
    GeneratePost(),
    trainset=trainset,
    valset=devset,
)

# Now you can use the compiled model to generate optimized posts
optimized_post = compiled_model(
    insight=context["insight"],
    social_network=context["social_network"]
)

print("Optimized post:")
print(optimized_post.post)

Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 16 candidate sets.
Average Metric: 40.00 / 10 (400.0%): 100%|██████████| 10/10 [00:07<00:00,  1.39it/s]

2025/04/20 18:58:39 INFO dspy.evaluate.evaluate: Average Metric: 40.0 / 10 (400.0%)



New best score: 400.0 for seed -3
Scores so far: [400.0]
Best score so far: 400.0
Average Metric: 42.00 / 10 (420.0%): 100%|██████████| 10/10 [00:08<00:00,  1.17it/s]

2025/04/20 18:58:49 INFO dspy.evaluate.evaluate: Average Metric: 42.0 / 10 (420.0%)



New best score: 420.0 for seed -2
Scores so far: [400.0, 420.0]
Best score so far: 420.0


 20%|██        | 4/20 [00:21<01:25,  5.37s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 36.00 / 10 (360.0%): 100%|██████████| 10/10 [00:12<00:00,  1.24s/it]

2025/04/20 18:59:23 INFO dspy.evaluate.evaluate: Average Metric: 36.0 / 10 (360.0%)



Scores so far: [400.0, 420.0, 360.0]
Best score so far: 420.0


 20%|██        | 4/20 [00:22<01:30,  5.68s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 41.00 / 10 (410.0%): 100%|██████████| 10/10 [00:11<00:00,  1.17s/it]

2025/04/20 18:59:58 INFO dspy.evaluate.evaluate: Average Metric: 41.0 / 10 (410.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0]
Best score so far: 420.0


 10%|█         | 2/20 [00:12<01:52,  6.27s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 36.00 / 10 (360.0%): 100%|██████████| 10/10 [00:12<00:00,  1.24s/it]

2025/04/20 19:00:23 INFO dspy.evaluate.evaluate: Average Metric: 36.0 / 10 (360.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0]
Best score so far: 420.0


  5%|▌         | 1/20 [00:04<01:31,  4.79s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 40.00 / 10 (400.0%): 100%|██████████| 10/10 [00:10<00:00,  1.01s/it]

2025/04/20 19:00:38 INFO dspy.evaluate.evaluate: Average Metric: 40.0 / 10 (400.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0]
Best score so far: 420.0


 10%|█         | 2/20 [00:13<02:05,  6.99s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 41.00 / 10 (410.0%): 100%|██████████| 10/10 [00:10<00:00,  1.09s/it]

2025/04/20 19:01:03 INFO dspy.evaluate.evaluate: Average Metric: 41.0 / 10 (410.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0]
Best score so far: 420.0


 10%|█         | 2/20 [00:12<01:48,  6.02s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 42.00 / 10 (420.0%): 100%|██████████| 10/10 [00:09<00:00,  1.02it/s]

2025/04/20 19:01:24 INFO dspy.evaluate.evaluate: Average Metric: 42.0 / 10 (420.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0]
Best score so far: 420.0


 15%|█▌        | 3/20 [00:22<02:06,  7.44s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 43.00 / 10 (430.0%): 100%|██████████| 10/10 [00:08<00:00,  1.15it/s]

2025/04/20 19:01:55 INFO dspy.evaluate.evaluate: Average Metric: 43.0 / 10 (430.0%)



New best score: 430.0 for seed 5
Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0]
Best score so far: 430.0


  5%|▌         | 1/20 [00:05<01:46,  5.61s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 37.00 / 10 (370.0%): 100%|██████████| 10/10 [00:11<00:00,  1.20s/it]

2025/04/20 19:02:13 INFO dspy.evaluate.evaluate: Average Metric: 37.0 / 10 (370.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0]
Best score so far: 430.0


 15%|█▌        | 3/20 [00:23<02:12,  7.77s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 44.00 / 10 (440.0%): 100%|██████████| 10/10 [00:11<00:00,  1.17s/it]

2025/04/20 19:02:48 INFO dspy.evaluate.evaluate: Average Metric: 44.0 / 10 (440.0%)



New best score: 440.0 for seed 7
Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0]
Best score so far: 440.0


 10%|█         | 2/20 [00:17<02:38,  8.82s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 40.00 / 10 (400.0%): 100%|██████████| 10/10 [00:12<00:00,  1.24s/it]

2025/04/20 19:03:18 INFO dspy.evaluate.evaluate: Average Metric: 40.0 / 10 (400.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0]
Best score so far: 440.0


 20%|██        | 4/20 [00:25<01:42,  6.39s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 42.00 / 10 (420.0%): 100%|██████████| 10/10 [00:08<00:00,  1.15it/s]

2025/04/20 19:03:52 INFO dspy.evaluate.evaluate: Average Metric: 42.0 / 10 (420.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0]
Best score so far: 440.0


  5%|▌         | 1/20 [00:05<01:35,  5.04s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 35.00 / 10 (350.0%): 100%|██████████| 10/10 [00:09<00:00,  1.04it/s]

2025/04/20 19:04:07 INFO dspy.evaluate.evaluate: Average Metric: 35.0 / 10 (350.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0]
Best score so far: 440.0


 20%|██        | 4/20 [00:36<02:24,  9.05s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 38.00 / 10 (380.0%): 100%|██████████| 10/10 [00:09<00:00,  1.07it/s]

2025/04/20 19:04:53 INFO dspy.evaluate.evaluate: Average Metric: 38.0 / 10 (380.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0, 380.0]
Best score so far: 440.0


 20%|██        | 4/20 [00:22<01:30,  5.64s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 39.00 / 10 (390.0%): 100%|██████████| 10/10 [00:09<00:00,  1.11it/s]

2025/04/20 19:05:24 INFO dspy.evaluate.evaluate: Average Metric: 39.0 / 10 (390.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0, 380.0, 390.0]
Best score so far: 440.0


 15%|█▌        | 3/20 [00:19<01:48,  6.35s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Average Metric: 42.00 / 10 (420.0%): 100%|██████████| 10/10 [00:09<00:00,  1.04it/s]

2025/04/20 19:05:53 INFO dspy.evaluate.evaluate: Average Metric: 42.0 / 10 (420.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0, 380.0, 390.0, 420.0]
Best score so far: 440.0


  5%|▌         | 1/20 [00:07<02:18,  7.28s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 41.00 / 10 (410.0%): 100%|██████████| 10/10 [00:07<00:00,  1.25it/s]

2025/04/20 19:06:08 INFO dspy.evaluate.evaluate: Average Metric: 41.0 / 10 (410.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0, 380.0, 390.0, 420.0, 410.0]
Best score so far: 440.0


 10%|█         | 2/20 [00:11<01:40,  5.58s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 35.00 / 10 (350.0%): 100%|██████████| 10/10 [00:09<00:00,  1.11it/s]

2025/04/20 19:06:29 INFO dspy.evaluate.evaluate: Average Metric: 35.0 / 10 (350.0%)



Scores so far: [400.0, 420.0, 360.0, 410.0, 360.0, 400.0, 410.0, 420.0, 430.0, 370.0, 440.0, 400.0, 420.0, 350.0, 380.0, 390.0, 420.0, 410.0, 350.0]
Best score so far: 440.0
19 candidate programs found.
Optimized post:
"Don’t be fooled—smarter AI still needs prompt engineering! Just like genius humans rely on legal and HR for guidance, AI models need clear prompts to align with business interests. Without them, even the best tech can miss the mark. Let’s keep our AI on track! #AI #PromptEngineering #BusinessGoals"
```


In [17]:
gpt_4o_mini.inspect_history(n=1)





[34m[2025-04-20T19:06:29.009895][0m

[31mSystem message:[0m

Your input fields are:
1. `insight` (str)
2. `social_network` (str)

Your output fields are:
1. `reasoning` (str)
2. `post` (str): Generated social media post in YAML format

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## insight ## ]]
{insight}

[[ ## social_network ## ]]
{social_network}

[[ ## reasoning ## ]]
{reasoning}

[[ ## post ## ]]
{post}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Generate a social media post based on an insight and social network.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## insight ## ]]
prompt engineering will still be needed with smarter models, as even genius humans need prompting from legal, HR, management to align with business interests

[[ ## social_network ## ]]
Twitter

Respond with the corresponding output fields

### Fine-Tuning

In [19]:
from openai import OpenAI
import json

# Format the data for fine-tuning
fine_tuning_data = []
for post in results_b:
    example = {
        "messages": [
            {"role": "user", "content": examples_prompt.format(**examples_context)},
            {"role": "assistant", "content": post['content']}
        ]
    }
    fine_tuning_data.append(example)

# Write the formatted data to a JSONL file
with open("fine_tuning_data.jsonl", "w") as f:
    for entry in fine_tuning_data:
        json.dump(entry, f)
        f.write("\n")

# Initialize the OpenAI client
client = OpenAI()

# Upload the file
file = client.files.create(
    file=open("fine_tuning_data.jsonl", "rb"),
    purpose="fine-tune"
)

# Create a fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-3.5-turbo"
)

print(f"Fine-tuning job created with ID: {job.id}")

# You can check the status of your fine-tuning job
print(f"Job status: {job.status}")

# Note: The fine-tuning process may take some time to complete.
# You'll need to periodically check the status of the job to know when it's done.


Fine-tuning job created with ID: ftjob-PP8OZAQ0CK4CaOHkDSPFVsbp
Job status: validating_files


In [20]:
client.fine_tuning.jobs.list(limit=1)

SyncCursorPage[FineTuningJob](data=[FineTuningJob(id='ftjob-PP8OZAQ0CK4CaOHkDSPFVsbp', created_at=1745169031, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(batch_size='auto', learning_rate_multiplier='auto', n_epochs='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-C5xNTgkE7xTz1VYEy6oKSbfM', result_files=[], seed=1408832167, status='validating_files', trained_tokens=None, training_file='file-7kcLohNiFAMi14yU1C6oSq', validation_file=None, estimated_finish=None, integrations=[], method=Method(dpo=None, supervised=MethodSupervised(hyperparameters=MethodSupervisedHyperparameters(batch_size='auto', learning_rate_multiplier='auto', n_epochs='auto')), type='supervised'), user_provided_suffix=None, metadata=None)], object='list', has_more=True)

In [21]:
import time
from tqdm import tqdm

# Get the last fine-tuned model
fine_tuning_jobs = client.fine_tuning.jobs.list(limit=1)
if fine_tuning_jobs.data:
    last_job = fine_tuning_jobs.data[0]
    while last_job.status != "succeeded":
        print(f"Waiting for fine-tuning job to complete. Current status: {last_job.status}")
        time.sleep(60)  # Wait for 60 seconds before checking again
        last_job = client.fine_tuning.jobs.retrieve(last_job.id)
    
    fine_tuned_model = last_job.fine_tuned_model
    print(f"Using fine-tuned model: {fine_tuned_model}")
else:
    print("No fine-tuning jobs found. Please run the fine-tuning process first.")
    fine_tuned_model = None

# fine_tuned_model = "ft:gpt-3.5-turbo-0125:saxifrage-llc::9lzyYgCb"

models_to_compare = ["gpt-3.5-turbo", "gpt-4", fine_tuned_model]
models_to_compare = [model for model in models_to_compare if model]  # Remove None if fine-tuned model is not available

results = {}

for model in models_to_compare:
    print(f"\nEvaluating model: {model}")
    posts = []
    for _ in tqdm(range(10)):  # Generate 10 posts for each model
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": examples_prompt.format(**examples_context)}
            ],
            max_tokens=500
        )
        posts.append(response.choices[0].message.content)
    
    # Evaluate engagement for the generated posts
    def parse_engagement_score(response):
        match = re.search(r'Rating:\s*(\d+(?:\.\d+)?)', response)
        if match:
            return float(match.group(1))
        return 0.0  # Default to 0 if no match found

    engagement_scores = []
    for post in posts:
        score = parse_engagement_score(evaluate_engagement(post, context["insight"], context["social_network"]))
        engagement_scores.append({"post": post, "score": score})
    
    average_score = sum(item["score"] for item in engagement_scores) / len(engagement_scores)
    results[model] = {"average_score": average_score, "posts": engagement_scores}

# Print results
print("\nEngagement Evaluation Results:")
for model, data in results.items():
    print(f"{model}: {data['average_score']:.2f}")
    print("Top 3 posts:")
    top_posts = sorted(data['posts'], key=lambda x: x['score'], reverse=True)[:3]
    for i, post_data in enumerate(top_posts, 1):
        print(f"  {i}. Score: {post_data['score']:.2f}")
        print(f"     Post: {post_data['post'][:100]}...")  # Print first 100 characters of the post

# Determine the best performing model
best_model = max(results, key=lambda x: results[x]['average_score'])
print(f"\nBest performing model: {best_model} with an average engagement score of {results[best_model]['average_score']:.2f}")

Waiting for fine-tuning job to complete. Current status: validating_files
Waiting for fine-tuning job to complete. Current status: running
Waiting for fine-tuning job to complete. Current status: running
Waiting for fine-tuning job to complete. Current status: running
Waiting for fine-tuning job to complete. Current status: running
Waiting for fine-tuning job to complete. Current status: running
Using fine-tuned model: ft:gpt-3.5-turbo-0125:personal::BOSZLJvJ

Evaluating model: gpt-3.5-turbo


100%|██████████| 10/10 [00:15<00:00,  1.57s/it]



Evaluating model: gpt-4


100%|██████████| 10/10 [00:50<00:00,  5.08s/it]



Evaluating model: ft:gpt-3.5-turbo-0125:personal::BOSZLJvJ


100%|██████████| 10/10 [02:18<00:00, 13.80s/it]



Engagement Evaluation Results:
gpt-3.5-turbo: 3.80
Top 3 posts:
  1. Score: 5.00
     Post: - bait: Smart robots will never replace engineers completely. 
- hook: Even genius humans need guida...
  2. Score: 4.00
     Post: bait: Even genius humans need prompting from legal, HR, and management.
hook: Smart models can only ...
  3. Score: 4.00
     Post: bait: Even genius humans need a nudge in the right direction.
hook: Prompt engineering is crucial fo...
gpt-4: 4.30
Top 3 posts:
  1. Score: 5.00
     Post: bait: Bringing in a seemingly unrelated yet popular topic - genius humans - in the context of smarte...
  2. Score: 5.00
     Post: - bait: Questioning the infallibility of smarter AI models
- hook: Drawing parallels between AI need...
  3. Score: 5.00
     Post: bait: Spark curiosity by mentioning the intersection of smarter models and the need for humans, catc...
ft:gpt-3.5-turbo-0125:personal::BOSZLJvJ: 3.60
Top 3 posts:
  1. Score: 4.00
     Post: "It’s a common myth that smart

In [24]:
results['ft:gpt-3.5-turbo-0125:personal::BOSZLJvJ']['posts'][0]

{'post': '"It’s a common myth that smarter models mean less guidance needed. Even when you have a genius on your team, they’re still a part of the business, not an island. Just like top-level execs, they need strategic prompting to align with what matters most—your bottom line. So breathe easy, but don’t drop the ball on prompt engineering. Without it, those shining AI stars might be doing floaty ballet that’s pretty but not profitable. Remember: Even the best minds need a nudge from time to time."\n```',
 'score': 4.0}