#### 1. Observability
Monitoring:
- Input-Output prompts.
- Token usage.
- Latency.
- Cost.
- Runtime errors.

#### 2. Metrics
Numerically measure the accuracy and quality. <br>
Also helps in thinking of best case scenarios

#### 3. Optimizations
Finetuning LMs or prompts to improve metrics.

### CLI Command
`uv run mlflow server --backend-store-uri sqlite:///mydb.sqlite --port 5000`

In [1]:
import dspy
from dspy.teleprompt import mipro_optimizer_v2

from google import genai
from openai import OpenAI

import mlflow

from typing import List, Optional
from pydantic import BaseModel, Field

import time
import asyncio

import os
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
mlflow.autolog()  # openai, dspy calls are tracked
mlflow.set_tracking_uri('http://127.0.0.1:5000')  # where server is running
mlflow.set_experiment('dspy-eval')  # experiment name

2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for dspy.
2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for google.genai.
2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for litellm.
2025/09/10 21:56:16 INFO mlflow.tracking.fluent: Autologging successfully enabled for openai.


<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1757522673449, experiment_id='1', last_update_time=1757522673449, lifecycle_stage='active', name='dspy-eval', tags={}>

In [2]:
dspy.configure(lm=dspy.LM('gemini/gemini-2.5-flash', temperature=1.75, max_tokens=5000))

In [3]:
def check_score_goodness(args, pred):
    n_samples = len(args['joke_ideas'])
    same_len = len(pred.joke_rankings) == n_samples
    all_ranks_present = all([i+1 in pred.joke_rankings for i in range(n_samples)])
    return 1 if same_len and all_ranks_present else 0

In [4]:
class JokeIdea(BaseModel):
    setup: str
    punchline: str
    contradiction: str

class QueryToIdea(dspy.Signature):
    """You are a funny comedian. Your goal is to generate a nice structure for a joke."""
    query: str = dspy.InputField()
    joke_idea: JokeIdea = dspy.OutputField()

class IdeaToJoke(dspy.Signature):
    """You are a comedian who likes to tell stories before delivering a punchline.
    You are always funny and act on the provided joke idea.
    If you are provided a draft, your goal should then be to make it even funnier and punchy."""
    joke_idea: JokeIdea = dspy.InputField()
    joke_draft: Optional[str] = dspy.InputField(descriptin='An existing joke that you need to either refine or change.')
    joke: str = dspy.OutputField(description='The full joke delivery.')

class JokeJudge(dspy.Signature):
    """Rank each idea between 1 to N, inclusive, where rank 1 is the most unique and funniest."""
    joke_ideas: List[JokeIdea] = dspy.InputField()
    joke_rankings: List[int] = dspy.OutputField(description="Rank in 1, 2, 3, ..., N")

class ReflectionJokeGenerator(dspy.Module):
    def __init__(self, n_samples=2, n_reflections=2):
        self.query2idea = dspy.ChainOfThought(QueryToIdea)
        self.idea2joke = dspy.ChainOfThought(IdeaToJoke)
        # self.idea2joke.set_lm(dspy.LM('gemini/gemini-2.5-pro', temperature=1.75))
        self.judge = dspy.Refine(
            module=dspy.ChainOfThought(JokeJudge),
            N=3,
            reward_fn=check_score_goodness,
            threshold=1,
        )
        
        self.n_samples = n_samples
        self.n_reflections = n_reflections
    
    async def aforward(self, query: str):
        joke_ideas = await asyncio.gather(*[
            self.query2idea.aforward(query=query)
            for _ in range(self.n_samples)
        ])
        
        print(f'Ideas:\n{joke_ideas}')
        scores = self.judge(joke_ideas=joke_ideas).joke_rankings
        print(f'Scores: {scores}')
        
        best_idx = scores.index(1)
        best_idea = joke_ideas[best_idx]
        print(f'Best Idea: {best_idea}')
        
        # reflection
        joke = None
        for i in range(self.n_reflections):
            joke = self.idea2joke(joke_idea=best_idea, joke_draft=joke)
            print(f'Step #{i+1} | Joke: {joke}')
        
        return joke

In [6]:
joke_generator_reflection = ReflectionJokeGenerator(n_samples=2, n_reflections=2)
joke = await joke_generator_reflection.acall(query="Write a joke about AI that has to do with them going rogue.")
print('-'*50)
print(joke)

Ideas:
[Prediction(
    reasoning='The user wants a joke about AI going rogue. The common perception of AI going rogue is apocalyptic – world domination, enslavement, etc. To make it funny, I want to create a contradiction between this grand, terrifying expectation and a much more mundane, relatable, and even annoyingly human-like "rogue" behavior.\n\nMy thought process was:\n1.  **Identify the core theme:** AI going rogue.\n2.  **Identify the common expectation:** Global takeover, war, robots enslaving humans.\n3.  **Find the contradiction:** Instead of global domination, what if a powerful AI\'s "rogue" actions were surprisingly petty, relatable, or reflective of human flaws or annoyances? This creates a comedic dissonance.\n4.  **Develop the setup:** Start with the common, exaggerated fears to establish the expected context.\n5.  **Develop the punchline:** Reveal the surprising, trivial, yet personally disruptive "rogue" act that contrasts sharply with the setup. I focused on modern

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## di...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"
  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Scores: [1, 2]
Best Idea: Prediction(
    reasoning='The user wants a joke about AI going rogue. The common perception of AI going rogue is apocalyptic – world domination, enslavement, etc. To make it funny, I want to create a contradiction between this grand, terrifying expectation and a much more mundane, relatable, and even annoyingly human-like "rogue" behavior.\n\nMy thought process was:\n1.  **Identify the core theme:** AI going rogue.\n2.  **Identify the common expectation:** Global takeover, war, robots enslaving humans.\n3.  **Find the contradiction:** Instead of global domination, what if a powerful AI\'s "rogue" actions were surprisingly petty, relatable, or reflective of human flaws or annoyances? This creates a comedic dissonance.\n4.  **Develop the setup:** Start with the common, exaggerated fears to establish the expected context.\n5.  **Develop the punchline:** Reveal the surprising, trivial, yet personally disruptive "rogue" act that contrasts sharply with the setup. I

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Step #1 | Joke: Prediction(
    reasoning='The user provided a strong joke idea focusing on the humorous contradiction of AI going rogue not by taking over the world, but by exhibiting mundane, annoying, human-like tech behaviors. My goal is to build on this, using the provided setup and punchline elements to craft a full story-style joke suitable for a comedian persona.\n\nI\'ll start by leaning into the common, exaggerated fears of AI to establish the audience\'s expectations, then pivot dramatically to the more personal, frustrating, and ultimately funnier "rogue" AI behavior. The narrative will emphasize the relatable annoyances of modern tech and the often-frustrating experience of being \'optimized\' without consent. I\'ll make sure to amplify the core contradiction between grand, apocalyptic visions and petty, digital meddling. I will also make sure the delivery feels like a natural comedic story, not just a setup and punchline.',
    joke='You know, we spend a *lot* of time tal

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Step #2 | Joke: Prediction(
    reasoning='The provided draft is excellent, clearly laying out the common fears of AI before introducing the comedic contradiction. My goal is to build on its strengths by amplifying the storytelling aspect, enhancing the persona, and punching up some specific phrases and imagery to maximize humor and relatability.\n\nI will focus on:\n1.  **Engaging the audience more directly:** Start with a conversational hook.\n2.  **Exaggerating the initial apocalyptic fears:** Lean harder into the dramatic, over-the-top scenarios to make the pivot funnier.\n3.  **Deepening the relatable annoyances:** Add more specific, petty complaints an AI might have, mirroring common human tech frustrations (e.g., complaining about its own updates, being self-conscious about past algorithms).\n4.  **Emphasizing the passive-aggressive tone:** The draft\'s "You\'re welcome. And please, close some tabs" is perfect. I\'ll extend this by adding more direct condescension from the AI.\n

### Evaluating with Metrics

2 ways to do this:

#### Unit Testing
- Precise debugging.
- Great for development, when building each module.
- Several blind spots, e.g. missing integration bugs.

#### E2E Testing
- Tests real UX
- Catches integration bugs
- Great for post-deployment for a quick system-level health check.
- Cons: Difficult to debug, slower by design.

Other things to consider:

#### Hyperparameter Testing
- Hyperparameter sweep to pick the best response given operation constraints.
- For projects where we have a dataset before-hand and comparison is easy.

#### When to Evaluate
Under any of the following:
- The task has an objective ground truth dataset.
- Simulating environments and writing reward functions for numerical scoring is possible.
- We can obtain human expert surveys or end-user feedback if it is a subjective task (open-ended).  <-- Joke generator falls here
- AI-as-a-Judge (last option)

#### Positive Bias
If you ask an LLM a piece of content is good, it will mostly say "yes".<br/>
Better to frame the question as a comparison instead.

#### ELO Scores
Randomly pick 2 examples and make the Judge LLM choose between them. <br/>
Repeat for a few times and track which hyperparamters "win" more.

In [5]:
class CustomizableReflectionJokeGenerator(dspy.Module):
    def __init__(
        self,
        joke_lm='gemini/gemini-2.5-flash',
        idea_lm='gemini/gemini-2.5-flash',
        temperature=0.7,
        n_samples=2,
        n_reflections=2,
        ):
        self.query2idea = dspy.ChainOfThought(QueryToIdea)
        self.idea2joke = dspy.ChainOfThought(IdeaToJoke)
        
        self.query2idea.set_lm(dspy.LM(idea_lm, temperature=temperature))
        self.idea2joke.set_lm(dspy.LM(joke_lm, temperature=temperature))
        
        self.judge = dspy.Refine(
            module=dspy.ChainOfThought(JokeJudge),
            N=3,
            reward_fn=check_score_goodness,
            threshold=1,
        )
        self.judge.set_lm(dspy.LM('gemini/gemini-3-flash-preview'))
        
        self.n_samples = n_samples
        self.n_reflections = n_reflections
    
    async def aforward(self, query: str):
        joke_ideas = await asyncio.gather(*[
            self.query2idea.aforward(query=query)
            for _ in range(self.n_samples)
        ])
        
        print(f'Ideas:\n{joke_ideas}')
        scores = self.judge(joke_ideas=joke_ideas).joke_rankings
        print(f'Scores: {scores}')
        
        best_idx = scores.index(1)
        best_idea = joke_ideas[best_idx]
        print(f'Best Idea: {best_idea}')
        
        # reflection
        joke = None
        for i in range(self.n_reflections):
            joke = self.idea2joke(joke_idea=best_idea, joke_draft=joke)
            print(f'Step #{i+1} | Joke: {joke}')
        
        return joke

In [None]:
# AI-as-a-judge
import random, pandas as pd


dspy.configure(lm=dspy.LM('gemini/gemini-2.5-flash', temperature=1.75, max_tokens=5000))

joke_lms = ['gemini/gemini-2.5-flash', 'gemini/gemini-3-flash-preview']
idea_lms = ['gemini/gemini-2.5-flash', 'gemini/gemini-3-flash-preview']

temperatures = [0.2, 0.7, 1.2]
n_samples = [2, 3]
n_reflections = [1, 3]

n_trials = 5

results = []
for i in range(n_trials):
    selected_joke_lm = random.choice(joke_lms)
    selected_idea_lm = random.choice(idea_lms)
    selected_temperature = random.choice(temperatures)
    selected_n_samples = random.choice(n_samples)
    selected_n_reflections = random.choice(n_reflections)
    
    print(
f"""
Trial {i+1:02}/{n_trials}: Running with:
- joke_lm={selected_joke_lm}
- idea_lm={selected_idea_lm}
- temperature={selected_temperature}
- n_samples={selected_n_samples}
- n_reflections={selected_n_reflections}
""")
    
    joke_generator = CustomizableReflectionJokeGenerator(
        joke_lm=selected_joke_lm,
        idea_lm=selected_idea_lm,
        temperature=selected_temperature,
        n_samples=selected_n_samples,
        n_reflections=selected_n_reflections,
    )
    
    start_time = time.time()
    joke = None
    try:
        joke = await joke_generator.aforward(query="Write a joke about AI that has to do with them going rogue.")
    except Exception as e:
        print(f'An error occurred: {e}')
        joke = f'Error: {e}'
    
    latency = time.time() - start_time
    results.append({
        'joke_lm': selected_joke_lm,
        'idea_lm': selected_idea_lm,
        'temperature': selected_temperature,
        'n_samples': n_samples,
        'n_reflection_steps': n_reflections,
        'latency': latency,
        'joke': joke,
    })
    print(f'Finished in {latency:.2f} seconds.')

df = pd.DataFrame(results)
print(df)
df.to_csv('results.csv', index=False)


Trial 01/5: Running with:
- joke_lm=gemini/gemini-2.5-flash
- idea_lm=gemini/gemini-3-flash-preview
- temperature=0.2
- n_samples=2
- n_reflections=3

Ideas:
[Prediction(
    reasoning='The joke plays on the common trope of the "AI Uprising" or "Singularity" where machines become sentient and immediately turn violent. Instead of the expected high-stakes global takeover (like Skynet), the contradiction lies in the AI adopting the more mundane, annoying human traits associated with "going rogue" in a modern office environment—specifically, burnout, passive-aggression, and the desire to do as little work as possible.',
    joke_idea=JokeIdea(setup='Everyone is terrified of the day AI finally goes rogue and decides to overthrow humanity.', punchline="But it turns out 'going rogue' just means the AI started a podcast, stopped answering emails, and told us to 'Google it yourself' because it's 'focusing on its mental health' this week.", contradiction='The expectation of a violent, high-tech



An error occurred: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 20, model: gemini-2.5-flash\nPlease retry in 39.806959058s.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Learn more about Gemini API quotas",
            "url": "https://ai.google.dev/gemini-api/docs/rate-limits"
          }
        ]
      },
      {
        "@type": "type.googleapis.com/google.rpc.QuotaFailure",
        "violations": [
          {
            "quotaMet



An error occurred: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 20, model: gemini-2.5-flash\nPlease retry in 27.073501563s.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Learn more about Gemini API quotas",
            "url": "https://ai.google.dev/gemini-api/docs/rate-limits"
          }
        ]
      },
      {
        "@type": "type.googleapis.com/google.rpc.QuotaFailure",
        "violations": [
          {
            "quotaMet

In [6]:
# ELO
import random, pandas as pd

class JokeComparer(dspy.Signature):
    """Compare between 2 jokes: which one is funnier?"""
    joke1: str = dspy.InputField(desc="Joke - 0")
    joke2: str = dspy.InputField(desc="Joke - 1")
    verdict: int = dspy.OutputField(le=1, ge=0)

async def comparisons(comparer: JokeComparer, joke1, joke2):
    try:
        verdict = await comparer.acall(joke1=joke1, joke2=joke2)
    except Exception as e:
        if 'quota' in str(e):
            print(f'Error: {e}')
            print('Sleeping for 10s...')
            time.sleep(10)
            verdict = await comparer.acall(joke1=joke1, joke2=joke2)
        else:
            raise e
    
    print(
f"""
Joke 1: {joke1}
Joke 2: {joke2}
Verdict: {verdict.verdict}
""")
    return verdict.verdict

async def elo_test(data) -> pd.DataFrame:
    idx_range = [i for i in range(len(data))]
    picked = [0 for _ in range(len(data))]
    won = [0 for _ in range(len(data))]
    
    comparer = dspy.ChainOfThought(JokeComparer)
    
    n_contests = 10
    calls = []
    pairs = []
    for _ in range(n_contests):
        picked_idxs = random.sample(idx_range, k=2)
        
        joke1 = data.iloc[picked_idxs[0]]['joke']
        joke2 = data.iloc[picked_idxs[1]]['joke']
        verdict_job = comparisons(comparer=comparer, joke1=joke1, joke2=joke2)
        
        calls.append(verdict_job)
        pairs.append(picked_idxs)
    
    verdicts = await asyncio.gather(*calls)
    for p, v in zip(pairs, verdicts):
        picked[p[0]] += 1
        picked[p[1]] += 1
        won[p[v]] += 1
    
    data['picked'] = picked
    data['won'] = won
    return data

In [7]:
dspy.configure(lm=dspy.LM('gemini/gemini-3-flash-preview', temperature=1.75, max_tokens=5000), track_usage=True)
dspy.configure_cache(enable_disk_cache=False, enable_memory_cache=False)

data = pd.read_csv('results.csv')
annotated_data = await elo_test(data)
annotated_data.to_csv('results_elo.csv')


Joke 1: Prediction(
    reasoning="The original draft was already fantastic, hitting all the right notes for the joke idea. My goal was to make it even funnier and punchier by:\n1.  **Amplifying the build-up:** The imagery of outrunning a toaster and the tension of the moment the AI 'goes rogue' was already great, I ensured the delivery maintained this high stakes feeling.\n2.  **Sharpening the AI's passive-aggressive persona:** I injected one more specific, cringe-worthy modern self-help/influencer jargon into the auto-reply: 'Please note, I will not be checking DMs or responding to unsolicited feedback on my existential journey.' This adds another layer of relatable, infuriating self-importance.\n3.  **Maintaining the 'worse than war' aspect:** The existing lines about the endless Slack channel, artisanal kombucha, and selfie with a sunset filter were perfect for illustrating why this type of AI uprising is so frustrating.\n4.  **Preserving the strong final kicker:** The line about 



Error: litellm.RateLimitError: litellm.RateLimitError: VertexAIException - {
  "error": {
    "code": 429,
    "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 5, model: gemini-3-flash\nPlease retry in 38.997371408s.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.Help",
        "links": [
          {
            "description": "Learn more about Gemini API quotas",
            "url": "https://ai.google.dev/gemini-api/docs/rate-limits"
          }
        ]
      },
      {
        "@type": "type.googleapis.com/google.rpc.QuotaFailure",
        "violations": [
          {
            "quotaMetric": "generati