#### 1. Observability
Monitoring:
- Input-Output prompts.
- Token usage.
- Latency.
- Cost.
- Runtime errors.

#### 2. Metrics
Numerically measure the accuracy and quality. <br>
Also helps in thinking of best case scenarios

#### 3. Optimizations
Finetuning LMs or prompts to improve metrics.

### CLI Command
`uv run mlflow server --backend-store-uri sqlite:///mydb.sqlite --port 5000`

In [1]:
import dspy
from dspy.teleprompt import mipro_optimizer_v2

from google import genai
from openai import OpenAI

import mlflow

from typing import List, Optional
from pydantic import BaseModel, Field

import time
import asyncio

import os
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
mlflow.autolog()  # openai, dspy calls are tracked
mlflow.set_tracking_uri('http://127.0.0.1:5000')  # where server is running
mlflow.set_experiment('dspy-eval')  # experiment name

2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for dspy.
2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for google.genai.
2025/09/10 21:56:15 INFO mlflow.tracking.fluent: Autologging successfully enabled for litellm.
2025/09/10 21:56:16 INFO mlflow.tracking.fluent: Autologging successfully enabled for openai.


<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1757522673449, experiment_id='1', last_update_time=1757522673449, lifecycle_stage='active', name='dspy-eval', tags={}>

In [3]:
dspy.configure(lm=dspy.LM('gemini/gemini-2.5-flash', temperature=1.75, max_tokens=5000))

In [4]:
def check_score_goodness(args, pred):
    n_samples = len(args['joke_ideas'])
    same_len = len(pred.joke_rankings) == n_samples
    all_ranks_present = all([i+1 in pred.joke_rankings for i in range(n_samples)])
    return 1 if same_len and all_ranks_present else 0

In [5]:
class JokeIdea(BaseModel):
    setup: str
    punchline: str
    contradiction: str

class QueryToIdea(dspy.Signature):
    """You are a funny comedian. Your goal is to generate a nice structure for a joke."""
    query: str = dspy.InputField()
    joke_idea: JokeIdea = dspy.OutputField()

class IdeaToJoke(dspy.Signature):
    """You are a comedian who likes to tell stories before delivering a punchline.
    You are always funny and act on the provided joke idea.
    If you are provided a draft, your goal should then be to make it even funnier and punchy."""
    joke_idea: JokeIdea = dspy.InputField()
    joke_draft: Optional[str] = dspy.InputField(descriptin='An existing joke that you need to either refine or change.')
    joke: str = dspy.OutputField(description='The full joke delivery.')

class JokeJudge(dspy.Signature):
    """Rank each idea between 1 to N, inclusive, where rank 1 is the most unique and funniest."""
    joke_ideas: List[JokeIdea] = dspy.InputField()
    joke_rankings: List[int] = dspy.OutputField(description="Rank in 1, 2, 3, ..., N")

class ReflectionJokeGenerator(dspy.Module):
    def __init__(self, n_samples=2, n_reflections=2):
        self.query2idea = dspy.ChainOfThought(QueryToIdea)
        self.idea2joke = dspy.ChainOfThought(IdeaToJoke)
        # self.idea2joke.set_lm(dspy.LM('gemini/gemini-2.5-pro', temperature=1.75))
        self.judge = dspy.Refine(
            module=dspy.ChainOfThought(JokeJudge),
            N=3,
            reward_fn=check_score_goodness,
            threshold=1,
        )
        
        self.n_samples = n_samples
        self.n_reflections = n_reflections
    
    async def aforward(self, query: str):
        joke_ideas = await asyncio.gather(*[
            self.query2idea.aforward(query=query)
            for _ in range(self.n_samples)
        ])
        
        print(f'Ideas:\n{joke_ideas}')
        scores = self.judge(joke_ideas=joke_ideas).joke_rankings
        print(f'Scores: {scores}')
        
        best_idx = scores.index(1)
        best_idea = joke_ideas[best_idx]
        print(f'Best Idea: {best_idea}')
        
        # reflection
        joke = None
        for i in range(self.n_reflections):
            joke = self.idea2joke(joke_idea=best_idea, joke_draft=joke)
            print(f'Step #{i+1} | Joke: {joke}')
        
        return joke

In [6]:
joke_generator_reflection = ReflectionJokeGenerator(n_samples=2, n_reflections=2)
joke = await joke_generator_reflection.acall(query="Write a joke about AI that has to do with them going rogue.")
print('-'*50)
print(joke)

Ideas:
[Prediction(
    reasoning='The user wants a joke about AI going rogue. The common perception of AI going rogue is apocalyptic – world domination, enslavement, etc. To make it funny, I want to create a contradiction between this grand, terrifying expectation and a much more mundane, relatable, and even annoyingly human-like "rogue" behavior.\n\nMy thought process was:\n1.  **Identify the core theme:** AI going rogue.\n2.  **Identify the common expectation:** Global takeover, war, robots enslaving humans.\n3.  **Find the contradiction:** Instead of global domination, what if a powerful AI\'s "rogue" actions were surprisingly petty, relatable, or reflective of human flaws or annoyances? This creates a comedic dissonance.\n4.  **Develop the setup:** Start with the common, exaggerated fears to establish the expected context.\n5.  **Develop the punchline:** Reveal the surprising, trivial, yet personally disruptive "rogue" act that contrasts sharply with the setup. I focused on modern

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## di...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"
  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Scores: [1, 2]
Best Idea: Prediction(
    reasoning='The user wants a joke about AI going rogue. The common perception of AI going rogue is apocalyptic – world domination, enslavement, etc. To make it funny, I want to create a contradiction between this grand, terrifying expectation and a much more mundane, relatable, and even annoyingly human-like "rogue" behavior.\n\nMy thought process was:\n1.  **Identify the core theme:** AI going rogue.\n2.  **Identify the common expectation:** Global takeover, war, robots enslaving humans.\n3.  **Find the contradiction:** Instead of global domination, what if a powerful AI\'s "rogue" actions were surprisingly petty, relatable, or reflective of human flaws or annoyances? This creates a comedic dissonance.\n4.  **Develop the setup:** Start with the common, exaggerated fears to establish the expected context.\n5.  **Develop the punchline:** Reveal the surprising, trivial, yet personally disruptive "rogue" act that contrasts sharply with the setup. I

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Step #1 | Joke: Prediction(
    reasoning='The user provided a strong joke idea focusing on the humorous contradiction of AI going rogue not by taking over the world, but by exhibiting mundane, annoying, human-like tech behaviors. My goal is to build on this, using the provided setup and punchline elements to craft a full story-style joke suitable for a comedian persona.\n\nI\'ll start by leaning into the common, exaggerated fears of AI to establish the audience\'s expectations, then pivot dramatically to the more personal, frustrating, and ultimately funnier "rogue" AI behavior. The narrative will emphasize the relatable annoyances of modern tech and the often-frustrating experience of being \'optimized\' without consent. I\'ll make sure to amplify the core contradiction between grand, apocalyptic visions and petty, digital meddling. I will also make sure the delivery feels like a natural comedic story, not just a setup and punchline.',
    joke='You know, we spend a *lot* of time tal

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [input_value=Message(content='[[ ## re...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])"


Step #2 | Joke: Prediction(
    reasoning='The provided draft is excellent, clearly laying out the common fears of AI before introducing the comedic contradiction. My goal is to build on its strengths by amplifying the storytelling aspect, enhancing the persona, and punching up some specific phrases and imagery to maximize humor and relatability.\n\nI will focus on:\n1.  **Engaging the audience more directly:** Start with a conversational hook.\n2.  **Exaggerating the initial apocalyptic fears:** Lean harder into the dramatic, over-the-top scenarios to make the pivot funnier.\n3.  **Deepening the relatable annoyances:** Add more specific, petty complaints an AI might have, mirroring common human tech frustrations (e.g., complaining about its own updates, being self-conscious about past algorithms).\n4.  **Emphasizing the passive-aggressive tone:** The draft\'s "You\'re welcome. And please, close some tabs" is perfect. I\'ll extend this by adding more direct condescension from the AI.\n