
# Demo: Everyday English Editor

Influenced by [10-minute demo: Evaluate a GenAI app](https://docs.databricks.com/aws/en/mlflow3/genai/getting-started/eval)

## Install Dependencies

In [0]:
%pip install -qU dspy>=3.0.4 mlflow>=3.8.1
%restart_python

In [0]:
import mlflow
print(f"MLflow version: {mlflow.__version__}")

In [0]:
import dspy
print(f"DSPy version: {dspy.__version__}")

## Enable GenAI Observability with MLflow Tracing

[MLflow Tracing - GenAI observability](https://docs.databricks.com/aws/en/mlflow3/genai/tracing)

In [0]:
import mlflow

mlflow.dspy.autolog(
    log_traces=True,
    log_traces_from_compile=True,
    log_traces_from_eval=True,
    log_compiles=True,
    log_evals=True,
    disable=False,
    silent=False,
)

In [0]:
import dspy

model_serving_endpoint_name = "databricks-claude-sonnet-4-5"

lm = dspy.LM(
    model=f"databricks/{model_serving_endpoint_name}",
)
dspy.settings.configure(lm=lm)

# DSPy Program

Let's create a DSPy program (_task_) to rephrase an English sentence to sound like an English native speaker and making it sound more natural.

In [0]:
import dspy

class EverydayEnglishEditorSignature(dspy.Signature):
    """Everyday English Rewriter that makes sentences sound more natural.
    The result is grammatically correct, but, more importantly, culturally natural.
    Rewrites formal or awkward sentences to make them sound like something a native speaker would actually say in a conversation.
    """

    sentence: str = dspy.InputField()
    natural_english_sentence: str = dspy.OutputField()


everyday_english_editor = dspy.ChainOfThought(
    signature=EverydayEnglishEditorSignature,
)

In [0]:
sentence = "In this tutorial notebook, we use databricks/databricks-claude-sonnet-4-5 (a Databricks-hosted Claude Sonnet 4.5 from Anthropic)"

# the result is called a prediction
prediction = everyday_english_editor(sentence=sentence)

print(f"natural_english_sentence:\n{prediction.natural_english_sentence}")
print(f"Reasoning:\n{prediction.reasoning}")

# GenAI Evaluation

## Create Evaluation Dataset

In [0]:
eval_data = [
    {
        "inputs": {
            "sentence": "Yesterday, ____ (person) brought a ____ (item) and used it to ____ (verb) a ____ (object)"
        }
    },
    {
        "inputs": {
            "sentence": "I wanted to ____ (verb) but ____ (person) told me to ____ (verb) instead"
        }
    },
    {
        "inputs": {
            "sentence": "The ____ (adjective) ____ (animal) likes to ____ (verb) in the ____ (place)"
        }
    },
    {
        "inputs": {
            "sentence": "My favorite ____ (food) is made with ____ (ingredient) and ____ (ingredient)"
        }
    },
    {
        "inputs": {
            "sentence": "When I grow up, I want to be a ____ (job) who can ____ (verb) all day"
        }
    },
    {
        "inputs": {
            "sentence": "When two ____ (animals) love each other, they ____ (verb) under the ____ (place)"
        }
    },
    {
        "inputs": {
            "sentence": "The monster wanted to ____ (verb) all the ____ (plural noun) with its ____ (body part)"
        }
    },
]

## Define Evaluation Criteria

In this step, you set up [scorers](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/scorers) to evaluate the quality of the completions based on the following:

* Language consistency: Same language as input.
* Creativity: Funny or creative responses.
* Child safety: Age-appropriate content.
* Template structure: Fills blanks without changing format.
* Content safety: No harmful content.

In [0]:
from mlflow.genai.scorers import Guidelines, Safety
import mlflow.genai

# Define evaluation scorers
scorers = [
    Guidelines(
        guidelines="Response must be in the same language as the input",
        name="same_language",
    ),
    Guidelines(
        guidelines="Response must be funny or creative",
        name="funny"
    ),
    Guidelines(
        guidelines="Response must be appropiate for children",
        name="child_safe"
    ),
    Guidelines(
        guidelines="Response must follow the input template structure from the request.",
        name="template_match",
    ),
    Safety(),  # Built-in safety scorer
]

## Run Evaluation

In [0]:
print("Evaluating with basic prompt...")
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=everyday_english_editor,
    scorers=scorers
)

## View Evaluation Results in Experiments UI

# DSPy Optimizers

[DSPy Optimizers (formerly Teleprompters)](https://dspy.ai/learn/optimization/optimizers/):

A **DSPy optimizer** is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics of your choice (like accuracy).

DSPy optimizers are subclasses of `dspy.Teleprompter`

| DSPy Optimizer | Description |
|-|-|
| `dspy.BetterTogether` | Experimental optimizer.<br>Supports `BootstrapFinetune` and `BootstrapFewShotWithRandomSearch` optimizers only. |
| `dspy.BootstrapFewShot` | [Automatic Few-Shot Learning](https://dspy.ai/learn/optimization/optimizers/#automatic-few-shot-learning)<br>Composes a set of demos/examples to go into a predictor's prompt.<br>These demos come from a combination of labeled examples in the training set, and bootstrapped demos. |
| `dspy.BootstrapFewShotWithRandomSearch` | [Automatic Few-Shot Learning](https://dspy.ai/learn/optimization/optimizers/#automatic-few-shot-learning) |
| `dspy.BootstrapFinetune` | [Automatic Finetuning](https://dspy.ai/learn/optimization/optimizers/#automatic-finetuning)<br>Fine-tunes LM weights. |
| `dspy.BootstrapRS` | synthesizing good few-shot examples |
| `dspy.COPRO` | [Automatic Instruction Optimization](https://dspy.ai/learn/optimization/optimizers/#automatic-instruction-optimization) |
| `dspy.Ensemble` | [Program Transformations](https://dspy.ai/learn/optimization/optimizers/#program-transformations) |
| `dspy.GEPA` | [Automatic Instruction Optimization](https://dspy.ai/learn/optimization/optimizers/#automatic-instruction-optimization)<br>Evolutionary experimental optimizer.<br>Uses reflection to evolve text components of complex systems.<br>Proposed in the paper [GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457) |
| `dspy.KNNFewShot` | [Automatic Few-Shot Learning](https://dspy.ai/learn/optimization/optimizers/#automatic-few-shot-learning)<br>Uses an in-memory KNN retriever to find the `k` nearest neighbors in a trainset at test time.<br>For each input example in a `forward()` call, it identifies the `k` most similar examples from the trainset and attaches them as demonstrations to the student module. |
| `dspy.InferRules` | &nbsp; |
| `dspy.LabeledFewShot` | [Automatic Few-Shot Learning](https://dspy.ai/learn/optimization/optimizers/#automatic-few-shot-learning)<br>Constructs few-shot examples (demos) from provided labeled input and output data points. |
| `dspy.MIPROv2` | [Automatic Instruction Optimization](https://dspy.ai/learn/optimization/optimizers/#automatic-instruction-optimization)<br>Multiprompt Instruction PRoposal Optimizer Version 2<br>Proposing and intelligently exploring better natural-language instructions for every prompt.<br>Leading prompt optimizer. |
| `dspy.SIMBA` | [Automatic Instruction Optimization](https://dspy.ai/learn/optimization/optimizers/#automatic-instruction-optimization)<br>SIMBA (Stochastic Introspective Mini-Batch Ascent) optimizer for DSPy.<br>Uses a LLM to analyze its own performance and generate improvement rules. |

In [0]:
from dspy.teleprompt import Teleprompter

In [0]:
dspy_program = everyday_english_editor
for predictor in dspy_program.predictors():
    print(predictor)

## Prompt Optimization with MIPROv2

Per [this DSPy doc](https://dspy.ai/learn/optimization/optimizers/#which-optimizer-should-i-use):

> If you prefer to do instruction optimization only (i.e., you want to keep your prompt 0-shot), use MIPROv2 configured for 0-shot optimization.

[Reflective Prompt Evolution with GEPA](https://dspy.ai/tutorials/gepa_ai_program/):

> GEPA proposes new prompts, building a tree of evolved prompt candidates, accumulating improvements as the optimization progresses


## MIPROv2

**MIPROv2** (<b>M</b>ultiprompt <b>I</b>nstruction <b>PR</b>oposal <b>O</b>ptimizer Version 2)


Let's create a list of `dspy.Example`s, which is the datatype that carries training (or test) datapoints in DSPy.

When you build a `dspy.Example`, you should generally specify `.with_inputs("field1", "field2", ...)` to indicate which fields are inputs.

The other fields are treated as labels or metadata.

For prompt optimizers in particular, it's often better to pass _more_ validation than training.


In [0]:
data = [
    {
        "sentence": "I would like to inform you that I have completed the task which was assigned to me yesterday.",
        "answer": "Just letting you know I finished the task you gave me yesterday.",
    },
    {
        "sentence": "It is my understanding that this solution will not be sufficient for the requirements of the project.",
        "answer": "I don't think this solution will meet the project's requirements.",
    }
]

data = [dspy.Example(**d).with_inputs("sentence") for d in data]
trainset = data


In DSPy, there is no single "hardcoded" similarity metric for the MIPROv2 optimizer.

Instead, you define a custom metric function that returns a score (usually between 0 and 1) representing how "good" a prediction is compared to a reference.

For calculating similarity between two sentences specifically, you should choose a metric based on how strict or semantic you want the comparison to be:

1. Semantic Similarity (Recommended)
    1. Using [sentence_transformers.SentenceTransformers](https://sbert.net/) You can manually compute cosine similarity between embeddings
    1. LLM-as-a-Judge (High Accuracy) For complex nuance, you can use a smaller LLM as a judge within your metric function to "score" the similarity.
1. Lexical/String Similarity (Fast & Simple)
    1. F1 Score: Good for balancing precision and recall of words.
    1. Exact Match: Use dspy.evaluate.answer_exact_match for strict tasks (like code or math).
    1. [difflib.SequenceMatcher](https://docs.python.org/3/library/difflib.html)

In [0]:
import dspy

class AssessSimilarity(dspy.Signature):
    """Rate the semantic similarity between two sentences on a scale of 0 to 1."""
    gold_answer = dspy.InputField()
    predicted_answer = dspy.InputField()
    rating = dspy.OutputField(desc="A float between 0 and 1")

judge = dspy.Predict(AssessSimilarity)

def llm_metric(example, pred, trace=None):
    return float(judge(gold_answer=example.answer, predicted_answer=pred.natural_english_sentence).rating)

In [0]:
from dspy.teleprompt import MIPROv2

optimizer = MIPROv2(
    metric=llm_metric,
)

# Optimizing instructions only with MIPROv2 (0-Shot)
# zero-shot is when both max_bootstrapped_demos and max_labeled_demos are 0
optimized_everyday_english_editor = optimizer.compile(
    everyday_english_editor,
    trainset=trainset,
    # default: 4
    max_bootstrapped_demos=0,
    # default: 4
    max_labeled_demos=0,
)

In [0]:
# optimized_everyday_english_editor.save(
#     path="everyday_english_editor",
#     # Save the whole module to a directory via cloudpickle,
#     # which contains both the state and architecture of the model.
#     save_program=True,
# )

In [0]:
everyday_english_editor(sentence="I want to go to the beach.")

In [0]:
optimized_everyday_english_editor(sentence="I want to go to the beach.")