
# 📊 Prompt Optimization with Evidently: Code Review Quality Classifier

This tutorial demonstrates how to use Evidently's new `PromptOptimizer` API for optimizing prompts for LLM judges. 
We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.

---

## ✅ What you'll learn:
- How to set up a dataset for LLM evaluation
- How to define an LLM judge with a prompt template
- How to run the prompt optimization loop
- How to retrieve and inspect the best performing prompt


In [None]:
# If you haven't installed the required packages yet:
# !pip install evidently openai pandas

In [1]:
import pandas as pd

from evidently import Dataset, DataDefinition, LLMClassification
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.llm.models import LLMMessage
from evidently.descriptors import LLMEval
from evidently.llm.optimization import PromptOptimizer

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "sk-"

In [2]:
# Load your dataset
review_dataset = pd.read_csv("../datasets/code_review.csv")
review_dataset.head()

Unnamed: 0,Generated review,Expert label,Expert comment
0,"This implementation appears to work, but the a...",bad,"The tone is slighly condescending, no actionab..."
1,Great job! Keep it up!,bad,Not actionable
2,It would be advisable to think about modularit...,bad,"there is a suggestion, but no real guidance"
3,"You’ve structured the class very well, and the...",good,"Good tone, actionable"
4,Great job! This is clean and well-organized. T...,bad,Pure praise


In [3]:
# Define how Evidently should interpret your dataset
dd = DataDefinition(
    text_columns=["Generated review", "Expert comment"],
    categorical_columns=["Expert label"],
    llm=LLMClassification(
        input="Generated review",
        target="Expert label",
        reasoning="Expert comment"
    )
)

In [4]:
# Convert your pandas DataFrame into an Evidently Dataset
dataset = Dataset.from_pandas(
    data=review_dataset,
    data_definition=dd
)

In [8]:
# Define a prompt template and judge for classifying code review quality
criteria = """A review is GOOD when it's actionable and constructive.
A review is BAD when it is non-actionable or overly critical."""

pre_messages = [
    LLMMessage(
        role="system",
        content="""You are evaluating the quality of code reviews given to junior developers."""
    ),
]
     
feedback_quality = BinaryClassificationPromptTemplate(
    pre_messages=pre_messages,
    criteria=criteria,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

In [10]:
judge = LLMEval(
    alias="Code Review Judge",
    provider="openai",
    model="gpt-4.1-mini",
    column_name="Generated review",
    template=feedback_quality
)

In [13]:
# Initialize the optimizer and run optimization using feedback strategy
optimizer = PromptOptimizer(
    name="code_review_example",
    strategy="feedback",
    checkpoint_path=None,
)
optimizer.set_input_dataset(dataset=dataset)
await optimizer.arun(
    executor=judge,
    scorer="accuracy"
)
# for sync version:
# optimizer.run(judge, "accuracy")

Executed prompt 'A review is GOOD when it's actionable an...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.62
Prompt 'A review is GOOD when it's actionable an...' optimized to 'A review is GOOD when it provides action...'
Executed prompt 'A review is GOOD when it provides action...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.84
Prompt 'A review is GOOD when it provides action...' optimized to 'A review is GOOD when it provides action...'
Executed prompt 'A review is GOOD when it provides action...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.8


In [14]:
# Show the best-performing prompt template found by the optimizer
print(optimizer.best_prompt())
# starting prompt:
# criteria = """A review is GOOD when it's actionable and constructive.
# A review is BAD when it is non-actionable or overly critical."""


A review is GOOD when it provides actionable and constructive feedback that encourages improvement.
A review is BAD when it is non-actionable, overly critical, or lacks specificity, making it unclear how to improve.
