
# 📊 Prompt Optimization with Evidently: Code Review Quality Classifier

This tutorial demonstrates how to use Evidently's new `PromptOptimizer` API for optimizing prompts for LLM judges. 
We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.

---

## ✅ What you'll learn:
- How to set up a dataset for LLM evaluation
- How to define an LLM judge with a prompt template
- How to run the prompt optimization loop
- How to retrieve and inspect the best performing prompt


In [None]:
# If you haven't installed the required packages yet:
# !pip install evidently openai pandas

In [1]:
import pandas as pd

from evidently import Dataset, DataDefinition, LLMClassification
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.llm.models import LLMMessage
from evidently.descriptors import LLMEval
from evidently.llm.optimization import PromptOptimizer

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "sk-"

In [2]:
# Load your dataset
review_dataset = pd.read_csv("../datasets/code_review.csv")
review_dataset.head()

Unnamed: 0,Generated review,Expert label,Expert comment
0,"This implementation appears to work, but the a...",bad,"The tone is slighly condescending, no actionab..."
1,Great job! Keep it up!,bad,Not actionable
2,It would be advisable to think about modularit...,bad,"there is a suggestion, but no real guidance"
3,"You’ve structured the class very well, and the...",good,"Good tone, actionable"
4,Great job! This is clean and well-organized. T...,bad,Pure praise


In [None]:
# Define how Evidently should interpret your dataset
dd = DataDefinition(
    categorical_columns=["Expert label"],  # column type with categorical data
    text_columns=["Generated review", "Expert comment"],  # column type: llm will use text columns
    numerical_columns=[],
    datetime_columns=[],
    llm=LLMClassification(
        input="Generated review",
        target="Expert label",
        predictions=None,
        reasoning="Expert comment",
        prediction_reasoning=None,
    )
)

In [None]:
# Convert your pandas DataFrame into an Evidently Dataset
dataset = Dataset.from_pandas(
    data=review_dataset,
    data_definition=dd,
    descriptors=None,
)

In [40]:
dataset.as_dataframe()

Unnamed: 0,Generated review,Expert label,Expert comment
0,"This implementation appears to work, but the a...",bad,"The tone is slighly condescending, no actionab..."
1,Great job! Keep it up!,bad,Not actionable
2,It would be advisable to think about modularit...,bad,"there is a suggestion, but no real guidance"
3,"You’ve structured the class very well, and the...",good,"Good tone, actionable"
4,Great job! This is clean and well-organized. T...,bad,Pure praise
5,You’ve done a solid job here. The tests are co...,good,want more like this
6,There is too much complexity in this function....,bad,there is some subtance but too sounds too harsh
7,"The loop is functioning correctly, but it coul...",good,"constructive and specific, but passive voice s..."
8,Excellent submission overall. Everything looks...,bad,"uncritical praise, offers no value for improve..."
9,It would be more efficient to not mutate the s...,bad,"some truth in the suggestion, but the phrasing..."


In [41]:
# Define a prompt template for evaluating code review quality
criteria = """A review is GOOD when it's actionable and constructive.
A review is BAD when it is non-actionable or overly critical."""

# List of system messages that set context or instructions before the evaluation task.
# Use it to explain the evaluator role (“you are an expert..”) or
# context (“your goal is to grade the work of an intern..”).
pre_messages = [
    LLMMessage(
        role="system",
        content="""You are evaluating the quality of code reviews given to junior developers."""
    ),
]

# see https://docs.evidentlyai.com/metrics/customize_llm_judge#binaryclassificationprompttemplate
# target category is the category you want to detect (e.g., you care about its precision/recall more than the other). 
feedback_quality = BinaryClassificationPromptTemplate(
    pre_messages=pre_messages,
    criteria=criteria,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",  # return "unknown" if the model cannot classify 'bad' vs 'good'
    include_reasoning=True,
)

In [42]:
# Define a judge (type: FeatureDescriptor) for classifying code review quality,
# see https://docs.evidentlyai.com/metrics/customize_llm_judge#llmeval
judge = LLMEval(
    template=feedback_quality,
    column_name="Generated review",
    provider="openai",
    model="gpt-4.1-mini",
    alias="Code Review Judge",
)

In [43]:
judge.feature

LLMJudge(type='evidently:feature:LLMJudge', display_name='Code Review Judge', provider='openai', model='gpt-4.1-mini', input_column=None, input_columns={'Generated review': 'input'}, template=BinaryClassificationPromptTemplate(type='evidently:prompt_template:BinaryClassificationPromptTemplate', criteria="A review is GOOD when it's actionable and constructive.\nA review is BAD when it is non-actionable or overly critical.", instructions_template='Use the following categories for classification:\n{__categories__}\n{__scoring__}\nThink step by step.', anchor_start='___text_starts_here___', anchor_end='___text_ends_here___', placeholders={}, target_category='bad', non_target_category='good', uncertainty=<Uncertainty.UNKNOWN: 'unknown'>, include_category=True, include_reasoning=True, include_score=False, score_range=(0.0, 1.0), output_column='category', output_reasoning_column='reasoning', output_score_column='score', pre_messages=[LLMMessage(role='system', content='You are evaluating the q

In [None]:
# strategy can be `simple`:
'''
    optimizer_prompt: str = (
        "I'm using llm to do {task}. Here is my prompt <prompt>{prompt}</prompt>. "
        "Please make it better so I can have better results. "
        "{instructions} "
        "Return new version inside <new_prompt> tag"
    )
'''

# or `feedback`:
'''
    add_feedback_prompt = (
        "I ran LLM for some inputs to do {task} and it made some mistakes. "
        "Here is my original prompt <prompt>\n{prompt}\n</prompt>\n"
        "And here are rows where LLM made mistakes:\n"
        "<rows>\n{rows}\n</rows>. "
        "Please update my prompt to improve LLM quality. "
        "Generalize examples to not overfit on them. "
        "{instructions} "
        "Return new prompt inside <new_prompt> tag"
    )
    row_template = """<input>{input}</input>
        <target>{target}</target>
        <llm_response>{llm_response}</llm_response>
        <human_reasoning>{human_reasoning}</human_reasoning>
        <llm_reasoning>{llm_reasoning}</llm_reasoning>
    """
'''

In [None]:
# Initialize the optimizer and run optimization using feedback strategy
optimizer = PromptOptimizer(
    name="code_review_example",
    strategy="feedback",
    checkpoint_path=None,
)
optimizer.set_input_dataset(dataset=dataset)

await optimizer.arun(
    executor=judge,
    scorer="accuracy"
)
# for sync version:
# optimizer.run(judge, "accuracy")

In [14]:
# Show the best-performing prompt template found by the optimizer
print(optimizer.best_prompt())
# starting prompt:
# criteria = """A review is GOOD when it's actionable and constructive.
# A review is BAD when it is non-actionable or overly critical."""


A review is GOOD when it provides actionable and constructive feedback that encourages improvement.
A review is BAD when it is non-actionable, overly critical, or lacks specificity, making it unclear how to improve.
