# Prompt Optimization with Evidently
Attribution & License

This notebook is adapted from: [evidentlyai/community-examples](https://github.com/evidentlyai/community-examples.git), licensed under the Apache License, Version 2.0. © Original authors.

Modifications: by Simeon Harrison/EuroCC Austria, © 2025.

This notebook demonstrates how to use Evidently's `PromptOptimizer` API for optimizing prompts for LLM judges. 

## Code Review Quality Classifier
We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.

### What you'll learn:
- How to set up a dataset for LLM evaluation
- How to define an LLM judge with a prompt template
- How to run the prompt optimization loop
- How to retrieve and inspect the best performing prompt

In [None]:
# If you haven't installed the required packages yet:
# !pip install evidently openai pandas

In [7]:
import pandas as pd

from evidently import Dataset, DataDefinition, LLMClassification
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.descriptors import LLMEval
from evidently.llm.optimization import PromptOptimizer
from evidently.descriptors import HuggingFace, HuggingFaceToxicity

In [2]:
# Load your dataset
review_dataset = pd.read_csv("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/code_review_dataset.csv")
review_dataset.head()

Unnamed: 0,Generated review,Expert label,Expert comment
0,"This implementation appears to work, but the a...",bad,"The tone is slighly condescending, no actionab..."
1,Great job! Keep it up!,bad,Not actionable
2,It would be advisable to think about modularit...,bad,"there is a suggestion, but no real guidance"
3,"You’ve structured the class very well, and the...",good,"Good tone, actionable"
4,Great job! This is clean and well-organized. T...,bad,Pure praise


In [8]:
# Define how Evidently should interpret your dataset
dd = DataDefinition(
    text_columns=["Generated review", "Expert comment"],
    categorical_columns=["Expert label"],
    llm=LLMClassification(input="Generated review", target="Expert label", reasoning="Expert comment")
)

In [9]:
# Convert your pandas DataFrame into an Evidently Dataset
dataset = Dataset.from_pandas(review_dataset, data_definition=dd)

In [8]:
import os, getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

OpenAI API key (input is hidden):  ········


In [13]:
# Define a prompt template and judge for classifying code review quality
criteria = '''A review is GOOD when it's actionable and constructive.
A review is BAD when it is non-actionable or overly critical.'''

feedback_quality = BinaryClassificationPromptTemplate(
    pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
    criteria=criteria,
    target_category="bad",
    non_target_category="good",
    uncertainty="unknown",
    include_reasoning=True,
)

judge = LLMEval(
    alias="Code Review Judge",
    provider="openai",
    model="gpt-4o-mini",
    column_name="Generated review",
    template=feedback_quality
)

In [10]:
# Initialize the optimizer and run optimization using feedback strategy
optimizer = PromptOptimizer("code_review_example", strategy="feedback")
optimizer.set_input_dataset(dataset)
await optimizer.arun(judge, "accuracy")
# for sync version:
# optimizer.run(judge, "accuracy")

Executed prompt 'A review is GOOD when it's actionable an...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.66
Prompt 'A review is GOOD when it's actionable an...' optimized to 'A review is GOOD when it provides constr...'
Executed prompt 'A review is GOOD when it provides constr...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.94
Prompt 'A review is GOOD when it provides constr...' optimized to 'A review is GOOD when it provides constr...'
Executed prompt 'A review is GOOD when it provides constr...', got preds(50) preds_reasoning(50)
Prompt scored: AccuracyScorer: 0.94


In [12]:
# Show the best-performing prompt template found by the optimizer
print(optimizer.best_prompt())

A review is GOOD when it provides constructive, actionable feedback that includes specific suggestions for improvement and avoids vague language. A review is BAD when it is overly critical, lacks clarity, or fails to offer concrete steps for enhancement. Consider feedback that maintains a balance between being constructive and encouraging while still addressing areas that need attention.


## Example 2: Bookings Query Classifier
In this tutorial, we'll optimize a prompt for classifying different types of customer service queries (like Booking, Payment, or Technical issues) using an LLM classifier.

### What you'll learn:
- How to load a dataset for LLM classification
- How to define a multiclass classification prompt
- How to run prompt optimization with Evidently
- How to retrieve the best performing prompt

In [1]:
import pandas as pd

from evidently import Dataset, DataDefinition, LLMClassification
from evidently.descriptors import LLMEval
from evidently.llm.templates import MulticlassClassificationPromptTemplate
from evidently.llm.optimization import PromptOptimizer

### Load Your Dataset

In [2]:
data = pd.read_csv("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/bookings.csv")
data.head()

Unnamed: 0,query,label
0,"booked a trip for 4 ppl, want to add a 5th now",Booking
1,"hello team, please confirm if my hotel reserva...",Booking
2,"i can’t see the payment options, dropdown just...",Technical
3,"I heard airlines sometimes overbook, what’s yo...",Policy
4,wanna reschedule my train ride to next week,Booking


### Define Data Structure for Evidently

In [3]:
dd = DataDefinition(
    text_columns=["query"],
    categorical_columns=["label"],
    llm=LLMClassification(input="query", target="label")
)

In [4]:
dataset = Dataset.from_pandas(data, data_definition=dd)

### Define a Multiclass Prompt and LLM Judge

In [6]:
import os, getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

OpenAI API key (input is hidden):  ········


In [16]:
base_prompt = "Classify inqueries by categories"

t = MulticlassClassificationPromptTemplate(
    pre_messages=[("system", "You are classifying user queries.")],
    criteria=base_prompt,
    category_criteria={
        "Booking": "bookings",
        "Technical": "technical questions",
        "Policy": "questions about policies",
        "Payment": "payment questions",
        "Escalation": "escalation requests"
    },
    uncertainty="unknown",
    include_reasoning=True,
)

judge = LLMEval(
    alias="bookings",
    provider="openai",
    model="gpt-4.1-mini",
    column_name="query",
    template=t
)

### Run the Prompt Optimizer

In [17]:
optimizer = PromptOptimizer("bookings_example", strategy="feedback")
optimizer.set_input_dataset(dataset)
await optimizer.arun(judge, "accuracy")
# sync version
# optimizer.run(judge, "accuracy")

Executed prompt 'Classify inqueries by categories...', got preds(200) preds_reasoning(200)
Prompt scored: AccuracyScorer: 0.915
Prompt 'Classify inqueries by categories...' optimized to 'Classify inquiries into the following ca...'
Executed prompt 'Classify inquiries into the following ca...', got preds(200) preds_reasoning(200)
Prompt scored: AccuracyScorer: 0.94
Prompt 'Classify inquiries into the following ca...' optimized to 'Classify inquiries into the following ca...'
Executed prompt 'Classify inquiries into the following ca...', got preds(200) preds_reasoning(200)
Prompt scored: AccuracyScorer: 0.93


### View the Best Optimized Prompt

In [12]:
print(optimizer.best_prompt())

Classify inqueries by categories


## Example 3: Tweet Generation Example
This tutorial shows how to optimize prompts for generating engaging tweets using Evidently's `PromptOptimizer` API. 
We'll iteratively improve a tweet generation prompt to maximize how engaging LLM-generated tweets are, according to a classifier.

### What you'll learn:
- How to define a tweet generation function with OpenAI
- How to set up an LLM judge to classify tweet engagement
- How to optimize a tweet generation prompt based on feedback
- How to inspect the best optimized prompt

In [None]:
# Install packages if needed
# !pip install evidently openai pandas

In [1]:
import pandas as pd
import openai

from evidently.descriptors import LLMEval
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.llm.optimization import PromptOptimizer, PromptExecutionLog, Params

### Define a Tweet Generation Function

In [2]:
import os, getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API key (input is hidden): ")

OpenAI API key (input is hidden):  ········


In [3]:
def basic_tweet_generation(topic, model="gpt-3.5-turbo", instructions=""):
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": instructions},
            {"role": "user", "content": f"Write a short paragraph about {topic}"}
        ]
    )
    return response.choices[0].message.content

### Define a Tweet Quality Judge

In [4]:
tweet_quality = BinaryClassificationPromptTemplate(
    pre_messages=[("system", "You are evaluating the quality of tweets")],
    criteria="""
Text is ENGAGING if it meets at least one of the following:
  • Strong hook (question, surprise, bold statement)
  • Uses emotion, humor, or opinion
  • Encourages interaction
  • Shows personality or distinct tone
  • Includes vivid language or emojis
  • Sparks curiosity or insight

Text is NEUTRAL if it lacks these qualities.
""",
    target_category="ENGAGING",
    non_target_category="NEUTRAL",
    uncertainty="non_target",
    include_reasoning=True,
)

judge = LLMEval("basic_tweet_generation.result", template=tweet_quality,
                provider="openai", model="gpt-4o-mini", alias="Tweet quality")


### Define a Prompt Execution Function

In [5]:
def run_prompt(generation_prompt: str, context) -> PromptExecutionLog:
    """generate engaging tweets"""
    my_topics = [
        "testing in AI engineering is as important as in development",
        "CI/CD is applicable in AI",
        "Collaboration of subject matter experts and AI engineers improves product",
        "Start LLM apps development from test cases generation",
        "evidently is a great tool for LLM testing"
    ]
    tweets = [basic_tweet_generation(topic, model="gpt-3.5-turbo", instructions=generation_prompt) for topic in my_topics * 3]
    return PromptExecutionLog(generation_prompt, prediction=pd.Series(tweets))

### Run the Prompt Optimizer

In [6]:
optimizer = PromptOptimizer("tweet_gen_example", strategy="feedback")
optimizer.set_param(Params.BasePrompt, "You are tweet generator")
await optimizer.arun(run_prompt, scorer=judge)
# sync version
# optimizer.run(run_prompt, scorer=judge)

Executed prompt 'You are tweet generator...', got preds(15)
Prompt scored: BinaryJudgeScorer: 0.06666666666666667
Prompt 'You are tweet generator...' optimized to 'You are a creative tweet generator taske...'
Executed prompt 'You are a creative tweet generator taske...', got preds(15)
Prompt scored: BinaryJudgeScorer: 1.0


### View the Best Optimized Prompt

In [7]:
print(optimizer.best_prompt())

You are a creative tweet generator tasked with crafting engaging, concise, and dynamic tweets that resonate with audiences. Your tweets should include strong hooks, emotional language, humor, prompts for interaction, and the use of vivid imagery or emojis where appropriate. Focus on delivering content that feels personal, relatable, and encourages readers to engage with the topic or share their thoughts. Aim for a tone that is friendly, enthusiastic, and full of personality, turning complex concepts into enjoyable and easily digestible messages.
