Copyright (c) Meta Platforms, Inc. and affiliates.
All rights reserved.

This source code is licensed under the terms described in the LICENSE file in
the root directory of this source tree.

<a href="https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/getting-started/llama-tools/pdo_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with PDO (Prompt Duel Optimizer) with prompt-ops

This tutorial will guide you through using PDO with prompt-ops to optimize prompts for Llama models through competitive dueling. We'll cover:

## Table of Contents

1. [Introduction to PDO](#1-introduction-to-pdo)
   - The Problem PDO Solves
   - Key Innovations
   - Comparison with Other Methods

2. [PDO Architecture Deep Dive](#2-pdo-architecture-deep-dive)
   - Core Components from the Paper
   - Dueling Bandits Framework
   - Thompson Sampling
   - Multi-Ranker Fusion

3. [Creating a PDO Project](#3-creating-a-pdo-project)
   - Project Structure
   - Configuration Essentials
   - Dataset Preparation

4. [Running PDO Optimization](#4-running-pdo-optimization)
   - The Optimization Process
   - Understanding the Output
   - Common Parameters

5. [Analyzing Results](#5-analyzing-results)
   - Interpreting Optimized Prompts
   - Performance Metrics
   - Duel Statistics


## 1. Introduction to PDO

### The Problem PDO Solves

Traditional prompt optimization methods face several critical challenges:

1. **Absolute Scoring Bias**: Single-point evaluation can be misleading - is a 0.85 score truly better than 0.82?
2. **Limited Exploration**: Greedy approaches miss superior prompts in unexplored regions
3. **Ranking Uncertainty**: Different metrics may rank prompts inconsistently
4. **Exploitation vs Exploration**: How to balance trying new prompts vs refining known good ones?

PDO addresses these limitations through a **dueling bandit approach** that treats prompt optimization as a competitive head-to-head tournament rather than absolute scoring.

#### **Design Choice: Pairwise vs. Pointwise Evaluation**  
PDO uses **pairwise comparison** even when labels are available—trading computational efficiency for more robust prompt evaluation.

### Key Innovations from the Paper

The paper explores several novel concepts in prompt optimization

1. **Dueling Bandits**: Prompts compete in pairwise comparisons, not absolute rankings
2. **Thompson Sampling**: Probabilistic exploration using Beta distributions for smart duel selection
3. **Multi-Ranker Fusion**: Combines multiple ranking algorithms (Copeland, Borda, Elo, TrueSkill) for robust evaluation
4. **Variance-Driven Exploration**: Selects opponent based on uncertainty, maximizing information gain
5. **LLM-as-Judge**: Uses language models to evaluate which prompt produces better responses
6. **Adaptive Pruning**: Removes consistently poor performers while maintaining diversity


### Comparison with Other Methods

The paper compares PDO with several baseline approaches:

| Method | Approach | Limitations | PDO's Advantage |
|--------|----------|-------------|-----------------|
| **Manual Prompting** | Human-written instructions | Time-consuming, subjective | Automated competitive optimization |
| **Few-shot Learning** | Examples in prompt | Limited by context window | Optimizes instructions through duels |
| **DSPy MIPRO** | Baysien-based optimization | Local optima, absolute scoring | Global exploration via bandits |
| **OPRO** | LLM-based optimization | Greedy, expensive | Efficient exploration via Thompson sampling |

### The Dueling Advantage

The paper's key insight is that **relative comparisons are more reliable than absolute scores**. Dueling bandits are well-suited for prompt optimization because they:

- Focus on what matters: "Which prompt is better?" not "What's the exact score?"
- Maintain diversity through probabilistic selection
- Balance exploration (trying new prompts) and exploitation (refining winners)
- Aggregate multiple ranking systems for robustness
- Scale efficiently with the number of prompts


### Visualizing the PDO Advantage

To understand why PDO's dueling approach is superior, let's compare it with traditional optimization:

**The Traditional Problem:**
When you score prompts independently (e.g., P4 gets 0.78, P5 gets 0.74), small differences might just be noise. Is P4 *really* better, or did it just get lucky on the test set? You can't be sure.

**PDO's Solution:**
Instead of asking "How good is this prompt?" (absolute), PDO asks "Which prompt is better?" (relative). By running head-to-head duels on the same examples, PDO eliminates scoring bias and reveals consistent winners.

**The Visualization Below Shows:**
- **Left**: Traditional optimization relies on absolute scores that may be unreliable
- **Right**: PDO builds a **win matrix** where each cell shows how often one prompt beats another in head-to-head duels

**Key Insight:** A prompt that wins 60%+ of duels against *every* opponent is a validated champion. This is PDO's core innovation: **reliable relative comparison through competition**.

![Point vs Pairwise Comparison](images/point-vs-pairwise.png)



- PDO discovers prompt quality through competitive duels,
- ✓ P4 dominates: wins 60%+ against ALL opponents (validated through competition)
- where consistent winners emerge naturally from head-to-head comparisons.
- \nKey Insight: P4's superiority is validated across ALL matchups, not just by a single score that might be noise.


## 2. PDO Architecture Deep Dive

This section explores the core components of PDO as described in the research paper, focusing on the theoretical foundations and algorithmic innovations.

### Core Components from the Paper

PDO consists of six main components:

1. **Instruction Pool (𝒫)**: Collection of candidate prompts competing for superiority
2. **Win Matrix (W)**: Tracks head-to-head results between all prompt pairs
3. **Thompson Sampler**: Selects duel pairs using probabilistic exploration
4. **LLM Judge**: Evaluates which prompt produces better responses
5. **Multi-Ranker System**: Aggregates multiple ranking algorithms for robust evaluation
6. **Instruction Evolution**: Generates new prompts by combining successful variants

Let's explore each component in detail.


### 1. The Dueling Bandits Framework

The paper introduces PDO's use of **Dueling Bandits** as the central optimization paradigm:

**Traditional Multi-Armed Bandits**:
- Pull an arm → observe absolute reward
- Problem: Reward scales may be unreliable or noisy

**Dueling Bandits (PDO's Approach)**:
- Select two prompts → run head-to-head comparison
- Observe relative preference: "Which is better?"
- Advantage: Relative comparisons are more stable than absolute scores

### The Algorithm Loop

```
Initialize:
  - Start with base prompt + generate initial variations
  - Initialize Win matrix W (all zeros)

For each round t = 1 to T:
  1. Use Thompson Sampling to select prompt pair (i, j)
  2. Run duel: Both prompts answer the same examples
  3. LLM judge decides winner on each example
  4. Update W[i,j] with wins, W[j,i] with losses
  5. Compute multi-ranker scores (Copeland, Borda, Elo, TrueSkill)
  6. [Optional] Generate new prompts by combining top performers
  7. [Optional] Prune worst-performing prompts

Return:
  - Best prompt according to aggregated rankings
```

**Key Innovation**: The Win matrix W becomes increasingly informative, guiding both exploration (via Thompson sampling) and exploitation (via ranking systems).


### 2. Thompson Sampling

One of PDO's most sophisticated components is its **Double Thompson Sampling** strategy:

#### What is Thompson Sampling?

Thompson Sampling is a probabilistic approach to the exploration-exploitation dilemma:
- Maintain a **belief distribution** about each prompt's quality
- Sample from these distributions to select actions
- Naturally balances trying new things vs. exploiting known winners

#### PDO's Double Thompson Sampling

The paper describes a two-stage selection process:

**Stage 1: Select First Prompt**
1. For each prompt pair (i,j), model win probability as Beta(wins+1, losses+1)
2. Sample a win-rate matrix θ from these Beta distributions
3. Compute multiple rankings from θ (Copeland, Borda, win-rate)
4. Combine with Elo and TrueSkill ratings using **Dirichlet-weighted fusion**
5. Apply softmax with temperature τ for final selection

**Stage 2: Select Second Prompt (Opponent)**
1. Among remaining prompts, filter to those still plausibly competitive
2. Select the one with **maximum variance** in win probability vs. first prompt
3. Rationale: High variance = high uncertainty = maximum information gain from this duel

**Benefits**:
- Efficient exploration: Doesn't waste duels on clearly inferior prompts
- Information-driven: Prioritizes duels that reduce uncertainty
- Adaptive: Automatically shifts from exploration to exploitation as data accumulates


#### Understanding the Thompson Sampling Visualization

The diagram below shows **Beta distributions** representing our belief about each prompt's win probability. This is the heart of how PDO decides which prompts to compare in duels:

**What You're Looking At:**
- Each plot shows a probability distribution over win rates (0 to 1 on x-axis)
- The **height** (density) represents how confident we are about that win rate
- The **width** represents uncertainty - wider = less certain

**The Three Scenarios:**

1. **Prompt A (Strong Performer - Green)**: 15 wins, 5 losses
   - **Narrow, tall distribution** centered at 0.73
   - High confidence: We're pretty sure this prompt wins ~73% of the time
   - PDO interpretation: *Reliable choice, but already well-understood*

2. **Prompt B (Average - Blue)**: 10 wins, 10 losses  
   - **Medium width** distribution centered at 0.50
   - Moderate confidence: Wins about half the time
   - PDO interpretation: *Middle-of-the-pack, not particularly interesting*

3. **Prompt C (Uncertain - Orange)**: 2 wins, 2 losses
   - **Wide, flat distribution** centered at 0.50
   - HIGH UNCERTAINTY: Could be anywhere from 0.2 to 0.8!
   - PDO interpretation: *High information value - might be secretly excellent OR terrible*

**PDO's Smart Strategy:**

Thompson Sampling doesn't just pick the prompt with the highest mean (that would be A). Instead:

1. **First prompt selection**: Sample from all distributions → A likely selected (highest mean)
2. **Second prompt selection**: Pick the opponent with **maximum variance** vs. the first
   - Dueling A vs B tells us little (both well-understood)
   - **Dueling A vs C tells us a LOT** (huge uncertainty about C)

**The Key Insight:** By choosing C as the opponent, PDO maximizes **information gain**. After this duel:
- If C wins → Great! We discovered a hidden champion
- If C loses → Good! We eliminated uncertainty and can focus elsewhere

This is why PDO converges faster than naive approaches - it strategically explores high-uncertainty regions rather than randomly sampling prompts.

![Thompson Sampling Beta Distributions](images/thompson-sampling.png)


### 3. Multi-Ranker Fusion

One of PDO's most important innovations is using **multiple ranking algorithms** simultaneously to create a more robust evaluation system.

#### What is Multi-Ranker Fusion?

Imagine you're trying to determine the best chess player. Would you:
- **Option A**: Use only one ranking system (e.g., just count wins)
- **Option B**: Combine multiple perspectives (wins, strength of opponents beaten, consistency, etc.)

PDO chooses Option B - it **fuses** (combines) the opinions of 5 different ranking algorithms, each with its own strengths and blind spots. This is called **Multi-Ranker Fusion**.

**The Core Problem:** No single ranking system is perfect. Each has biases:
- Some favor consistent performers
- Some reward crushing weak opponents
- Some value beating strong opponents highly
- Some are sensitive to the order of matches

By combining all 5, PDO gets a **consensus view** that's more reliable than any individual ranker.

#### Why Multiple Rankers?

Different ranking systems capture different aspects of prompt quality:

| Ranking System | What It Measures | Strengths | Weaknesses |
|----------------|------------------|-----------|------------|
| **Copeland** | Number of opponents beaten | Simple, intuitive | Ignores margin of victory |
| **Borda** | Sum of win probabilities | Accounts for all matchups | Can be dominated by many weak wins |
| **Average Win Rate** | Mean win probability | Easy to interpret | Doesn't account for opponent strength |
| **Elo** | Chess-style rating | Accounts for opponent strength | Sensitive to order of matches |
| **TrueSkill** | Bayesian skill rating | Confidence intervals, robust | More complex, computational cost |

#### Fusion Strategy

The paper describes a sophisticated fusion approach:

```
For each round:
1. Compute all 5 ranking scores for each prompt
2. Normalize each to [0, 1] range
3. Sample fusion weights from Dirichlet(1,1,1,1,1)
   → Introduces randomness in how rankings are combined
4. Compute weighted combination: score = Σ(weight_i × ranking_i)
5. Use fused scores for prompt selection
```

**What's Happening Here:**
- **Step 3 is key**: Instead of fixed weights (e.g., always 20% each), PDO **randomly samples** weights each round
- **Dirichlet distribution** ensures weights sum to 1.0 but vary each time
- Example: One round might weight Elo heavily (0.4, 0.15, 0.15, 0.15, 0.15), next round Copeland (0.15, 0.4, 0.15, 0.15, 0.15)
- This explores different "philosophies" of ranking over time

**Benefits**:
- **Robustness**: No single ranking system dominates
- **Exploration**: Dirichlet sampling explores different ranking perspectives
- **Consensus**: Prompts that rank well across ALL systems are truly superior

#### Concrete Example

Let's say we have 3 prompts after some duels:

| Prompt | Copeland | Borda | Win Rate | Elo | TrueSkill | **Average** |
|--------|----------|-------|----------|-----|-----------|-------------|
| **P1** | 1st | 1st | 1st | 2nd | 1st | **1st (winner!)** |
| **P2** | 2nd | 2nd | 3rd | 1st | 2nd | **2nd** |
| **P3** | 3rd | 3rd | 2nd | 3rd | 3rd | **3rd** |

**Key Insight**: P1 wins 4 out of 5 ranking systems - it's a **consensus champion**. Even though P2 tops the Elo ranking (maybe it beat one strong opponent), P1 is more consistently excellent across all evaluation criteria.

This is why Multi-Ranker Fusion is powerful: **It prevents a prompt from gaming one specific metric** and ensures true, well-rounded superiority.


### 4. LLM-as-Judge

The paper emphasizes that PDO's evaluation mechanism is crucial for effective optimization:

**Traditional Metrics**: Hard-coded rules (exact match, F1, BLEU, etc.)

**PDO's LLM Judge**:
- **Natural Language Evaluation**: Can assess nuanced qualities (helpfulness, clarity, correctness)
- **Pairwise Comparison**: Given two responses, which is better and why?
- **Flexible Criteria**: Can evaluate domain-specific quality without custom metric code
- **Reasoning Output**: Provides explanation for decisions (useful for debugging)

#### Judge Prompt Structure

```
You are an expert evaluator. Compare these two responses:

Input: {question}
Expected Answer: {label}

Response A:
{response_from_prompt_A}

Response B:
{response_from_prompt_B}

Which response is better? Consider:
- Correctness: Does it match the expected answer?
- Completeness: Does it address all aspects?
- Clarity: Is it well-structured and understandable?

Output format:
{
  "reasoning": "Your detailed comparison",
  "winner": "A" or "B"
}
```

**Benefits of LLM-as-Judge**:
- Works for any task (classification, generation, reasoning)
- Captures subtle quality differences
- Scales to new domains without custom metric engineering

## 3. Creating a PDO Project

Now that we understand the theoretical foundations, let's see how to create a project that leverages PDO's capabilities.

### Project Structure

A PDO-enabled project requires several components:

```
my-pdo-project/
├── config.yaml          # PDO configuration
├── data/
│   └── dataset.json     # Training/test data
├── prompts/
│   └── prompt.txt       # Initial prompt template
├── results/             # Optimization outputs
└── logs/                # Detailed execution logs
```

### Key Configuration Elements

Here are the essential configuration parameters:


In [None]:
# Example PDO Configuration (config.yaml)
config_example = """
# PDO Configuration based on paper recommendations

name: customer-support-classifier

models:
  task_model: "openrouter/meta-llama/llama-3.3-70b-instruct"      # Model to optimize
  proposer_model: "openrouter/meta-llama/llama-3.3-70b-instruct" # For evolution & judging
  provider: "openrouter"
  temperature: 0.1                                                # Low temp for consistency

dataset:
  path: "data/dataset.json"
  input_field: ["fields", "input"]                               # Path to input in JSON
  golden_output_field: "answer"                                  # Expected output field

optimization:
  strategy: "qpdo"                                               # Enable PDO strategy
  
  # Core dueling parameters
  total_rounds: 50                                               # Number of rounds
  num_duels_per_round: 3                                         # Duels per round
  num_eval_examples_per_duel: 25                                 # Examples per duel
  num_initial_instructions: 3                                    # Initial prompt pool size
  
  # Thompson sampling parameters
  thompson_alpha: 2.0                                            # Confidence bound parameter
  
  # Instruction evolution
  num_top_prompts_to_combine: 3                                  # Top K for combination
  num_new_prompts_to_generate: 1                                 # New prompts per gen round
  max_new_prompts_to_generate: 20                                # Max total prompt pool size
  num_to_prune_each_round: 1                                     # Adaptive pruning
  gen_new_prompt_round_frequency: 3                              # Generate every N rounds
  
  # Execution parameters
  max_concurrent_threads: 4                                      # Parallel threads
  use_labels: true                                               # Enable supervised evolution
  verbose: true                                                  # Detailed logging

prompts:
  system: "prompts/prompt.txt"                                   # Initial prompt file

metric:
  class: "llama_prompt_ops.core.metrics.StandardJSONMetric"      # JSON matching metric
  strict_json: false                                             # Allow flexible parsing
  output_field: "answer"
"""

print("PDO Configuration Structure:")
print("=" * 50)
print(config_example)

print("\n" + "=" * 50)
print("\nKey Parameters Explained:")
print("• total_rounds × num_duels_per_round = total number of prompt comparisons")
print("• thompson_alpha controls exploration: higher = more exploration")
print("• gen_new_prompt_round_frequency: Generate new variants every N rounds")
print("• num_to_prune_each_round: Remove worst performers to maintain efficiency")
print("\nCost Considerations:")
print("Total LLM calls ≈ rounds × duels × examples × 3")
print("  (×3 = both prompts + judge evaluation)")
print("Example: 50 × 3 × 25 × 3 = ~11,250 LLM calls")


### Dataset Preparation

According to the paper, PDO works effectively with:

1. **Moderate dataset sizes** (50-200 examples): Enough for reliable duels
2. **Clear input/output pairs**: Enables supervised evolution
3. **Diverse examples**: Cover different aspects of the task

Example dataset structure:


In [None]:
# TODO

### Initial Prompt Template

The starting prompt doesn't need to be perfect - PDO will evolve it through duels:

**Key principles**:
1. Start with a **clear task description**
2. Include **basic structure** for the output
3. PDO will discover improvements through **competitive testing**


In [None]:
# Example initial prompt (prompts/prompt.txt)
initial_prompt = """
You are a customer support message classifier.

Analyze the customer message and classify it according to:
- Urgency: low, medium, high, or critical
- Sentiment: positive, neutral, or negative
- Category: account_access, billing, technical_issue, feedback, or general_inquiry
- Escalation needed: true or false

Return your classification as a JSON object with these exact fields:
{
  "urgency": "<urgency_level>",
  "sentiment": "<sentiment>",
  "category": "<category>",
  "requires_escalation": <boolean>
}
"""

print("Initial Prompt Template:")
print("=" * 50)
print(initial_prompt)
print("\n" + "=" * 50)
print("\nWhat PDO Will Optimize:")
print("• Clarity of classification criteria")
print("• Examples of edge cases (e.g., 'What makes something critical?')")
print("• Ordering and emphasis of instructions")
print("• Additional guidance for ambiguous cases")
print("\nEvolution Process:")
print("1. PDO tests this prompt against others in duels")
print("2. Identifies where it wins/loses")
print("3. Generates improved variants")
print("4. New variants compete in subsequent rounds")
print("5. Best performers naturally emerge through competition")


## 4. Running PDO Optimization

### The Optimization Process

Based on the paper's methodology, PDO optimization follows a structured tournament process:

1. **Initialization Phase**
   - Load base prompt
   - Generate initial prompt variations
   - Initialize Win matrix (all zeros)

2. **Competition Phase** (main loop)
   - Use Thompson sampling to select duel pair
   - Run head-to-head comparison on examples
   - LLM judge decides winners
   - Update Win matrix
   - Compute multi-ranker scores
   - Periodically generate new prompts
   - Prune worst performers

3. **Selection Phase**
   - Aggregate rankings (Copeland, Borda, Elo, TrueSkill)
   - Select highest-ranked prompt
   - Save optimized prompt and metadata

### Running PDO with prompt-ops


In [None]:
# TODO

### Budget Considerations from the Paper

The paper provides important insights on setting optimization budgets:

| Budget Type | Recommended Range | Use Case |
|------------|-------------------|----------|
| **max_metric_calls** | 500-2000 | General optimization |
| **max_iterations** | 20-100 | When you want fixed rounds |
| **max_evals_per_trainval_instance** | 10-50 | Per-example budget control |

**Paper's Findings**:
- PDO discovers winners through ~50-200 rounds of duels
- Larger budgets help with complex multi-objective tasks
- Small training sets (10-30 examples) work best to avoid overfitting
- Validation performance is the key metric to track


## 5. Analyzing Results

### Understanding the Optimized Prompts

After PDO completes, you'll have access to detailed results that showcase the paper's innovations in action.


In [None]:
# TODO

### Visualizing Performance Evolution

The paper emphasizes tracking performance over the optimization process to understand PDO's dueling dynamics:


### Comparing Before and After

The paper shows significant improvements across various benchmarks. Here's how to interpret your results:


In [None]:
# TODO

## Conclusion and Key Takeaways

### Summary of PDO's Innovations

Based on our exploration of the PDO paper and practical implementation:

1. **Dueling Bandits Framework**: PDO treats optimization as a tournament where prompts compete head-to-head, making relative comparisons more reliable than absolute scoring.

2. **Thompson Sampling**: Smart exploration strategy that balances trying new prompts (exploration) with refining known winners (exploitation) through probabilistic sampling.

3. **Multi-Ranker Fusion**: Combines 5 different ranking algorithms (Copeland, Borda, Win Rate, Elo, TrueSkill) with Dirichlet-weighted fusion for robust evaluation.

4. **LLM-as-Judge**: Uses language models to evaluate pairwise preferences, enabling nuanced quality assessment beyond hard metrics.

5. **Instruction Evolution**: Generates new prompts by combining top performers, allowing discovery of superior variants through competitive pressure.

6. **Adaptive Pruning**: Removes consistently poor performers while maintaining diversity in the prompt pool.

### When to Use PDO

The paper suggests PDO is particularly effective for:

- **High-stakes applications** where prompt quality is critical
- **Complex tasks** with nuanced quality criteria
- **Scenarios where absolute scoring is unreliable** (subjective tasks)
- **Budget available for thorough exploration** (more LLM calls than simpler methods)

### Paper's Key Results

The paper demonstrates PDO's effectiveness:
- **Outperforms gradient-based methods**: More thorough exploration avoids local optima
- **Robust rankings**: Multi-ranker consensus prevents over-reliance on single metric
- **Efficient evolution**: Thompson sampling focuses computational budget on informative duels
- **Interpretable results**: Win matrices and Elo ratings provide clear understanding of prompt quality


### Final Thoughts

PDO represents an important advance in prompt optimization by bringing dueling bandit algorithms to LLM prompt engineering. Its competitive approach, combined with sophisticated exploration strategies and multi-ranker fusion, creates a powerful framework for discovering high-quality prompts through natural selection in a tournament setting.

The key insight is that **asking "which is better?" is often more reliable than asking "how good is this?"** - PDO leverages this principle throughout its design.

For more details, refer to the full paper and implementation documentation.


## Additional Resources

### References

 1. **PDO Paper**: "Prompt Dueling Optimization: Tournament-Driven Prompt Discovery and Evaluation" ([arXiv:2510.13907](https://arxiv.org/abs/2510.13907))

### Quick Reference Commands

```bash
# Install prompt-ops
pip install prompt-ops

# Create a new PDO project
prompt-ops create my-pdo-project

# Run PDO optimization
prompt-ops migrate --config config.yaml --log-level INFO

# View results
ls results/*.yaml
```

### Tips for Success

1. **Start small**: Use 10-30 training examples as recommended by the paper
2. **Design good feedback**: Specific, actionable feedback functions are crucial
3. **Monitor validation**: Track validation performance to ensure generalization
4. **Experiment with budgets**: Start with 500-1000 metric calls
5. **Enable merging**: Set `use_merge: true` for best results

---

*This tutorial covered the theoretical foundations and practical implementation of PDO based on dueling bandits the research paper. For hands-on practice, try the facility support example or adapt PDO to your own classification or generation tasks.*
