# Prompt Comparison with LLM Judge Evaluation
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Evals/Prompt_Evals.ipynb)

<img src="../images/prompt_compare.png" width="750">

## Introduction

This notebook demonstrates how to compare different prompts or even model settings(`max_tokens`, `temperature` etc.) for the same model to ensure we are using the optimal setup.

Using the Together AI [Evaluations API](https://docs.together.ai/docs/ai-evaluations) we'll:

1. Load the SummEval dataset containing news articles to summarize
2. Configure two different prompts for the same model
3. Use a judge model to evaluate which prompt produces better summaries
4. Analyze the prompt optimization results

The full list of supported models can be found [here](https://docs.together.ai/docs/evaluations-supported-models).


**Concepts Covered:**
- **Prompt Engineering**: Comparing simple vs. detailed prompts for the same task
- **LLM-as-a-Judge**: Using a language model to evaluate prompt effectiveness  
- **Compare Evaluation**: A/B testing prompts to determine optimal configurations
- **Summarization Evaluation**: Assessing summary quality across multiple criteria

# 📋 Prompt Comparison Overview

Prompt engineering is crucial for getting optimal performance from language models. This evaluation compares:

- **Prompt A (Simple)**: Basic instruction with minimal guidance
- **Prompt B (Structured)**: Detailed instruction with specific guidelines  

We'll test which approach produces better summaries on news articles.

In [None]:
# setup and installation
!pip install -qU together datasets pandas

In [None]:
import os
import together
import pandas as pd

# Set your API key: export TOGETHER_API_KEY="your-key-here"
# Or set it programmatically: 
# together_client = together.Client(api_key="your-key-here")
together_client = together.Client()

## 📊 Understanding the SummEval Dataset

The SummEval dataset contains news articles paired with both human and machine-generated summaries, along with quality ratings across multiple dimensions like relevance, coherence, fluency, and consistency.

**Dataset Structure:**
- `text`: The original news article to be summarized
- `machine_summaries`: Various automated summaries
- `human_summaries`: Human-written reference summaries
- Quality ratings across multiple evaluation criteria

For our evaluation, we'll focus on the original articles and generate new summaries using different prompts with our target model.

In [None]:
from datasets import load_dataset

summ_eval = load_dataset("mteb/summeval")

summ_eval

DatasetDict({
    test: Dataset({
        features: ['machine_summaries', 'human_summaries', 'relevance', 'coherence', 'fluency', 'consistency', 'text', 'id'],
        num_rows: 100
    })
})

In [4]:
summ_eval['test'].to_pandas().head()

Unnamed: 0,machine_summaries,human_summaries,relevance,coherence,fluency,consistency,text,id
0,"[donald sterling , nba team last year . sterli...",[V. Stiviano must pay back $2.6 million in gif...,"[1.6666666666666667, 1.6666666666666667, 2.333...","[1.3333333333333333, 3.0, 1.0, 2.6666666666666...","[1.0, 4.666666666666667, 4.333333333333333, 4....","[1.0, 2.3333333333333335, 4.666666666666667, 5...",(CNN)Donald Sterling's racist remarks cost him...,cnn-test-404f859482d47c127868964a9a39d1a7645dd2e9
1,[north pacific gray whale has earned a spot in...,"[The whale, Varvara, swam a round trip from Ru...","[2.3333333333333335, 4.666666666666667, 3.6666...","[1.3333333333333333, 4.666666666666667, 3.6666...","[1.0, 5.0, 4.666666666666667, 3.66666666666666...","[1.3333333333333333, 5.0, 5.0, 4.3333333333333...",(CNN)A North Pacific gray whale has earned a s...,cnn-test-4761dc6d8bdf56b9ada97104113dd1bcf4aed3f1
2,[russian fighter jet intercepted a u.s. reconn...,[The incident occurred on April 7 north of Pol...,"[4.0, 4.0, 4.0, 3.3333333333333335, 3.33333333...","[3.3333333333333335, 4.333333333333333, 1.6666...","[3.6666666666666665, 4.333333333333333, 5.0, 4...","[5.0, 5.0, 4.666666666666667, 5.0, 5.0, 5.0, 5...",(CNN)After a Russian fighter jet intercepted a...,cnn-test-5139ccfabee55ddb83e7937f5802c0a67aee8975
3,[michael barnett captured the fire on intersta...,[Country band Lady Antebellum's bus caught fir...,"[2.0, 3.0, 2.6666666666666665, 3.3333333333333...","[2.0, 3.0, 2.6666666666666665, 3.3333333333333...","[2.6666666666666665, 5.0, 5.0, 5.0, 5.0, 5.0, ...","[2.3333333333333335, 5.0, 5.0, 5.0, 5.0, 5.0, ...",(CNN)Lady Antebellum singer Hillary Scott's to...,cnn-test-88c2481234e763c9bbc68d0ab1be1d2375c1349a
4,[deep reddish color caught seattle native tim ...,[Smoke from massive fires in Siberia created f...,"[1.6666666666666667, 3.6666666666666665, 3.333...","[1.6666666666666667, 3.6666666666666665, 1.666...","[5.0, 5.0, 5.0, 5.0, 4.666666666666667, 5.0, 5...","[2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ...",(CNN)A fiery sunset greeted people in Washingt...,cnn-test-a02e362c5b8f049848ce718b37b96117485461cf


We are only interested in the 'text' collumn from this dataset

In [5]:
print(summ_eval['test'][0]['text'])

(CNN)Donald Sterling's racist remarks cost him an NBA team last year. But now it's his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling's wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple's money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O'Donnell said in a statemen

## 🔄 Preparing Data for Evaluation

Before running our comparison, we need to convert the dataset to JSONL format and upload it to the Together AI platform.

The evaluation service requires:
- JSONL format with consistent fields across all examples
- Upload with `purpose="eval"` to indicate evaluation usage

In [None]:
# Convert the 'text' column to JSONL format and upload for evaluation
import tempfile
import json

# Create a temporary file with JSONL format
with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
    for item in summ_eval['test']:
        json.dump({'text': item['text']}, f)
        f.write('\n')
    temp_file_path = f.name

# Upload the file using together_client
uploaded_file = together_client.files.upload(
    file=temp_file_path,
    purpose='eval'
)

# Clean up the temporary file
os.unlink(temp_file_path)

print(f"Uploaded file: {uploaded_file}")

Uploading file tmpdpd9nmgt.jsonl: 100%|██████████| 213k/213k [00:00<00:00, 266kB/s]


Uploaded file: id='file-06a8ad03-c29e-4550-87ef-51fab736d47e' object=<ObjectType.File: 'file'> created_at=1752957728 type=None purpose=<FilePurpose.Eval: 'eval'> filename='tmpdpd9nmgt.jsonl' bytes=213087 line_count=0 processed=True FileType='jsonl'


## ⚙️ Prompt Configurations

We'll compare the same model but using two different prompts:
- **Prompt A**: An overly simplistic prompt for comparison
- **Prompt B**: A well structured second prompt for comparison - *in this toy evaluation we expect this prompt to win* 
- **Judge Model**: Evaluates which summary is better based on our criteria

We use the prompts below to setup the models being evaluated and the Judge LLM.

In [11]:
compare_judge_template = """You are an expert judge evaluating the quality of text summaries. Your task is to compare two summaries and determine which one is better.

EVALUATION CRITERIA:
1. **Accuracy & Faithfulness**: Does the summary accurately represent the source text without hallucinations or distortions?
2. **Completeness**: Does the summary capture all key points and main ideas from the source text?
3. **Conciseness**: Is the summary appropriately brief while maintaining essential information?
4. **Clarity & Readability**: Is the summary well-written, coherent, and easy to understand?
5. **Relevance**: Does the summary focus on the most important aspects of the source text?

INSTRUCTIONS:
- Read the source text carefully
- Evaluate both Summary A and Summary B against each criterion
- Consider the overall quality and usefulness of each summary
- Give a brief explanation (2-3 sentences) justifying your choice
"""

In [8]:
summarization_prompt_A = """You are an expert summarizer. 
Your task is to create a concise, accurate summary.
"""

summarization_prompt_B = """You are an expert summarizer. 
Your task is to create a concise, accurate summary.

Please follow these guidelines when creating your summary:
1. Read the entire text carefully to understand the main points
2. Identify the key themes, arguments, and conclusions
3. Write a summary that is approximately 25% of the original length
4. Use clear, concise language and maintain the original tone
5. Include the most important facts, figures, and examples
6. Ensure the summary flows logically from one point to the next
7. Avoid adding your own opinions or interpretations
8. Focus on the author's main message and supporting evidence
"""

## 🏃‍♂️ Running the Prompt Comparison

The evaluation will:
1. Generate summaries using both prompts on the same model
2. Have a judge model compare the quality of outputs
3. Determine which prompt performs better overall

**Key Parameters:**
- Same base model for both configurations
- Different system prompts (simple vs. structured)
- Judge evaluates based on accuracy, completeness, and clarity

In [None]:
# Model configurations
MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct-Turbo"

JUDGE_MODEL_NAME = "deepseek-ai/DeepSeek-V3"

model_a_config = {
    "model_name": MODEL_NAME,
    "system_template": summarization_prompt_A,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

model_b_config = {
    "model_name": MODEL_NAME,
    "system_template": summarization_prompt_B,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

# Create compare evaluation
evaluation_response = together_client.evaluation.create(
    type="compare",
    input_data_file_path=uploaded_file.id,
    judge_model_name=JUDGE_MODEL_NAME,
    judge_system_template=compare_judge_template,
    model_a=model_a_config,
    model_b=model_b_config
)

print(f"Evaluation ID: {evaluation_response.workflow_id}")
print(f"Status: {evaluation_response.status}")

Evaluation ID: eval-8af6-1752679553
Status: pending


## 📊 Understanding the Results

Once the evaluation is done running we can see results.

The comparison provides key metrics:
- **A_wins**: Times the simple prompt was preferred  
- **B_wins**: Times the structured prompt was preferred
- **Ties**: Cases where both prompts performed equally
- **Fail counts**: Generation errors (should be 0)

A clear winner indicates one prompting approach is more effective.

In [None]:
import json

status_compare_prompts = together_client.evaluation.status(evaluation_response.workflow_id)

print(json.dumps(status_compare_prompts.results, indent=2))

{
  "A_wins": 6,
  "B_wins": 30,
  "Ties": 64,
  "generation_fail_count": 0,
  "judge_fail_count": 0,
  "result_file_id": "file-30124ed1-a78b-4a82-968a-09bcbcf1ec09"
}


## 🔑 Key Findings

**Prompt Optimization Results:**
- **Structured Prompt (B)** significantly outperformed Simple Prompt (A): **30 wins vs 6 wins**
- **High tie rate** of **64%** suggests many cases where prompt differences were minimal
- **Clear winner**: Structured prompts with detailed guidelines produce better summaries

**Insights:**
- Detailed instructions (8 specific guidelines) dramatically improve summary quality
- The 5:1 win ratio (30 vs 6) shows structured prompts are much more effective
- 64% ties indicate both approaches often produce acceptable results, but structured prompts excel in edge cases