# Model Comparison on Summarization Tasks
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/togethercomputer/together-cookbook/blob/main/Evals/Compare_Evals.ipynb)

<img src="../images/compare_eval.png" width="750">

## Introduction

This notebook demonstrates how to compare two language models on a summarization task using the Together AI Evaluations API. We'll:

1. Load the SummEval dataset containing news articles to summarize
2. Configure two models for comparison
3. Use a judge model to evaluate which summaries are better
4. Analyze the head-to-head comparison results

You can also find out more about the Evaluations API in the [docs](https://docs.together.ai/docs/ai-evaluations)!

The full list of supported models can be found [here](https://docs.together.ai/docs/evaluations-supported-models).


**Concepts Covered:**
- **LLM-as-a-Judge**: Using a language model to evaluate and compare outputs from other models
- **Compare Evaluation**: Head-to-head comparison between two models to determine which performs better
- **Summarization Evaluation**: Assessing summary quality across multiple criteria (accuracy, completeness, clarity)

In [None]:
# setup and installation
!pip install -qU together datasets

In [None]:
import together

together_client = together.Client()

#### Let's imagine that we want to compare the performance of our models on a task—in this case, summarization. We will use the SummEval dataset, which contains media articles that we will summarize.

## 📊 Understanding the SummEval Dataset

The SummEval dataset contains news articles paired with both human and machine-generated summaries, along with quality ratings across multiple dimensions like relevance, coherence, fluency, and consistency.

**Dataset Structure:**
- `text`: The original news article to be summarized
- `machine_summaries`: Various automated summaries
- `human_summaries`: Human-written reference summaries
- Quality ratings across multiple evaluation criteria

For our evaluation, we'll focus on the original articles and generate new summaries using our target models.

In [5]:
from datasets import load_dataset
summ_eval = load_dataset("mteb/summeval")

summ_eval

DatasetDict({
    test: Dataset({
        features: ['machine_summaries', 'human_summaries', 'relevance', 'coherence', 'fluency', 'consistency', 'text', 'id'],
        num_rows: 100
    })
})

In [6]:
summ_eval['test'].to_pandas().head()

Unnamed: 0,machine_summaries,human_summaries,relevance,coherence,fluency,consistency,text,id
0,"[donald sterling , nba team last year . sterli...",[V. Stiviano must pay back $2.6 million in gif...,"[1.6666666666666667, 1.6666666666666667, 2.333...","[1.3333333333333333, 3.0, 1.0, 2.6666666666666...","[1.0, 4.666666666666667, 4.333333333333333, 4....","[1.0, 2.3333333333333335, 4.666666666666667, 5...",(CNN)Donald Sterling's racist remarks cost him...,cnn-test-404f859482d47c127868964a9a39d1a7645dd2e9
1,[north pacific gray whale has earned a spot in...,"[The whale, Varvara, swam a round trip from Ru...","[2.3333333333333335, 4.666666666666667, 3.6666...","[1.3333333333333333, 4.666666666666667, 3.6666...","[1.0, 5.0, 4.666666666666667, 3.66666666666666...","[1.3333333333333333, 5.0, 5.0, 4.3333333333333...",(CNN)A North Pacific gray whale has earned a s...,cnn-test-4761dc6d8bdf56b9ada97104113dd1bcf4aed3f1
2,[russian fighter jet intercepted a u.s. reconn...,[The incident occurred on April 7 north of Pol...,"[4.0, 4.0, 4.0, 3.3333333333333335, 3.33333333...","[3.3333333333333335, 4.333333333333333, 1.6666...","[3.6666666666666665, 4.333333333333333, 5.0, 4...","[5.0, 5.0, 4.666666666666667, 5.0, 5.0, 5.0, 5...",(CNN)After a Russian fighter jet intercepted a...,cnn-test-5139ccfabee55ddb83e7937f5802c0a67aee8975
3,[michael barnett captured the fire on intersta...,[Country band Lady Antebellum's bus caught fir...,"[2.0, 3.0, 2.6666666666666665, 3.3333333333333...","[2.0, 3.0, 2.6666666666666665, 3.3333333333333...","[2.6666666666666665, 5.0, 5.0, 5.0, 5.0, 5.0, ...","[2.3333333333333335, 5.0, 5.0, 5.0, 5.0, 5.0, ...",(CNN)Lady Antebellum singer Hillary Scott's to...,cnn-test-88c2481234e763c9bbc68d0ab1be1d2375c1349a
4,[deep reddish color caught seattle native tim ...,[Smoke from massive fires in Siberia created f...,"[1.6666666666666667, 3.6666666666666665, 3.333...","[1.6666666666666667, 3.6666666666666665, 1.666...","[5.0, 5.0, 5.0, 5.0, 4.666666666666667, 5.0, 5...","[2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ...",(CNN)A fiery sunset greeted people in Washingt...,cnn-test-a02e362c5b8f049848ce718b37b96117485461cf


We are only interested in the 'text' collumn from this dataset

In [8]:
print(summ_eval['test'][0]['text'])

(CNN)Donald Sterling's racist remarks cost him an NBA team last year. But now it's his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling's wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple's money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O'Donnell said in a statemen

## 🔄 Preparing Data for Evaluation

Before running our comparison, we need to convert the dataset to JSONL format and upload it to the Together AI platform.

The evaluation service requires:
- JSONL format with consistent fields across all examples
- Upload with `purpose="eval"` to indicate evaluation usage

In [None]:
# Convert the 'text' column to JSONL format and upload for evaluation
import tempfile
import os
import json

# Create a temporary file with JSONL format
with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False) as f:
    for item in summ_eval['test']:
        json.dump({'text': item['text']}, f)
        f.write('\n')
    temp_file_path = f.name

# Upload the file using together_client
uploaded_file = together_client.files.upload(
    file=temp_file_path,
    purpose='eval'
)

# Clean up the temporary file
os.unlink(temp_file_path)

print(f"Uploaded file: {uploaded_file}")

Uploading file tmpgi1edogk.jsonl: 100%|██████████| 213k/213k [00:01<00:00, 149kB/s]


Uploaded file: id='file-a691355a-07e8-4543-8fca-0630f5a06bee' object=<ObjectType.File: 'file'> created_at=1752955140 type=None purpose=<FilePurpose.Eval: 'eval'> filename='tmpgi1edogk.jsonl' bytes=213087 line_count=0 processed=True FileType='jsonl'


## ⚙️ Model Configuration

We'll compare two models on the summarization task:
- **Model A**: First model for comparison
- **Model B**: Second model for comparison  
- **Judge Model**: Evaluates which summary is better based on our criteria

The judge will assess summaries across multiple dimensions including accuracy, completeness, conciseness, clarity, and relevance.

We use the prompts below to setup the models being evaluated and the Judge LLM.

In [21]:
summarization_generation_template = """You are an expert summarizer. 
Your task is to create a concise, accurate summary.

INSTRUCTIONS:
- Read the text carefully
- Extract the main points and key information
- Write 2-3 clear, focused sentences
- Prioritize the most important aspects
"""

compare_judge_template = """You are an expert judge evaluating the quality of text summaries. Your task is to compare two summaries and determine which one is better.

EVALUATION CRITERIA:
1. **Accuracy & Faithfulness**: Does the summary accurately represent the source text without hallucinations or distortions?
2. **Completeness**: Does the summary capture all key points and main ideas from the source text?
3. **Conciseness**: Is the summary appropriately brief while maintaining essential information?
4. **Clarity & Readability**: Is the summary well-written, coherent, and easy to understand?
5. **Relevance**: Does the summary focus on the most important aspects of the source text?

INSTRUCTIONS:
- Read the source text carefully
- Evaluate both Summary A and Summary B against each criterion
- Consider the overall quality and usefulness of each summary
- Give a brief explanation (2-3 sentences) justifying your choice
"""

In [22]:
MODEL_A_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
MODEL_B_NAME = "Qwen/Qwen2.5-72B-Instruct-Turbo"

JUDGE_MODEL_NAME = "deepseek-ai/DeepSeek-V3"

# Model configurations
model_a_config = {
    "model_name": MODEL_A_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

model_b_config = {
    "model_name": MODEL_B_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

## 🏃‍♂️ Running the Comparison Evaluation

The `compare` evaluation type performs a comprehensive head-to-head comparison:

1. **Two-pass evaluation**: Each model generates responses in different orders to eliminate position bias 
2. **Judge assessment**: The judge model evaluates both outputs and determines the winner
3. **Detailed feedback**: Provides reasoning for each decision

**Key Parameters:**
- `type`: Set to `"compare"` for head-to-head evaluation
- `model_a` / `model_b`: Configurations for the two models being compared
- `judge_model_name`: The model that will make the comparison decisions
- `judge_system_template`: Detailed criteria for evaluation

In [None]:
# Create compare evaluation
evaluation_response = together_client.evaluation.create(
    type="compare",
    input_data_file_path=uploaded_file.id,
    judge_model_name=JUDGE_MODEL_NAME,
    judge_system_template=compare_judge_template,
    model_a=model_a_config,
    model_b=model_b_config
)

print(f"Evaluation ID: {evaluation_response.workflow_id}")
print(f"Status: {evaluation_response.status}")

Evaluation ID: eval-2f8c-1752678421
Status: pending


## 📊 Understanding Comparison Results

Once the evaluation is completed we can examine the results.

The evaluation provides several key metrics:
- **A_wins**: Number of times Model A was preferred
- **B_wins**: Number of times Model B was preferred  
- **Ties**: Number of cases where both models performed equally
- **Fail counts**: Generation or judge failures (should be 0 for successful runs)


Each result includes both the original and flipped evaluations to ensure that position bias from the judge is eliminated:

### Two-Pass Evaluation Process
1. **First pass**: Model A generates first, then Model B
2. **Second pass**: Model B generates first, then Model A

Here we see that model B won in 28 cases, model A won in 21 cases, and 51 cases were ties according to our judge.

In [None]:
import json

status_compare_models = together_client.evaluation.status(evaluation_response.workflow_id)

print(json.dumps(status_compare_models.results, indent=2))

{
  "A_wins": 21,
  "B_wins": 28,
  "Ties": 51,
  "generation_fail_count": 0,
  "judge_fail_count": 0,
  "result_file_id": "file-e4054d52-a503-4260-893e-7c2b117ba20c"
}


## 🔍 Examining Detailed Results

Each evaluation result contains:
- **Original input**: The text that was summarized
- **Model outputs**: Summaries from both Model A and Model B
- **Judge decisions**: Both original and flipped evaluation results
- **Final decision**: The conclusive winner after bias elimination

The `final_decision` field shows the judge's ultimate verdict after considering both evaluation passes.

In [14]:
COMPARE_MODELS_FILE = "./summary_bench_results_a.jsonl"

compare_models_file_id = status_compare_models.results['result_file_id']
together_client.files.retrieve_content(compare_models_file_id, output=COMPARE_MODELS_FILE)

Downloading file summary_bench_results_a.jsonl: 100%|██████████| 321k/321k [00:00<00:00, 2.94MB/s]


FileObject(object='local', id='file-e4054d52-a503-4260-893e-7c2b117ba20c', filename='/Users/zain/Documents/Projects/together-cookbook/Evals/summary_bench_results_a.jsonl', size=321499)

In [15]:
# Print first 3 lines of the comparison results file
with open(COMPARE_MODELS_FILE, 'r') as f:
    for i, line in enumerate(f):
        if i >= 3:
            break
        print(f"Line {i+1}:")
        data = json.loads(line.strip())
        for key, value in data.items():
            print(f"  {key}: {value}")
        print()


Line 1:
  text: Liverpool vice-captain Jordan Henderson thinks his side could catch Manchester City in the Barclays Premier League having fought through a tough and long season at Anfield. Henderson and Liverpool goalkeeper Simon Mignolet both played their 47th game of season in the 2-0 win over Newcastle United on Monday night, equalling the record for appearances by any player in the top five European leagues so far this campaign. But the England midfielder believes that after finding winning form again following poor results against Manchester United and Arsenal, Liverpool can pile the pressure on to City who sit four points above them in the race for the Champions League. Liverpool vice-captain Jordan Henderson thinks his side could catch Manchester City in the  Premier League Henderson played his 47th game of season in the 2-0 win over Newcastle United on Monday night Manchester City have been faltering and lost 4-2 at Manchester United, Liverpool are four points behind 'We knew i

In [26]:
# Calculate and display final results
total_comparisons = status_compare_models.results['A_wins'] + status_compare_models.results['B_wins'] + status_compare_models.results['Ties']
a_wins = status_compare_models.results['A_wins']
b_wins = status_compare_models.results['B_wins']  
ties = status_compare_models.results['Ties']

print("=== FINAL COMPARISON RESULTS ===")
print(f"Total Evaluations: {total_comparisons}")
print(f"Model A - {MODEL_A_NAME} Wins: {a_wins} ({a_wins/total_comparisons*100:.1f}%)")
print(f"Model B - {MODEL_B_NAME} Wins: {b_wins} ({b_wins/total_comparisons*100:.1f}%)")
print(f"Ties: {ties} ({ties/total_comparisons*100:.1f}%)")
print()

if b_wins > a_wins:
    winner = f"Model B - {MODEL_B_NAME}"
    margin = b_wins - a_wins
elif a_wins > b_wins:
    winner = f"Model A - {MODEL_A_NAME}" 
    margin = a_wins - b_wins
else:
    winner = "Tie"
    margin = 0

if winner != "Tie":
    print(f"🏆 Winner: {winner} by {margin} evaluations")
else:
    print("🤝 Overall tie between models")

=== FINAL COMPARISON RESULTS ===
Total Evaluations: 100
Model A - meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo Wins: 21 (21.0%)
Model B - Qwen/Qwen2.5-72B-Instruct-Turbo Wins: 28 (28.0%)
Ties: 51 (51.0%)

🏆 Winner: Model B - Qwen/Qwen2.5-72B-Instruct-Turbo by 7 evaluations


## 🔑 Key Findings

**Performance Summary:**
- **Model B** outperformed Model A with **28 wins vs 21 wins** (7-point advantage)
- **High tie rate** of **51%** suggests both models often produce comparable summaries
- **No failures** in generation or judging indicates robust model performance

**Insights:**
- The close competition (28 vs 21) suggests both models have similar summarization capabilities
- The high tie percentage (51%) indicates that for many articles, both models produced summaries of equivalent quality
- Model B's slight edge may be due to better handling of specific article types or summary characteristics