# Scoring Summaries with LLMs: BLEU and ROUGE in Action

## Evaluating LLMs in Practice — Part 2 of 7

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mekjr1/evaluating_llms_in_practice/blob/master/part-2-bleu_rouge_deep_dive/bleu_rouge_deep_dive.ipynb?hl=en#runtime_type=gpu)

📌 **Recap from Part 1**

In Part 1, we saw why accuracy and F1 don't work for LLMs and introduced a range of metrics. This time, we're going deeper into two of the most widely used ones: **BLEU** and **ROUGE**.

We'll look at how they work, the math behind them, and run experiments to see where they succeed and fail.

🧠 **Why BLEU and ROUGE?**

Both metrics were created in the early 2000s to address the problem of evaluating machine-generated text without relying only on human judgments.

- **BLEU (2002)** → Machine translation. Measures **precision**: how much of the candidate output matches the reference text.
- **ROUGE (2004)** → Summarization. Measures **recall**: how much of the reference text is captured in the candidate output.

Both rely on **n-gram overlap** (matching sequences of 1, 2, 3, or 4 words).

In [None]:
!pip install transformers datasets evaluate rouge-score nltk matplotlib pandas seaborn

## 📐 How BLEU Works

The BLEU score is based on two parts:

**1. Modified n-gram precision:**

$$P_n = \frac{\sum_{\text{ngram} \in \text{candidate}} \min(\text{count}_{\text{cand}}, \text{count}_{\text{ref}})}{\sum_{\text{ngram} \in \text{candidate}} \text{count}_{\text{cand}}}$$

**2. Brevity Penalty (BP):**

$$BP = \begin{cases} 
1 & \text{if } \text{length}_{\text{cand}} > \text{length}_{\text{ref}} \\
e^{(1-\frac{\text{length}_{\text{ref}}}{\text{length}_{\text{cand}}})} & \text{otherwise}
\end{cases}$$

**Final BLEU:**

$$BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \cdot \log P_n\right)$$

Usually $N=4$ and weights $w_n = \frac{1}{4}$.

## 📐 How ROUGE Works

ROUGE is simpler and comes in several flavors:

**ROUGE-N (recall):**

$$ROUGE\_N = \frac{\sum_{\text{ngram} \in \text{reference}} \min(\text{count}_{\text{cand}}, \text{count}_{\text{ref}})}{\sum_{\text{ngram} \in \text{reference}} \text{count}_{\text{ref}}}$$

**ROUGE-L (Longest Common Subsequence):** Measures the longest sequence of words appearing in both candidate and reference.

**ROUGE-S (Skip-bigrams):** Measures overlapping pairs of words, allowing gaps.

👉 In practice, **ROUGE-1**, **ROUGE-2**, and **ROUGE-L** are the most common.

## 📝 Toy Example

Let's start with a simple example to understand the concepts:

**Reference:** "The cat sat on the mat."

**Candidate 1:** "The cat sat on the rug."

**Candidate 2:** "A small cat rested quietly."

Let's see how BLEU and ROUGE score these candidates:

In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
import nltk
nltk.download('punkt', quiet=True)

# Our toy examples
ref_text = "the cat sat on the mat"
cand1_text = "the cat sat on the rug"
cand2_text = "a small cat rested quietly"

# Tokenize for BLEU (expects lists of tokens)
reference = [ref_text.split()]  # BLEU expects list of reference lists
candidate1 = cand1_text.split()
candidate2 = cand2_text.split()

print("📚 Our Examples:")
print("Reference:", reference[0])
print("Candidate 1:", candidate1) 
print("Candidate 2:", candidate2)

In [None]:
# Calculate BLEU scores with smoothing (to handle zero n-grams)
smooth = SmoothingFunction().method1

bleu1_cand1 = sentence_bleu(reference, candidate1, weights=(1, 0, 0, 0), smoothing_function=smooth)
bleu1_cand2 = sentence_bleu(reference, candidate2, weights=(1, 0, 0, 0), smoothing_function=smooth)

bleu4_cand1 = sentence_bleu(reference, candidate1, smoothing_function=smooth)  # Default: 4-gram
bleu4_cand2 = sentence_bleu(reference, candidate2, smoothing_function=smooth)

print("🔵 BLEU Scores:")
print(f"Candidate 1 - BLEU-1: {bleu1_cand1:.3f}, BLEU-4: {bleu4_cand1:.3f}")
print(f"Candidate 2 - BLEU-1: {bleu1_cand2:.3f}, BLEU-4: {bleu4_cand2:.3f}")

In [None]:
# Calculate ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge_cand1 = scorer.score(ref_text, cand1_text)
rouge_cand2 = scorer.score(ref_text, cand2_text)

print("🔴 ROUGE Scores:")
print(f"Candidate 1:")
print(f"  ROUGE-1: {rouge_cand1['rouge1'].fmeasure:.3f}")
print(f"  ROUGE-2: {rouge_cand1['rouge2'].fmeasure:.3f}")
print(f"  ROUGE-L: {rouge_cand1['rougeL'].fmeasure:.3f}")

print(f"Candidate 2:")
print(f"  ROUGE-1: {rouge_cand2['rouge1'].fmeasure:.3f}")
print(f"  ROUGE-2: {rouge_cand2['rouge2'].fmeasure:.3f}")
print(f"  ROUGE-L: {rouge_cand2['rougeL'].fmeasure:.3f}")

### 💡 Observation

**Candidate 1**: BLEU ≈ 0.6, ROUGE ≈ 0.7 → looks good (most words match).

**Candidate 2**: BLEU ≈ 0.1, ROUGE ≈ 0.2 → looks bad, even though semantically it's valid.

👉 This shows why BLEU/ROUGE can be misleading - they miss semantic similarity!

## 🔬 Step-by-Step Manual Calculation

Let's manually calculate BLEU for "The cat sat on the rug" vs "The cat sat on the mat" to understand the internals:

In [None]:
def manual_bleu_calculation(reference_tokens, candidate_tokens):
    """Manually calculate BLEU score to show the internals"""
    
    print(f"📊 Manual BLEU Calculation")
    print(f"Reference: {reference_tokens}")
    print(f"Candidate: {candidate_tokens}")
    print()
    
    # 1-gram precision
    ref_1grams = reference_tokens
    cand_1grams = candidate_tokens
    
    matches_1 = 0
    ref_counts = {}
    for token in ref_1grams:
        ref_counts[token] = ref_counts.get(token, 0) + 1
    
    print("1-gram analysis:")
    for token in cand_1grams:
        if token in ref_counts and ref_counts[token] > 0:
            matches_1 += 1
            ref_counts[token] -= 1
            print(f"  '{token}' → MATCH ✓")
        else:
            print(f"  '{token}' → NO MATCH ✗")
    
    p1 = matches_1 / len(cand_1grams)
    print(f"\n1-gram precision: {matches_1}/{len(cand_1grams)} = {p1:.3f}")
    
    # Brevity penalty
    ref_len = len(reference_tokens)
    cand_len = len(candidate_tokens)
    
    if cand_len > ref_len:
        bp = 1.0
    else:
        bp = math.exp(1 - ref_len/cand_len)
    
    print(f"\nBrevity Penalty:")
    print(f"  Reference length: {ref_len}")
    print(f"  Candidate length: {cand_len}")
    print(f"  BP = {bp:.3f}")
    
    # Simple BLEU-1 score
    bleu_1 = bp * p1
    print(f"\n🎯 BLEU-1 = BP × P1 = {bp:.3f} × {p1:.3f} = {bleu_1:.3f}")
    
    return bleu_1

import math

# Manual calculation for our example
manual_score = manual_bleu_calculation(
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "cat", "sat", "on", "the", "rug"]
)

## 🔬 Experiment 1: Comparing Hugging Face Models

Let's scale up to real summarization models and see how they perform on actual news articles.

In [None]:
from transformers import pipeline
from datasets import load_dataset
import evaluate
import pandas as pd

# Load a small sample from CNN/DailyMail dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:3]")

# The models we'll compare
models = {
    "DistilBART": "sshleifer/distilbart-cnn-12-6",
    "BART-Base": "facebook/bart-base",
    "T5-Small": "t5-small"
}

print("📊 Models to evaluate:", list(models.keys()))
print("📰 Articles to test:", len(dataset))
print("\n📝 Sample article (first 200 chars):")
print(dataset[0]["article"][:200] + "...")
print("\n🎯 Reference summary:")
print(dataset[0]["highlights"])

In [None]:
# Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

# Store results for all models
all_results = []

for model_name, model_id in models.items():
    print(f"\n🔄 Loading {model_name}...")
    summarizer = pipeline("summarization", model=model_id)
    
    model_results = {"Model": model_name}
    bleu_scores = []
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []
    
    for i, item in enumerate(dataset):
        article = item["article"]
        reference = item["highlights"]
        
        # Generate summary
        summary = summarizer(article, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]
        
        # Calculate metrics
        bleu_score = bleu.compute(predictions=[summary], references=[[reference]])["bleu"]
        rouge_score = rouge.compute(predictions=[summary], references=[reference])
        
        bleu_scores.append(bleu_score)
        rouge1_scores.append(rouge_score["rouge1"])
        rouge2_scores.append(rouge_score["rouge2"])
        rougeL_scores.append(rouge_score["rougeL"])
        
        print(f"  📄 Article {i+1}: BLEU={bleu_score:.3f}, ROUGE-1={rouge_score['rouge1']:.3f}")
        print(f"     Summary: {summary[:80]}...")
    
    # Average scores
    model_results.update({
        "BLEU": sum(bleu_scores) / len(bleu_scores),
        "ROUGE-1": sum(rouge1_scores) / len(rouge1_scores),
        "ROUGE-2": sum(rouge2_scores) / len(rouge2_scores), 
        "ROUGE-L": sum(rougeL_scores) / len(rougeL_scores)
    })
    
    all_results.append(model_results)

# Create results table
results_df = pd.DataFrame(all_results)
print("\n📊 Final Results:")
print(results_df.round(3))

### 🏆 Model Comparison Analysis

| Model | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | Notes |
|-------|------|--------|--------|---------|-------|
| DistilBART | ~0.28 | ~0.53 | ~0.38 | ~0.47 | Summarization tuned → best |
| BART-Base | ~0.06 | ~0.23 | ~0.15 | ~0.18 | Output too long |
| T5-Small | ~0.19 | ~0.42 | ~0.29 | ~0.36 | Balanced |

*Note: Actual scores may vary based on the specific articles tested.*

## 🔬 Experiment 2: Short vs Long Summaries

BLEU and ROUGE behave differently if one model outputs short concise text while another outputs long verbose summaries. Let's test this hypothesis:

In [None]:
# Create synthetic examples with different lengths
reference_summary = "The president met European leaders in Paris to discuss climate policies."

# Candidate summaries of different lengths
short_summary = "The president met leaders in Paris."
medium_summary = "The president met with European leaders in Paris to discuss policies."  
long_summary = "The president of the country met with European leaders in Paris on Tuesday to discuss ongoing climate and economic policies and future cooperation."

candidates = {
    "Short (7 words)": short_summary,
    "Medium (11 words)": medium_summary, 
    "Long (21 words)": long_summary
}

print("🎯 Reference:", reference_summary)
print(f"   Length: {len(reference_summary.split())} words\n")

length_results = []

for name, candidate in candidates.items():
    # BLEU score
    ref_tokens = [reference_summary.split()]
    cand_tokens = candidate.split()
    bleu_score = sentence_bleu(ref_tokens, cand_tokens, smoothing_function=smooth)
    
    # ROUGE scores
    rouge_scores = scorer.score(reference_summary, candidate)
    
    result = {
        "Summary Type": name,
        "Length": len(cand_tokens),
        "BLEU": bleu_score,
        "ROUGE-1": rouge_scores['rouge1'].fmeasure,
        "ROUGE-2": rouge_scores['rouge2'].fmeasure,
        "ROUGE-L": rouge_scores['rougeL'].fmeasure
    }
    
    length_results.append(result)
    print(f"📏 {name}: '{candidate[:50]}...'")
    print(f"   BLEU: {bleu_score:.3f} | ROUGE-1: {rouge_scores['rouge1'].fmeasure:.3f}")

# Create comparison table
length_df = pd.DataFrame(length_results)
print("\n📊 Length Impact Results:")
print(length_df.round(3))

### 📏 Key Insight

- **Short summaries** often get higher BLEU (precision-focused)
- **Long summaries** often get higher ROUGE (recall-focused) 
- But which one is more useful depends on the task!

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Set style for better-looking plots
plt.style.use('default')
sns.set_palette("husl")

# Create comprehensive visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Model comparison chart
models_list = results_df['Model'].tolist()
x = np.arange(len(models_list))
width = 0.2

ax1.bar(x - width, results_df['BLEU'], width, label='BLEU', alpha=0.8, color='skyblue')
ax1.bar(x, results_df['ROUGE-1'], width, label='ROUGE-1', alpha=0.8, color='lightcoral')
ax1.bar(x + width, results_df['ROUGE-L'], width, label='ROUGE-L', alpha=0.8, color='lightgreen')

ax1.set_xlabel('Models')
ax1.set_ylabel('Score')
ax1.set_title('🤖 Model Performance: BLEU vs ROUGE')
ax1.set_xticks(x)
ax1.set_xticklabels(models_list, rotation=45)
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Length impact chart  
lengths = length_df['Length'].tolist()
x2 = np.arange(len(lengths))

ax2.bar(x2 - width, length_df['BLEU'], width, label='BLEU', alpha=0.8, color='skyblue')
ax2.bar(x2, length_df['ROUGE-1'], width, label='ROUGE-1', alpha=0.8, color='lightcoral')
ax2.bar(x2 + width, length_df['ROUGE-L'], width, label='ROUGE-L', alpha=0.8, color='lightgreen')

ax2.set_xlabel('Summary Length (words)')
ax2.set_ylabel('Score')  
ax2.set_title('📏 Length Impact: BLEU vs ROUGE')
ax2.set_xticks(x2)
ax2.set_xticklabels(lengths)
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. BLEU vs ROUGE correlation
all_bleu = list(results_df['BLEU']) + list(length_df['BLEU'])
all_rouge1 = list(results_df['ROUGE-1']) + list(length_df['ROUGE-1'])

ax3.scatter(all_bleu, all_rouge1, alpha=0.7, s=100, color='purple')
ax3.set_xlabel('BLEU Score')
ax3.set_ylabel('ROUGE-1 Score')
ax3.set_title('🔍 BLEU vs ROUGE-1 Correlation')
ax3.grid(True, alpha=0.3)

# Add trend line
z = np.polyfit(all_bleu, all_rouge1, 1)
p = np.poly1d(z)
ax3.plot(sorted(all_bleu), p(sorted(all_bleu)), "r--", alpha=0.8)

# 4. Score distribution
score_types = ['BLEU', 'ROUGE-1', 'ROUGE-2', 'ROUGE-L']
model_scores = [results_df['BLEU'].mean(), results_df['ROUGE-1'].mean(), 
                results_df['ROUGE-2'].mean(), results_df['ROUGE-L'].mean()]

ax4.pie(model_scores, labels=score_types, autopct='%1.2f', startangle=90, 
        colors=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
ax4.set_title('📊 Average Score Distribution')

plt.tight_layout()
plt.show()

print("💡 Key Insights from the visualizations:")
print("• Different models show varying BLEU vs ROUGE performance")
print("• Summary length significantly affects metric scores")
print("• BLEU and ROUGE-1 often correlate but can diverge")
print("• No single metric tells the complete story!")

## ⚖️ Strengths vs Weaknesses

Based on our experiments, let's analyze when BLEU and ROUGE work well and when they fail.

### ✅ Strengths

- **Fast and reproducible** - Can evaluate thousands of summaries in seconds
- **Standardized** - Widely used in research, making comparisons possible  
- **No human annotators needed** - Fully automated evaluation
- **Good for relative comparisons** - Helps rank different models
- **Captures surface-level quality** - Good at detecting completely broken outputs

### ❌ Weaknesses

- **Surface-level only** - Ignores meaning, only matches words
- **Sensitive to word choice** - "car" vs "automobile" scored as completely different
- **Length bias** - Short summaries often get higher BLEU; long ones higher ROUGE
- **No factual checking** - A summary could score high but contain false information
- **Ignores fluency** - Grammatically broken text can still score well if words match
- **Single reference limitation** - Real summaries have many valid variations

## 🧪 Experiment 3: Where BLEU/ROUGE Fail

Let's create examples that highlight the limitations of these metrics:

In [None]:
# Examples where BLEU/ROUGE give misleading scores
reference = "The company announced record profits this quarter."

test_cases = {
    "Perfect Paraphrase": "The business reported exceptional earnings this period.",
    "Factually Wrong": "The company announced record losses this quarter.",
    "Gibberish": "Company the announced this record quarter profits.",
    "Word Salad": "Announced profits record company quarter this the.",
    "Completely Unrelated": "Cats love to sleep in sunny windows during afternoon."
}

print("🎯 Reference:", reference)
print("\n🧪 Test Cases Analysis:")
print("="*60)

failure_results = []

for case_name, candidate in test_cases.items():
    # Calculate scores
    ref_tokens = [reference.split()]
    cand_tokens = candidate.split()
    bleu_score = sentence_bleu(ref_tokens, cand_tokens, smoothing_function=smooth)
    rouge_scores = scorer.score(reference, candidate)
    
    failure_results.append({
        "Case": case_name,
        "BLEU": bleu_score,
        "ROUGE-1": rouge_scores['rouge1'].fmeasure,
        "Human Quality": "High" if case_name == "Perfect Paraphrase" else "Low"
    })
    
    print(f"\n📝 {case_name}:")
    print(f"   Text: '{candidate}'")
    print(f"   BLEU: {bleu_score:.3f} | ROUGE-1: {rouge_scores['rouge1'].fmeasure:.3f}")
    
    # Add human judgment
    if case_name == "Perfect Paraphrase":
        print("   👤 Human: Excellent! Perfect meaning, different words.")
    elif case_name == "Factually Wrong":
        print("   👤 Human: Terrible! Opposite meaning, but high word overlap.")
    elif "Gibberish" in case_name or "Word Salad" in case_name:
        print("   👤 Human: Nonsensical! Same words, wrong order.")
    else:
        print("   👤 Human: Completely irrelevant!")

# Summary table
failure_df = pd.DataFrame(failure_results)
print("\n📊 Failure Analysis Summary:")
print(failure_df.round(3))

### 🚨 Critical Observations

1. **Perfect Paraphrase** gets LOW scores despite being semantically identical
2. **Factually Wrong** can get HIGH scores if it uses similar words
3. **Gibberish** with same words can score better than good paraphrases
4. These metrics are **word-matching**, not **meaning-matching**!

## 💡 Key Takeaways

After our deep dive into BLEU and ROUGE, here's what we learned:

### 🎯 The Core Insight
- **BLEU = precision**; **ROUGE = recall**
- Both are **n-gram based** and easy to compute
- Great for **benchmarking**, but not enough for real-world quality

### ⚠️ Major Limitations
- They **fail on paraphrases** and don't capture semantics
- Can be **gamed** by length manipulation
- No understanding of **factual correctness** or **fluency**

### 🛠️ Best Practices
- Use them alongside **human evaluation**
- Consider **multiple metrics** together
- Be aware of **length bias** in your models
- Don't rely on them for **final quality judgments**

### 🔮 Modern Alternatives
For semantic understanding, consider:
- **BERTScore** (contextual embeddings)
- **BLEURT** (learned evaluation)
- **Human evaluation** (still the gold standard)

---

## 📌 What's Next

This is **Part 2** of "Evaluating LLMs in Practice: Metrics, Experiments, and A/B Testing."

In **Part 3**, we'll dive into **Perplexity**:
- What it really measures
- Why lower perplexity isn't always better
- How to compute it with GPT-2 and GPT-Neo
- When perplexity correlates with human quality (and when it doesn't)

**Coming up in the series:**
- Part 4: Semantic Similarity Metrics (BERTScore, METEOR)
- Part 5: Human Evaluation & Inter-Annotator Agreement  
- Part 6: A/B Testing for LLM Applications
- Part 7: Building Your Own Custom Evaluation Pipeline

Stay tuned! ⭐