# BLEU and ROUGE Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mekjr1/evaluating_llms_in_practice/blob/master/part-1-bleu_and_rouge/bleu_and_rouge.ipynb?hl=en#runtime_type=gpu)

This notebook demonstrates how to evaluate text summarization models using BLEU and ROUGE metrics. The notebook is configured to use GPU runtime for faster model inference.

## Why Traditional Accuracy Metrics Fail for Text Generation

When evaluating text generation models, we can't simply use classification accuracy because:

### Classification vs Generation Tasks

```
┌─────────────────────────────────┬─────────────────────────────────┐
│        Classification           │        Text Generation          │
├─────────────────────────────────┼─────────────────────────────────┤
│ Input: "I love this movie"      │ Input: Long article text        │
│ Output: [Positive, Negative]    │ Output: "Scientists discover..." │
│ Ground Truth: Positive          │ Reference: "Researchers found..." │
│                                 │                                 │
│ ✅ Exact Match = 100% Accuracy  │ ❌ Exact Match = 0% Accuracy    │
│ ❌ Different = 0% Accuracy      │ But both summaries are good! 🤔 │
└─────────────────────────────────┴─────────────────────────────────┘
```

**The Problem**: Two different but equally valid summaries would get 0% accuracy score, even if they're both excellent summaries that capture the same key information.

**The Solution**: We need metrics that measure **semantic similarity** and **content overlap** rather than exact matches.

In [1]:
!pip install transformers datasets evaluate rouge-score nltk matplotlib seaborn plotly pandas

Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting datasets
Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
  Downloading datasets-4.1.1-py3-none-any.whl.metadata (18 kB)
Collecting evaluate
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25l  Installing build dependencies ... [?25l-done
[?25h  Getting requirements to build wheel ... [?25done
[?25h  Getting requirements to build wheel ... [?25l-done
[?25h  Preparing metadata (pyproject.toml) ... [?25done
[?25h  Preparing metadata (pyproject.toml) ... [?25l-done
[?25done
[?25hCollecting nltk
Collecti

In [2]:
from transformers import pipeline
from datasets import load_dataset
import evaluate

ModuleNotFoundError: No module named 'transformers'

In [None]:
# Load a small subset of CNN/DailyMail dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:20]")

In [None]:
# Two summarization models
model_a = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
model_b = pipeline("summarization", model="facebook/bart-base")

In [None]:
# Pick one article
article = dataset[0]["article"]
reference = dataset[0]["highlights"]

In [None]:
summary_a = model_a(article, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]
summary_b = model_b(article, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]

In [None]:
print("Reference:", reference)
print("\nModel A:", summary_a)
print("\nModel B:", summary_b)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from collections import Counter
import re
from IPython.display import HTML, display



In [None]:
def highlight_overlaps(reference, candidate, title):
    """Highlight overlapping words between reference and candidate text"""
    # Simple word tokenization
    ref_words = set(re.findall(r'\b\w+\b', reference.lower()))
    cand_words = re.findall(r'\b\w+\b', candidate.lower())
    
    # Create highlighted version
    highlighted = []
    for word in candidate.split():
        clean_word = re.sub(r'[^\w]', '', word.lower())
        if clean_word in ref_words:
            highlighted.append(f"<span style='background-color: #90EE90; padding: 2px; margin: 1px; border-radius: 3px;'>{word}</span>")
        else:
            highlighted.append(word)
    
    overlap_count = sum(1 for word in cand_words if word in ref_words)
    overlap_ratio = overlap_count / len(cand_words) if cand_words else 0
    
    return " ".join(highlighted), overlap_ratio

In [None]:
# Create visualizations for both summaries
print("🔍 N-gram Overlap Analysis")
print("=" * 100)



In [None]:
# Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

In [None]:
ref_display = f"<div style='padding: 15px; background-color: #f0f8ff; border-left: 4px solid #4682b4; margin: 10px 0;'><strong>Reference Summary:</strong><br>{reference}</div>"

summary_a_highlighted, overlap_a = highlight_overlaps(reference, summary_a, "Model A")
summary_b_highlighted, overlap_b = highlight_overlaps(reference, summary_b, "Model B")

model_a_display = f"""
<div style='padding: 15px; background-color: #f9f9f9; border-left: 4px solid #32cd32; margin: 10px 0;'>
<strong>Model A Summary (Word Overlap: {overlap_a:.1%}):</strong><br>
{summary_a_highlighted}
</div>
"""

model_b_display = f"""
<div style='padding: 15px; background-color: #f9f9f9; border-left: 4px solid #ff6347; margin: 10px 0;'>
<strong>Model B Summary (Word Overlap: {overlap_b:.1%}):</strong><br>
{summary_b_highlighted}
</div>
"""

legend = """
<div style='padding: 10px; background-color: #fffacd; border: 1px solid #ddd; margin: 10px 0; border-radius: 5px;'>
<strong>Legend:</strong> <span style='background-color: #90EE90; padding: 2px; margin: 1px; border-radius: 3px;'>Highlighted words</span> appear in both the reference and candidate summary
</div>
"""

display(HTML(ref_display + model_a_display + model_b_display + legend))

## BLEU vs ROUGE: Understanding the Metrics

<div style="display: flex; justify-content: center; margin: 20px 0;">
<table style="border-collapse: collapse; width: 100%; max-width: 800px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
  <thead>
    <tr style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white;">
      <th style="padding: 15px; text-align: left; font-weight: 600;">Aspect</th>
      <th style="padding: 15px; text-align: center; font-weight: 600;">BLEU</th>
      <th style="padding: 15px; text-align: center; font-weight: 600;">ROUGE</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: #f8f9fa;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">📝 <strong>Primary Focus</strong></td>
      <td style="padding: 15px; text-align: center;">Precision (exact matches)</td>
      <td style="padding: 15px; text-align: center;">Recall (coverage)</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">🔍 <strong>What it Measures</strong></td>
      <td style="padding: 15px; text-align: center;">N-gram precision<br><em>(How many generated words are correct?)</em></td>
      <td style="padding: 15px; text-align: center;">N-gram recall<br><em>(How much reference content is captured?)</em></td>
    </tr>
    <tr style="background-color: #f8f9fa;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">🎯 <strong>Best For</strong></td>
      <td style="padding: 15px; text-align: center;">Machine Translation</td>
      <td style="padding: 15px; text-align: center;">Text Summarization</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">📊 <strong>Score Range</strong></td>
      <td style="padding: 15px; text-align: center;">0.0 - 1.0<br><em>(Higher = Better)</em></td>
      <td style="padding: 15px; text-align: center;">0.0 - 1.0<br><em>(Higher = Better)</em></td>
    </tr>
    <tr style="background-color: #f8f9fa;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">⚖️ <strong>Key Strength</strong></td>
      <td style="padding: 15px; text-align: center;">Penalizes hallucinations</td>
      <td style="padding: 15px; text-align: center;">Rewards content coverage</td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td style="padding: 15px; font-weight: 500; border-right: 2px solid #e9ecef;">⚠️ <strong>Limitation</strong></td>
      <td style="padding: 15px; text-align: center;">May penalize valid paraphrases</td>
      <td style="padding: 15px; text-align: center;">May reward keyword stuffing</td>
    </tr>
  </tbody>
</table>
</div>

> **💡 Pro Tip**: Use both metrics together! BLEU catches fluency issues while ROUGE ensures content coverage.

In [2]:
bleu_a = bleu.compute(predictions=[summary_a], references=[[reference]])
bleu_b = bleu.compute(predictions=[summary_b], references=[[reference]])

NameError: name 'bleu' is not defined

In [None]:
rouge_a = rouge.compute(predictions=[summary_a], references=[reference])
rouge_b = rouge.compute(predictions=[summary_b], references=[reference])

In [None]:
print("\nModel A BLEU:", bleu_a)
print("Model B BLEU:", bleu_b)
print("\nModel A ROUGE:", rouge_a)
print("Model B ROUGE:", rouge_b)

In [None]:
# Create comprehensive visualization of results
plt.style.use('default')
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('📊 BLEU & ROUGE Evaluation Results', fontsize=20, fontweight='bold', y=0.98)

# Color scheme
colors = ['#3498db', '#e74c3c']  # Blue for Model A, Red for Model B
model_names = ['Model A\n(DistilBART)', 'Model B\n(BART-base)']

# 1. BLEU Scores Comparison
bleu_scores = [bleu_a['bleu'], bleu_b['bleu']]
bars1 = ax1.bar(model_names, bleu_scores, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
ax1.set_title('🎯 BLEU Scores', fontsize=14, fontweight='bold', pad=20)
ax1.set_ylabel('BLEU Score', fontsize=12)
ax1.set_ylim(0, max(bleu_scores) * 1.2)
ax1.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on bars
for bar, score in zip(bars1, bleu_scores):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.005,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

# 2. ROUGE-1 Comparison
rouge1_scores = [rouge_a['rouge1'], rouge_b['rouge1']]
bars2 = ax2.bar(model_names, rouge1_scores, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
ax2.set_title('🔍 ROUGE-1 Scores', fontsize=14, fontweight='bold', pad=20)
ax2.set_ylabel('ROUGE-1 Score', fontsize=12)
ax2.set_ylim(0, max(rouge1_scores) * 1.2)
ax2.grid(axis='y', alpha=0.3, linestyle='--')

for bar, score in zip(bars2, rouge1_scores):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

# 3. ROUGE-2 Comparison
rouge2_scores = [rouge_a['rouge2'], rouge_b['rouge2']]
bars3 = ax3.bar(model_names, rouge2_scores, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
ax3.set_title('📝 ROUGE-2 Scores', fontsize=14, fontweight='bold', pad=20)
ax3.set_ylabel('ROUGE-2 Score', fontsize=12)
ax3.set_ylim(0, max(rouge2_scores) * 1.2)
ax3.grid(axis='y', alpha=0.3, linestyle='--')

for bar, score in zip(bars3, rouge2_scores):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height + 0.005,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

# 4. ROUGE-L Comparison
rougeL_scores = [rouge_a['rougeL'], rouge_b['rougeL']]
bars4 = ax4.bar(model_names, rougeL_scores, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)
ax4.set_title('📏 ROUGE-L Scores', fontsize=14, fontweight='bold', pad=20)
ax4.set_ylabel('ROUGE-L Score', fontsize=12)
ax4.set_ylim(0, max(rougeL_scores) * 1.2)
ax4.grid(axis='y', alpha=0.3, linestyle='--')

for bar, score in zip(bars4, rougeL_scores):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.subplots_adjust(top=0.93)
plt.show()



In [None]:
# Summary table
print("\n" + "="*80)
print("📋 SUMMARY TABLE")
print("="*80)

summary_data = {
    'Metric': ['BLEU', 'ROUGE-1', 'ROUGE-2', 'ROUGE-L'],
    'Model A (DistilBART)': [f"{bleu_a['bleu']:.3f}", f"{rouge_a['rouge1']:.3f}", 
                            f"{rouge_a['rouge2']:.3f}", f"{rouge_a['rougeL']:.3f}"],
    'Model B (BART-base)': [f"{bleu_b['bleu']:.3f}", f"{rouge_b['rouge1']:.3f}", 
                           f"{rouge_b['rouge2']:.3f}", f"{rouge_b['rougeL']:.3f}"],
    'Winner': []
}

# Determine winners
metrics_scores = [(bleu_a['bleu'], bleu_b['bleu']), (rouge_a['rouge1'], rouge_b['rouge1']), 
                  (rouge_a['rouge2'], rouge_b['rouge2']), (rouge_a['rougeL'], rouge_b['rougeL'])]

for a_score, b_score in metrics_scores:
    if a_score > b_score:
        summary_data['Winner'].append('🏆 Model A')
    elif b_score > a_score:
        summary_data['Winner'].append('🏆 Model B')
    else:
        summary_data['Winner'].append('🤝 Tie')

df_summary = pd.DataFrame(summary_data)
print(df_summary.to_string(index=False))
print("="*80)

## 🎓 Key Takeaways

### What We Learned

1. **🚫 Why Accuracy Fails**: Traditional classification accuracy doesn't work for text generation because multiple valid outputs exist for the same input.

2. **🔍 BLEU vs ROUGE**: 
   - **BLEU** focuses on precision (avoiding hallucinations)
   - **ROUGE** focuses on recall (capturing key information)

3. **📊 Visual Analysis**: The charts above show how different models perform across various metrics, helping us make informed decisions.

4. **💡 Best Practices**: 
   - Always use multiple metrics together
   - Consider the specific task (translation vs summarization)
   - Look at both word-level and semantic similarity

### Next Steps
- Try with different models and datasets
- Explore newer metrics like BERTScore or METEOR
- Consider human evaluation for critical applications

---
*This notebook demonstrates fundamental concepts in NLP evaluation. The visual approach makes complex metrics more intuitive and actionable.*