# BLEU and ROUGE Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mekjr1/evaluating_llms_in_practice/blob/master/part-1-bleu_and_rouge/bleu_and_rouge.ipynb?hl=en#runtime_type=gpu)

This notebook demonstrates how to evaluate text summarization models using BLEU and ROUGE metrics. The notebook is configured to use GPU runtime for faster model inference.

In [None]:
!pip install transformers datasets evaluate rouge-score nltk

In [None]:
from transformers import pipeline
from datasets import load_dataset
import evaluate

In [None]:
# Load a small subset of CNN/DailyMail dataset
dataset = load_dataset("cnn_dailymail", "3.0.0", split="test[:20]")

In [None]:
# Two summarization models
model_a = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
model_b = pipeline("summarization", model="facebook/bart-base")

In [None]:
# Pick one article
article = dataset[0]["article"]
reference = dataset[0]["highlights"]

In [None]:
summary_a = model_a(article, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]
summary_b = model_b(article, max_length=60, min_length=20, do_sample=False)[0]["summary_text"]

In [None]:
print("Reference:", reference)
print("\nModel A:", summary_a)
print("\nModel B:", summary_b)

In [None]:
# Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

In [None]:
bleu_a = bleu.compute(predictions=[summary_a], references=[[reference]])
bleu_b = bleu.compute(predictions=[summary_b], references=[[reference]])

In [None]:
rouge_a = rouge.compute(predictions=[summary_a], references=[reference])
rouge_b = rouge.compute(predictions=[summary_b], references=[reference])

In [None]:
print("\nModel A BLEU:", bleu_a)
print("Model B BLEU:", bleu_b)
print("\nModel A ROUGE:", rouge_a)
print("Model B ROUGE:", rouge_b)