<a href="https://colab.research.google.com/github/pranalibose/LangVisionWorkshop/blob/main/ROUGE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ROUGE Scores: An Overview

## What is ROUGE?
ROUGE (**Recall-Oriented Understudy for Gisting Evaluation**) is a set of metrics used to evaluate **text summarization and text generation tasks** by comparing machine-generated text to human-written references.

## Types of ROUGE Scores

### 1. ROUGE-N (N-Gram Overlap)
- Measures **n-gram** (sequence of N words) overlap between generated and reference text.
- **Formula:**  
  \[
  ROUGE-N = \frac{\text{Number of overlapping n-grams}}{\text{Total n-grams in reference}}
  \]
- Common variations:
  - **ROUGE-1** (unigrams) – measures **word-level** overlap.
  - **ROUGE-2** (bigrams) – considers **phrase-level** overlap.
  - **ROUGE-L** (longest common subsequence) – captures **sentence fluency and structure**.

### 2. ROUGE-L (Longest Common Subsequence - LCS)
- Measures the **longest matching sequence of words** between generated and reference text.
- Captures sentence **fluency and coherence**, unlike n-grams which may not preserve order.

### 3. ROUGE-W (Weighted LCS)
- Similar to ROUGE-L but **assigns more weight** to longer consecutive matches, emphasizing readability.

### 4. ROUGE-S (Skip-Bigram-Based Co-Occurrence)
- Measures overlap of **pairs of words** allowing gaps in between (non-consecutive matches).

## What is a Good ROUGE Score?
- **ROUGE-1 (~40-60%)** is considered **good** for abstractive summarization.
- **ROUGE-2 (~15-30%)** is expected, as bigram matches are less frequent.
- **ROUGE-L (~40-50%)** is ideal for fluency and coherence.
- Higher **recall** is often preferred in summarization tasks since the generated text should **cover key information** from the reference.

🔹 **Example Interpretation**:  
A **ROUGE-1 score of 50%** means **half of the words in the reference summary appear in the generated summary.**



In [3]:
#!pip install rouge

In [4]:
from rouge import Rouge

def calculate_rouge(reference, generated):
    rouge = Rouge()
    scores = rouge.get_scores(generated, reference)
    return scores

# Example reference and generated summary
reference_summary = "The cat sat on the mat and looked outside the window."
generated_summary = "The cat was sitting on the mat and gazing through the window."

# Calculate ROUGE scores
rouge_scores = calculate_rouge(reference_summary, generated_summary)
print("ROUGE Scores:", rouge_scores)

ROUGE Scores: [{'rouge-1': {'r': 0.7, 'p': 0.6363636363636364, 'f': 0.6666666616780046}, 'rouge-2': {'r': 0.5, 'p': 0.45454545454545453, 'f': 0.47619047120181407}, 'rouge-l': {'r': 0.7, 'p': 0.6363636363636364, 'f': 0.6666666616780046}}]
