## üß© Evaluation of Text Generation

In this notebook, we'll continue **look at text generation**. And this time, we will focus on **Evaluation**.

Evaluation of Text Generation is a central challenge in natural language generation research. How we evaluate systems shapes not only model comparison but also our very definition of what counts as a ‚Äúgood‚Äù output. In this session, you will explore different types of evaluation metrics to uncover their strengths, limitations, and inherent biases.

The goal of this notebook is **not to chase perfect evaluation scores**, but to **experiment** and **build intuition** about how evaluation of text generation actually works.  

You‚Äôre encouraged to:  
- Try out different evaluation metrics from various classes,  
- Compare how they rate the same model outputs, and  
- Reflect on when and why these metrics agree‚Äîor fail to.  

In this notebook, we‚Äôll focus on two main types of evaluation metrics:  
(a) **Content-overlap metrics**   
(b) **Model-based metrics**.  

Beyond these automatic methods, you‚Äôre also encouraged to **manually evaluate** some generated outputs yourself ‚Äî observe which metrics best align with your own intuition about quality.

By the end of this notebook, you‚Äôll have a practical understanding of how to **evaluate text generation models** and a clearer sense of **what good evaluation really means**.

### üßÆ The `evaluate` Library

[`evaluate`](https://huggingface.co/docs/evaluate) is a lightweight library from Hugging Face that provides a unified interface for computing a wide range of NLP evaluation metrics ‚Äî from classic ones like **BLEU**, and **Perplexity**, to modern model-based metrics such as **BERTScore** and **COMET**. 

 Evaluate provides access to a wide range of evaluation tools. It covers a range of modalities such as text, computer vision, audio, etc. as well as tools to evaluate models or datasets. 

You can check more Metrics here: https://huggingface.co/evaluate-metric/spaces

 Each metric is a separate Python module, but for using any of them, there is a single entry point: `evaluate.load()`!

### üìè Content-overlap Metrics

**Content-overlap metrics** evaluate how closely a generated text matches one or more reference texts by comparing their surface forms ‚Äî typically through word or n-gram overlap.  

These metrics are simple, interpretable, and fast to compute, but they often fail to capture deeper semantic meaning or paraphrasing.

In this notebook, we‚Äôll focus on two of the most widely used metrics in this category:  

- **BLEU** ‚Äî computes the overlap of n-grams between generated and reference texts, and is widely used in **machine translation** and **summarization**.  
  
- **ROUGE** ‚Äî measures the overlap of n-grams, words, or word sequences, but is especially designed for **summarization** tasks, focusing on recall rather than precision.n.


### BLEU

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine‚Äôs output and that of a human: ‚Äúthe closer a machine translation is to a professional human translation, the better it is‚Äù ‚Äì this is the central idea behind BLEU. 

BLEU and BLEU-derived metrics are most often used for machine translation.



You should first run `pip install evaluate` to install it. Then, You Can load evaluate by this:

In [2]:
import evaluate

bleu = evaluate.load("bleu")

Downloading builder script: 5.94kB [00:00, 2.82MB/s]
Downloading extra modules: 4.07kB [00:00, 3.88MB/s]                   
Downloading extra modules: 3.34kB [00:00, 2.98MB/s]


Here is an example texts:



In [3]:
predictions = [
    "The cat is on the mat.",
    "There is a cat sitting on the carpet."
]
references = [
    ["The cat sits on the mat."],
    ["A cat is on the carpet."]
]


------------
**`TODO:`** Compute the BLEU score for the `predictions` list against the `references` list using the `bleu.compute()` function.  
- Use the `predictions` variable as the input for the `predictions` parameter.  
- Use the `references` variable as the input for the `references` parameter.  
- Store the result in a variable named `results`.  
- Print the BLEU score from the `results` dictionary using the key `'bleu'`.  
This will help you understand how BLEU evaluates the overlap between the generated and reference texts.

In [4]:
results = bleu.compute(predictions=predictions, references=references)
print(f"BLEU score: {results['bleu']:.4f}")

BLEU score: 0.3976



----

BLEU also has several limitations, which we‚Äôll illustrate through examples.

In [5]:
references = [["The cat is on the mat."]]
predictions_good = ["A cat sits on the rug."]    
predictions_bad = ["The cat is not on the mat."]  


Compute the scores of `predictions_good` and `predictions_bad` against the `references`, and **discuss** why this happens ‚Äî what limitation of BLEU does it reveal?

In [7]:
score_good = bleu.compute(predictions=predictions_good, references=references)
score_bad = bleu.compute(predictions=predictions_bad, references=references)

print(f"Semantically similar sentence BLEU: {score_good['bleu']:.4f}")
print(f"Semantically wrong sentence BLEU:  {score_bad['bleu']:.4f}")

Semantically similar sentence BLEU: 0.0000
Semantically wrong sentence BLEU:  0.5000


This happens because BLEU only measures surface-level n-gram overlap, without understanding the underlying semantics of the sentence. It rewards lexical similarity rather than meaning, revealing one of BLEU‚Äôs main limitations: it cannot distinguish between correct and incorrect meanings if the wording is similar.

### ROUGE

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.



This metrics is a wrapper around Google Research reimplementation of ROUGE: https://github.com/google-research/google-research/tree/master/rouge

Unlike BLEU, which focuses on precision, ROUGE emphasizes **recall** ‚Äî how much of the reference text‚Äôs content is captured in the generated text.

Here are the main variants you‚Äôll encounter:

- **ROUGE-1** ‚Äî Measures the overlap of individual words (unigrams) between the prediction and reference.  
  ‚Üí Captures basic lexical similarity.

- **ROUGE-2** ‚Äî Measures the overlap of 2-word sequences (bigrams).  
  ‚Üí Reflects fluency and short-phrase consistency.

- **ROUGE-L** ‚Äî Based on the *Longest Common Subsequence (LCS)* between prediction and reference.  
  ‚Üí Captures sentence-level structure and word order similarity.

- **ROUGE-Lsum** ‚Äî A summary-level variant of ROUGE-L that averages the LCS-based recall across multiple sentences in the generated summary.  
  ‚Üí More suitable for multi-sentence summarization tasks.

üí° **Interpretation:**  
Higher ROUGE scores generally indicate better content overlap with the reference, but like BLEU, ROUGE is still surface-based and does not measure semantic correctness or factual accuracy.

Here is the case:

In [9]:
reference = "The cat is on the mat."

prediction_normal = "The cat is on the mat."
prediction_paraphrase = "A cat sits on the rug."
prediction_negated = "The cat is not on the mat."


----

**`TODO:`** Use the example above to compute **ROUGE** scores for your generated outputs.



1. **Load the ROUGE metric** using `evaluate.load("rouge")`.

2. **Compute ROUGE** for your different predictions (e.g., `prediction_normal`, `prediction_paraphrase`, and `prediction_negated`) against the same reference.

3. **Access specific ROUGE variants** such as ROUGE-1, ROUGE-2, ROUGE-L, or ROUGE-Lsum by indexing the result dictionary (e.g., `rouge.compute(...)[‚Äòrouge1‚Äô]`).

4. **Compare the results** across the three predictions:  
   - How do the scores differ between exact matches, paraphrases, and negated sentences?  
   - Do higher ROUGE scores always correspond to better or more semantically accurate outputs?

5. **Discuss your findings:**  
   Consider where ROUGE may fail to capture semantic equivalence or meaning preservation.

----

In [12]:
rouge = evaluate.load("rouge")
res_normal = rouge.compute(predictions=[prediction_normal], references=[reference])
res_para = rouge.compute(predictions=[prediction_paraphrase], references=[reference])
res_neg = rouge.compute(predictions=[prediction_negated], references=[reference])

def show(tag, res):
    print(f"\n[{tag}]")
    print(f"ROUGE-1:  {res['rouge1']:.4f}")
    print(f"ROUGE-2:  {res['rouge2']:.4f}")
    print(f"ROUGE-L:  {res['rougeL']:.4f}")
    print(f"ROUGE-Lsum: {res['rougeLsum']:.4f}")

show("Normal (near-identical wording)", res_normal)
show("Failure: paraphrase (semantic match, low surface overlap)", res_para)
show("Failure: negation (surface match, opposite meaning)", res_neg)


[Normal (near-identical wording)]
ROUGE-1:  1.0000
ROUGE-2:  1.0000
ROUGE-L:  1.0000
ROUGE-Lsum: 1.0000

[Failure: paraphrase (semantic match, low surface overlap)]
ROUGE-1:  0.5000
ROUGE-2:  0.2000
ROUGE-L:  0.5000
ROUGE-Lsum: 0.5000

[Failure: negation (surface match, opposite meaning)]
ROUGE-1:  0.9231
ROUGE-2:  0.7273
ROUGE-L:  0.9231
ROUGE-Lsum: 0.9231


Higher ROUGE scores generally indicate better content overlap with the reference, but like BLEU, ROUGE is still surface-based and does not measure semantic correctness or factual accuracy.

### Model-based metrics: bert_score

**BERTScore** is a *model-based evaluation metric* that measures the similarity between generated and reference texts using contextual embeddings from pretrained language models such as **BERT** or **RoBERTa**.  

Instead of comparing surface-level n-gram overlap (like BLEU or ROUGE), BERTScore computes **semantic similarity** between words by aligning their embeddings in a high-dimensional space.  It captures meaning even when different words or phrases are used.


üí° **Note:**  
BERTScore relies on a large pretrained model, so it is computationally heavier than BLEU or ROUGE, but it provides a more meaning-aware evaluation of generated text.

You can refer to Tianyi‚Äôs paper for more details (but unfortunately, it‚Äôs a different Tianyi ‚Äî not me ü§°): https://arxiv.org/pdf/1904.09675


We still use the example:

In [22]:
reference = "The cat is on the mat."

prediction_paraphrase = "A cat sits on the rug."
prediction_negated = "The cat is not on the mat."

BERTScore uses contextual embeddings from a pretrained model (like BERT) to measure **semantic similarity** between tokens in the prediction and reference sentences.  
Here‚Äôs how each component is computed:

1. **Token-level similarity:**  
   Each token is represented as a vector embedding.  
   The similarity between two tokens is measured using **cosine similarity**.

2. **Precision (P):**  
   For each token in the *prediction*, find the **most similar** token in the *reference*,  
   then take the **average** of these maximum similarities.  


3. **Recall (R):**  
   For each token in the *reference*, find the **most similar** token in the *prediction*,  
   then take the average of these maximum similarities.  

4. **F1 Score:**  
   The harmonic mean of Precision and Recall, capturing overall alignment between prediction and reference


üí° **Intuition:**  
- Precision measures how *relevant* the generated tokens are to the reference.  
- Recall measures how much of the reference meaning is *covered* by the generation.  
- F1 balances both ‚Äî higher F1 indicates stronger semantic similarity overall.

------------
**`TODO:`** Compute the BLEU score for the `predictions` list against the `references` list using the `bleu.compute()` function.  
- Use the `predictions` variable as the input for the `predictions` parameter.  
- Use the `references` variable as the input for the `references` parameter.  
- Store the result in a variable named `results`.  
- Print the BLEU score from the `results` dictionary using the key `'bleu'`.  
This will help you understand how BLEU evaluates the overlap between the generated and reference texts.

In [23]:
bertscore = evaluate.load("bertscore")

# Compute BERTScore for each prediction
res_normal = bertscore.compute(predictions=[prediction_normal], references=[reference], model_type="bert-base-uncased")
res_para = bertscore.compute(predictions=[prediction_paraphrase], references=[reference], model_type="bert-base-uncased")
res_neg = bertscore.compute(predictions=[prediction_negated], references=[reference], model_type="bert-base-uncased")

def show(tag, res):
    print(f"\n[{tag}]")
    print(f"Precision: {res['precision'][0]:.4f}")
    print(f"Recall:    {res['recall'][0]:.4f}")
    print(f"F1 Score:  {res['f1'][0]:.4f}")

show("Normal (identical)", res_normal)
show("Paraphrase (semantic match, different words)", res_para)
show("Negated (surface similar, opposite meaning)", res_neg)


[Normal (identical)]
Precision: 1.0000
Recall:    1.0000
F1 Score:  1.0000

[Paraphrase (semantic match, different words)]
Precision: 0.7715
Recall:    0.7715
F1 Score:  0.7715

[Negated (surface similar, opposite meaning)]
Precision: 0.9052
Recall:    0.9432
F1 Score:  0.9238


----
**TODO** : Discussion: Why does `prediction_negated` still get a high score? How to imporve it?

BERTScore measures similarity between individual tokens.
In "The cat is not on the mat.", all tokens except ‚Äúnot‚Äù are identical to the reference,
so their cosine similarities are very high.

Metrics like BLEURT tend to correlate more strongly with human judgments and are usually better than BERTScore at penalizing semantic errors.