In [None]:
# There were some tweaks this week, so update your course package
# uncomment the line below to install the latest version
#!pip install ../Course_Tools/introdl

You may wish to manage your diskspace before starting the lesson.  You'll be downloading several new models and datasets and you don't want to run out of diskspace.

Deleting both the the `cs_workspace` and the `~/.cache/huggingface` directories before running the lesson code or working on the homework will remove all your previous models and datasets.  It won't affect any of your Lesson or Homework notebooks.

You can uncomment and run the following cell on your compute server (it won't hurt to run it on the home server, but cs_workspace doesn't exist there).

In [None]:
# be careful with rm -rf, it will delete everything in the path you give it
# !rm -rf ~/cs_workspace
# !rm -rf ~/.cache/huggingface

In [1]:
from introdl.utils import config_paths_keys, wrap_print_text

# call cnonfig_paths_keys() before importing hugging face packages
paths = config_paths_keys()
MODELS_PATH = paths['MODELS_PATH'] # where to store your trained models
DATA_PATH = paths['DATA_PATH'] # where to store downloaded data
CACHE_PATH = paths['CACHE_PATH'] # where to store pretrained models

print = wrap_print_text(print, width = 100)

from datasets import load_dataset
from evaluate import load
from nltk import sent_tokenize, download
import numpy as np
import torch
import transformers
from transformers import (
    Trainer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, 
    AutoTokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer
)
import warnings

# Download Punkt tokenizer for sentence splitting (used by ROUGE-Lsum)
download("punkt", quiet=True)
download("punkt_tab", quiet=True)

# Load evaluation metrics
rouge = load("rouge")
bertscore = load("bertscore")

# Suppress warnings from the transformers library
transformers.logging.set_verbosity_error()

from helpers import compute_all_metrics, print_metrics

MODELS_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\models
DATA_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\data
CACHE_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
TORCH_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_DATASETS_CACHE=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\data


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Successfully logged in to Hugging Face Hub.


## **Section 1 - Introduction to Text Summarization**

Text summarization is the task of generating a concise and coherent summary that captures the key information from a longer piece of text. As the volume of textual data continues to grow—from news articles and research papers to customer reviews and meeting transcripts—summarization plays an increasingly important role in helping people process and understand information efficiently.

There are two main approaches to summarization:

---

### 📌 Extractive Summarization

Extractive summarization works by identifying and selecting the most important sentences or phrases from the original text. The summary is formed by piecing together these extracted parts without modifying the original wording.

**Example**:  
**Original text**:  
> The mayor held a press conference today announcing new environmental initiatives to reduce air pollution in the city. The new policies include increased funding for public transportation and stricter emissions regulations for factories.  

**Extractive summary**:  
> The mayor announced new environmental initiatives. The policies include funding for public transportation and stricter emissions regulations.

---

### 📌 Abstractive Summarization

Abstractive summarization goes a step further by generating new sentences that may not appear in the original text. It interprets and paraphrases the key ideas using natural language generation, similar to how a human might summarize.

**Abstractive summary**:  
> The mayor introduced new plans to cut air pollution through transit funding and factory regulations.

---

### ✅ Why This Lesson Focuses on Abstractive Summarization

While extractive summarization is easier to implement and often performs well on factual documents, it has limitations:
- It may copy irrelevant or redundant text.
- It struggles to rephrase or synthesize ideas.

Abstractive summarization, powered by deep learning models like BART, T5, PEGASUS and LLMs, enables:
- More fluent and human-like summaries.
- Better generalization across different domains.
- Control over the tone and length of the summary.

Because abstractive models are more aligned with how people summarize information—and because they demonstrate the strengths of modern language generation—we’ll focus on **abstractive summarization** in this lesson.

In the next section we discuss the major models, inlcuding LLMs, used for summarization.

## **Section 2 - Overview of Popular Models for Abstractive Summarization**

Summarization tasks can be tackled using **specialized encoder-decoder models** like **BART, T5, and PEGASUS**, or **general-purpose decoder-only models (LLMs)** like **GPT-4o and LLaMA**.  Our textbook already introduced the models, so we won't go into detail here, rather we'll discuss the strengths and weaknesses of each approach.

---

## ✅ **Strengths of Specialized Summarization Models (BART, T5, PEGASUS)**
1. **Architectural Efficiency**
   - Encoder-decoder models process the entire input once with the encoder before generating the summary, making them *computationally efficient* for summarization.
   - In contrast, decoder-only models must repeatedly attend to the entire input during generation, which is particularly costly for long inputs.

2. **Tailored Training Objectives**
   - These models are pre-trained specifically for text-to-text tasks.
     - **BART:** Trained as a denoising autoencoder, making it robust to noisy or incomplete input.
     - **T5:** Uses a “text-to-text” framework, making it versatile across various NLP tasks, including summarization.
     - **PEGASUS:** Pre-trained to generate summaries by masking entire sentences during training, directly optimizing for abstractive summarization.

3. **Alignment with Summarization Tasks**
   - Fine-tuning on summarization datasets (e.g., CNN/Daily Mail, XSum) leads to **high-quality summaries** that are concise and relevant.
   - Performance on benchmarks often surpasses general-purpose LLMs.

4. **Better Control over Output**
   - Easier to enforce structure, conciseness, or adherence to specific formatting requirements.
   - Less prone to **hallucinations** or verbose outputs compared to general-purpose LLMs.

5. **Domain-Specific Optimization**
   - Fine-tuning encoder-decoder models on specialized datasets (e.g., medical or legal texts) produces highly accurate summaries with relevant terminology and structure.

---

### ❌ **Weaknesses of Specialized Summarization Models**
1. **Limited Generalization**
   - Models like BART, T5, and PEGASUS require fine-tuning for specific summarization tasks.
   - Struggle with novel domains or tasks without retraining.

2. **Less Effective at Zero-Shot Summarization**
   - General-purpose LLMs can perform reasonably well on summarization tasks without fine-tuning, which is challenging for encoder-decoder models.

3. **Inflexibility**
   - Encoder-decoder models are often designed for fixed inputs and outputs, making them less adaptable to creative or open-ended summarization tasks.

---

### ✅ **Strengths of LLMs for Summarization**
1. **Generalization Across Tasks**
   - Capable of summarization **without fine-tuning** through prompt engineering (e.g., “Summarize the following text...”).
   - Strong performance across various domains with minimal adjustments.

2. **Few-Shot & Zero-Shot Learning**
   - Easily adaptable to new domains or styles through *in-context learning* (providing examples within the prompt).

3. **Versatility**
   - Handles a wide range of tasks beyond summarization, making them highly flexible for mixed-use applications.
   - Can switch between extractive, abstractive, or creative summarization depending on the prompt.

4. **Ease of Use**
   - No need for specialized training or fine-tuning, making them immediately usable for various summarization tasks.

---

### ❌ **Weaknesses of LLMs for Summarization**
1. **Inefficiency for Long Texts**
   - Decoder-only models process the entire input text during every generation step, resulting in high computational costs for long documents.

2. **Prone to Hallucination**
   - Without fine-tuning or careful prompting, LLMs can generate irrelevant or incorrect information, particularly for factual summarization tasks.

3. **Less Structured Output**
   - Outputs may be verbose or off-topic unless the prompt is carefully designed to enforce structure and conciseness.

4. **Lack of Task-Specific Optimization**
   - General-purpose LLMs may underperform compared to fine-tuned encoder-decoder models on specific summarization datasets.

---

### **State-of-the-art Models**

As of April 2025, BART, T5, and PEGASUS remain state-of-the-art for many summarization tasks, especially when compute and data are limited, you need efficient fine-tuned models for specific domains, or you're doing sequence-to-sequence tasks where controllability and reproducibility matter.

### **What’s new for summarization tasks?**

More recent models like:

- **FLAN-T5** (instruction-tuned T5)  
- **UL2** (Universal Language Learning)  
- **PaLM**, **Gemma**, **Mixtral**, and **LLaMA** family models (especially when fine-tuned)  
- **Longformer Encoder-Decoder (LED)** or **LongT5** for long documents  
- **LLMs** (ChatGPT, Claude, Llama etc.) for few-shot or zero-shot summarization  
- **HERA** (Hallucination Evaluation and Rewriting Architecture) for post-editing summaries to reduce factual errors  

...can outperform **BART**, **PEGASUS**, and **T5** in terms of quality when used with strong prompting or fine-tuning. However, these models:

- Are much larger  
- Often require API access or substantial compute  
- May not be easily reproducible or tunable for every use case  
- (In the case of **HERA**) introduce additional stages to the pipeline, increasing complexity



# Section 3 - Metrics for Evaluating Generated Text

The book discusses two metrics, ROUGE and BLEU.  We introduced BERTScore last week in the text-generation lesson.  We'll use all three of these to evaluate our summarization results.  These metrics also apply to any text-generation task in which a reference text is available.  A fourth, and recently developed metric, is BARTScore and is specific to summarization tasks.  We'll introduce it here, but won't demonstrate it because it's a bit difficult to set up and not widely used yet.

Evaluating text generation is challenging because there are **many valid summaries** for a single input, and traditional metrics like ROUGE or BLEU only compare surface-level word overlaps. They often **miss meaning**, **penalize paraphrasing**, and **fail to detect factual errors or hallucinations**. Newer metrics like BERTScore and BARTScore help, but no single metric fully captures quality, faithfulness, and readability.


Use an AI here to get more details as needed about n-gram, skip-bigram, etc.

### **Brief Introductions to the Metrics**

There are many others, but we'll focus on these:

1. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
   - Measures **n-gram, subsequence, or skip-bigram overlap** between a candidate and reference text.
   - Commonly used for **extractive summarization** but also applied to **abstractive summarization**.
   - Variants: **ROUGE-N (e.g., ROUGE-1, ROUGE-2), ROUGE-L (Longest Common Subsequence), ROUGE-S (Skip-bigrams)**.

2. **BLEU (Bilingual Evaluation Understudy)**
   - Measures **n-gram overlap** between a candidate text and one or more reference texts.
   - Originally designed for **machine translation**, but adapted for **summarization**.
   - Uses a **brevity penalty** to avoid favoring overly short outputs.
   - Often reported with **1-gram to 4-gram precision scores**.

3. **BERTScore**
   - Measures **semantic similarity** between candidate and reference texts using **contextual embeddings** from models like BERT.
   - Matches tokens based on their **cosine similarity in embedding space**.
   - Effective for **abstractive summarization**, especially when paraphrasing is present.

4. **BARTScore**
   - Uses **pretrained language models (e.g., BART)** to estimate the **likelihood of a summary given the source text** and vice versa.
   - Evaluates summaries using **bidirectional scoring**: Coverage (`P(summary | source)`) and Faithfulness (`P(source | summary)`).
   - Particularly useful for evaluating **fluency, coherence, and factual consistency**.

---

### 📊 **Comparison Table**

| **Metric**    | **Use-Cases**                        | **Strengths**                                                                                          | **Weaknesses**                                                                                    |
|---------------|--------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| **ROUGE**     | Extractive summarization, Some Abstractive Summarization | - Easy to compute and interpret. <br> - Works well for extractive tasks. <br> - Multiple variants for different needs. | - Ignores paraphrasing. <br> - Surface-level comparison. <br> - Sensitive to minor wording changes. |
| **BLEU**      | Machine Translation, Summarization   | - Simple and fast to compute. <br> - Precision-oriented. <br> - Useful for extractive and some abstractive tasks. | - Penalizes paraphrasing. <br> - Limited to local n-gram matching. <br> - Ignores semantic similarity. |
| **BERTScore** | Abstractive Summarization, Paraphrasing | - Captures semantic similarity well. <br> - Robust to paraphrasing and rephrasing. <br> - Works well for abstractive summaries. | - Ignores coherence and sentence structure. <br> - Dependent on quality of pretrained embeddings. |
| **BARTScore** | Abstractive Summarization, Coherence Evaluation, Faithfulness Check | - Measures fluency, coherence, and factual consistency. <br> - Can evaluate coverage and faithfulness. <br> - Useful for abstractive summarization. | - Sensitive to the training domain. <br> - Can prioritize fluency over factual accuracy. |

---


### **Demonstrating the Metrics with Examples**

#### L12_1_Metrics_Examples Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l12_1_metrics_example/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l12_1_metrics_example/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/M4axFkxWp6s" target="_blank">Open Descript version of video in new tab</a>


To get a better feel for what these metrics do, let's compute the metrics for a series of sentence pairs. In each case we have `prediction` and `reference`.  You can think of the prediction as a predicted summary and reference as the ground-truth summary or you can simply think of them as two texts for which we want to determine similarity. We'll use `compute_all_metrics` and `print_metrics` from `helpers.py`.  In the video for this section we talk a bit about that code.

#### ✳️ Example 1. *Exact lexical match — all metrics should score well*



In [2]:
reference = "The cat sat on the mat."
prediction = "The cat sat on the mat."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 100.00
rouge1: 100.00
rouge2: 100.00
rougeL: 100.00
rougeLsum: 100.00
bertscore_f1: 100.00


All the metrics are perfect since there's perfect token overlap and order.



#### ✳️ 2. *Minor synonym substitution — shows BERTScore strength*


In [3]:

reference = "The cat sat on the mat."
prediction = "The feline rested on the rug."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 50.00
rouge2: 20.00
rougeL: 50.00
rougeLsum: 50.00
bertscore_f1: 95.97


 BLEU and ROUGE struggle due to lack of exact word overlap.  BERTScore understands semantic similarity because the synonyms have produce embeddings that are close toether.

#### ✳️ 3. *Paraphrase with reordering — ROUGE handles this better than BLEU*

In [4]:
reference = "The cat sat on the mat in the afternoon."
prediction = "In the afternoon, the cat was sitting on the mat."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 84.21
rouge2: 58.82
rougeL: 52.63
rougeLsum: 52.63
bertscore_f1: 95.91


BLEU scores low due to strict n-gram ordering.  ROUGE scores higher due to flexible matching via longest common subsequences.  BERTScore is very good and captures paraphrased meaning.

#### ✳️ 4. *Prediction is shorter but contains key ideas — BLEU drops, ROUGE & BERTScore still decent*


In [5]:
reference = "The government announced a stimulus package to support the economy during the recession."
prediction = "A stimulus package was announced."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 44.44
rouge2: 25.00
rougeL: 33.33
rougeLsum: 33.33
bertscore_f1: 92.60


The BLEU metric has a harsh penalty for short output.  The ROGUE scores show that key content words are captured.  The BERTScore remains high because it recognizes core meaning.

#### ✳️ 5. *Prediction uses entirely different vocabulary — only BERTScore succeeds*


In [6]:

reference = "The plane crashed due to engine failure."
prediction = "The aircraft accident was caused by mechanical problems."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 13.33
rouge2: 0.00
rougeL: 13.33
rougeLsum: 13.33
bertscore_f1: 93.44


BLEU and ROUGE are both low because there's no surface overlap in the texts while BERTScore is high because it understands semantic similarity.  This example highlights why modern metrics are needed.

#### ✳️ 6. *Copying style but not content — ROUGE and BERTScore may be fooled*



In [7]:
reference = "The court ruled in favor of the defendant."
prediction = "The judge made a ruling in the case of the cat and the fiddle."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 45.45
rouge2: 20.00
rougeL: 45.45
rougeLsum: 45.45
bertscore_f1: 90.87


BLEU scores low because there is almost no n-gram overlap.  ROUGE scored moderately due to some word/form overlap.  BERTScore was high because many of the tokens (words) have similar embeddings.  This is a **failure case** showing the limitations of automatic metrics.  The texts have similar form and related words, but their factual meanings are quite different.  


#### ✳️ 7. *Word salad with shared vocabulary — BERTScore gives falsely high score*



In [8]:
reference = "The stock market crashed due to unexpected inflation news."
prediction = "Inflation stock news market due crashed the unexpected."
metrics = compute_all_metrics(prediction, reference)
print_metrics(metrics)

bleu: 0.00
rouge1: 94.12
rouge2: 0.00
rougeL: 47.06
rougeLsum: 47.06
bertscore_f1: 89.47


BLEU is low because there's little n-gram overlap.  However, ROUGE and BERTScore both get fooled.  This is another **failure case.** Let's look at the ROUGE scores in detail because it will help us better understand how they work:


| Metric     | Score | Why? |
|------------|-------|------|
| **ROUGE-1** | 94.12 | This measures **unigram (single word)** overlap. The prediction contains nearly all the same words as the reference, just in a jumbled order. So the unigram match is very high. |
| **ROUGE-2** | 0.00  | This measures **bigram (2-word sequence)** overlap. Since the word order is completely scrambled, there are **no matching bigrams** — hence a zero score. |
| **ROUGE-L** | 47.06 | ROUGE-L is based on the **Longest Common Subsequence (LCS)**. Some words appear in the same order (e.g., `"stock martket crashed the"`), but the rest are rearranged. So you get a partial score. |
| **ROUGE-Lsum** | 47.06 | Same as ROUGE-L here, but adjusted for sentence-level evaluation with potential sentence boundaries. In your case, there's only one sentence, so it's equivalent. |

ROUGE-1 alone can be misleading — it gives high scores even if the summary is nonsense, as long as the words are right.

ROUGE-2 and ROUGE-L help mitigate this by adding sensitivity to word order and structure — but they still don't fully capture meaning.

BERTScore is high because the token embeddings match even though the means do not.  BERTScore is better at capturing meaning but it isn't perfect.

The bottom line is that **no metric is perfect**, and **human review is always necessary**, especially for evaluating meaning and fluency.

## **Section 4 - Fine-Tuning and Evaluating a Summarization Model**

#### L12_Fine_Tuning_Summarization Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l12_fine_tuning_summarization/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l12_fine_tuning_summarization/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/MYc2Y04Y2LH" target="_blank">Open Descript version of video in new tab</a>


In this section, we'll start with a BART model that has already been fine-tuned for a summarization task and demonstrate how to fine-tune it for a different task. Specifically, we'll begin with `facebook/bart-large-cnn`, a model fine-tuned to summarize news articles using the **CNN/DailyMail dataset**, and adapt it to the **XSum dataset** for generating highly abstractive summaries. 

The **CNN/DailyMail dataset**, used to fine-tune `facebook/bart-large-cnn`, consists of news articles paired with multi-sentence summaries that are often extractive in nature. In contrast, the **XSum dataset** is a collection of BBC articles, each paired with a single-sentence summary that captures the essence of the article. The **XSum dataset** is widely used for training and evaluating abstractive summarization models due to its focus on generating concise and highly abstractive summaries.

This process will demonstrate transfer learning for text summarization.  

### ✳️ **Setup**

Here we'll instantiate the Hugging Face classes we'll be using.  They are:

- **`AutoModelForSeq2SeqLM`**:
  - A class for loading pre-trained sequence-to-sequence models (e.g., BART, T5).
  - Specifically designed for tasks like summarization, translation, and text generation.
  - Includes the `generate()` method for text generation.

- **`AutoTokenizer`**:
  - A class for loading the appropriate tokenizer for a given model.
  - Handles tokenization (converting text to token IDs) and detokenization (converting token IDs back to text).
  - Ensures compatibility with the pre-trained model being used.

- **`DataCollatorForSeq2Seq`**:
  - A class for preparing batches of data for sequence-to-sequence models.
  - Handles padding and formatting of input and output sequences to ensure they are compatible with the model.
  - Useful for training and evaluation pipelines.

In [9]:

# Load model and tokenizer
model_name = "facebook/bart-large-cnn"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

---

### ✳️ **Load and Preview Data**

We're going to choose small subsets for training and validation to demonsrate how fine-tuning works and performs, but in a production setting we'd use all of the training data that we can or at least as much as we can afford to use for training.

In [10]:
# Load XSum dataset
dataset = load_dataset("xsum", trust_remote_code=True)
num_train = min(2000, len(dataset["train"]))
num_val = min(200, len(dataset["validation"]))
train_data = dataset["train"].select(range(num_train))
val_data = dataset["validation"].select(range(num_val))

# Preview 2 validation samples
for i in range(2):
    print(f"\n📄 Article {i+1}:\n{val_data[i]['document']}")
    print(f"📝 Summary:\n{val_data[i]['summary']}")


📄 Article 1:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand
trial in July.
They were all released on bail.
📝 Summary:
Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of
charity fraud.

📄 Article 2:
Voges was forced to retire hurt on 86 after suffering the injury while batting during the County
Championship draw with Somerset on 4 June.
Middlesex hope to have the Australian back for their T20 Blast game against Hampshire at Lord's on 3
August.
The 37-year-old has scored 230 runs in four first-class games this se


---

### ✳️ **Tokenize the Datasets**

We start with a function that takes one item from the dataset, e.g. `train_data[0]`.  Let's look at the structure of that item:

In [11]:
print(train_data[0])

{'document': 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still
being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly
affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the
Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart
after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to
inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on
Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which
was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever,
she said more preventative work could have been carried out to ensure the retaining wall did not
fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I
totally apprecia

We need to produce a new dictionary that has 'input_ids' which will be the tokenized text of 'document'.  It will have an 'attention_mask' of 1's and 0's where 0's indicate padding tokens that should be ignored in the 'input_ids'.  Finally we'll add a 'labels' key in the dictionary and the corresponding value is the tokenized summary.

The best way to understand this is to examine the code below and study the input dictionary above and the output dictionary below.

In [12]:

def dataset_tokenizer(examples):
    model_inputs = tokenizer(
        examples['document'], max_length=512, truncation=True
    )
    labels = tokenizer(
        text_target=examples['summary'], max_length=64, truncation=True
    )
    model_inputs['labels'] = labels['input_ids']
    return model_inputs


In [13]:
print(dataset_tokenizer(train_data[0]))

{'input_ids': [0, 133, 455, 701, 9, 1880, 11, 10793, 6192, 6, 65, 9, 5, 911, 2373, 2132, 6, 16, 202,
145, 11852, 4, 50118, 22026, 2456, 173, 16, 2256, 11, 10034, 1758, 8, 171, 3197, 11, 221, 1942, 428,
1672, 6867, 1091, 7340, 2132, 30, 2934, 514, 4, 50118, 12667, 5069, 15, 5, 3072, 3673, 42656, 652,
10044, 528, 7, 1880, 23, 5, 226, 9708, 1054, 16376, 625, 21491, 4, 50118, 10787, 1252, 8, 6028, 268,
58, 2132, 30, 5681, 11, 10793, 6192, 71, 5, 1995, 30084, 41031, 9725, 88, 5, 1139, 4, 50118, 10993,
692, 14371, 21801, 3790, 5, 443, 7, 18973, 5, 1880, 4, 50118, 133, 5794, 18646, 10, 17784, 2204, 6,
5681, 171, 1861, 3611, 15, 4769, 852, 111, 5, 1049, 3482, 10675, 17825, 4, 50118, 35689, 3398,
16255, 6, 54, 1831, 5, 43351, 16542, 61, 21, 7340, 2132, 6, 26, 79, 115, 45, 7684, 5, 3228, 12,
26904, 1263, 683, 5, 5005, 478, 4, 50118, 10462, 6, 79, 26, 55, 2097, 3693, 173, 115, 33, 57, 2584,
66, 7, 1306, 5, 17784, 2204, 222, 45, 5998, 4, 50118, 113, 243, 16, 1202, 53, 38, 109, 206, 89, 16,
98, 203

Finally, we apply the function, using `map` function, to produce the tokenized datasets.

In [14]:

# Tokenize data

tokenized_train = train_data.map(dataset_tokenizer, batched=True)
tokenized_val = val_data.map(dataset_tokenizer, batched=True)


---

### ✳️ **Fine-Tune BART-Large-CNN**

The code for fine-tuning is similar to what we've seen in previous lessons.  We have to setup the training arguments, then configure the trainer and execute the training.  Some of you have expressed interest in learning more about these configurations.  I've found AI to be really helpful for getting started here.

<details>
<summary>You can CLICK HERE to get a description of each of the training arguments</summary>

- **`output_dir`**: Specifies the directory where model checkpoints and outputs will be saved.
- **`eval_strategy`**: Determines when to evaluate the model during training. Options include "no", "steps", or "epoch". Here, it evaluates at the end of each epoch.
- **`save_strategy`**: Specifies when to save model checkpoints. Options include "no", "steps", or "epoch". Here, it saves at the end of each epoch.
- **`learning_rate`**: Sets the initial learning rate for the optimizer. Here, it is set to `3e-5`.
- **`per_device_train_batch_size`**: Defines the batch size for training on each device (e.g., GPU or CPU). Here, it is set to `8`.
- **`per_device_eval_batch_size`**: Defines the batch size for evaluation on each device. Here, it is set to `8`.
- **`num_train_epochs`**: Specifies the total number of training epochs. Here, it is set to `3`.
- **`weight_decay`**: Applies weight decay (L2 regularization) to prevent overfitting. Here, it is set to `0.01`.
- **`fp16`**: Enables mixed precision training for faster computation and reduced memory usage. Set to `True` to use FP16 (16-bit floating point).
- **`disable_tqdm`**: Disables the progress bar during training. Set to `False` to keep the progress bar visible.
- **`predict_with_generate`**: Enables text generation during evaluation to compute metrics like ROUGE. Set to `True` to generate predictions.

</details>

In [15]:
training_args = Seq2SeqTrainingArguments(
    output_dir=str(MODELS_PATH / "xsum_bart_large"),
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    disable_tqdm=False,
    predict_with_generate=True,
)

<details>
<summary>CLICK HERE to get details about the arguments we're using in `Seq2SeqTrainer`</summary>

- **`model`**: The Hugging Face model to train or evaluate. In this case, it is the sequence-to-sequence model (`model`) loaded earlier (e.g., `facebook/bart-large-cnn`).

- **`args`**: The training arguments provided as an instance of `Seq2SeqTrainingArguments`. This includes configurations like learning rate, batch size, number of epochs, and evaluation strategy.

- **`train_dataset`**: The dataset used for training. Here, it is the tokenized training dataset (`tokenized_train`).

- **`eval_dataset`**: The dataset used for evaluation. Here, it is the tokenized validation dataset (`tokenized_val`).

- **`data_collator`**: A function or object that batches and preprocesses data for the model. In this case, it is a ` DataCollatorForSeq2Seq` object we loaded above, which ensures proper padding and formatting for sequence-to-sequence tasks.

</details>


In [17]:

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
)

# Train
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.845416
2,1.562400,1.904088
3,1.562400,2.082829




TrainOutput(global_step=750, training_loss=1.2987611083984374, metrics={'train_runtime': 181.3485, 'train_samples_per_second': 33.085, 'train_steps_per_second': 4.136, 'total_flos': 6482063822487552.0, 'train_loss': 1.2987611083984374, 'epoch': 3.0})

There's some evidence of overfitting there since the validation loss increases.  We're using a very small subset of the data for this demonstration so it's not surprising that we're seeing overfitting. 

**Note:** If you're seeing a warning it's because the pre-trained model we loaded stores it's configuration parameter differently than the latest transformers library expect.  It's OK to ignore.




---

### ✳️ **Qualitative Comparison of Summaries**

Let's see how the summaries compare for the first article in the validation set.


The first article in the validation set is:

In [16]:
sample_article = val_data[0]['document']
print(f"📄 Sample article:\n{sample_article}\n")

📄 Sample article:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand
trial in July.
They were all released on bail.



Here's a function to help you generate summaries.  We added many comments to help you understand the process.

In [17]:
def generate_summary(text, model, tokenizer, device, max_length=64, length_penalty=1.0):
    # Tokenize the input text and convert it into tensors suitable for the model
    # `max_length=512` ensures the input is truncated if it exceeds 512 tokens
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
    
    # Generate the summary using the model
    # `num_beams=4` specifies the beam search size for better quality summaries
    # `max_length=64` limits the length of the generated summary
    # `early_stopping=True` stops generation when all beams reach the end token
    # length_penalty adjusts the length of the generated summary
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, 
                                 max_length=max_length, early_stopping=True,
                                 length_penalty=length_penalty)
    
    # Decode the generated token IDs back into a human-readable string
    # `skip_special_tokens=True` removes special tokens like <s> and </s>
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)



Now let's load the base-model, our fine-tuned model, and a model that has been fine-tuned on the complete xsum dataset.  We'll generate the summaries for each so we can compare them qualitatively.

**Note:**  This line of code `fine_tuned_model.config.forced_bos_token_id = None` shouldn't be necessary, but `transformers` is setting `forced_bos_token_id = 0` in the saved model which causes the text generation to work incorrectly.  I'm opening an issue on Github for this.

In [18]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# reload the base model
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn").to(device)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

# Reload the fine-tuned model
checkpoint_path = MODELS_PATH / "xsum_bart_large" / "checkpoint-750"
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path).to("cuda" if torch.cuda.is_available() else "cpu")
fine_tuned_model.config.forced_bos_token_id = None # Set to None to squash bug
# tokenizer is the same as base model

# Fully-fine-tuned model summary
full_ft_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum").to(device)

reference_summary = val_data[0]['summary']
base_summary = generate_summary(sample_article, model, tokenizer, device)
fine_tuned_summary = generate_summary(sample_article, fine_tuned_model, tokenizer, device)
full_ft_summary = generate_summary(sample_article, full_ft_model, tokenizer, device)

print(f"📝 Reference Summary:\n{reference_summary}\n")
print(f"📝 Base Model Summary: \n{base_summary}\n" )
print(f"📝 Fine-Tuned Model Summary: \n{fine_tuned_summary}\n" )
print(f"📝 Fully Fine-Tuned Model Summary: \n{full_ft_summary}\n" )




📝 Reference Summary:
Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of
charity fraud.

📝 Base Model Summary:
Sam Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42. The
charge relates to offences which allegedly took place between 2008 and 2014. Sam, from Kent, Efe and
Bright, of Greater Manchester, and Stephen,. from Bexley,

📝 Fine-Tuned Model Summary:
Former England footballer Sam Sodje has appeared in court accused of defrauding a sports charity he
set up in his native Nigeria out of more than £1.5m over a period of four years, the Old Bailey has
heard. The charity was set up by Mr Sodje's father,

📝 Fully Fine-Tuned Model Summary:
Former Premier League footballer Sam Sodje has appeared in court charged with fraud.



Our fine-tuned model clearly knows more about Sam Sodje than what is contained in the article.  It does manage a summary but it's not exactly brief.  We'lve only used a very small subset of the training data for a demonstration so our result isn't that great. However, you can see that the fine-tuned model that was already trained on the whole dataset produces a great, simple abstractive summary that's perhaps better than the reference summary.


---

### ✳️ **Define Evaluation Metrics (ROUGE and BERTScore)**

The `compute_metrics` function below takes the model prediction and label which are lists or tensors of token IDs, decodes them back to text, and computes the metrics.  We didn't include BLEU since our example above showed that BLEU isn't very good for assessing text similarity.  AI is helpful here to figure out how to include the evaluation metrics you want.  

We make use of the Hugging Face `evaluate` library.  [Learn more here.](https://huggingface.co/docs/evaluate/en/index)  You'll need the `rouge` and `bert_score` packages installed - they should be if you've installed the latest course package.

In [19]:
from evaluate import load
rouge = load("rouge")
bertscore = load("bertscore")

def compute_metrics(eval_pred):
    """
    Compute ROUGE and BERTScore metrics for evaluating summarization models.
    
    This function is designed to be used with Hugging Face's Trainer.evaluate() method.
    It compares model predictions to reference summaries using:
    
    - ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum (recall-focused measures of overlap)
    - BERTScore-F1 (semantic similarity based on contextual embeddings)

    Returns:
        A dictionary with metric names as keys and scores as values (multiplied by 100).
    """
    # Unpack predictions and labels
    predictions, labels = eval_pred

    # Some models return (logits, ...) as predictions, so we extract the first element
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    # Convert to numpy arrays for easier handling
    predictions = np.asarray(predictions)
    labels = np.asarray(labels)

    # If predictions are logits (batch_size x seq_len x vocab_size), take argmax
    if predictions.ndim == 3:
        predictions = np.argmax(predictions, axis=-1)

    # Convert to plain Python lists
    predictions = predictions.tolist()
    labels = labels.tolist()

    # Replace -100 (used to ignore padding in labels) with the tokenizer's pad token ID
    pad_token_id = tokenizer.pad_token_id
    labels = [[(token if token != -100 else pad_token_id) for token in label] for label in labels]

    # Decode token IDs to strings
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Add newlines between sentences for ROUGE-Lsum to work properly
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]

    # Compute ROUGE scores (with stemming)
    rouge_scores = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    rouge_result = {f"{k}_f1": v * 100 for k, v in rouge_scores.items()}

    # Compute BERTScore (average F1 across examples)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")  # Suppress tokenizer/model loading warnings
        bertscore_result = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en")
    bertscore_f1 = {"bertscore_f1": np.mean(bertscore_result["f1"]) * 100}

    # Merge all metrics into a single dictionary
    return {**rouge_result, **bertscore_f1}


---

### ✳️ **Evaluate Models on the Validation Set**

Here we'll compute all the metrics on the (reduced) validation set for the base model, our fine-tuned model, and the fully fine-tuned model.

You need to run the cell that loads the models in the qualitative comparisons section above before running the code below.  

To expedite the evaluation we use a Trainer to take advantage of batch processing.  First we'll build a small helper function that takes the model, dataset, collator, and compute_metrics function as input and returns the dictionary of metrics evaluated on the dataset.

In [20]:
def evaluate_metrics(model, training_args, dataset, data_collator, compute_metrics):
    """
    Evaluate the model on the given dataset and compute metrics.

    Args:
        model: The model to evaluate.
        training_args: Training arguments containing evaluation settings.
        dataset: The dataset to evaluate on.
        data_collator: Data collator for batching data.
        compute_metrics: Function to compute metrics.

    Returns:
        A dictionary with metric names as keys and scores as values.
    """
    # Create a Trainer instance for evaluation
    trainer = Trainer(
        model=model,
        args=training_args,
        eval_dataset=dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    # Evaluate the model
    eval_results = trainer.evaluate()

    return eval_results

Now we'll fetch the results.  This could take a few minutes.

In [23]:
base_results = evaluate_metrics(model, training_args, tokenized_val, data_collator, compute_metrics)
print("\n📈 Base BART Results:")
print(base_results)

fine_tuned_results = evaluate_metrics(fine_tuned_model, training_args, tokenized_val, data_collator, compute_metrics)
print("\n📈 Fine-Tuned BART Results:")
print(fine_tuned_results)

full_ft_results = evaluate_metrics(full_ft_model, training_args, tokenized_val, data_collator, compute_metrics)
print("\n📈 Fully Fine-Tuned BART Results:")
print(full_ft_results)


📈 Base BART Results:
{'eval_loss': 2.52492618560791, 'eval_model_preparation_time': 0.0056, 'eval_rouge1_f1':
44.73847994604354, 'eval_rouge2_f1': 17.584912484216776, 'eval_rougeL_f1': 41.397483143740445,
'eval_rougeLsum_f1': 41.94427735000251, 'eval_bertscore_f1': 86.6550963819027, 'eval_runtime':
42.1136, 'eval_samples_per_second': 4.749, 'eval_steps_per_second': 0.594}



📈 Fine-Tuned BART Results:
{'eval_loss': 2.082829236984253, 'eval_model_preparation_time': 0.0041, 'eval_rouge1_f1':
53.11391717145055, 'eval_rouge2_f1': 27.323786668296307, 'eval_rougeL_f1': 50.544033050052086,
'eval_rougeLsum_f1': 50.617683562004956, 'eval_bertscore_f1': 88.53831321001053, 'eval_runtime':
55.0593, 'eval_samples_per_second': 3.632, 'eval_steps_per_second': 0.454}



📈 Fully Fine-Tuned BART Results:
{'eval_loss': 2.3121683597564697, 'eval_model_preparation_time': 0.004, 'eval_rouge1_f1':
57.32454164802877, 'eval_rouge2_f1': 32.91543717646353, 'eval_rougeL_f1': 55.161648779914984,
'eval_rougeLsum_f1': 55.20277263969467, 'eval_bertscore_f1': 89.46658211946487, 'eval_runtime':
57.2807, 'eval_samples_per_second': 3.492, 'eval_steps_per_second': 0.436}


We didn't print those out nicely, but if you look carefully you can see that the metrics all increased for the fine-tuned model, and even more for the fully fine-tuned model.

In the homework we'll compare using LLMs for summarization to using these specialized models.  We'll also further explore using LLMs for evaluating text similarity.