# Fine-tuning a Model for Summarization Task

In this task, you will load, preprocess, and fine-tune a T5 model on a dataset of news articles for a summarization task. Follow the steps below carefully.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `UBC-NLP/AraT5-base` if you face any problem you can use `google-t5/t5-small` but the first one is the correct one for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/news_articles_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [37]:
#  pip install transformers datasets torch tensorflow

In [38]:
#  pip install --upgrade transformers torch

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
from torch.utils.data import DataLoader
from transformers import AdamW
from datasets import load_dataset
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from transformers import pipeline
from transformers import T5ForConditionalGeneration, T5Tokenizer

In [10]:
dataset = load_dataset('CUTD/news_articles_df')
subset_dataset = dataset['train'].select(range(200))

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'summarizer', 'text'],
        num_rows: 8378
    })
})

In [12]:
train_test_split = subset_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']


In [13]:
print(f"Train size: {len(train_dataset)}, Test size: {len(test_dataset)}")

Train size: 160, Test size: 40


In [14]:
train_dataset

Dataset({
    features: ['Unnamed: 0', 'summarizer', 'text'],
    num_rows: 160
})

In [15]:
test_dataset

Dataset({
    features: ['Unnamed: 0', 'summarizer', 'text'],
    num_rows: 40
})

## Step 2: Load the Pretrained Tokenizer

In [16]:
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [17]:
def preprocess_function(a):
    inputs = ["summarize: " + article for article in a['article']]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(a['summary'], max_length=150, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


Initialize a tokenizer from the gevin model checkpoint.

In [18]:
text = "summarize: This is a sample news article to test the tokenizer."
inputs = tokenizer(text, max_length=512, truncation=True, return_tensors="pt")

In [19]:
print(inputs)

{'input_ids': tensor([[21603,    10,   100,    19,     3,     9,  3106,  1506,  1108,    12,
           794,     8, 14145,  8585,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


## Step 3: Preprocess the Dataset

Define a preprocessing function that adds a prefix ("summarize:") to each input if needed and tokenizes the text for the model. The labels will be the tokenized summaries.

In [20]:
def preprocess_function(examples):
    inputs = tokenizer(
        examples["text"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )
    labels = tokenizer(
        examples["summarizer"],
        max_length=128,
        truncation=True,
        padding="max_length"
    )["input_ids"]

    return {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": labels
    }


In [21]:
print(train_dataset.column_names)

['Unnamed: 0', 'summarizer', 'text']


In [22]:
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

## Step 4: Define the Data Collator

Use a data collator designed for sequence-to-sequence models, which dynamically pads inputs and labels.

In [23]:
model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small")

In [24]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

## Step 5: Load the Pretrained Model

Load the model for sequence-to-sequence tasks (summarization).

In [25]:
# defined above

## Step 6: Define Training Arguments

Set up the training configuration with parameters like learning rate, batch size, and number of epochs.

In [26]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    logging_dir="./logs",
    logging_steps=500,
    save_steps=1000,
    eval_accumulation_steps=10
)



## Step 7: Initialize the Trainer

Use the `Seq2SeqTrainer` class to train the model.

In [27]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset
)

In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.742063


TrainOutput(global_step=40, training_loss=2.367477607727051, metrics={'train_runtime': 489.6547, 'train_samples_per_second': 0.327, 'train_steps_per_second': 0.082, 'total_flos': 21654688235520.0, 'train_loss': 2.367477607727051, 'epoch': 1.0})

## Step 8: Fine-tune the Model

Train the model using the specified arguments and dataset.

In [30]:
print(tokenized_train_dataset[0])


{'Unnamed: 0': 165, 'summarizer': 'وذكرت\xa0وكالة سبوتنيك الروسية أن الطائرة تسببت في قطع بعض الأسلاك الكهربائية، قبل أن يُثقب خزان وقودها، وبعدها اختفت الطائرة من أمام الكاميرا. وفي بيان للشرطة المحلية، فإنه لم ينتج عن الحادث إصابات خطيرة، حيث تسبب فقط في عطب عدة سيارات، من بينها واحدة احترقت كليا. \nأظهر مقطع فيديو سقوط طائرة ذات محرك واحد من نوع بايبر PA-32 ، اليوم الأربعاء، على الطريق السريعة في ميكيلتو فى واشنطن.', 'text': 'اظهر مقطع فيديو سقوط طائره محرك نوع بايبر PA اليوم الاربعاء الطريق السريعه ميكيلتو فى واشنطن وذكرت وكاله سبوتنيك الروسيه الطائره تسببت قطع الاسلاك الكهربائيه خزان وقودها وبعدها اختفت الطائره الكاميرا ووفقا لقناه بى سى نيوز فقد فشل محرك الطائره فور اقلاعها المطار وفي بيان للشرطه المحليه فانه ينتج الحادث اصابات خطيره تسبب فقط عطب عده سيارات بينها واحده احترقت كليا', 'input_ids': [3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 4935, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 

In [31]:
trainer.save_model("./trained_model")
eval_results = trainer.evaluate()
print(eval_results)


{'eval_loss': 0.7420634031295776, 'eval_runtime': 39.1522, 'eval_samples_per_second': 1.022, 'eval_steps_per_second': 0.255, 'epoch': 1.0}


## Step 9: Inference

Once the model is trained, perform inference on a sample text to generate a summary. Use the tokenizer to process the text, and then feed it into the model to get the generated summary.

In [33]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)
summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs . it's the most aggressive action on tackling the climate crisis in american history . no one making under $400,000 per year will pay a penny more in taxes ."}]

In [36]:
sample_text ="summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
inputs = tokenizer.encode("summarize: " + sample_text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=150, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Original Text:", sample_text)
print("Generated Summary:", summary)

Original Text: summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.
Generated Summary: the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. no one making under $400,000 per year will pay a penny more in taxes.
