# **Full Fine-Tuning of a pretrained T5 model on the DialogSum dataset**
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1P7W3UsHSUDbFJgK0Mbd-OazySVa-Q7T8?usp=sharing)

#### **This Notebook is created by: [mahdi khoshmaram](https://github.com/mahdi-khoshmaram)** 🤗


**In this notebook**, I fine-tune the **`T5`** model on the [DialogSum dataset](https://huggingface.co/datasets/knkarthick/dialogsum) using **two** approaches:
1. Fine-tuning with the 🤗 **`Transformers` Trainer**.
2. Fine-tuning using **native PyTorch.**

# Table of Contents

1. [Set Device](#scrollTo=iNBAM7HSXSs-&line=1&uniqifier=1)
2. [Loading Dataset](#scrollTo=iNBAM7HSXSs-&line=1&uniqifier=1)
3. [Set-Up Model](#scrollTo=op8CLNJRaY_B&line=1&uniqifier=1)
4. [Make dataset ready for training](#scrollTo=liPT1hdPD-7d)
5. [Full Fine-tuning with the 🤗 Transformers Trainer](#scrollTo=qM4s61P9rJmI&line=1&uniqifier=1)
6. [Full Fine-tuning using native PyTorch](#scrollTo=hXFeW4aC9oGj&line=1&uniqifier=1)
7. [Evaluating the Original and Fine-Tuned Models Using ROUGE](#scrollTo=KApS1AWMg9Iu)

# **Set Device**
[Back to Top](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

`torch.device` is a PyTorch object that specifies the device (CPU or GPU) on which tensors are allocated. It helps manage computations efficiently across different hardware.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')

if torch.cuda.is_available():
    print(torch.cuda.get_device_name(0))
    print(f"Memory: {round((torch.cuda.get_device_properties(device).total_memory)/(1024)**3,2)}GB")

# **Loading Dataset**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

In [None]:
%%capture
%pip install datasets

from datasets import load_dataset

In [None]:
%%capture
hf_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(hf_dataset_name, spilit=None)

# **Set-Up Model**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)


`AutoModelForSeq2SeqLM` is a class from the Hugging Face Transformers library. It is used to automatically load a sequence-to-sequence (Seq2Seq) model based on the model checkpoint you specify.

---
Seq2Seq models are commonly used for tasks like text translation, summarization, and text generation.

---
**GenerationConfig:**

``max_new_tokens=200``
* This sets the maximum number of new tokens the model can generate in response to a prompt.


``do_sample=True``
* This enables sampling-based generation instead of deterministic generation.
* When ``do_sample=False``, the model chooses the most likely token at each step (greedy decoding or beam search).
* When ``do_sample=True``, the model introduces randomness, making outputs more diverse.
* **When ``do_sample=False``, the ``temperature`` parameter has no effect.**

``temperature=1``
* This controls the randomness of token selection during sampling.
* A higher temperature ``(>1)`` makes the model more random and creative.
* A lower temperature ``(<1)`` makes it more deterministic and focused.
* ``Temperature=1`` means default randomness—balancing coherence and diversity.


---


**What is ``torch_dtype="auto"``**?


* By default, when we load a model with:
```python
    AutoModelForSequenceClassification.from_pretrained(model_name)
```
The model loads its weights in ```torch.float32``` (full precision), even if the original model was trained or stored in a lower precision format like ```torch.float16```.

**Using ```torch_dtype="auto"```**
Instead of manually specifying a precision (e.g., ```torch.float16``` or ```torch.bfloat16```), we can automatically load the model in the optimal data type by setting:
```python
AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype="auto")
```
* This tells PyTorch to check the model's ``config.json`` file, which defines the precision in which the weights were saved.
* If the model was originally trained and stored in ``torch.float16`` or ``torch.bfloat16``, it will load in that format, saving memory and improving speed.
---

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

In [None]:
%%capture
model_name = 'google/flan-t5-base'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype="auto").to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generation_config = GenerationConfig(max_new_tokens=200, do_sample=True, temperature=1)

# **Make dataset ready for training**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

---
* **Map** method in dataset class: [ChatGPT response](https://chatgpt.com/share/67b9efea-5534-8004-bcf1-8562576077db)
---

In [None]:
def tokenize_function(batch):
    start_prompt = "Summarize the following conversation.\n\n"
    end_prompt = "\n\nSummary: "
    prompts = [start_prompt + dialogue + end_prompt for dialogue in batch['dialogue']]
    batch['input_ids'] = tokenizer(prompts, padding='max_length', truncation=True, return_tensors='pt').input_ids
    batch['labels'] = tokenizer(batch['summary'], padding='max_length', truncation=True, return_tensors='pt').input_ids
    return batch

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['id', 'dialogue', 'summary', 'topic'])

---
* About **Lambda** function: https://chatgpt.com/share/67b9fd12-ac54-8004-b30c-6bee1d584f2e


* About **filter** method arguments of huggingface **dataset** class: https://chatgpt.com/share/67b9ff1e-85f8-8004-970a-d53772de81ac
---

In [None]:
tokenized_dataset = tokenized_dataset.filter(lambda example, index: index % 10 == 0, with_indices=True)

# **Full Fine-tuning with the 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer) class**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)


---

**TrainingArguments**

``TrainingArguments`` is a class in the 🤗 `Transformers` library that is used to configure the training process of a model. It provides various options for fine-tuning and training models using the Trainer API.

* ``TrainingArguments`` class is often used with the ``Trainer`` class for model training in Hugging Face Transformers.

* ``TrainingArguments`` is the subset of the arguments we use in our example scripts **which relate to the training loop itself.**

---
**Training hyperparameters**

`output_dir`
* Specifies the directory where model checkpoints and logs will be saved.

`learning_rate`
* Sets the learning rate for the optimizer.

`num_train_epochs`
* Defines the total number of training epochs

`weight_decay`
* Regularization technique to prevent overfitting by adding a penalty to large weights.

`logging_steps`
* Determines how often training logs (such as loss values) are recorded.

`per_device_train_batch_size`
* Number of training samples per batch per device

`per_device_eval_batch_size`
* Number of evaluation (validation) samples per batch per device

`eval_strategy`
* Defines how often evaluation is performed.

`report_to`
* Specifies where to log training metrics (e.g., `wandb`, `tensorboard`).
* "none" disables logging to external tracking tools.

---

In [None]:
from transformers import TrainingArguments
from time import strftime

In [None]:
HF_training_args = TrainingArguments(
    output_dir = f"./T5-FF-TransformersTrainer-{strftime('%H:%M:%S')}",
    learning_rate = 1e-5,
    num_train_epochs = 10,
    weight_decay = 0.01,
    logging_steps = 1,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    eval_strategy = "epoch",
    report_to = "none"
)

---
**Now, fine-tune the model**

Create a Trainer object with your **model**, **training arguments**, **training and test datasets**, and **evaluation function**:

---

In [None]:
from transformers import Trainer

In [None]:
trainer = Trainer(
    model = model,
    args = HF_training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["validation"]
)

In [None]:
trainer.train()

# **Full Fine-tuning using native PyTorch**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

---
In **Hugging Face Datasets**, `set_format("torch")` is used to **convert dataset elements into PyTorch tensors.**

This is useful when training a model with PyTorch, as it ensures that the data is in the correct format.

---

In [None]:
tokenized_dataset.set_format("torch")

---
**DataLoader**

The `Dataset` retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.

**`DataLoader`** is an iterable that abstracts this complexity for us in an easy API. [link](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders)

**DataLoader Parameters:**
* `dataset`: The dataset to load data from.
* `batch_size`: Number of samples per batch (default is `1`).
* `shuffle`: Whether to shuffle the data at every epoch (`True`/`False`).
* `num_workers`: Number of CPU processes used for data loading (`0` means no parallel loading).
* `pin_memory`: If `True`, speeds up GPU transfer by using pinned (page-locked) memory.

---

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(dataset=tokenized_dataset['train'], shuffle=True, batch_size=4)
eval_dataloader = DataLoader(dataset=tokenized_dataset["validation"], batch_size=4)

---
**Optimizer**

Create an optimizer and learning rate scheduler to fine-tune the model. Let’s use the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) optimizer from PyTorch:

* `torch.optim` is a package implementing various optimization algorithms.
* To construct an Optimizer I have to give it an iterable containing the **parameters** to optimize. Then, I can specify **optimizer-specific options** such as the learning rate, weight decay, etc.

* In PyTorch, **`model.parameters()`** is a method that returns an iterator over all the learnable parameters (i.e., weights and biases) of a neural network model.

---
**Learning rate scheduler - `get_scheduler`**

When fine-tuning a pre-trained model (like BERT, GPT, or ViT), gradually decreasing the learning rate (LR) over time helps improve stability and ensures the model adapts well to the new task without forgetting pre-trained knowledge. Here’s why:
1. Prevents Catastrophic Forgetting
2. Helps Convergence & Avoids Overshooting
3. Improves Generalization & Reduces Overfitting

**`get_scheduler`** is a utility from `transformers` helps adjust the learning rate dynamically.
**`get_scheduler`** parameters:

* `name`: Specifies the type of scheduler to use. Each scheduler adjusts the learning rate differently during training. Available options include: **linear**, **cosine**, **cosine_with_restarts**, **polynomial**, **constant**, **constant_with_warmup**, **inverse_sqrt**, **reduce_lr_on_plateau**, **cosine_with_min_lr**, **warmup_stable_decay**.
<br>

* `optimizer`:The optimizer that will be used during training.
* `num_warmup_steps`: Number of steps to linearly increase the learning rate from 0 to the initial value set in the optimizer.
* `num_training_steps`: Total number of training steps. This is typically calculated as the **number of epochs multiplied by the number of batches per epoch**.

---

In [None]:
# Set Optimizer
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=1e-5)

In [None]:
# Set lr scheduler
from transformers import get_scheduler

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

---
**Lastly**, specify `device` to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes.

* `get_backend()` is used for automatically detecting the best available computing device (e.g., GPU or CPU).

---

In [None]:
%%capture
from accelerate.test_utils.testing import get_backend

device, _, _ = get_backend()
model.to(device)

---

**Training Loop**

`model.train()`
* This puts the model into "training mode," enabling behaviors such as dropout (if applicable) and tracking gradients.

`batch = {k: torch.tensor(v).to(device) for k, v in batch.items()}`
* Preparing Data for the Model: Converts each value in batch to a PyTorch tensor and moves it to the specified device (GPU or CPU)

`output = model(**batch)`
* Forward Pass : Feeds the batch into the model, producing an output object.
* The output typically contains predictions and a loss value (if the model is set up for training).

`loss = output.loss`
* Extracts the loss from the model's output.

`loss.backward()`
* Computes the gradients of the loss with respect to the model parameters (backpropagation).

`optimizer.step()`
* Updates model parameters using the gradients computed in `loss.backward()`

`lr_scheduler.step()`
* Adjusts the learning rate according to the learning rate scheduler (if used).

`optimizer.zero_grad()`
* Clears previously accumulated gradients before the next iteration. This prevents gradient accumulation across batches.
* In PyTorch, gradients are accumulated by default. This means that after calling `.backward()`, the gradients from the current backward pass add up to the gradients from previous batches.
* `optimizer.zero_grad()` resets (clears) all gradients of the model parameters before the next iteration, preventing unintended accumulation.

`progress_bar.update(1)`
* Moves the progress bar forward by one step to indicate progress.

`model.save_pretrained(output_dir)`
* Saves the trained model to the specified output_dir, allowing it to be reloaded later.

---

**Why Are Gradients Accumulated?**

In PyTorch, when you call `loss.backward()`, the gradients are not automatically replaced. Instead, they are added to the existing gradients.

This behavior is useful in cases like gradient accumulation, where you intentionally sum gradients over multiple batches before updating the weights. However, in a standard training loop, this would cause issues if gradients were not reset before each step.

---

In [None]:
from tqdm.auto import tqdm
from time import strftime
output_dir = f"./T5-FF-NativePytorch-{strftime('%H:%M:%S')}"

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: torch.tensor(v).to(device) for k, v in batch.items()}
        output = model(**batch)
        loss = output.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

model.save_pretrained(output_dir)

# **Evaluating the Original and Fine-Tuned Models Using ROUGE**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

---

**Evaluating the Original Model vs. Fine-Tuned Models**

In this section, to determine whether the fine-tuned model's performance has improved, I created the `compute_rouge` function that takes a `model_name` as input and returns the **ROUGE** metrics for evaluation.

---

**The default number of examples for calculating ROUGE is set to 10. This can be adjusted using the `num_examples` parameter.**

---

In [None]:
%%capture
%pip install evaluate
%pip install rouge_score

In [None]:
import evaluate

def compute_rouge(model_name, tokenizer=tokenizer, dataset=dataset ,num_examples = 10):
    print(f"Model:____{model_name}____\n")
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype="auto").to(device)
    dialogues = dataset['test'][0:num_examples]['dialogue']
    prediction_list = []
    for index, dialogue in enumerate(dialogues):
        prompt = f"""Summarize the following conversation.\n\n{dialogue}\n\nSummary: """
        input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)
        model_output = model.generate(input_ids, generation_config=generation_config)
        model_text_output = tokenizer.decode(model_output[0], skip_special_tokens=True)
        prediction_list.append(model_text_output)

    rouge = evaluate.load('rouge')
    rouge_score = rouge.compute(
        predictions = prediction_list,
        references = dataset['test'][0:num_examples]['summary'],
        use_aggregator = True,
        use_stemmer = True)
    return rouge_score

In [None]:
# setting Model names
original_model = model_name
# FF_model = "./T5-FF-NativePytorch-"
FF_model = "./T5-FF-TransformersTrainer-22:22:13/checkpoint-500"

# Compute ROUGE
print(compute_rouge(model_name=original_model, num_examples=20), end="\n\n")
print(compute_rouge(model_name=FF_model, num_examples=20))

# **T5 vocabulary with indices:**

[[Back to Top]](#scrollTo=z--RVKIEAhgE&line=1&uniqifier=1)

In [None]:
vocab_index_token = {v:k for k,v in tokenizer.get_vocab().items()}

How to get logits in T5

In [None]:
# Example
input_text = "translate English to French: Hello, how are you?"
input_ids = tokenizer(input_text, return_tensors='pt').input_ids.to(device)
decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]], device=device)

In [None]:
with torch.no_grad():
    output = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

In [None]:
probs = torch.softmax(output.logits, dim=-1)

In [None]:
pred_token = torch.argmax(probs, dim=-1).item()

In [None]:
vocab_index_token.get(pred_token)