# Introduction

we aim to train a t5-small model on the `imdb` dataset using the following four techniques:

- Full Fine-tuning
- Soft Prompt Tuning
- Adapter
- LoRA

For each technique, we will do the tuning by:
- coding the entire tuning without any library
- An implementation from a PEFT library


### Requirements

**%%capture:**
This is a Jupyter magic command. It captures the output of the cell, which includes stdout and stderr. Essentially, this means that any output generated by the code in this cell (like print statements or warnings) will be suppressed and not displayed in the Jupyter notebook. This is particularly useful when you want to install packages without cluttering your notebook with the installation output

**The exclamation mark (!)** is used to run shell commands directly from a Jupyter notebook cell. Here, it’s calling pip, which is the Python package installer.

**datasets:** A library by Hugging Face that provides easy access to a wide variety of datasets for machine learning and data science.

**transformers:** Another library by Hugging Face that offers state-of-the-art pre-trained models for natural language processing tasks.

In [None]:
%%capture
! pip install datasets transformers

### Imports

In [None]:
from tqdm.notebook import tqdm
from IPython import display

import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn

from datasets import load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration, DataCollatorForSeq2Seq

### Constants

### Base Model Selection
We will use `t5-small` as our base model from Hugging Face ([HF_Link](https://huggingface.co/t5-small)).

**The developers of the Text-To-Text Transfer Transformer (T5) write:**

*With T5, we propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Our text-to-text framework allows us to use the same model, loss function, and hyperparameters on any NLP task.*

**Encoder-Decoder Structure:**

T5 uses the Transformer architecture (introduced by Vaswani et al., 2017).
The **encoder** processes the input sequence to generate contextual representations.

The **decoder** generates the output sequence, conditioned on the encoder’s output and previous tokens in the output sequence.

**Text-to-Text Framework:**

T5 formulates all NLP tasks as a text-to-text problem. This means both the input and output are represented as text strings.

*Example:*

Task: Translation from English to French
Input: "translate English to French: What is your name?"
Output: "Quel est votre nom?"
Task: Sentiment analysis
Input: "classify sentiment: I love this product."
Output: "positive"
Unified Training Objective:

T5 is trained with a sequence-to-sequence learning objective, specifically masked span prediction (similar to BERT’s masked language modeling). However, instead of predicting individual masked tokens, T5 predicts spans of tokens.

`torch.device:`








This is a PyTorch function used to specify the device where the tensor operations will be performed. PyTorch can perform computations on different types of devices, such as the CPU and GPU (CUDA-enabled devices).

`"cuda:0":`

This string specifies the first CUDA-enabled GPU. If you have multiple GPUs, you can specify others by changing the index, like "cuda:1", "cuda:2", etc.

In [None]:
#####################################
###### DO NOT CHANGE THIS CELL ######
#####################################

BASE_MODEL_NAME = 't5-small'

BATCH_SIZE = 32
LEARNING_RATE = 1e-5
EPOCHS = 10

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

cuda:0


# Dataset

### Load dataset

`imdb` dataset is a famouns NLP for binary sentiment dataset. Each row of data is either `negative` or `positive` ([HF_Link](https://huggingface.co/datasets/imdb)).

...

Examples from dataset:

| Example Text | Tag |
|------------|---|
| The plot line of No One Sleeps is not a bad idea, ... | 0 (neg) |
| This version of Anna Christie is in German. Greta ... | 1 (pos) |


`pop Method:`

The pop method is used to remove a specified key from a dictionary and return the associated value. If the key is not found, it raises a KeyError. However, in this context, dataset is an instance of the DatasetDict class (from the datasets library), which behaves similarly to a Python dictionary.

`'unsupervised' Key:`

The 'unsupervised' key refers to a particular subset or split of the IMDB dataset. By calling dataset.pop('unsupervised'), you are removing the 'unsupervised' split from the dataset dictionary. This subset typically contains data that does not have labeled outputs, which means it's not useful for supervised learning tasks like classification or regression.

In [None]:
dataset = load_dataset('imdb')
print(dataset)
dataset.pop('unsupervised')
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


 **Pretraining Objective:** Span Corruption

Instead of masking single tokens (as in BERT), T5 masks contiguous spans of text in the input.
These spans are replaced with a single special token, <extra_id_N>, where N is a unique identifier for each masked span in the sequence.
The task is to predict the masked spans in the output sequence.
Example:
Input:
"The `<extra_id_0>` fox jumps over the `<extra_id_1>` dog."

Output:
"`<extra_id_0>` quick brown `<extra_id_1>` lazy"

The model learns to generate the missing spans as text, treating this as a sequence-to-sequence generation task.

### Define related functions

Because `T5` model is a sequence to sequence model we should map our labels to label_names before training and doing vice versa duing calculating metrics.

The functions `id2label` and `label2id` are defined to do this.

**label_names_dict.get(label, 2):**

If label exists as a key in the dictionary, its corresponding value is returned.

If label is not found in the dictionary, the default value 2 is returned.

In [None]:
def id2label(ids):
    label_names = ['negative', 'positive']
    return [label_names[id] for id in ids]

def label2id(labels):
    label_names_dict = {
        'negative': 0,
        'positive': 1
    }
    return [
        label_names_dict.get(label, 2)
        for label in labels
    ]

# Tokenizer

### Load tokenizer

In [None]:
tokenizer = T5TokenizerFast.from_pretrained(BASE_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

### Process dataset using tokenizer

In this step we will getting our dataset ready for training.

We preprocess tokenize our `text` and `label`.

For easier prompt tuning we put placeholders by prepending multiple `pad_token` to our input. The count of this pad tokens is the same as `n_soft_prompt_tokens`.



text = text.replace('`<br />`', ' '):

Searches the text for occurrences of the HTML tag `<br />`.
Replaces each occurrence with a single space.
Why It's Used:

Clean HTML Artifacts: The IMDB dataset often contains user reviews with HTML formatting. The `<br />` tag represents a line break in HTML but is not meaningful for the T5 model, which expects plain text as input.
Improve Tokenization: T5's tokenizer works best on clean, natural text. Retaining tags like `<br />` would lead to unnecessary tokens during tokenization, potentially affecting model performance.
Maintain Readability: Replacing `<br />` with a space ensures that text remains readable and semantically consistent after removing HTML artifacts

The token IDs are created through the following steps, handled by the tokenizer associated with the model (e.g., T5Tokenizer in this case):




1. Input Text Preprocessing

The tokenizer applies preprocessing to the input text, such as:

Lowercasing (if specified in the tokenizer configuration).

Removing special characters or normalizing text (depending on the model).

For example: Input: "This movie was amazing!"

After preprocessing: "this movie was amazing!"

2. Tokenization

The tokenizer splits the preprocessed text into smaller units called tokens. The splitting rules depend on the tokenizer type:

Word-based Tokenizers: Split text by spaces (e.g., "this movie was amazing" → ["this", "movie", "was", "amazing"]).

Subword Tokenizers: Split text into smaller units based on a predefined vocabulary (e.g., "amazing" → ["amaz", "ing"]).

Character Tokenizers: Break text into individual characters.

T5 uses a sentencepiece tokenizer with a subword-level vocabulary.

3. Mapping Tokens to IDs

Each token is mapped to a unique integer ID using the model's vocabulary, which is a predefined mapping between tokens and IDs.
Example:
Vocabulary:

"this" → 321

"movie" → 654

"was" → 987

"amaz" → 123

"ing" → 456

Tokens: ["this", "movie", "was", "amaz", "ing"]

Token IDs: [321, 654, 987, 123, 456]

4. Adding Special Tokens

The tokenizer may add special tokens required by the model:

<pad>: Padding token (e.g., for shorter sequences in a batch).

<eos>: End-of-sequence token.

<bos>: Beginning-of-sequence token (if applicable).

For T5, an <eos> token is typically added to the end of each sequence.
Example:

Tokens: ["this", "movie", "was", "amaz", "ing"]

After adding <eos>: ["this", "movie", "was", "amaz", "ing", "</s>"]

5. Truncation and Padding

If the tokenized sequence exceeds the max_length (e.g., 256), it is truncated to fit the limit.

If the tokenized sequence is shorter than max_length, padding is applied to make all sequences the same length (important for batching).

Output

The tokenizer returns:

input_ids: The list of token IDs for the text.
attention_mask: A binary mask indicating which tokens are real (1) and which are padding (0).

Example Walkthrough

Input Text:

"This movie was amazing!"

Steps:

Preprocess: "this movie was amazing!"

Tokenization: ["this", "movie", "was", "amaz", "ing"]

Token IDs: [321, 654, 987, 123, 456]

Add <eos>: [321, 654, 987, 123, 456, 1] (where 1 is the ID for <eos>)

Pad/Truncate (if needed):

For max length 8: [321, 654, 987, 123, 456, 1, 0, 0]

Attention mask: [1, 1, 1, 1, 1, 1, 0, 0]

In [None]:
def preprocess_input(text):
    text = text.lower()
    text = text.replace('<br />', ' ')
    return text

def map_function(row):
    processed_input = [
        preprocess_input(text)
        for text in row['text']
    ]
    input_info = tokenizer(processed_input, truncation=True, max_length=256)
    output_info = tokenizer(id2label(row['label']))
    return {
        **input_info,
        'labels': output_info.input_ids
    }


dataset = dataset.map(map_function, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

input_info:

This is the output from the tokenizer for the input text (stored earlier as input_info).
It typically contains keys like:

'input_ids': The tokenized representation of the input text.

'attention_mask': A binary mask indicating which tokens are actual input and which are padding.

**input_info:

This syntax is a Python dictionary unpacking operator. It spreads the key-value pairs from input_info into the new dictionary.

`dataset = dataset.map(map_function, batched=True)`
This line applies the map_function to each row in the dataset.

The map_function processes the text and labels in each row, converting them to the format required by the model.

The batched=True argument indicates that the function should be applied to batches of rows rather than individual rows, which can be more efficient.

2. `dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])`
This line sets the format of the dataset to be compatible with PyTorch.

The type='torch' argument specifies that the dataset should be converted to PyTorch tensors.

The columns=`['input_ids', 'attention_mask', 'labels']` argument specifies which columns should be included in the final dataset. These columns are the ones needed for training the model: input_ids (tokenized input text), attention_mask (indicating which tokens should be attended to), and labels (the labels for the input text).

Together, these lines ensure that the dataset is processed and formatted correctly for use with a PyTorch-based model, making it ready for training or evaluation.

# Model

### Load model

In [None]:
model = T5ForConditionalGeneration.from_pretrained(BASE_MODEL_NAME)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# Train and evaluate

# DataLoader
A DataLoader in PyTorch is a powerful tool that simplifies the process of preparing data for training and testing machine learning models. It handles tasks like batching, shuffling, and parallel data loading. Essentially, it makes it easier to iterate over your dataset during the training and evaluation phases.

In the code:




```
train_loader = torch.utils.data.DataLoader(
    dataset['train'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset['test'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
)
```


`dataset['train']` and `dataset['test']` refer to the training and testing datasets, respectively.

`batch_size=BATCH_SIZE` specifies the number of samples per batch.

`collate_fn=col_fn` defines how samples are combined into a single batch.

shuffle=True means the training data will be randomly shuffled at the start of each epoch, improving generalization.

DataCollatorForSeq2Seq
The DataCollatorForSeq2Seq is a specific data collator used for sequence-to-sequence models like T5. It ensures that input sequences are properly padded and prepared for batch processing.

In your code:


```col_fn = DataCollatorForSeq2Seq(
    tokenizer, return_tensors='pt', padding='longest',
)```
tokenizer: This is your tokenizer, which converts text to token IDs.

return_tensors='pt': This ensures the tensors returned are PyTorch tensors.

padding='longest': This option pads sequences to the length of the longest sequence in the batch. This is crucial for ensuring all sequences in a batch are the same length, which is necessary for efficient batch processing in most deep learning frameworks.

Together, these components prepare your data for training and evaluation by handling batching, padding, and shuffling automatically. They ensure your data is in the right format for feeding into the model, making the process much more efficient.

### Define dataloaders

In [None]:
col_fn = DataCollatorForSeq2Seq(
    tokenizer, return_tensors='pt', padding='longest',
)

train_loader = torch.utils.data.DataLoader(
    dataset['train'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset['test'],
    batch_size=BATCH_SIZE,
    collate_fn=col_fn,
)

### Train functions

In [None]:
def train_loop(model, loader, optimizer):
    model.train()

    batch_losses = []

    for row in tqdm(loader, desc='Training:'):
        optimizer.zero_grad()

        out = model(**row.to(model.device))
        loss = out.loss

        batch_loss_value = loss.item()
        loss.backward()
        optimizer.step()

        batch_losses.append(batch_loss_value)

    loss_value = np.mean(batch_losses)
    return {'train_loss': loss_value}

def _predict(model, row):
    return model.generate(
        input_ids=row.input_ids,
        attention_mask=row.attention_mask,
        max_length=5
    )

def tokenizer_ids_to_label(all_input_ids):
    return tokenizer.batch_decode(all_input_ids, skip_special_tokens=True)

def valid_loop(model, loader, compute_metrics):
    model.eval()

    all_true = []
    all_pred = []

    with torch.no_grad():
        for row in tqdm(loader, desc='Validating:'):
            row.to(model.device)
            pred = _predict(model, row)

            all_true += row.labels.detach().cpu().tolist()
            all_pred += pred.detach().cpu().tolist()

    all_true = label2id(tokenizer_ids_to_label(all_true))
    all_pred = label2id(tokenizer_ids_to_label(all_pred))

    return {'valid_acc': compute_metrics(y_true=all_true, y_pred=all_pred)}

### Define our optimizer and metric function

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
compute_metrics = accuracy_score

In [None]:
model.to(DEVICE)  #This line moves all the model's parameters and buffers to the device specified by DEVICE.

all_results = []
for epoch in range(EPOCHS):
    epoch_results = {'epoch': epoch}

    epoch_results.update(
        train_loop(
            model=model,
            loader=train_loader,
            optimizer=optimizer,
        )
    )

    epoch_results.update(
        valid_loop(
            model=model,
            loader=test_loader,
            compute_metrics=compute_metrics,
        )
    )
    all_results.append(epoch_results)

    display.clear_output()
    display.display(pd.DataFrame(all_results).set_index('epoch'))

Unnamed: 0_level_0,train_loss,valid_acc
epoch,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1.794521,0.82968
1,0.206151,0.85484
2,0.179503,0.87084
3,0.165054,0.88128
4,0.156557,0.88744
5,0.147961,0.89116
6,0.142497,0.89428
7,0.138048,0.89672
8,0.133215,0.89844
9,0.130639,0.89988


### Best Performance and number of parameters

In [None]:
best_score = pd.DataFrame(all_results)['valid_acc'].max() * 100
total_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {total_params}")
print('Best model preformance is: %%%.1f' % best_score)

Number of parameters: 60506624
Best model preformance is: %90.0


In [None]:
print(total_params)
print(best_score)

60506624
89.988
