### T5 MODEL - TASK: SENTIMENT ANALYSIS

### 🛠️ SETUP

#### Dataset: IMDB
- The IMDB dataset consists of movie reviews labeled as either **positive** or **negative**, commonly used for **sentiment classification** tasks.
- It contains 50,000 reviews, evenly split into training and testing sets.

#### Model: T5-small
- `t5-small` is a lightweight version of the **Text-to-Text Transfer Transformer (T5)** developed by Google.
- T5 treats every NLP task as a **text-to-text problem**, meaning both input and output are formatted as text.
- In this experiment, the model is fine-tuned to perform **sentiment classification** by mapping a review text to its corresponding label ("positive" or "negative").


In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '4,5'

In [None]:
import warnings
warnings.filterwarnings('ignore')
os.system("your_command_here 2>/dev/null")

In [None]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer,Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, Seq2SeqTrainer
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, TaskType
import evaluate
import numpy as np


### MODEL AND DATASET

In [4]:
# Load tokenizer and dataset
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
dataset = load_dataset("imdb")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### DATA PREPROCESSING

In [5]:
def preprocess_function(examples):
    inputs = ["sentiment: " + str(text) if text else "sentiment: " for text in examples["text"]]
    targets = ["positive" if label == 1 else "negative" for label in examples["label"]]

    # Tokenize input
    model_inputs = tokenizer(
        inputs, max_length=512, truncation=True, padding="max_length"
    )

    # Tokenize labels 
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets, max_length=10, truncation=True, padding="max_length"
        )

    labels_ids = labels["input_ids"]

    # Change pad token id to -100
    labels_ids = [
        [(token_id if token_id != tokenizer.pad_token_id else -100) for token_id in label]
        for label in labels_ids
    ]

    model_inputs["labels"] = labels_ids
    return model_inputs


In [None]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
eval_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


In [7]:
print(train_dataset[0].keys())
print(eval_dataset[0].keys())


dict_keys(['input_ids', 'attention_mask', 'labels'])
dict_keys(['input_ids', 'attention_mask', 'labels'])


In [8]:
# Metric
metric = evaluate.load("accuracy")

def compute_metrics(eval_preds):
    preds_ids, labels = eval_preds  # preds_ids = generated token IDs

    decoded_preds = tokenizer.batch_decode(preds_ids, skip_special_tokens=True)
    labels_for_decode = np.where(labels == -100, tokenizer.pad_token_id, labels)
    decoded_labels = tokenizer.batch_decode(labels_for_decode, skip_special_tokens=True)

    label_map = {"positive": 1, "negative": 0}
    
    decoded_preds = [p.lower().strip() for p in decoded_preds]
    decoded_labels = [l.lower().strip() for l in decoded_labels]

    valid_data = [
        (label_map[p], label_map[l])
        for p, l in zip(decoded_preds, decoded_labels)
        if p in label_map and l in label_map
    ]

    if not valid_data:
        return {"accuracy": 0.0}

    pred_labels, true_labels = zip(*valid_data)
    return metric.compute(predictions=pred_labels, references=true_labels)


In [None]:
# Beam search generate function for eval
def generate_with_beam_search(model, inputs, num_beams=5, max_length=10):
    outputs = model.generate(
        **inputs,
        num_beams=num_beams,
        max_length=max_length,
        early_stopping=True
    )
    return outputs

# Custom Trainer to use beam search during evaluation
from transformers import Seq2SeqTrainer

class BeamSearchTrainer(Seq2SeqTrainer):
    def __init__(self, *args, custom_tokenizer=None, **kwargs):
        super().__init__(*args, **kwargs)
        self._custom_tokenizer = custom_tokenizer 

    def prediction_step(self, model, inputs, prediction_loss_only=False, ignore_keys=None):
        # unwrap DataParallel 
        model_to_use = model.module if hasattr(model, "module") else model

        has_labels = "labels" in inputs
        inputs = self._prepare_inputs(inputs)

        with torch.no_grad():
            generated_tokens = model_to_use.generate(
                input_ids=inputs["input_ids"],
                attention_mask=inputs.get("attention_mask", None),
                num_beams=5,
                max_length=10,
                early_stopping=True,
                return_dict_in_generate=True,
                output_scores=False
            ).sequences

            # compute loss
            loss = model_to_use(**inputs).loss if has_labels else None

        # Pad nếu cần
        pad_token_id = getattr(self._custom_tokenizer, "pad_token_id", 0)
        if generated_tokens.shape[-1] < 10:
            generated_tokens = torch.nn.functional.pad(
                generated_tokens, (0, 10 - generated_tokens.shape[-1]), value=pad_token_id
            )

        labels = inputs["labels"] if has_labels else None
        return (loss, generated_tokens, labels)


# Training arguments template
from transformers import EarlyStoppingCallback

def get_training_args(output_dir):
    return Seq2SeqTrainingArguments(
        output_dir=output_dir,
        eval_strategy="epoch",     # Evaluate once per epoch
        save_strategy="epoch",           # Save once per epoch
        logging_strategy="epoch",        # Log once per epoch
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=10,
        weight_decay=0.01,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        fp16=torch.cuda.is_available(),
        report_to="none"
    )


#### LORA FINE-TUNING

In [13]:
# ------------- LoRA Fine-tuning -------------
print("LoRA Fine-tuning")

base_model = T5ForConditionalGeneration.from_pretrained(model_name)

# Instantiate data collator with your tokenizer and model (for label padding)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=base_model)

# Freeze base model
for param in base_model.parameters():
    param.requires_grad = False

# LoRA config 
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q", "v"], 
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(base_model, lora_config)

training_args = get_training_args("t5-imdb-lora")

trainer = BeamSearchTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,   
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    custom_tokenizer=tokenizer  # Pass the tokenizer for beam search
)

trainer.train()

print("Evaluation:")
eval_results = trainer.evaluate()
print(eval_results)

LoRA Fine-tuning


No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Accuracy
1,2.6828,0.153016,0.873795
2,0.1713,0.143606,0.88416
3,0.1615,0.134185,0.89148
4,0.1536,0.1312,0.89612
5,0.1471,0.128708,0.89696
6,0.1454,0.12562,0.89896
7,0.1444,0.124914,0.90052
8,0.1417,0.123136,0.90256
9,0.1405,0.123292,0.902
10,0.1404,0.123021,0.90252


Evaluation:


{'eval_loss': 0.12313584238290787, 'eval_accuracy': 0.90256, 'eval_runtime': 132.2732, 'eval_samples_per_second': 189.003, 'eval_steps_per_second': 2.956, 'epoch': 10.0}


In [14]:
print("LoRA Fine-tuning result: ", eval_results)

LoRA Fine-tuning result:  {'eval_loss': 0.12313584238290787, 'eval_accuracy': 0.90256, 'eval_runtime': 132.2732, 'eval_samples_per_second': 189.003, 'eval_steps_per_second': 2.956, 'epoch': 10.0}


#### STANDARD FINE-TUNING

In [None]:
# STANDARD FINE-TUNING

print("=== Standard Fine-tuning ===")
model = T5ForConditionalGeneration.from_pretrained(model_name)

training_args = get_training_args("t5-imdb-standard")  # Should return Seq2SeqTrainingArguments

trainer = BeamSearchTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    custom_tokenizer=tokenizer,  # Pass the tokenizer for beam search
)

trainer.train()

# Evaluation after training
results_standard = trainer.evaluate()

# print("Standard fine-tuning accuracy:", results_standard.get("eval_accuracy", "Metric not available"))
print("Standard fine-tuning results:", results_standard)


=== Standard Fine-tuning ===


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3053,0.116356,0.91268
2,0.1212,0.116875,0.91288
3,0.1106,0.101123,0.9244
4,0.1016,0.099826,0.92528
5,0.0962,0.101324,0.92704
6,0.0913,0.097965,0.92816
7,0.0858,0.099598,0.92908
8,0.0848,0.099813,0.9286
9,0.0809,0.100888,0.92944
10,0.0807,0.10016,0.92872


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Standard fine-tuning results: {'eval_loss': 0.10088798403739929, 'eval_accuracy': 0.92944, 'eval_runtime': 126.9253, 'eval_samples_per_second': 196.966, 'eval_steps_per_second': 3.081, 'epoch': 10.0}


#### FREEZE FINE-TUNING

In [None]:
# ------------- Freeze Fine-tuning -------------
print("Freeze Fine-tuning")

from transformers import T5ForConditionalGeneration

# Load base model
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Freeze encoder
for param in model.encoder.parameters():
    param.requires_grad = False

# Freeze shared embedding
for param in model.shared.parameters():
    param.requires_grad = False

# Setup training arguments
training_args = get_training_args("t5-imdb-freeze")

# Initialize trainer
trainer = BeamSearchTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    custom_tokenizer=tokenizer,  # Pass the tokenizer for beam search
)

# Start training
trainer.train()

# Evaluate model
print("Evaluation:")
freeze_results = trainer.evaluate()
print("Freeze fine-tuning results:", freeze_results)


Freeze Fine-tuning


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8956,0.13217,0.891396
2,0.1496,0.121123,0.90324
3,0.1388,0.115645,0.907396
4,0.1347,0.113589,0.908636
5,0.1311,0.111689,0.911076
6,0.1293,0.110224,0.911673
7,0.126,0.10963,0.913157
8,0.1249,0.110075,0.912953
9,0.1237,0.108858,0.914433
10,0.1238,0.108736,0.914513


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Evaluation:


Freeze fine-tuning results: {'eval_loss': 0.10873585939407349, 'eval_accuracy': 0.9145131610528843, 'eval_runtime': 142.3442, 'eval_samples_per_second': 175.631, 'eval_steps_per_second': 2.747, 'epoch': 10.0}


#### SUMMARY TABLE

In [17]:
from tabulate import tabulate

# Create a table 3 fine-tune techniques
results_table = [
    ["Standard", round(results_standard["eval_loss"], 4), round(results_standard["eval_accuracy"], 4)],
    ["Freeze", round(freeze_results["eval_loss"], 4), round(freeze_results["eval_accuracy"], 4)],
    ["LoRA", round(eval_results["eval_loss"], 4), round(eval_results["eval_accuracy"], 4)],
]

# Print the table
print(tabulate(results_table, headers=["Fine-tune Method", "Eval Loss", "Eval Accuracy"], tablefmt="github"))


| Fine-tune Method   |   Eval Loss |   Eval Accuracy |
|--------------------|-------------|-----------------|
| Standard           |      0.1009 |          0.9294 |
| Freeze             |      0.1087 |          0.9145 |
| LoRA               |      0.1231 |          0.9026 |


#### CONCLUSION

1. **Standard Fine-tuning** (Accuracy: **92.94%**, Loss: **0.1009**):
   - Achieves the **lowest eval loss** and **highest accuracy**.
   - Indicates that full fine-tuning unlocks the best model performance.
   - However, it is also the **most resource-intensive** (memory, GPU, training time).

2. **Freeze** (Accuracy: **91.45%**, Loss: **0.1087**):
   - Slightly worse performance than Standard.
   - A good trade-off between **efficiency** and **accuracy**.
   - Recommended when resources are limited or to avoid overfitting.

3. **LoRA (Low-Rank Adaptation)** (Accuracy: **90.26%**, Loss: **0.1231**):
   - Produces the **highest loss** and **lowest accuracy**.
   - Useful in **parameter-efficient** fine-tuning scenarios.
   - Acceptable choice for **scalability** or **low-resource environments**.

---
- **Standard**: Best performance (92.94% accuracy, 0.1009 loss), highest cost.
- **Freeze**: Balanced trade-off (91.45% accuracy, 0.1087 loss).
- **LoRA**: Efficient but slightly lower performance (90.26% accuracy, 0.1231 loss).

> **Summary**: For sentiment analysis on the IMDB dataset using `t5-small`, full fine-tuning (Standard) remains the most effective method in terms of evaluation performance.