### T5 MODEL - TASK: QUESTION ANSWERING

#### SETUP

##### DATASET: SQuAD v1.1
- Benchmark dataset for reading comprehension.
- Contains 100,000+ crowd-sourced QA pairs from Wikipedia.
- Each answer is a span of text from a related paragraph.
- Commonly used to evaluate models' ability to extract information.

##### MODEL: T5-Small
- A lightweight version of the T5 model (~60M parameters).
- Treats all NLP tasks as text-to-text problems.
- Suitable for fine-tuning with limited computational resources.
- Used here for extractive QA on the SQuAD v1.1 dataset.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "6,7"

In [24]:
import warnings
warnings.filterwarnings('ignore')
os.system("your_command_here 2>/dev/null")

32512

In [56]:
model_name = "t5-small"

In [None]:
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback
)
from datasets import load_dataset
from peft import (
    get_peft_model,
    LoraConfig,
    PrefixTuningConfig,
    PromptTuningConfig,
    TaskType,
    PromptTuningInit
)
import evaluate
import torch

In [3]:
tokenizer = AutoTokenizer.from_pretrained("t5-base")
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

In [4]:
dataset = load_dataset("squad")

### DATA PREPROCESSING

In [None]:
def preprocess_function(examples):
    inputs, targets = [], []

    for q, c, a in zip(examples["question"], examples["context"], examples["answers"]):
        if a["text"]:
            inputs.append(f"question: {q} context: {c}")
            targets.append(a["text"][0])

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

    labels_input_ids = [
        [(token if token != tokenizer.pad_token_id else -100) for token in label]
        for label in labels["input_ids"]
    ]

    model_inputs["labels"] = labels_input_ids
    model_inputs["decoder_attention_mask"] = labels["attention_mask"]

    return model_inputs


In [7]:
print(dataset["train"][0])

{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}


In [None]:
# apply preprocessing
dataset = dataset.filter(lambda x: len(x["answers"]["text"]) > 0)
dataset = dataset.map(preprocess_function, batched=True, batch_size=100, remove_columns=dataset["train"].column_names)
dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels", "decoder_attention_mask"])


In [None]:
# Load evaluation metric (SQuAD)
metric = evaluate.load("squad")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # If predictions is a tuple (sometimes logits + other outputs), take the first element
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    # Convert predictions to tensor if not already
    pred_tensor = torch.tensor(predictions) if not isinstance(predictions, torch.Tensor) else predictions
    label_tensor = torch.tensor(labels) if not isinstance(labels, torch.Tensor) else labels

    # If predictions are logits (3D: batch_size x seq_len x vocab_size), take argmax over vocab dim
    if pred_tensor.ndim == 3:
        pred_ids = torch.argmax(pred_tensor, dim=-1)
    else:
        pred_ids = pred_tensor

    # If labels have -100 as padding, replace with tokenizer.pad_token_id for decoding
    labels_ids = label_tensor.clone()
    pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
    labels_ids[labels_ids == -100] = pad_token_id

    pred_texts = []
    label_texts = []

    for pred_seq, label_seq in zip(pred_ids, labels_ids):
        # Clip indices that are out of tokenizer vocab range
        vocab_size = len(tokenizer)
        pred_seq_clipped = [token if 0 <= token < vocab_size else pad_token_id for token in pred_seq.tolist()]
        label_seq_clipped = [token if 0 <= token < vocab_size else pad_token_id for token in label_seq.tolist()]

        # Decode sequences safely
        pred_text = tokenizer.decode(pred_seq_clipped, skip_special_tokens=True)
        label_text = tokenizer.decode(label_seq_clipped, skip_special_tokens=True)

        pred_texts.append(pred_text)
        label_texts.append(label_text)

    predictions_list = [{"prediction_text": p, "id": str(i)} for i, p in enumerate(pred_texts)]
    references_list = [{"answers": {"text": [l], "answer_start": [0]}, "id": str(i)} for i, l in enumerate(label_texts)]


    return metric.compute(predictions=predictions_list, references=references_list)


#### FREEZE TUNING

In [58]:
print("\n==== Training with FREEZE TUNING ====\n")
model_name = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cpu")

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Unfreeze specific parameters - example: unfreeze the LM head only
for param in model.lm_head.parameters():
    param.requires_grad = True

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-freeze-tuning",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=64,
    generation_num_beams=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    num_train_epochs=10,
    logging_dir="./logs-freeze",
    logging_steps=10,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none",
    fp16=False,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model)

trainer_freeze = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer_freeze.train()
model.save_pretrained("./t5-freeze-tuning")
tokenizer.save_pretrained("./t5-freeze-tuning")
print("Finished training with FREEZE TUNING.\n" + "="*50)



==== Training with FREEZE TUNING ====



Epoch,Training Loss,Validation Loss,Exact Match,F1
1,0.3893,0.447231,61.201514,75.796557
2,0.4528,0.44563,61.229896,75.789781
3,0.4184,0.444857,61.210974,75.748865
4,0.4536,0.444233,61.258278,75.793609
5,0.3654,0.443798,61.315043,75.8237
6,0.4059,0.44349,61.229896,75.775111
7,0.4045,0.443397,61.210974,75.759836
8,0.3462,0.443319,61.201514,75.755994
9,0.4263,0.443271,61.210974,75.765329
10,0.4119,0.443246,61.210974,75.771975


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


Finished training with FREEZE TUNING.


In [59]:
freeze_results = trainer_freeze.evaluate()
print("Evaluate result with FREEZE: ", freeze_results)

Evaluate result with FREEZE:  {'eval_loss': 0.44324642419815063, 'eval_exact_match': 61.21097445600757, 'eval_f1': 75.77197490083633, 'eval_runtime': 324.2085, 'eval_samples_per_second': 32.602, 'eval_steps_per_second': 1.021, 'epoch': 10.0}


#### P-TUNING V2 

In [28]:
# P-Tuning v2
print("\n==== Training with P-TUNING V2 ====\n")
model_name = "t5-small"
base_model_ptv2 = AutoModelForSeq2SeqLM.from_pretrained(model_name)
peft_config_ptv2 = PromptTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Answer the question based on the context.",  # Task-specific initialization
    num_virtual_tokens=20,
    tokenizer_name_or_path=model_name,
    inference_mode=False,
)
model_ptv2 = get_peft_model(base_model_ptv2, peft_config_ptv2).to("cpu")  # Force CPU
training_args_ptv2 = Seq2SeqTrainingArguments(
    output_dir="./t5-peft-ptv2",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=64,
    generation_num_beams=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    num_train_epochs=10,
    logging_dir="./logs-ptv2",
    logging_steps=10,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none",
    fp16=False,  # Disable FP16 for CPU
)
data_collator_ptv2 = DataCollatorForSeq2Seq(tokenizer, model_ptv2, label_pad_token_id=tokenizer.pad_token_id)
trainer_ptv2 = Seq2SeqTrainer(
    model=model_ptv2,
    args=training_args_ptv2,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator_ptv2,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    compute_metrics=compute_metrics,
)
trainer_ptv2.train()
model_ptv2.save_pretrained("./t5-peft-ptv2")
tokenizer.save_pretrained("./t5-peft-ptv2")
print("Finished training with P-TUNING V2.\n" + "="*50)


==== Training with P-TUNING V2 ====



No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,0.7281,0.549748,47.672658,61.847324
2,0.7268,0.53038,49.697256,64.009655
3,0.645,0.518458,50.860927,65.183149
4,0.6757,0.509789,51.835383,66.193497
5,0.6266,0.503337,52.289499,66.728696
6,0.6108,0.498758,53.017975,67.414628
7,0.5676,0.495467,53.424787,67.801053
8,0.5387,0.493266,53.774834,68.21003
9,0.5967,0.492019,53.945128,68.389107
10,0.5709,0.491614,54.058657,68.530349


Finished training with P-TUNING V2.


In [29]:
ptv2_results = trainer_ptv2.evaluate()
print("Evaluation results with P-TUNING V2:", ptv2_results)

Evaluation results with P-TUNING V2: {'eval_loss': 0.4916144907474518, 'eval_exact_match': 54.058656575212865, 'eval_f1': 68.53034923827283, 'eval_runtime': 327.1992, 'eval_samples_per_second': 32.304, 'eval_steps_per_second': 1.012, 'epoch': 10.0}


#### LORA FINETUNING

In [21]:
from transformers import EarlyStoppingCallback
# LoRA Training
print("\n==== Training with LORA ====\n")
model_name = "t5-small"
base_model_lora = AutoModelForSeq2SeqLM.from_pretrained(model_name)
peft_config_lora = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
)
model_lora = get_peft_model(base_model_lora, peft_config_lora)
training_args_lora = Seq2SeqTrainingArguments(
    output_dir="./t5-peft-lora",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    predict_with_generate=True,
    generation_max_length=64,
    generation_num_beams=4,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-4,
    num_train_epochs=10,
    logging_dir="./logs-lora",
    logging_steps=10,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none",
    fp16=torch.cuda.is_available(),
)
data_collator_lora = DataCollatorForSeq2Seq(tokenizer, model_lora)
trainer_lora = Seq2SeqTrainer(
    model=model_lora,
    args=training_args_lora,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator_lora,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    compute_metrics=compute_metrics,
)
trainer_lora.train()
model_lora.save_pretrained("./t5-peft-lora")
tokenizer.save_pretrained("./t5-peft-lora")
print("Finished training with LORA.\n" + "="*50)


==== Training with LORA ====



No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Exact Match,F1
1,0.4166,0.431055,61.28666,75.619932
2,0.3811,0.420659,61.513718,75.968633
3,0.367,0.418586,62.071902,76.217438
4,0.3653,0.414501,61.958373,76.028235
5,0.3686,0.409802,62.317881,76.458878
6,0.3706,0.41283,62.56386,76.67981
7,0.3396,0.413686,62.317881,76.435071
8,0.3583,0.410395,62.639546,76.681935


Finished training with LORA.


In [22]:
lora_results = trainer_lora.evaluate()
print("Evaluate result with LORA: ", lora_results)

Evaluate result with LORA:  {'eval_loss': 0.4098019003868103, 'eval_exact_match': 62.317880794701985, 'eval_f1': 76.4588778571315, 'eval_runtime': 260.8604, 'eval_samples_per_second': 40.52, 'eval_steps_per_second': 0.636, 'epoch': 8.0}


### SUMMARY TABLE

In [60]:
from tabulate import tabulate

# Create a table 3 fine-tune techniques
results_table = [
    ["P-Tuning v2", round(ptv2_results["eval_loss"], 4), round(ptv2_results["eval_exact_match"], 4), round(ptv2_results["eval_f1"], 4)],
    ["Freeze", round(freeze_results["eval_loss"], 4), round(freeze_results["eval_exact_match"], 4), round(freeze_results["eval_f1"], 4)],
    ["LoRA", round(lora_results["eval_loss"], 4), round(lora_results["eval_exact_match"], 4), round(lora_results["eval_f1"], 4)]
]

# Print the table
print(tabulate(results_table, headers=["Fine-tune Method", "Eval Loss" , "Eval Exact Match", "Eval F1"], tablefmt="github"))


| Fine-tune Method   |   Eval Loss |   Eval Exact Match |   Eval F1 |
|--------------------|-------------|--------------------|-----------|
| P-Tuning v2        |      0.4916 |            54.0587 |   68.5303 |
| Freeze             |      0.4432 |            61.211  |   75.772  |
| LoRA               |      0.4098 |            62.3179 |   76.4589 |


#### INFERENCE

In [61]:
# # Sample QA example
# sample_question = "What is the capital of France?"
# sample_context = "France is a country in Western Europe. Its capital city is Paris, known for the Eiffel Tower."
# Sample QA example (more challenging)
sample_question = "In what year did the Great Fire of London happen?"
sample_context = "The Great Fire of London destroyed much of the city, including over 13,000 houses and St. Paul’s Cathedral. It began on Pudding Lane and lasted several days in September 1666."


# Prepare input
input_text = f"question: {sample_question[:100]} context: {sample_context[:400]}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True, padding="max_length")

# Function to generate and decode answer
def generate_answer(model, inputs, tokenizer, max_length=64, num_beams=4):
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"].to(model.device),
            attention_mask=inputs["attention_mask"].to(model.device),
            max_length=max_length,
            num_beams=num_beams,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [62]:
from peft import PeftModel, PeftConfig
# 1. Load and test LoRA model
print("\n==== Testing LoRA Model ====\n")
peft_config_lora = PeftConfig.from_pretrained("./t5-peft-lora")
base_model_lora = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model_lora = PeftModel.from_pretrained(base_model_lora, "./t5-peft-lora")
model_lora.to("cuda" if torch.cuda.is_available() else "cpu")
answer_lora = generate_answer(model_lora, inputs, tokenizer)
print(f"Question: {sample_question}")
print(f"Context: {sample_context}")
print(f"LoRA Answer: {answer_lora}")


==== Testing LoRA Model ====



Question: In what year did the Great Fire of London happen?
Context: The Great Fire of London destroyed much of the city, including over 13,000 houses and St. Paul’s Cathedral. It began on Pudding Lane and lasted several days in September 1666.
LoRA Answer: 1666


In [65]:
print("\n==== Testing Freeze Tuning Model ====\n")

# Load model như thường, không dùng PEFT
model_freeze = AutoModelForSeq2SeqLM.from_pretrained("./t5-freeze-tuning")
model_freeze.to("cuda" if torch.cuda.is_available() else "cpu")

answer_freeze = generate_answer(model_freeze, inputs, tokenizer)
print(f"Question: {sample_question}")
print(f"Context: {sample_context}")
print(f"Freeze Tuning Answer: {answer_freeze}")



==== Testing Freeze Tuning Model ====

Question: In what year did the Great Fire of London happen?
Context: The Great Fire of London destroyed much of the city, including over 13,000 houses and St. Paul’s Cathedral. It began on Pudding Lane and lasted several days in September 1666.
Freeze Tuning Answer: 1666


In [66]:
# 3. Load and test Parameter Fine-Tuning v2 model
print("\n==== Testing Parameter Fine-Tuning v2 Model ====\n")
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model_p_finetune_v2 = PeftModel.from_pretrained(base_model, "./t5-peft-ptv2")
model_p_finetune_v2.to("cuda" if torch.cuda.is_available() else "cpu")

# Generate answer
answer_p_finetune_v2 = generate_answer(model_p_finetune_v2, inputs, tokenizer)
print(f"Question: {sample_question}")
print(f"Context: {sample_context}")
print(f"Parameter Fine-Tuning v2 Answer: {answer_p_finetune_v2}")


==== Testing Parameter Fine-Tuning v2 Model ====



Question: In what year did the Great Fire of London happen?
Context: The Great Fire of London destroyed much of the city, including over 13,000 houses and St. Paul’s Cathedral. It began on Pudding Lane and lasted several days in September 1666.
Parameter Fine-Tuning v2 Answer: 1666


In [69]:
# Compare answers
print("\n==== Comparison ====\n")
print(f"Question: {sample_question}")
print(f"Context: {sample_context}")
print(f"LoRA: {answer_lora}")
print(f"Freeze Tuning: {answer_freeze}")
print(f"Parameter Fine-Tuning v2: {answer_p_finetune_v2}")
print(f"Expected Answer: 1666")


==== Comparison ====

Question: In what year did the Great Fire of London happen?
Context: The Great Fire of London destroyed much of the city, including over 13,000 houses and St. Paul’s Cathedral. It began on Pudding Lane and lasted several days in September 1666.
LoRA: 1666
Freeze Tuning: 1666
Parameter Fine-Tuning v2: 1666
Expected Answer: 1666


#### CONCLUSION

Three fine-tuning strategies were evaluated on the QA task using the `t5-small` model:

- **P-Tuning v2** (Eval Loss: 0.4916, Exact Match: 54.06%, F1: 68.53%)
- **Freeze Tuning** (Eval Loss: 0.4432, Exact Match: 61.21%, F1: 75.77%)
- **LoRA** (Eval Loss: 0.4098, Exact Match: 62.32%, F1: 76.46%)

##### Summary

- **LoRA** delivers the best overall performance across all metrics (lowest eval loss at 0.4098, highest exact match at 62.32%, and best F1 score at 76.46%). This shows its strong ability to adapt with minimal parameter overhead while still achieving high accuracy.

- **Freeze Tuning** performs competitively (eval loss at 0.4432, exact match at 61.21%, F1 at 75.77%) even though most of the model is frozen. It is a resource-efficient option and works well when fine-tuning budgets are tight.

- **P-Tuning v2**, while attractive due to its low number of trainable parameters, lags behind in performance (eval loss at 0.4916, exact match at 54.06%, F1 at 68.53%). It may require more task-specific prompt design or longer training to reach higher effectiveness.

#### Inference Consistency

Despite the differences in training strategy and performance metrics, all three fine-tuned models produced the **same and correct answer** on the selected QA test sample. This suggests that for some questions, once the base model is adapted even slightly, the output can converge — indicating the task may not require deep adaptation to perform accurately on simpler inputs.

This consistency highlights a potential for using lighter fine-tuning methods (e.g., Freeze Tuning or P-Tuning v2) in production when latency or resource constraints are critical, especially in scenarios involving relatively straightforward queries.
