# Fine-Tuning a T5 Model on the SQuAD Dataset for Generative Question Answering

This code that demonstrates the development of a Question Answering (Q&A) system using the T5 model fine-tuned on the SQuAD dataset. The system is trained on the first N examples from the training set, with validation and testing conducted on separate M-example splits. The process begins by loading and preprocessing the SQuAD data, including tokenizing the questions and context pairs and aligning the answers for model training. The t5-small model is then fine-tuned using the Hugging Face Trainer API with a custom training loop that includes validation and loss reporting. After training, the model is evaluated on a test set using the ROUGE metric to measure the quality of its generated answers. Two types of inputs are tested: one where both the question and context are provided, and another with only the question. The code also computes ROUGE scores for both input types, allowing for analysis of the model's performance based on the presence of context. 

# 0. Import libraries

In [1]:
import os
import torch
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from rouge_score import rouge_scorer
from tqdm import tqdm

In [3]:
# Set seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x1adefb50670>

# 1. Load and preprocess SQuAD dataset

In [5]:
# 1. Load and preprocess SQuAD dataset
dataset = load_dataset("squad")

In [7]:
# Take subsets to avoid overload
train_dataset = dataset["train"].select(range(10))
val_dataset = dataset["validation"].select(range(10))
test_dataset = dataset["validation"].select(range(10))  # No official SQuAD test set

In [10]:
# Load tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [12]:
# Preprocessing function
def preprocess(example):
    input_text = f"question: {example['question']}  context: {example['context']}"
    target_text = example["answers"]["text"][0]
    input_enc = tokenizer(input_text, padding="max_length", truncation=True, max_length=512)
    target_enc = tokenizer(target_text, padding="max_length", truncation=True, max_length=32)
    input_enc["labels"] = target_enc["input_ids"]
    return input_enc

In [14]:
# Preprocess the datasets
train_enc = train_dataset.map(preprocess, batched=False)
val_enc = val_dataset.map(preprocess, batched=False)
test_enc = test_dataset.map(preprocess, batched=False)

In [16]:
# Set format
columns = ['input_ids', 'attention_mask', 'labels']
train_enc.set_format(type="torch", columns=columns)
val_enc.set_format(type="torch", columns=columns)
test_enc.set_format(type="torch", columns=columns)

# 2. Fine-tune T5 model

In [18]:
# 2. Fine-tune T5 model
training_args = TrainingArguments(
    output_dir="./t5_squad",
    evaluation_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir="./logs"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=val_enc,
    tokenizer=tokenizer,
    data_collator=data_collator
)

trainer.train()

  trainer = Trainer(
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,16.777882
2,No log,12.778071
3,No log,11.265055


TrainOutput(global_step=6, training_loss=11.959903717041016, metrics={'train_runtime': 17.0941, 'train_samples_per_second': 1.755, 'train_steps_per_second': 0.351, 'total_flos': 4060254044160.0, 'train_loss': 11.959903717041016, 'epoch': 3.0})

In [20]:
# Report training and validation losses
metrics = trainer.evaluate()
print("Validation Loss:", metrics["eval_loss"])

Validation Loss: 11.265054702758789


In [22]:
# Save model
model.save_pretrained("./t5_squad_model")
tokenizer.save_pretrained("./t5_squad_model")

('./t5_squad_model\\tokenizer_config.json',
 './t5_squad_model\\special_tokens_map.json',
 './t5_squad_model\\spiece.model',
 './t5_squad_model\\added_tokens.json')

# 3. Evaluate on test set using ROUGE (updated slicing)

In [24]:
# 3. Evaluate on test set using ROUGE (updated slicing)
def generate_answers(dataset, use_context=True, limit=100):
    # Make sure we have a row-oriented Dataset
    subset = dataset.select(range(limit))
    inputs = []
    for ex in subset:
        if use_context:
            inputs.append(f"question: {ex['question']}  context: {ex['context']}")
        else:
            inputs.append(f"question: {ex['question']}")
    tokenized = tokenizer(
        inputs,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=512
    )
    outputs = model.generate(**tokenized)
    answers = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    references = [ex["answers"]["text"][0] for ex in subset]
    return answers, references

In [26]:
# Now call it:
answers_ctx, refs_ctx = generate_answers(test_dataset, use_context=True,  limit=10)
answers_noctx, refs_noctx = generate_answers(test_dataset, use_context=False, limit=10)

# 4. Evaluate on test set using ROUGE

In [28]:
# 4. Evaluate on test set using ROUGE
def compute_rouge(predictions, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    results = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    for pred, ref in zip(predictions, references):
        score = scorer.score(ref, pred)
        for k in results:
            results[k].append(score[k].fmeasure)
    return {k: {"precision": sum(results[k])/len(results[k])} for k in results}

rouge_with_ctx = compute_rouge(answers_ctx, refs_ctx)
rouge_no_ctx   = compute_rouge(answers_noctx, refs_noctx)

print("\nROUGE with context:", rouge_with_ctx)
print("\nROUGE without context:", rouge_no_ctx)


ROUGE with context: {'rouge1': {'precision': 0.7}, 'rouge2': {'precision': 0.6}, 'rougeL': {'precision': 0.7}}

ROUGE without context: {'rouge1': {'precision': 0.0}, 'rouge2': {'precision': 0.0}, 'rougeL': {'precision': 0.0}}
