# LoRA Fine-Tuning Lab: T5-small Customer-Support Adapter

Welcome!  
This hands-on notebook will guide you, step by step, through fine-tuning a **T5-small** model with **LoRA (Low-Rank Adaptation)** so it can answer customer-support questions in a consistent, structured style.  
Indeed, out-of-the-box **T5-small** has never seen our customer-support style or policy.  
If you prompt it with “My order arrived late, I want a refund,” it answers vaguely, or not at all.  

In this notebook we will:

1. **Measure the baseline** and see how poorly vanilla T5 handles eight real support prompts.  
2. **Attach a tiny LoRA adapter** (\~ 9 MB, 8-rank) and fine-tune it on **just 250 examples**.  
3. **Re-test the same prompts** to verify that the adapted model now produces concise, policy-compliant replies.  

**Key takeaway:** with LoRA we upgrade a generic language model into a task-specialist in ~10 minutes on a free Colab GPU, without touching the original 60 M parameters.

## What is LoRA (Low-Rank Adaptation)

Instead of updating all hundreds of millions of parameters, LoRA freezes the original model and inserts two tiny matrices into selected linear layers (often the Q and V projections). Training adjusts only those low-rank “adapter” weights. So you need far less GPU memory, reach good quality with small datasets, and ship adapters (\~10 MB) instead of full checkpoints (\~2 GB).

## 0 Environment Clean-up
Before starting, you may want to delete any previous artefacts (checkpoints, logs, etc.) so the run is fresh and reproducible.  
Feel free to skip this cell if you have nothing to clean.

In [None]:
# Remove previous training artefacts—run only if you need a fresh start
!rm -rf t5-lora-out
!rm -rf t5-small-lora-adapter

## 1 Install Dependencies
We rely on the **Hugging Face Transformers** ecosystem plus two helper libraries:
- **`transformers`**: model and trainer APIs.
- **`datasets`**: efficient data loading from HF.
- **`peft`** – adds LoRA and other parameter-efficient methods.  
- **`accelerate`** – handles device placement (CPU / single-GPU / multi-GPU) transparently.

Installation is one time.

In [None]:
# Transformers, Datasets, PEFT, and Accelerate (quiet install)
!pip install -q transformers datasets peft accelerate

## 2 Baseline Check: How well does vanilla T5-small handle our task?

Before we train anything, let's ask the out-of-the-box model to draft a refund reply.  
Spoiler: its answer will be generic, overly long, or simply unrelated because T5-small has never been told what our support policy or tone should be.


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model_base = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

prompt = "reply to this customer's email: My order arrived late and I want a refund."
inputs = tokenizer(prompt, return_tensors="pt")

# Greedy decode to keep things deterministic and short
outputs = model_base.generate(**inputs, max_new_tokens=120)
print("Vanilla T5-small says:\n")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


## 3 Load CSV Dataset and transform it to HF Dataset

Upload to Colab filesystem the file of our dataset: `files/customer_support_lora_dataset_250`.  
Our CSV file has two columns:

| Column name   | What it contains                                 |
|---------------|--------------------------------------------------|
| `input_text`  | A raw customer request or complaint              |
| `target_text` | The ideal structured reply we want the model to generate |

We'll turn the CSV into a **HF Dataset** object so it plays nicely with the Trainer API.

In [None]:
import pandas as pd
from datasets import Dataset

# Read the 250-row customer-support file
df = pd.read_csv("customer_support_lora_dataset_250.csv")
ds = Dataset.from_pandas(df)

print("Sample row:") # quick sanity check
ds[0]

## 3 Tokenisation and Label Preparation

Transformer models can't read raw text, they need **token IDs**.
For sequence-to-sequence models like T5 we must prepare **two** sequences:

1. **Source** - the customer request (`input_text`)  
2. **Target** - the desired reply (`target_text`)  

### Key details:

- We call `tokenizer.as_target_tokenizer()` so the decoder uses its own special prefix tokens.  
- We truncate to 128 tokens to keep batches small on modest GPUs.

### What exactly are "tokens" (the 128-token limit)?  
A token is not a word or a single character.
Transformers work on sub-word units produced by a tokenizer (for T5 that's a SentencePiece model with a 32 k-item vocabulary). The rules are learned from large corpora and try to strike a balance:

| Example text  | Tokens generated       | Notes                                 |
| ------------- | ---------------------- | ------------------------------------- |
| `tracking`    | `▁track`, `ing`        | the leading “▁” marks a word start    |
| `refund`      | `▁refund`              | common words are often a single token |
| `extra-large` | `▁extra`, `-`, `large` | punctuation becomes its own token     |

Because tokens can be full words or fragments, the length in tokens is usually 1.3-1.6x shorter than counting raw characters but longer than counting full words.
A 128-token limit therefore fits roughly 75-100 English words (fewer if the text contains many rare names, URLs, or emojis that split into multiple tokens).

In [None]:
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

tokenizer = AutoTokenizer.from_pretrained("t5-small")

def preprocess(example):
    # Encode source
    model_inputs  = tokenizer(example["input_text"],  max_length=128, truncation=True)
    # Encode target as labels
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(example["target_text"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

ds_tok = ds.map(preprocess, remove_columns=ds.column_names)

print("Input: ", ds[0]["input_text"])
print("Input tokens: ", len(ds_tok[0]["input_ids"]))
print("Target: ", ds[0]["target_text"])
print("Target tokens: ", len(ds_tok[0]["labels"]))


## 4 Build Base Model and LoRA Configuration

Here we load the vanilla **T5-small** (60 M parameters) and wrap it with a `peft.LoraConfig`.

### Key hyper-parameters:

| Parameter    | Role                                                          | Here |
|--------------|---------------------------------------------------------------|------|
| `r`          | Rank of the low-rank matrices (higher = more learning capacity)        | 8    |
| `lora_alpha` | Scaling factor for the adapter’s update                       | 16   |
| `target_modules` | Which weight matrices get adapters (we pick **q** & **v**) | ["q","v"] |
| `lora_dropout` | Regularisation inside adapters                               | 0.05 |

### Why place LoRA adapters on q and v? What about the others?

| Symbol | Full name                       | Role in self-attention         |
| ------ | ------------------------------- | ------------------------------ |
| **Q**  | **Query** projection            | asks “what am I looking for?”  |
| **K**  | **Key** projection              | represents “what do I have?”   |
| **V**  | **Value** projection            | holds the information to mix   |
| **O**  | **Output** (final linear layer) | re-mixes heads after attention |

A complete attention block has four projection matrices per head. Putting LoRA on all four gives maximum flexibility but also multiplies train-time memory.  
Empirical sweet-spot: Research (LoRA paper §5 and several follow-ups) showed that adapting Q + V captures most task-specific gains while keeping parameter count and GPU RAM minimal. The intuition:
- Queries (Q) change how each token attends to others.
- Values (V) change what content is blended once attention scores are computed.

Keys and the output layer matter too, but adjusting them yields diminishing returns for many language-generation tasks.

In [None]:
from transformers import AutoModelForSeq2SeqLM
from peft import LoraConfig, get_peft_model, TaskType

base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

lora_cfg = LoraConfig(
    task_type      = TaskType.SEQ_2_SEQ_LM,  # generation task
    r              = 8,      # rank of the LoRA matrices
    lora_alpha     = 16,     # scaling
    target_modules = ["q", "v"],  # project only query & value matrices
    lora_dropout   = 0.05,
    bias           = "none"
)

peft_model = get_peft_model(base_model, lora_cfg)
peft_model.print_trainable_parameters()

## 5 Training Arguments and Trainer Loop

Hugging Face `Seq2SeqTrainer` takes care of the full training loop (forward, back-prop, gradient clipping, etc.).

Important flags we set:

- **`per_device_train_batch_size` = 16**, fits on a 12 GB GPU.  
- **`num_train_epochs` = 30**, small dataset needs more passes.  
- **`learning_rate` = 5e-4**, slightly higher than full-fine-tuning, because we’re optimising far fewer weights.  
- **`save_strategy` = "no"**, skip checkpoints to save disk. You can change it to `"epoch"` if you want them.  

In [None]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir          = "./t5-lora-out",
    per_device_train_batch_size = 16,
    num_train_epochs    = 30,
    learning_rate       = 5e-4,
    logging_steps       = 5,
    save_strategy       = "no",
    report_to           = "none",
)

trainer = Seq2SeqTrainer(
    model         = peft_model,
    args          = training_args,
    train_dataset = ds_tok,
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=peft_model),
)

trainer.train()

## 6 Save Adapter and Tokenizer

LoRA lets us store **only** the lightweight adapter, ~9 MB in this case. The base T5-small weights are **not duplicated**.

In [None]:
trainer.model.save_pretrained("t5-small-lora-adapter")
tokenizer.save_pretrained("t5-small-lora-adapter")

## 7 Load LoRA-Adapted Model for Inference

We merge the adapter with the frozen base model at load time, then generate a reply for a sample complaint.

In [None]:
from peft import PeftConfig, PeftModel
from transformers import AutoModelForSeq2SeqLM

# Load base model
cfg         = PeftConfig.from_pretrained("t5-small-lora-adapter")
base_model  = AutoModelForSeq2SeqLM.from_pretrained(cfg.base_model_name_or_path)
# Load LoRA adapter
model_lora  = PeftModel.from_pretrained(base_model, "t5-small-lora-adapter")

# Test input
prompt = "generate reply: My order arrived late. I want a refund."
# Tokenize the test input
inputs = tokenizer(prompt, return_tensors="pt")
# Generate reply
outputs = model_lora.generate(**inputs, max_new_tokens=80)

print("LoRA reply:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))

## 8 Side-by-Side Evaluation (Base vs LoRA)

Let’s run eight realistic prompts through both the vanilla T5-small and our LoRA-adapted version, then print the outputs in a table for quick eyeballing.  
You should notice LoRA replies are:
- More structured (e.g., include apology and next steps)  
- Shorter and on brand  
- Consistent JSON or bullet style, depending on your `target_text` examples  

In [None]:
# Compare vanilla T5-small with LoRA fine‑tuned adapter on structured JSON output
import json, pandas as pd
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import PeftConfig, PeftModel

tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Base (pre‑trained) model
base_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# LoRA‑adapted model, make sure this path matches the one used in the training cell
adapter_path = "t5-small-lora-adapter"
cfg       = PeftConfig.from_pretrained(adapter_path)
ft_model  = PeftModel.from_pretrained(
    AutoModelForSeq2SeqLM.from_pretrained(cfg.base_model_name_or_path),
    adapter_path
)

def generate(model, prompt):
    ids = tokenizer(prompt, return_tensors="pt")
    out = model.generate(**ids, max_new_tokens=120)
    return tokenizer.decode(out[0], skip_special_tokens=True)

test_prompts = [
    "Compose a response to this customer email: My order arrived late. I want a refund.",
    "Draft a reply to this customer message: The product I received is damaged. What can I do?",
    "Write a response to this email from a client: I received the wrong item in my order.",
    "Create a reply for this customer's email: How can I return an item I purchased last week?",
    "Formulate a response to the customer's email: I never received my order.",
    "Respond to this message from the customer: Why was I charged twice for my order?",
    "Prepare a reply to this client email: I need help tracking my shipment.",
    "Construct a response for the customer's message: Can I exchange my item for a different size?"
]

records = []
for p in test_prompts:
    records.append({
        "prompt": p,
        "T5-base": generate(base_model, p),
        "LoRA":    generate(ft_model,  p)
    })

df = pd.DataFrame(records)
print(df.to_markdown(index=False))

## 9  Next Steps

1. **Quantisation**: combine LoRA with 8-bit weights using `bitsandbytes` library to shrink disk size and speed up inference.  
2. **Hyper-parameter search**: try different ranks (`r`) and target modules (add **k** and **o** matrices) for possibly better accuracy.  
3. **Objective metrics**: integrate BLEU, ROUGE-L, or a custom JSON validator to track quality over epochs.  
4. **Deployment**: merge base + adapter and serve via FastAPI, Streamlit, or Hugging Face Inference Endpoints.  
5. **Prompt scaffolding**: prepend `"generate structured_reply:"` automatically so end-users don’t need to remember it.  

Happy fine-tuning!