
# 🧪 Lab 2 — **LLM Fine-tuning** (DistilGPT2) + **LoRA** + **Prompting**

**Course:** Generative AI (Day 2)  
**Lab Length:** ~3 hours

**Goal:** Fine-tune a small causal LM on a tiny domain corpus (Tiny Shakespeare), compare **full fine-tuning** vs **LoRA**, and experiment with **prompt engineering**.

> ✅ **Deliverables (submit this notebook):**
> - Baseline vs fine-tuned **perplexity** table
> - 3–5 **generations** for the same prompts across models (baseline / full FT / LoRA)
> - Short **prompting analysis** (zero-shot, few-shot, instruction template)
> - 3–5 bullet **takeaways**



### 🚦 Rules & Guidance
- If `bitsandbytes` isn't available, LoRA still works without 4-bit (skip QLoRA).
- Cells marked **(Provided)** can be run as-is; **(TODO)** require your edits.



## 🎯 Learning Objectives
- Prepare data, tokenize, and **pack** sequences for causal LM training.
- Compute baseline **validation perplexity** (no fine-tuning).
- Run a short **full fine-tune** (a few epochs) and evaluate.
- Apply **LoRA (PEFT)** (optionally QLoRA) and evaluate.
- Compare **generations** and analyze **prompting** strategies.


In [None]:
!pip install -q -U triton

# ===== (Provided) Setup & Installs =====
# If you're in Colab, uncomment to install pinned versions:
!pip -q install transformers==4.44.2 datasets==2.20.0 accelerate==0.34.2 peft==0.12.0 bitsandbytes==0.43.1 evaluate==0.4.2 rouge-score -U

import os, math, random, numpy as np, torch
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling,
                          Trainer, TrainingArguments, set_seed)
from transformers import pipeline
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import evaluate

set_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)



---

## Part 0 — Data (Tiny Shakespeare) *(~10 min)*

We'll use the **Tiny Shakespeare** dataset (~1MB).  
Feel free to swap with a small domain corpus if desired.


In [None]:

# ===== (Provided) Load dataset =====
raw = load_dataset("Trelis/tiny-shakespeare")
# Split into train/valid/test quickly
split = raw["train"].train_test_split(test_size=0.02, seed=42)
test_valid = split["test"].train_test_split(test_size=0.5, seed=42)
dataset = DatasetDict({
    "train": split["train"],
    "validation": test_valid["train"],
    "test": test_valid["test"]
})
dataset



---

## Part 1 — Tokenization & Packing *(~25–30 min)*

**Tasks:**
1) Load tokenizer for `distilgpt2` and set `pad_token = eos_token`.  
2) **Tokenize** the text.  
3) **Group/pack** into fixed-length blocks (e.g., `block_size = 256`) with `labels = input_ids`.

> 💡 **Hints**
> - Use `remove_columns=["text"]` in `map` after tokenization.
> - For grouping, concatenate token lists and then slice into blocks.


In [None]:

# ===== (TODO) Tokenizer & packing =====
model_id = "distilgpt2"
tok = AutoTokenizer.from_pretrained(model_id)
tok.pad_token = tok.eos_token  # important for collation
block_size = 256             

def tokenize(ex):
    out = tok(ex["Text"])
    return out

tokenized = dataset.map(tokenize, batched=True, remove_columns=["Text"])
tokenized

def group_texts(examples):
    # - Concatenate lists per key
    # - Truncate to a multiple of block_size
    # - Split into fixed blocks
    concat = {k: sum(examples[k], []) for k in examples.keys()}
    total_len = (len(concat["input_ids"]) // block_size) * block_size
    result = {
        k: [t[i:i+block_size] for i in range(0, total_len, block_size)]
        for k, t in concat.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized.map(group_texts, batched=True)
lm_datasets



---

## Part 2 — Baseline Perplexity *(~20 min)*

Compute validation **perplexity** before any fine-tuning.

> 💡 **Hints**
> - Use `AutoModelForCausalLM` with `distilgpt2`.
> - Use `DataCollatorForLanguageModeling(mlm=False)` for causal LM.
> - Evaluate a limited number of batches (e.g., 50) for speed.


In [None]:

# ===== Baseline perplexity =====
baseline_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
collator = DataCollatorForLanguageModeling(tok, mlm=False)

def compute_ppl(model, ds, max_batches=50):
    model.eval()
    losses = []
    loader = torch.utils.data.DataLoader(ds, batch_size=8, shuffle=False, collate_fn=collator)
    for i, batch in enumerate(loader):
        if i >= max_batches: break
        for k in batch:
            batch[k] = batch[k].to(device)
        with torch.no_grad():
            out = model(**batch)
            loss = out.loss.detach().float()
        losses.append(loss.item())
    mean_loss = float(np.mean(losses)) if losses else float("inf")
    return math.exp(mean_loss) if mean_loss < 20 else float("inf")

ppl_baseline = compute_ppl(baseline_model, lm_datasets["validation"])
print(f"Baseline validation PPL: {ppl_baseline:.2f}")



---

## Part 3 — Full Fine-tune *(~45–60 min)*

Run a short fine-tune (2–3 epochs).

> 💡 **Hints**
> - Start with `per_device_train_batch_size=8`, `grad_accum_steps=2` (adjust to VRAM).
> - `fp16=True` if on GPU.
> - Log with `logging_steps=50` and evaluate per epoch.


In [None]:

# ===== Full fine-tune =====
ft_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

args = TrainingArguments(
    output_dir="./ft-distilgpt2",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    eval_strategy="epoch",
    logging_steps=50,
    learning_rate=2e-4,
    num_train_epochs=5,      # (TODO) increase if time allows
    warmup_ratio=0.03,
    weight_decay=0.1,
    save_strategy="epoch",
    fp16=torch.cuda.is_available(),
    report_to="none"
)

trainer = Trainer(
    model=ft_model,
    args=args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=collator,
)

trainer.train()

ppl_ft = compute_ppl(ft_model, lm_datasets["validation"])
print(f"Full FT validation PPL: {ppl_ft:.2f}")



---

## Part 4 — Quick Generations (Full FT) *(~10 min)*

Generate a short sample with your **fine-tuned** model for qualitative inspection.


In [None]:

# ===== (Provided) Generation helper for FT model =====
gen_ft = pipeline("text-generation", model=ft_model, tokenizer=tok, device=0 if device=="cuda" else -1)

prompt = "ROMEO: I dreamt tonight that"
out = gen_ft(prompt, max_new_tokens=80, do_sample=True, temperature=0.9, top_p=0.95)[0]["generated_text"]
print(out)



---

## Part 5 — LoRA (PEFT) *(~45 min)*

Train with **LoRA** (and optionally QLoRA if 4-bit quantization is available).

> 💡 **Hints**
> - If `bitsandbytes` is available, use `BitsAndBytesConfig(load_in_4bit=True)` + `prepare_model_for_kbit_training`.
> - Set `r=8..16`, `lora_alpha=16..32`, `lora_dropout≈0.05`.
> - Compare PPL vs full FT.


In [None]:
# ===== LoRA fine-tune =====
# NOTE: We have disabled 4-bit quantization (bitsandbytes) to avoid Colab errors.
# DistilGPT2 is small enough to run fine without it.

base_for_lora = AutoModelForCausalLM.from_pretrained(model_id).to(device)

lora_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["c_attn", "q_attn", "v_attn"] if "gpt2" in model_id else None,
    bias="none",
    task_type="CAUSAL_LM"
)
lora_model = get_peft_model(base_for_lora, lora_cfg)
lora_model.print_trainable_parameters()

args_lora = TrainingArguments(
    output_dir="./lora-distilgpt2",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    eval_strategy="epoch",
    logging_steps=50,
    learning_rate=1.5e-4,
    num_train_epochs=2,
    warmup_ratio=0.03,
    save_strategy="epoch",
    fp16=torch.cuda.is_available(),
    report_to="none"
)

trainer_lora = Trainer(
    model=lora_model,
    args=args_lora,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=collator,
)
trainer_lora.train()

ppl_lora = compute_ppl(lora_model, lm_datasets["validation"])
print(f"LoRA validation PPL: {ppl_lora:.2f}")



---

## Part 6 — Prompting Experiments *(~30 min)*

Run **the same prompts** through:
- Baseline (no FT)
- Full FT
- LoRA

Explore:
- **Zero-shot** continuation
- **Instruction-style** template (even if GPT-2 family isn't instruction-tuned, observe behavior)
- **Few-shot** pattern completion


In [None]:

# ===== Compare generations across models =====
gen_base = pipeline("text-generation", model=baseline_model, tokenizer=tok, device=0 if device=="cuda" else -1)
gen_lora = pipeline("text-generation", model=lora_model, tokenizer=tok, device=0 if device=="cuda" else -1)

prompts = [
    "ROMEO: I dreamt tonight that",
    "### Instruction:\nRewrite the line in modern English.\n### Input:\n'Thou art more lovely and more temperate.'\n### Response:\n",
    "SCENE PROMPTS:\nExample: 'A dark forest at midnight.' -> 'The moon hides as branches claw at the sky.'\n"
    "Example: 'A crowded marketplace.' -> 'Vendors roar; coins clatter like rain.'\n"
    "Task: 'A stormy seashore.' -> "
]

def run_batch(gen_pipe, name):
    print(f\"\n=== {name} ===\")
    for p in prompts:
        out = gen_pipe(p, max_new_tokens=60, do_sample=True, temperature=0.9, top_p=0.92)[0][\"generated_text\"]
        print(f\"\n[Prompt]\n{p}\n[Output]\n{out}\n{'-'*60}\")

run_batch(gen_base, "Baseline")
run_batch(gen_ft,   "Full FT")
run_batch(gen_lora, "LoRA")



---

# Part 7 — Challenge: The "Catastrophic Forgetting" Test

**Context:** You have fine-tuned the model to speak like Shakespeare. But what happened to the general knowledge it had before? 

When we aggressively fine-tune a small model on a narrow dataset, we often trigger **Catastrophic Forgetting**—the model "forgets" how to speak modern English or answer factual questions because its weights have been overwritten to minimize loss on Shakespeare plays.

**Task:**
1. Run the code below using a **modern** prompt (something that definitely didn't exist in Shakespeare's time).
2. Observe how the **Full Fine-Tune** model struggles compared to the **Baseline**.
3. **Crucial:** Analyze if **LoRA** preserves more general knowledge than Full FT.

In [None]:
# ===== (TODO) The Forgetting Test =====
modern_prompts = [
    "The capital of France is",
    "To install a python package, you should run",
    "The theory of relativity was proposed by"
]

print("\n=== Forgetting Test: Modern Prompts ===\n")

for p in modern_prompts:
    print(f"\n>>> Prompt: {p}")
    # Baseline
    try:
        out_base = gen_base(p, max_new_tokens=30, do_sample=True, temperature=0.5)[0]["generated_text"]
        print(f"[Baseline]: {out_base[len(p):].strip()} ...") 
    except NameError:
        print("[Baseline]: Model pipeline not found")

    # Full FT
    try:
        out_ft = gen_ft(p, max_new_tokens=30, do_sample=True, temperature=0.5)[0]["generated_text"]
        print(f"[Full FT]:  {out_ft[len(p):].strip()} ...")
    except NameError:
        print("[Full FT]:  Model pipeline not found")

    # LoRA
    try:
        out_lora = gen_lora(p, max_new_tokens=30, do_sample=True, temperature=0.5)[0]["generated_text"]
        print(f"[LoRA]:     {out_lora[len(p):].strip()} ...")
    except NameError:
        print("[LoRA]:     Model pipeline not found")


---
# Part 8 — The "Ablation Study" 

**Warning: This section requires significant compute time.**

**Context:**
In Part 5, you used a LoRA Rank (`r`) of **16**. But why 16? Why not 1? Why not 100?
In Deep Learning, hyperparameters are rarely guessed correctly on the first try. We must perform an **Ablation Study**—a systematic experiment where we change one variable to see its impact.

**The Theory:**
* **Rank (`r`)** determines the size of the low-rank matrices.
* **High Rank (e.g., 64):** More trainable parameters. Theoretically "smarter," but slower to train and higher risk of overfitting.
* **Low Rank (e.g., 1 or 8):** Fewer parameters. Faster, but might not have the "capacity" to learn the style.

**Your Task:**
You will run the LoRA training loop **3 distinct times** to find the "Sweet Spot."

1. **Run A:** `r = 1` (Extreme compression)
2. **Run B:** `r = 8` (Low capacity)
3. **Run C:** `r = 64` (High capacity)

*Note: To save time, you can reduce `num_train_epochs` to **1** for these experiments, as we are looking for relative differences, not absolute convergence.*

### 📝 Deliverable Table (Fill this out)

| Experiment | Rank (`r`) | Trainable Parameters (%) | Validation PPL | Did it learn the style? (Subjective) |
| :--- | :--- | :--- | :--- | :--- |
| **Run A** | 1 | ... % | ... | ... |
| **Run B** | 8 | ... % | ... | ... |
| **Run C** | 64 | ... % | ... | ... |



---
*(Run the code cell below to execute the loop. Go grab a coffee, this will take a while!)*

In [None]:
# ===== Memory Cleanup before Ablation =====
import gc
print("Cleaning up memory...")
try:
    del baseline_model
    del ft_model
    del lora_model
    del base_for_lora
    del trainer
    del trainer_lora
except NameError:
    pass
gc.collect()
torch.cuda.empty_cache()
print("Memory cleared.")


In [None]:
# ===== Ablation Study Loop =====
# This cell runs the training 3 times. 
# WARNING: This will take significant time. 

ranks_to_test = [1, 8, 64]
results = {}

for r_val in ranks_to_test:
    print(f"\n\n{'='*20} STARTING TRAINING WITH RANK: {r_val} {'='*20}")
    
    # 1. Config with new Rank
    # Note: We scale alpha to match r roughly, or keep it fixed. 
    # For this experiment, we will keep alpha fixed at 32 to isolate 'r'.
    loop_config = LoraConfig(
        r=r_val, 
        lora_alpha=32, 
        lora_dropout=0.05,
        target_modules=["c_attn", "q_attn", "v_attn"], 
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    # 2. Prepare Model
    model_to_train = AutoModelForCausalLM.from_pretrained(model_id).to(device)
    # (Add quantization logic here if using QLoRA, otherwise standard:)
    loop_model = get_peft_model(model_to_train, loop_config)
    
    # Print param %
    trainable, total = loop_model.get_nb_trainable_parameters()
    print(f"Trainable params: {trainable} || {100 * trainable / total:.4f}%")
    
    # 3. Training Arguments (Unique output dir is crucial!)
    loop_args = TrainingArguments(
        output_dir=f"./lora-rank-{r_val}",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        learning_rate=1.5e-4,
        num_train_epochs=1,      # Reduced to 1 epoch to save YOUR time, typically 2-3
        logging_steps=50,
        fp16=torch.cuda.is_available(),
        report_to="none"
    )
    
    # 4. Train
    loop_trainer = Trainer(
        model=loop_model,
        args=loop_args,
        train_dataset=lm_datasets["train"],
        eval_dataset=lm_datasets["validation"],
        data_collator=collator
    )
    loop_trainer.train()
    
    # 5. Evaluate
    ppl = compute_ppl(loop_model, lm_datasets["validation"])
    results[r_val] = ppl
    print(f"Rank {r_val} Finished. PPL: {ppl:.2f}")
    
    # Clear memory
    del loop_model, loop_trainer
    torch.cuda.empty_cache()

print("\n\n=== FINAL RESULTS ===")
print(results)


---

## 📝 Final Report (fill before submission)

Complete the table and add brief commentary.

### Perplexity
| Model | Val PPL |
|------|---------|
| Baseline (distilgpt2) | … |
| Full Fine-tune | … |
| LoRA | … |

### Generations (same prompts)
- Prompt A:  
  - Baseline → …  
  - Full FT → …  
  - LoRA → …  

- Prompt B: …

### Prompt Engineering Notes
- Zero-shot vs few-shot: …  
- Instruction template effects: …  
- Failure cases / artifacts: …  

### Analysis: Catastrophic Forgetting
* **Observation:** How did the Full Fine-Tuned model respond to the modern prompt compared to the Baseline? Did it hallucinate Shakespearean words?
* **LoRA vs Full FT:** Did the LoRA model retain more "modern English" capability than the Full FT model? 
* **Theory:** Why does LoRA (training <1% of parameters) theoretically help prevent forgetting compared to updating 100% of parameters?

### Analysis Questions: The "Ablation Study" 

1. **The "Diminishing Returns" Trap:** Did increasing the Rank from 8 to 64 result in a massive drop in Perplexity, or was the improvement marginal?
2. **Efficiency:** Look at the file size or parameter count of `r=1` vs `r=64`. Given the performance difference you observed, which Rank would you choose for a production mobile app?
3. **Overfitting:** Did the high-rank model (`r=64`) start to memorize the training data (loss goes down) but fail to generalize (validation PPL stays high)? Explain based on your logs.

### Takeaways (3–5 bullets)
- …
