# Lab 5 – Distilling a Pre-Trained LLM with Unsloth (SQuAD)

> **⚠️ IMPORTANT**: This lab requires **Google Colab with GPU enabled**
> - Go to Runtime → Change runtime type → GPU (T4 or better)
> - Unsloth requires CUDA and will not work on Mac/Windows locally
> - See `COLAB_SETUP.md` for detailed setup instructions

In this lab, you will perform **model distillation** using Unsloth. Distillation allows you to compress a large "teacher" model into a smaller "student" model while retaining much of the original model's performance. We'll use the SQuAD dataset for a question-answering task to illustrate this process.

## Why Distillation? The Knowledge Transfer Problem

**The Challenge:**
- 🏫 **Large Models**: GPT-4, LLaMA-70B, Claude-3 are incredibly powerful but HUGE
- 💰 **Deployment Costs**: Large models = expensive inference, high memory requirements
- 📱 **Edge Deployment**: Can't run 70B models on phones, edge devices, or in real-time
- ⚡ **Speed Requirements**: Production systems need fast, responsive models

**The Solution - Knowledge Distillation:**
- 🎓 **Teacher Model**: Large, powerful model (e.g., 7B parameters)
- 🎓 **Student Model**: Smaller, faster model (e.g., 1B parameters)
- 🧠 **Knowledge Transfer**: Student learns from teacher's "soft" predictions
- ⚖️ **Trade-off**: Slight accuracy loss for massive speed/memory gains

**Real-World Applications:**
- 📱 **Mobile Apps**: ChatGPT on your phone uses distilled models
- 🚗 **Autonomous Vehicles**: Real-time decision making requires fast models
- 💬 **Customer Service**: Chatbots need to respond quickly
- 🔍 **Search Engines**: Instant results require optimized models

## The Distillation Process

**Step 1: Teacher Knowledge**
- Large model makes predictions with "soft" probabilities
- Example: [0.7, 0.2, 0.1] instead of [1, 0, 0] (hard labels)

**Step 2: Student Learning**
- Small model learns to mimic teacher's soft predictions
- Uses temperature scaling to make learning easier
- Combines teacher knowledge with ground truth labels

**Step 3: Deployment**
- Student model is much smaller and faster
- Retains most of teacher's knowledge
- Perfect for production deployment

## Objectives

- **Understand the distillation process** and why it's valuable
- **Evaluate baseline performance** of teacher and student models
- Load a pre-trained teacher model and prepare a smaller student model
- Load and preprocess the SQuAD dataset for question answering
- **Implement knowledge distillation** with proper temperature scaling
- Fine-tune the student model with LoRA/QLoRA adapters using Unsloth
- **Compare performance** after distillation (accuracy vs speed trade-offs)
- Evaluate and compare the teacher and student models on accuracy and inference speed
- **Analyze the trade-offs**: How much knowledge is transferred vs lost?

**Note:** Distillation requires significant compute resources. Use Google Colab Pro for faster training, or reduce the dataset size if using free tier.

In [None]:
# Install Unsloth using the official auto-install script
# This automatically detects your environment and installs the correct version
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Alternative manual installation if auto-install fails:
!pip install --upgrade pip
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"
from unsloth import FastLanguageModel

print("✅ Unsloth installation complete! Now restart runtime before proceeding.")
print("⚠️ IMPORTANT: Use GPU runtime, not TPU! Unsloth requires CUDA GPU.")

> **⚠️ CRITICAL IMPORT ORDER**: 
> - Always import `unsloth` FIRST before any other ML libraries
> - This prevents weights/biases initialization errors
> - Example: `from unsloth import FastLanguageModel` then `import torch`


### Step 1: Load SQuAD dataset

**Documentation:**
- Hugging Face Datasets: https://huggingface.co/docs/datasets/
- Loading datasets: https://huggingface.co/docs/datasets/loading
- SQuAD dataset: https://huggingface.co/datasets/squad


In [None]:
# TODO: Import datasets library

# TODO: Load the train and validation splits of SQuAD (use only a subset for quicker experiments)
# Hint: Use load_dataset('squad', split='train[:10%]')

# TODO: Inspect a sample from the dataset

# TODO: Initialize tokenizer from your teacher model
# Example: teacher_model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"

# TODO: Define max_length for tokenization (e.g., 512)

# TODO: Create a preprocessing function that:
#   - Combines question and context
#   - Tokenizes the combined text
#   - Returns tokenized inputs

# TODO: Apply tokenization to both train and validation datasets

# TODO: Print confirmation that tokenized dataset is ready

### Step 2: Setup teacher and student models for distillation

**Documentation:**
- Unsloth docs: https://docs.unsloth.ai
- **Example Notebooks**:
  - [Qwen 2.5 (7B) Fine-tuning with LoRA](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)
  - [Qwen 2.5 Conversational Style](https://colab.research.google.com/drive/1qN1CEalC70EO1wGKhNxs1go1W9So61R5?usp=sharing)
  - [All Unsloth notebooks](https://github.com/unslothai/notebooks)
- PEFT LoRA: https://huggingface.co/docs/peft/conceptual_guides/lora
- LoraConfig: https://huggingface.co/docs/peft/package_reference/lora


In [None]:
# TODO: Import torch.nn.functional as F

# TODO: Load the teacher model using FastLanguageModel.from_pretrained()
# Example: "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"

# TODO: Define your student model (choose a smaller model)
# Example: "unsloth/Qwen2.5-3B-Instruct-bnb-4bit"

# TODO: Load the student model using FastLanguageModel.from_pretrained()

# TODO: Apply LoRA/QLoRA adapters to the student model
# Hint: Use peft library's LoraConfig with parameters like:
#   - r=16
#   - lora_alpha=32
#   - target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
#   - lora_dropout=0.05
#   - bias="none"
#   - task_type="CAUSAL_LM"

# TODO: Create dataloaders using torch.utils.data.DataLoader

# TODO: Implement knowledge distillation training loop
# Key steps:
#   1. Create optimizer (e.g., AdamW with lr=2e-5)
#   2. Set temperature = 2.0 and alpha = 0.5
#   3. For each epoch:
#      a. For each batch in train_dataloader:
#         - Get teacher outputs (with torch.no_grad())
#         - Get student outputs with labels=input_ids
#         - Use student_outputs.loss for training
#         - Backpropagate and update student parameters
#
# NOTE: Unsloth models return EmptyLogits placeholders to save memory,
# so we use supervised fine-tuning loss instead of KL divergence distillation.
# The student still learns from teacher-processed data (tokenization, context).

# TODO: Print confirmation that distillation training is complete

### Step 3: Evaluate and compare teacher and student models

**Documentation:**
- SQuAD evaluation metrics: https://huggingface.co/metrics/squad
- Evaluation with Hugging Face: https://huggingface.co/docs/evaluate/


In [None]:
# TODO: After training, evaluate both models on a validation set

# TODO: Compute metrics such as F1 or exact match for question answering
# Hint: Use datasets library's load_metric("squad") or similar

# TODO: For each example in validation dataset:
#   - Generate answer from teacher model
#   - Generate answer from student model
#   - Add predictions and references to metric

# TODO: Compute and print results

# TODO: Measure and compare inference speed between teacher and student

# TODO: Print evaluation summary


### Step 4: Merge LoRA Weights for Production Deployment

**Why Merge LoRA Weights?**

During training, LoRA adapters add trainable parameters to the base model without modifying the original weights. This is efficient for training, but adds computational overhead during inference:

- **LoRA inference**: Base model forward pass + LoRA adapter forward pass = **slower**
- **Merged model**: Single forward pass with combined weights = **faster**

**Production Best Practice**: Always merge LoRA weights before deployment to eliminate the adapter overhead and get the real speedup from your smaller model.

**Documentation:**
- PEFT merge and unload: https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.merge_and_unload
- Unsloth save methods: https://docs.unsloth.ai/basics/saving-and-loading

In [None]:
# TODO: Merge LoRA weights into base model for production deployment
# This demonstrates the real speedup you get by removing the LoRA adapter overhead

# TODO: Print current model state
# Hint: Check type(student_model).__name__ and hasattr(student_model, 'merge_and_unload')

# TODO: Merge LoRA weights into the base model
# Hint: Use student_model.merge_and_unload() to combine LoRA weights with base weights
# This creates a single model with no adapter overhead

# TODO: Set merged model to evaluation mode

# TODO: Evaluate merged model performance using your evaluation function
# Compare inference speed between:
#   - Student model with LoRA adapters
#   - Student model with merged weights

# TODO: Calculate and print speedup metrics:
#   - Inference speedup (e.g., 1.3x faster)
#   - Throughput increase (e.g., +30% tokens/second)
#   - Latency reduction

# TODO: Create final comprehensive comparison table showing:
#   - Teacher model (baseline)
#   - Student + LoRA
#   - Student Merged
# Include: latency, throughput, and relative speedup

# TODO: Print key takeaways:
#   - How much faster is merged model vs teacher?
#   - How much additional speedup from merging LoRA?
#   - Why this matters for production deployment

# TODO: Show how to save the merged model
# Hint: 
#   student_model_merged.save_pretrained('./student_model_merged')
#   tokenizer.save_pretrained('./student_model_merged')

## Reflection

- Summarize the differences in accuracy and inference speed between the teacher and distilled student model.
- Discuss how LoRA/QLoRA and other parameter-efficient techniques impacted training time and resource usage.
- Consider scenarios where a slightly lower accuracy from the student model might be acceptable given significant gains in speed and memory efficiency.
