# Lab 7 – Quantizing an LLM with Unsloth (IMDB)> **⚠️ IMPORTANT**: This lab requires **Google Colab with GPU enabled**> - Go to Runtime → Change runtime type → GPU (T4 or better)> - Unsloth requires CUDA and will not work on Mac/Windows locally> - See `COLAB_SETUP.md` for detailed setup instructionsThis lab focuses on **quantization**, which reduces the numerical precision of model weights to decrease memory usage and improve inference speed. We'll use the IMDB movie reviews dataset for sentiment analysis as an example task.## Objectives- Fine-tune a base model on the IMDB sentiment analysis dataset.- Apply 8-bit and 4-bit quantization using Unsloth and compare their impacts on model size, memory usage, and inference speed.- Evaluate quantized models on a validation set to understand the trade-offs between speed and accuracy.Note: Quantization relies on appropriate hardware (e.g., GPUs that support int8/4-bit kernels) and may degrade model accuracy slightly. Experiment with different quantization bit widths and record your observations.

In [None]:
# Install Unsloth using the official auto-install script
# This automatically detects your environment and installs the correct version
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Alternative manual installation if auto-install fails:
# !pip install --upgrade pip
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"

print("✅ Unsloth installation complete! Now restart runtime before proceeding.")
print("⚠️ IMPORTANT: Use GPU runtime, not TPU! Unsloth requires CUDA GPU.")

In [None]:
# 1️⃣ Load IMDB dataset

from datasets import load_dataset
from transformers import AutoTokenizer

# Load subsets of the IMDB dataset for training and validation
dataset = load_dataset("imdb", split="train[:5%]")
dataset_val = load_dataset("imdb", split="test[:5%]")

print(dataset[0])

# Initialize tokenizer
base_model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

max_length = 512

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)

train_dataset = dataset.map(tokenize_function, batched=True)
val_dataset = dataset_val.map(tokenize_function, batched=True)

print("Tokenized IMDB dataset ready.")


In [None]:
from unsloth import FastLanguageModelimport torch# 2️⃣ Fine-tune a sentiment classifier on IMDB# Load a base model for classification (you can choose a smaller model if needed)model, _ = FastLanguageModel.from_pretrained(    model_name=base_model_name,    dtype=torch.float16,    device_map="auto")# Set up LoRA for efficient fine-tuningfrom peft import LoraConfig, get_peft_model, prepare_model_for_kbit_trainingfrom torch.utils.data import DataLoaderfrom tqdm import tqdm# Prepare model for trainingmodel = prepare_model_for_kbit_training(model)# Configure LoRAlora_config = LoraConfig(    r=8,    lora_alpha=16,    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],    lora_dropout=0.05,    bias="none",    task_type="CAUSAL_LM")model = get_peft_model(model, lora_config)model.print_trainable_parameters()# Prepare dataloaderdef collate_fn(batch):    input_ids = torch.tensor([item['input_ids'] for item in batch])    attention_mask = torch.tensor([item['attention_mask'] for item in batch])    labels = torch.tensor([item['label'] for item in batch])    return {        'input_ids': input_ids,        'attention_mask': attention_mask,        'labels': labels    }train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)# Training loopoptimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)num_epochs = 2print(f"\n🔄 Fine-tuning on IMDB sentiment classification...")print(f"Epochs: {num_epochs}, Batch size: 4\n")# CRITICAL: Configure model for proper training (prevents EmptyLogits)
model.config.use_cache = False  # Disable cache for training
model.gradient_checkpointing_enable()  # Enable gradient checkpointing

model.train()for epoch in range(num_epochs):    epoch_loss = 0    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_epochs}")        for batch_idx, batch in enumerate(progress_bar):        input_ids = batch['input_ids'].to(model.device)        attention_mask = batch['attention_mask'].to(model.device)        labels = batch['labels'].to(model.device)                # Forward pass        outputs = model(            input_ids=input_ids,            attention_mask=attention_mask,            output_hidden_states=True        )                # Classification loss (simplified)        logits = outputs.logits[:, -1, :2]        loss = torch.nn.functional.cross_entropy(logits, labels)                loss.backward()        optimizer.step()        optimizer.zero_grad()                epoch_loss += loss.item()        progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})                # Limit batches for demo        if batch_idx >= 50:            break        avg_loss = epoch_loss / min(len(train_dataloader), 51)    print(f"Epoch {epoch + 1} completed. Average loss: {avg_loss:.4f}\n")# Save the fine-tuned model (for later quantization comparison)print("✓ Fine-tuning complete!")print("💾 Model is already quantized to 4-bit using Unsloth's bnb-4bit variant")

In [None]:
# 3️⃣ Apply quantization to the fine-tuned model

# After training your model, apply quantization. You may need libraries such as `bitsandbytes` or PyTorch’s quantization utilities.
# Example pseudocode:
# from torch.ao.quantization import quantize_dynamic
# model_int8 = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

# For 4-bit quantization, you might use third-party libraries like `bitsandbytes`:
# import bitsandbytes as bnb
# model_int4 = bnb.nn.quantization.quantize_model(model, bits=4)

print("Quantization applied. You can now evaluate int8 and int4 models on the validation set.")


In [None]:
# 4️⃣ Evaluate original and quantized models

# Evaluate the quantized model
import time
import psutil
import os

def get_model_memory_usage():
    """Estimate GPU/CPU memory usage"""
    try:
        import torch
        if torch.cuda.is_available():
            return torch.cuda.memory_allocated() / 1024**3  # GB
        else:
            # Estimate from model parameters
            process = psutil.Process(os.getpid())
            return process.memory_info().rss / 1024**3  # GB
    except:
        return 0

def evaluate_model(model, dataloader, model_name="Model"):
    """Evaluate model accuracy and inference speed"""
    model.eval()
    correct = 0
    total = 0
    inference_times = []
    
    print(f"\n📊 Evaluating {model_name}...")
    
    with torch.no_grad():
        for batch_idx, batch in enumerate(tqdm(dataloader, desc="Validating")):
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)
            labels = batch['labels'].to(model.device)
            
            # Measure inference time
            start_time = time.time()
            
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                output_hidden_states=True
            )
            
            elapsed = time.time() - start_time
            inference_times.append(elapsed)
            
            # Get predictions
            logits = outputs.logits[:, -1, :2]
            predictions = torch.argmax(logits, dim=-1)
            
            correct += (predictions == labels).sum().item()
            total += labels.numel()  # Count all elements, not just batch dimension
            
            # Limit evaluation for demo
            if batch_idx >= 20:
                break
    
    accuracy = correct / total * 100
    avg_inference_time = sum(inference_times) / len(inference_times)
    
    return {
        'accuracy': accuracy,
        'avg_inference_time': avg_inference_time,
        'samples_per_sec': len(batch['input_ids']) / avg_inference_time
    }

print("\n🔬 Evaluating Quantized Model Performance...")

# Get memory usage
memory_gb = get_model_memory_usage()

# Prepare validation dataloader
val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn)

# Evaluate the 4-bit quantized model
results = evaluate_model(model, val_dataloader, "4-bit Quantized Model")

# Display results
print("\n" + "="*60)
print("📈 QUANTIZATION RESULTS")
print("="*60)

print(f"\n💾 Memory Usage:")
print(f"  - Model memory: ~{memory_gb:.2f} GB")
print(f"  - Quantization: 4-bit (bitsandbytes)")

print(f"\n🎯 Performance:")
print(f"  - Accuracy: {results['accuracy']:.2f}%")
print(f"  - Avg inference time: {results['avg_inference_time']*1000:.2f}ms per batch")
print(f"  - Samples/second: {results['samples_per_sec']:.1f}")

print(f"\n💡 Key Insights:")
print(f"  - 4-bit quantization reduces memory by ~75% compared to FP32")
print(f"  - Inference speed increases due to smaller memory footprint")
print(f"  - Accuracy typically drops by 1-3% but remains acceptable")

print("\n✓ Evaluation complete!")
print("\n📝 Note: Unsloth's bnb-4bit models use advanced quantization techniques")
print("   that minimize accuracy loss while maximizing efficiency.")


## Reflection

- How did quantization to 8-bit and 4-bit affect the model's accuracy on the IMDB dataset?
- Compare the memory footprint and inference latency between different quantization levels. Is the trade-off acceptable?
- Consider scenarios where the slight performance drop from 4-bit quantization might be justified by significant gains in throughput and cost savings.
