# Lab 7 – Quantizing an LLM with Unsloth (IMDB)> **⚠️ IMPORTANT**: This lab requires **Google Colab with GPU enabled**> - Go to Runtime → Change runtime type → GPU (T4 or better)> - Unsloth requires CUDA and will not work on Mac/Windows locally> - See `COLAB_SETUP.md` for detailed setup instructionsThis lab focuses on **quantization**, which reduces the numerical precision of model weights to decrease memory usage and improve inference speed. We'll use the IMDB movie reviews dataset for sentiment analysis as an example task.## Objectives- Fine-tune a base model on the IMDB sentiment analysis dataset.- Apply 8-bit and 4-bit quantization using Unsloth and compare their impacts on model size, memory usage, and inference speed.- Evaluate quantized models on a validation set to understand the trade-offs between speed and accuracy.Note: Quantization with Unsloth leverages CUDA-optimized kernels for int8/4-bit operations. Experiment with different quantization bit widths and record your observations.

In [None]:
# Install Unsloth using the official auto-install script
# This automatically detects your environment and installs the correct version
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Alternative manual installation if auto-install fails:
# !pip install --upgrade pip
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"

print("✅ Unsloth installation complete! Now restart runtime before proceeding.")
print("⚠️ IMPORTANT: Use GPU runtime, not TPU! Unsloth requires CUDA GPU.")

> **⚠️ CRITICAL IMPORT ORDER**: 
> - Always import `unsloth` FIRST before any other ML libraries
> - This prevents weights/biases initialization errors
> - Example: `from unsloth import FastLanguageModel` then `import torch`


### Step 1: Load IMDB dataset

**Documentation:**
- IMDB dataset: https://huggingface.co/datasets/imdb
- Loading datasets: https://huggingface.co/docs/datasets/loading


In [None]:
# TODO: Import datasets and tokenizer libraries# TODO: Load subsets of the IMDB dataset# Hint: Use load_dataset("imdb", split="train[:5%]")# TODO: Print a sample from the dataset# TODO: Initialize tokenizer from base model# TODO: Define max_length for tokenization# TODO: Create tokenization function that:#   - Tokenizes the text field#   - Uses padding='max_length' and truncation=True# TODO: Apply tokenization to train and validation datasets# TODO: Print confirmation that tokenized dataset is ready

### Step 2: Fine-tune a sentiment classifier on IMDB

**Documentation:**
- Transformers AutoModelForSequenceClassification: https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification
- Training with transformers: https://huggingface.co/docs/transformers/training


In [None]:
# CRITICAL: Import unsloth FIRST to avoid weights/biases initialization errors# TODO: Import torch and FastLanguageModel from unsloth# TODO: Load a base model for classification# Example: "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"# TODO: Alternatively, use AutoModelForSequenceClassification# Hint: model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)# TODO: Move model to GPU if available# TODO: Create data loaders# TODO: Implement training loop:#   - Define optimizer and loss function#   - For each epoch:#     - For each batch:#       - Forward pass#       - Calculate loss#       - Backward pass#       - Update weights# TODO: Print confirmation that fine-tuning is complete

### Step 3: Apply quantization to the fine-tuned model

**Documentation:**
- **Unsloth Quantization**: Unsloth provides pre-quantized 4-bit models (e.g., `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`)
- **Example Notebooks**:
  - [Qwen 2.5 with 4-bit quantization](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb)
  - [All quantized models on Unsloth](https://docs.unsloth.ai/get-started/all-our-models)
- PyTorch quantization: https://pytorch.org/docs/stable/quantization.html
- Dynamic quantization: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
- bitsandbytes library: https://github.com/TimDettmers/bitsandbytes


In [None]:
# TODO: Apply 8-bit quantization
# Hint: Use torch.quantization.quantize_dynamic()
# Example: model_int8 = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

# TODO: Apply 4-bit quantization
# Hint: You may need to use bitsandbytes library
# Example: import bitsandbytes as bnb

# TODO: Save both quantized models for evaluation

# TODO: Print confirmation that quantization is complete


### Step 4: Evaluate original and quantized models

**Documentation:**
- Model evaluation: https://huggingface.co/docs/transformers/training#evaluate


In [None]:
# TODO: Create evaluation function that:
#   - Sets model to eval mode
#   - Iterates through dataloader
#   - Computes predictions
#   - Calculates accuracy

# TODO: Evaluate original model and record:
#   - Accuracy
#   - Memory usage
#   - Inference speed

# TODO: Evaluate 8-bit quantized model and record same metrics

# TODO: Evaluate 4-bit quantized model and record same metrics

# TODO: Print comparison table of all models


## Reflection

- How did quantization to 8-bit and 4-bit affect the model's accuracy on the IMDB dataset?
- Compare the memory footprint and inference latency between different quantization levels. Is the trade-off acceptable?
- Consider scenarios where the slight performance drop from 4-bit quantization might be justified by significant gains in throughput and cost savings.
