# Lab 4 – Hello Unsloth: Load and Infer> **⚠️ IMPORTANT**: This lab requires **Google Colab with GPU enabled**> - Go to Runtime → Change runtime type → GPU (T4)> - Unsloth requires CUDA and will not work on Mac/Windows locally> - See `COLAB_SETUP.md` for detailed setup instructionsIn this lab, you will set up your environment for using **Unsloth** and perform a simple inference with a quantized LLM. The goal is to ensure that your environment is correctly configured and to record baseline metrics for inference speed and resource usage.## Objectives- Install and verify the Unsloth library and its dependencies (e.g., `transformers`, `torch`, `accelerate`).- Load a 4-bit quantized base model, such as `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`.- Generate a few example outputs to confirm the model works.- Measure VRAM usage, inference latency, and tokens per second.Before starting, make sure you have enabled GPU runtime in Google Colab. This notebook provides skeleton code and measurement functions – feel free to customize based on your needs.

In [None]:
# Install Unsloth using the official auto-install script
# This automatically detects your environment and installs the correct version
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Alternative manual installation if auto-install fails:
# !pip install --upgrade pip
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"

print("✅ Unsloth installation complete! Now restart runtime before proceeding.")
print("⚠️ IMPORTANT: Use GPU runtime, not TPU! Unsloth requires CUDA GPU.")

> **⚠️ CRITICAL IMPORT ORDER**: 
> - Always import `unsloth` FIRST before any other ML libraries
> - This prevents weights/biases initialization errors
> - Example: `from unsloth import FastLanguageModel` then `import torch`


### Step 1: Import libraries and load a quantized model

**Documentation:**
- Unsloth documentation: https://docs.unsloth.ai
- Unsloth Quick Start: https://docs.unsloth.ai/get-started/fine-tuning-guide
- **Example Notebook**: [Qwen 2.5 (7B) Alpaca](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) - Shows complete workflow
- All Unsloth notebooks: https://github.com/unslothai/notebooks
- PyTorch dtypes: https://pytorch.org/docs/stable/tensor_attributes.html#torch-dtype


In [None]:
# TODO: Import necessary libraries (CRITICAL: Import unsloth FIRST!)
# Hint: from unsloth import FastLanguageModel
# Hint: import torch

# TODO: Choose your model (e.g., "unsloth/Qwen2.5-7B-Instruct-bnb-4bit")

# TODO: Load the model and tokenizer using FastLanguageModel.from_pretrained()
# Hint: Set dtype=torch.float16 and device_map="auto"

# TODO: Print confirmation that the model is loaded


### Step 2: Run a simple inference and measure performance

**Documentation:**
- Model.generate() documentation: https://huggingface.co/docs/transformers/main_classes/text_generation
- Tokenization: https://huggingface.co/docs/transformers/main_classes/tokenizer


In [None]:
# TODO: Import time module

# TODO: Define a helper function to measure inference latency and throughput
# Function should:
#   1. Tokenize the prompt
#   2. Measure start time
#   3. Generate output using model.generate() with torch.inference_mode()
#   4. Measure end time
#   5. Decode output and compute tokens per second
#   6. Return response, elapsed time, and tokens per second

# TODO: Define an example prompt (e.g., "Explain the principle of superposition in quantum mechanics in simple terms.")

# TODO: Generate a response and collect metrics

# TODO: Print the response, elapsed time, and tokens per second


### Step 3: Record VRAM usage and other system metrics

**Documentation:**
- CUDA memory management: https://pytorch.org/docs/stable/notes/cuda.html#memory-management
- torch.cuda.memory_allocated(): https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html


In [None]:
# TODO: Check if CUDA is available using torch.cuda.is_available()

# TODO: If CUDA is available:
#   - Get allocated memory using torch.cuda.memory_allocated() and convert to GB
#   - Get reserved memory using torch.cuda.memory_reserved() and convert to GB
#   - Print both values

# TODO: If CUDA is not available, print a message indicating GPU is needed


## Reflection

- Compare the inference latency and tokens-per-second you observed with your peers. If you notice significant differences, consider hardware differences or background workload.
- If your model failed to load or inference did not execute, check the installation and whether your GPU has enough memory (for 8B models, you may need ≥ 16 GB VRAM).
- Save your metrics (latency, tokens per second, VRAM usage) for later labs; you will compare these values after applying optimization techniques such as distillation, quantization, and pruning.
