# Lab 4 – Hello Unsloth: Load and Infer> **⚠️ IMPORTANT**: This lab requires **Google Colab with GPU enabled**> - Go to Runtime → Change runtime type → GPU (T4)> - Unsloth requires CUDA and will not work on Mac/Windows locally> - See `COLAB_SETUP.md` for detailed setup instructionsIn this lab, you will set up your environment for using **Unsloth** and perform a simple inference with a quantized LLM. The goal is to ensure that your environment is correctly configured and to record baseline metrics for inference speed and resource usage.## Objectives- Install and verify the Unsloth library and its dependencies (e.g., `transformers`, `torch`, `accelerate`).- Load a 4-bit quantized base model, such as `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`.- Generate a few example outputs to confirm the model works.- Measure VRAM usage, inference latency, and tokens per second.Before starting, make sure you have enabled GPU runtime in Google Colab. This notebook provides skeleton code and measurement functions – feel free to customize based on your needs.

In [None]:
# Install Unsloth using the official auto-install script
# This automatically detects your environment and installs the correct version
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Alternative manual installation if auto-install fails:
# !pip install --upgrade pip
# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install "unsloth_zoo @ git+https://github.com/unslothai/unsloth-zoo.git"

print("✅ Unsloth installation complete! Now restart runtime before proceeding.")
print("⚠️ IMPORTANT: Use GPU runtime, not TPU! Unsloth requires CUDA GPU.")

### Step 1: Import libraries and load a quantized model

**Documentation:**
- Unsloth documentation: https://docs.unsloth.ai
- Unsloth Quick Start: https://docs.unsloth.ai/get-started/fine-tuning-guide
- **Example Notebook**: [Qwen 2.5 (7B) Alpaca](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) - Shows complete workflow
- All Unsloth notebooks: https://github.com/unslothai/notebooks
- PyTorch dtypes: https://pytorch.org/docs/stable/tensor_attributes.html#torch-dtype


In [None]:
from unsloth import FastLanguageModel
import torch
# 1️⃣ Import libraries and load a quantized model


# Choose your model. The example below uses a 4-bit quantized Qwen 2.5 model.
model_name = "unsloth/Qwen2.5-7B-Instruct-bnb-4bit"

# Load the model and tokenizer. You can set `dtype=torch.float16` or `torch.float32` depending on your hardware.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    dtype=torch.float16,
    device_map="auto"
)

# Verify model is loaded
print(f"Loaded model: {model_name}")


### Step 2: Run a simple inference and measure performance

**Documentation:**
- Model.generate() documentation: https://huggingface.co/docs/transformers/main_classes/text_generation
- Tokenization: https://huggingface.co/docs/transformers/main_classes/tokenizer


In [None]:
# 2️⃣ Run a simple inference and measure performance

# Define a helper function to measure inference latency and throughput
import time

def generate_response(prompt: str, max_new_tokens: int = 100, temperature: float = 0.7):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    start_time = time.time()
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature
        )
    end_time = time.time()
    elapsed = end_time - start_time
    # Decode output and compute tokens generated
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    num_tokens = len(response.split())
    tokens_per_sec = num_tokens / elapsed if elapsed > 0 else float('inf')
    return response, elapsed, tokens_per_sec

# Example prompt
prompt_text = "Explain the principle of superposition in quantum mechanics in simple terms."

# Generate a response and collect metrics
response, elapsed_time, tps = generate_response(prompt_text)
print("Response:", response)
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"Tokens per second: {tps:.2f}")


### Step 3: Record VRAM usage and other system metrics

**Documentation:**
- CUDA memory management: https://pytorch.org/docs/stable/notes/cuda.html#memory-management
- torch.cuda.memory_allocated(): https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html


In [None]:
# 3️⃣ Record VRAM usage and other system metrics

# VRAM measurement (for GPUs). This requires that you have a CUDA-enabled GPU.
# If you are running on CPU, you can skip this or replace with appropriate memory metrics.

try:
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1024**3  # in GB
        reserved = torch.cuda.memory_reserved() / 1024**3   # in GB
        print(f"CUDA memory allocated: {allocated:.2f} GB")
        print(f"CUDA memory reserved: {reserved:.2f} GB")
    else:
        print("CUDA is not available. Please run this notebook on a GPU instance to measure VRAM usage.")
except Exception as e:
    print(f"Error checking VRAM: {e}")


## Reflection

- Compare the inference latency and tokens-per-second you observed with your peers. If you notice significant differences, consider hardware differences or background workload.
- If your model failed to load or inference did not execute, check the installation and whether your GPU has enough memory (for 8B models, you may need ≥ 16 GB VRAM).
- Save your metrics (latency, tokens per second, VRAM usage) for later labs; you will compare these values after applying optimization techniques such as distillation, quantization, and pruning.
