<a href="https://colab.research.google.com/github/harjeet88/llm-course/blob/main/5_module/memory_footprint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading a 7B Model with BitsAndBytes: Memory-Efficient Inference

In this notebook, we'll:
1. Install required libraries.
2. Load a quantized model to reduce VRAM usage (e.g., from ~14GB to ~4-7GB for a 7B model).
3. Generate text to verify it works.
4. Monitor memory usage.

Quantization via `bitsandbytes` compresses weights without much accuracy loss.

In [1]:
# Install libraries (run once per session)
!pip install -q transformers torch bitsandbytes accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
#Import Libraries

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import gc  # For garbage collection to free memory

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Clear any existing cache
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

Using device: cuda


In [3]:
from google.colab import userdata
HF_TOKEN=userdata.get('HF_TOKEN')

In [4]:
from huggingface_hub import login
login(HF_TOKEN)
print("Hugging Face login successful!")

Hugging Face login successful!


In [5]:
hf_token=HF_TOKEN

## Step 2: Load Tokenizer and Model with Quantization

- `load_in_8bit=True`: Reduces weights to 8-bit, ~50% memory savings.
- `load_in_4bit=True`: Even more aggressive (~75% savings), but use with `bnb_4bit_compute_dtype=torch.float16` for stability.

For a full 7B model, 4-bit quantization fits on a single T4 GPU.

In [6]:
# Model name (using Llama-2-7B for true 7B; requires HF login)
model_name = "google/gemma-3-4b-it"  # Swap to "google/gemma-2-2b" for non-gated demo


In [None]:
# Model name (use 'meta-llama/Llama-2-7b-chat-hf' for true 7B; requires HF login)
model_name = "google/gemma-3-4b-it"  # Demo model; swap for 7B

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with 8-bit quantization (swap to 4-bit below for more savings)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Automatically maps to GPU
    load_in_4=True,  # Enable 8-bit quantization
    torch_dtype=torch.float16,  # Use FP16 for compute
)

# Alternative: 4-bit for max savings (uncomment below)
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     device_map="auto",
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_use_double_quant=True,  # Nested quantization for extra ~0.4 bits savings
# )

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Function to print memory usage
def print_memory_usage():
    if torch.cuda.is_available():
        print(f"GPU Memory Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
        print(f"GPU Memory Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    else:
        print("No GPU available.")

print("Memory before loading model:")
print_memory_usage()

# After loading (run the model load cell first, then this)
print("\nMemory after loading quantized model:")
print_memory_usage()

# Compare to full precision (uncomment to test; will OOM on T4 for 7B)
# model_full = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto", token=hf_token)
# print("\nMemory with full precision (may crash):")
# print_memory_usage()

In [None]:
# Sample prompt
prompt = "Explain how quantization reduces memory in LLMs:"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)