# LoRA Fine-Tuning Tutorial

This notebook demonstrates **Parameter-Efficient Fine-Tuning** with LoRA (Low-Rank Adaptation).

## What You'll Learn:
1. **Basic LoRA**: Train only 0.1-1% of parameters
2. **QLoRA**: 4-bit quantization for large models
3. **DPO + LoRA**: Preference alignment with minimal memory
4. **Adapter Management**: Save, load, merge adapters

## Benefits:
- ‚úì 100x less memory than full fine-tuning
- ‚úì 10x faster training
- ‚úì Works on consumer GPUs (even 4GB VRAM!)
- ‚úì Adapters are tiny (~few MB) and shareable

---

## Setup & Configuration

Run this cell first to set up the environment and detect your GPU.

In [1]:
import os
import torch
from typing import Dict, Any

# Configuration
MODEL_NAME = "distilgpt2"  # Small model for demos (~82M params)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {DEVICE}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("Running on CPU (slower but works!)")

print("\n‚úì Setup complete!")

Device: cuda
GPU: NVIDIA GeForce RTX 5070 Laptop GPU
VRAM: 8.0 GB

‚úì Setup complete!


## GPU Auto-Configuration

This function detects your GPU VRAM and sets optimal training parameters automatically.

In [2]:
def detect_gpu_config():
    """Detect GPU and return optimized configuration."""
    if not torch.cuda.is_available():
        return {
            'device': 'cpu',
            'batch_size': 2,
            'use_fp16': False,
            'max_length': 256,
            'gradient_checkpointing': True,
        }
    
    vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    
    if vram_gb < 4:
        config = {'batch_size': 1, 'max_length': 128}
    elif vram_gb < 8:
        config = {'batch_size': 2, 'max_length': 256}
    elif vram_gb < 12:
        config = {'batch_size': 4, 'max_length': 512}
    else:
        config = {'batch_size': 8, 'max_length': 512}
    
    config.update({
        'device': 'cuda',
        'use_fp16': True,
        'gradient_checkpointing': vram_gb < 12,
    })
    
    print(f"Auto-configured for {vram_gb:.1f}GB VRAM:")
    print(f"  - Batch size: {config['batch_size']}")
    print(f"  - Max length: {config['max_length']}")
    print(f"  - FP16: {config['use_fp16']}")
    print(f"  - Gradient checkpointing: {config['gradient_checkpointing']}")
    
    return config

GPU_CONFIG = detect_gpu_config()

Auto-configured for 8.0GB VRAM:
  - Batch size: 2
  - Max length: 256
  - FP16: True
  - Gradient checkpointing: True


---

# Part 1: Understanding LoRA

LoRA works by adding small trainable matrices to the model:

```
Normal fine-tuning:  Update all W (millions/billions of parameters)
LoRA fine-tuning:    Add W' = A √ó B (only thousands of parameters!)
```

Where:
- `W` is the original weight matrix (frozen)
- `A` and `B` are small matrices (trainable)
- Rank `r` controls the size: smaller = fewer parameters, larger = more capacity

## Key LoRA Parameters

| Parameter | Range | What it does |
|-----------|-------|-------------|
| **r** (rank) | 4-64 | Higher = more capacity, more params |
| **lora_alpha** | 16-32 | Scaling factor (usually 2√ór) |
| **target_modules** | varies | Which layers to adapt |
| **lora_dropout** | 0.0-0.1 | Regularization |

For a **7B model**:
- Full fine-tuning: 7B parameters
- LoRA (r=8): ~8M parameters (0.1%)
- LoRA (r=64): ~67M parameters (1%)

---

# Part 2: Basic LoRA Fine-Tuning

## Step 1: Load Base Model

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

print(f"Loading model: {MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if GPU_CONFIG['use_fp16'] else torch.float32,
)

print(f"‚úì Model loaded: {model.num_parameters():,} parameters")

Loading model: distilgpt2


`torch_dtype` is deprecated! Use `dtype` instead!


‚úì Model loaded: 81,912,576 parameters


## Step 2: Apply LoRA Adapters

In [4]:
from peft import LoraConfig, get_peft_model, TaskType

# LoRA configuration
lora_config = LoraConfig(
    r=8,                        # Rank (try 4, 8, 16, 32)
    lora_alpha=16,              # Scaling (usually 2√ór)
    target_modules=["c_attn"],  # GPT-2 attention layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA
lora_model = get_peft_model(model, lora_config)


# FIX: Add these 3 lines for gradient checkpointing compatibility
if GPU_CONFIG['gradient_checkpointing']:
    lora_model.enable_input_require_grads()

# Print trainable parameters
trainable = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
total = sum(p.numel() for p in lora_model.parameters())

print(f"\nüìä LoRA Applied:")
print(f"  Total params: {total:,}")
print(f"  Trainable: {trainable:,} ({100 * trainable / total:.2f}%)")
print(f"  Frozen: {total - trainable:,}")


üìä LoRA Applied:
  Total params: 82,060,032
  Trainable: 147,456 (0.18%)
  Frozen: 81,912,576




## Step 3: Prepare Dataset

In [5]:
from datasets import Dataset

# Create sample Q&A dataset
data = [
    {"text": "Question: What is Python?\nAnswer: Python is a high-level programming language known for its simplicity and readability."},
    {"text": "Question: Explain machine learning.\nAnswer: Machine learning is a branch of AI where computers learn patterns from data."},
    {"text": "Question: What is an API?\nAnswer: An API is a set of protocols that allows different software applications to communicate."},
    {"text": "Question: What is Docker?\nAnswer: Docker is a platform for developing and running applications in isolated containers."},
] * 30  # 120 examples

dataset = Dataset.from_list(data)

# Tokenize
def tokenize(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        max_length=GPU_CONFIG['max_length'],
        padding="max_length",
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
split_dataset = tokenized_dataset.train_test_split(test_size=0.1)

print(f"‚úì Dataset prepared:")
print(f"  Train: {len(split_dataset['train'])} examples")
print(f"  Test: {len(split_dataset['test'])} examples")

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

‚úì Dataset prepared:
  Train: 108 examples
  Test: 12 examples


## Step 4: Train with LoRA

**Note:** This will actually train the model! Adjust `num_train_epochs` as needed.

In [6]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Training configuration
training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=1,  # Increase for better results
    per_device_train_batch_size=GPU_CONFIG['batch_size'],
    per_device_eval_batch_size=GPU_CONFIG['batch_size'],
    gradient_accumulation_steps=2,
    learning_rate=2e-4,  # LoRA can use higher LR
    warmup_ratio=0.1,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="epoch",
    fp16=GPU_CONFIG['use_fp16'],
    report_to="none",
    gradient_checkpointing=GPU_CONFIG['gradient_checkpointing'],
)

# Data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Trainer
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

print("üöÄ Starting LoRA training...\n")
trainer.train()
print("\n‚úì Training complete!")

  trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


üöÄ Starting LoRA training...



`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss



‚úì Training complete!


## Step 5: Save LoRA Adapters

**Important:** This saves only the LoRA weights (~few MB), not the full model!

In [7]:
# Save LoRA adapters
lora_model.save_pretrained("./my_lora_adapters")
tokenizer.save_pretrained("./my_lora_adapters")

print("‚úì LoRA adapters saved to ./my_lora_adapters/")
print("  (Only LoRA weights, very small!)")

‚úì LoRA adapters saved to ./my_lora_adapters/
  (Only LoRA weights, very small!)


## Step 6: Test the Fine-Tuned Model

In [8]:
# Test generation
prompt = "Question: What is artificial intelligence?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(lora_model.device)

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id,
    )

print("Prompt:", prompt)
print("\nGenerated:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Prompt: Question: What is artificial intelligence?
Answer:

Generated:
Question: What is artificial intelligence?
Answer: Artificial Artificial Artificial Artificial Artificial Artificial Artificial Artificial Artificial Artificial AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI AI


---

# Part 3: Loading LoRA Adapters

Show how to load saved LoRA adapters onto a base model.

In [9]:
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Load LoRA adapters
loaded_model = PeftModel.from_pretrained(base_model, "./my_lora_adapters")

print("‚úì LoRA adapters loaded onto base model")
print("\nüí° You can now use this model for inference!")

‚úì LoRA adapters loaded onto base model

üí° You can now use this model for inference!


## Merge Adapters (For Deployment)

If you want to deploy without PEFT dependency, merge the adapters into the base model.

In [10]:
# Merge and unload LoRA weights into base model
merged_model = loaded_model.merge_and_unload()

# Save merged model (standard HuggingFace format)
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

print("‚úì Merged model saved to ./merged_model/")
print("  (Can now use without PEFT library)")

‚úì Merged model saved to ./merged_model/
  (Can now use without PEFT library)


---

# Part 4: DPO + LoRA (Preference Alignment)

Combine **DPO** (Direct Preference Optimization) with **LoRA** for memory-efficient alignment.

**Use case:** Align a model to prefer certain response styles with minimal memory.

## Why DPO + LoRA?

| Feature | Benefit |
|---------|--------|
| DPO | No reward model needed, simpler than PPO |
| LoRA | Memory-efficient, only 0.1-1% params |
| Combined | Align models on consumer GPUs! |

**Companies using this approach:** Many! It's a production-ready technique.

## Step 1: Create Preference Dataset

DPO needs pairs of (chosen, rejected) responses for each prompt.

In [11]:
preference_data = [
    {
        "prompt": "Explain what Python is.",
        "chosen": "Python is a high-level, interpreted programming language known for its clear syntax and readability. It supports multiple programming paradigms and has extensive libraries.",
        "rejected": "python is a programming language i guess",
    },
    {
        "prompt": "What is machine learning?",
        "chosen": "Machine learning is a subset of artificial intelligence where systems learn patterns from data to make predictions without being explicitly programmed for each task.",
        "rejected": "ml is when computers do stuff automatically",
    },
    {
        "prompt": "Explain how computers work.",
        "chosen": "A computer processes data using a CPU that executes instructions stored in memory. It takes input, processes it according to programs, and produces output.",
        "rejected": "computers work by doing calculations fast",
    },
] * 15  # 45 preference pairs

dpo_dataset = Dataset.from_list(preference_data)

print(f"‚úì Preference dataset: {len(dpo_dataset)} pairs")
print(f"\nExample:")
print(f"  Prompt: {preference_data[0]['prompt']}")
print(f"  Chosen: {preference_data[0]['chosen'][:60]}...")
print(f"  Rejected: {preference_data[0]['rejected']}")

‚úì Preference dataset: 45 pairs

Example:
  Prompt: Explain what Python is.
  Chosen: Python is a high-level, interpreted programming language kno...
  Rejected: python is a programming language i guess


## Step 2: Setup DPO Models

DPO needs:
- **Policy model** (with LoRA) - being trained
- **Reference model** (frozen) - for KL penalty

In [12]:
# Load models
dpo_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if GPU_CONFIG['use_fp16'] else torch.float32,
)
ref_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if GPU_CONFIG['use_fp16'] else torch.float32,
)

# Apply LoRA to policy model (NOT reference model)
dpo_lora_config = LoraConfig(
    r=16,  # Higher rank for alignment tasks
    lora_alpha=32,
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

dpo_model = get_peft_model(dpo_model, dpo_lora_config)

trainable = sum(p.numel() for p in dpo_model.parameters() if p.requires_grad)
total = sum(p.numel() for p in dpo_model.parameters())

print(f"‚úì DPO models ready")
print(f"  Policy model: {trainable:,} trainable params ({100*trainable/total:.2f}%)")
print(f"  Reference model: Frozen")

‚úì DPO models ready
  Policy model: 294,912 trainable params (0.36%)
  Reference model: Frozen




## Step 3: Train with DPO

DPO directly optimizes the policy from preference pairs.

In [13]:
from trl import DPOConfig, DPOTrainer

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo_lora_output",
    num_train_epochs=1,
    per_device_train_batch_size=GPU_CONFIG['batch_size'],
    gradient_accumulation_steps=2,
    learning_rate=5e-5,  # Lower LR for DPO
    beta=0.1,  # DPO temperature
    max_length=GPU_CONFIG['max_length'],
    max_prompt_length=GPU_CONFIG['max_length'] // 2,
    logging_steps=5,
    fp16=GPU_CONFIG['use_fp16'],
    report_to="none",
    remove_unused_columns=False,
)

# Create trainer
dpo_trainer = DPOTrainer(
    model=dpo_model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=dpo_dataset,
    processing_class=tokenizer,
)

print("üöÄ Starting DPO + LoRA training...\n")
dpo_trainer.train()
print("\n‚úì DPO training complete!")

Extracting prompt in train dataset:   0%|          | 0/45 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/45 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/45 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.


üöÄ Starting DPO + LoRA training...



Step,Training Loss
5,0.6823
10,0.6603



‚úì DPO training complete!


## Step 4: Save DPO Adapters

In [14]:
dpo_model.save_pretrained("./dpo_lora_adapters")
tokenizer.save_pretrained("./dpo_lora_adapters")

print("‚úì DPO-aligned LoRA adapters saved!")
print("  These adapters now prefer high-quality responses.")

‚úì DPO-aligned LoRA adapters saved!
  These adapters now prefer high-quality responses.


## Step 5: Test DPO-Aligned Model

In [15]:
# Compare: Base model vs DPO-aligned model
test_prompt = "Explain what artificial intelligence is."
inputs = tokenizer(test_prompt, return_tensors="pt").to(dpo_model.device)

print("Prompt:", test_prompt)
print("\nDPO-Aligned Response:")

with torch.no_grad():
    outputs = dpo_model.generate(
        **inputs,
        max_new_tokens=60,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Prompt: Explain what artificial intelligence is.

DPO-Aligned Response:
Explain what artificial intelligence is. It It It It It It It It It It It It It It It It It













































---

# Part 5: QLoRA (Quantized LoRA)

**QLoRA = 4-bit quantization + LoRA**

This enables fine-tuning **7B-70B models** on consumer GPUs!

**Requirements:**
- CUDA GPU (6GB+ VRAM recommended)
- `pip install bitsandbytes`

**Note:** This cell requires GPU. Skip if on CPU.

In [None]:
if not torch.cuda.is_available():
    print("‚ö†Ô∏è QLoRA requires CUDA GPU. Showing configuration only...")
    print("""
# QLoRA Configuration Example:

from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Memory: Llama-2-7B
# - Full precision: ~28 GB ‚ùå
# - 4-bit + LoRA: ~6-8 GB ‚úì Fits on RTX 3060!
    """)
else:
    vram_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    if vram_gb < 6:
        print(f"‚ö†Ô∏è VRAM ({vram_gb:.1f}GB) too low for QLoRA demo.")
        print("Need 6GB+ for TinyLlama-1.1B with QLoRA.")
    else:
        print("‚úì GPU detected! You can run QLoRA.")
        print("Uncomment the code below to try with TinyLlama-1.1B.")
        
        # Uncomment to try:
        # from transformers import BitsAndBytesConfig
        # from peft import prepare_model_for_kbit_training
        # 
        # bnb_config = BitsAndBytesConfig(
        #     load_in_4bit=True,
        #     bnb_4bit_quant_type="nf4",
        #     bnb_4bit_compute_dtype=torch.float16,
        #     bnb_4bit_use_double_quant=True,
        # )
        # 
        # qlora_model = AutoModelForCausalLM.from_pretrained(
        #     "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        #     quantization_config=bnb_config,
        #     device_map="auto",
        # )
        # 
        # qlora_model = prepare_model_for_kbit_training(qlora_model)
        # qlora_model = get_peft_model(qlora_model, lora_config)
        # 
        # print(f"Memory: {qlora_model.get_memory_footprint() / 1024**2:.1f} MB")

---

# Summary

## What You Learned:

| Technique | When to Use | Memory Savings |
|-----------|-------------|----------------|
| **Basic LoRA** | General fine-tuning | 100x less memory |
| **DPO + LoRA** | Preference alignment | Same as LoRA + simple |
| **QLoRA** | Large models (7B+) | 4x less than LoRA |

## Key Takeaways:

1. **LoRA trains only 0.1-1% of parameters** - Massive memory savings!
2. **Adapters are tiny (~few MB)** - Easy to share and version
3. **DPO + LoRA** - Production-ready alignment on consumer GPUs
4. **QLoRA** - Fine-tune 7B-70B models on single GPU

## Next Steps:

- ‚úì Try with your own dataset from HuggingFace
- ‚úì Experiment with different ranks (r=4, 8, 16, 32, 64)
- ‚úì Try QLoRA with larger models (if you have 8GB+ VRAM)
- ‚úì Load multiple adapters for multi-task models

---

**Happy Fine-Tuning! üöÄ**