# 🛠️ Step 2: Building LoRA from Scratch

## Week 7-8: Fine-tuning Implementation

Now that you understand the theory, let's **build LoRA from scratch**! You'll implement every component and see exactly how it works.

### 🎯 What You'll Build:
1. **Basic LoRA Layer** - Core implementation
2. **LoRA Linear Layer** - Drop-in replacement for nn.Linear
3. **Testing and Verification** - Make sure it works correctly
4. **Performance Comparisons** - See the memory savings
5. **Integration Examples** - How to use with real models

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import math

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("🚀 Ready to build LoRA from scratch!")
print(f"Using PyTorch version: {torch.__version__}")

## 🔧 Part 1: Core LoRA Implementation

Let's start by building the basic LoRA layer step by step:

In [None]:
class LoRALayer(nn.Module):
    """
    LoRA (Low-Rank Adaptation) Layer
    
    This implements: output = original_layer(x) + B @ A @ x
    Where B and A are low-rank matrices we train
    """
    def __init__(self, original_layer, rank=4, alpha=16, dropout=0.0):
        super().__init__()
        
        # Store the original layer (we'll freeze this)
        self.original_layer = original_layer
        
        # Get dimensions from original layer
        if hasattr(original_layer, 'in_features') and hasattr(original_layer, 'out_features'):
            self.in_features = original_layer.in_features
            self.out_features = original_layer.out_features
        else:
            raise ValueError("Original layer must be a Linear layer")
        
        # LoRA parameters
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank  # This balances the adaptation strength
        
        # Create the LoRA matrices
        # A: rank x in_features (initialized with random values)
        # B: out_features x rank (initialized with zeros)
        self.lora_A = nn.Parameter(torch.randn(rank, self.in_features))
        self.lora_B = nn.Parameter(torch.zeros(self.out_features, rank))
        
        # Optional dropout
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        
        # Freeze the original layer
        for param in self.original_layer.parameters():
            param.requires_grad = False
        
        print(f"✅ Created LoRA layer:")
        print(f"   Original: {self.in_features} -> {self.out_features} ({self.in_features * self.out_features:,} params)")
        print(f"   LoRA: rank={rank}, alpha={alpha}")
        print(f"   A matrix: {rank} x {self.in_features} = {rank * self.in_features:,} params")
        print(f"   B matrix: {self.out_features} x {rank} = {self.out_features * rank:,} params")
        print(f"   Total trainable: {(rank * self.in_features) + (self.out_features * rank):,} params")
        print(f"   🎯 Reduction: {(self.in_features * self.out_features) / ((rank * self.in_features) + (self.out_features * rank)):.1f}x")
    
    def forward(self, x):
        """
        Forward pass: output = W₀x + BAx
        
        We compute this efficiently as:
        1. original_output = W₀ @ x
        2. lora_output = B @ (A @ x)  # Note: we don't materialize B@A
        3. return original_output + scaling * lora_output
        """
        # Original layer output (frozen weights)
        original_output = self.original_layer(x)
        
        # LoRA adaptation path
        # Step 1: A @ x (rank x batch_size x ...)
        x_adapted = F.linear(x, self.lora_A)  # Same as self.lora_A @ x but more efficient
        
        # Apply dropout
        x_adapted = self.dropout(x_adapted)
        
        # Step 2: B @ (A @ x)
        lora_output = F.linear(x_adapted, self.lora_B.T)  # Transpose because F.linear expects transposed weights
        
        # Combine original and LoRA outputs with scaling
        return original_output + lora_output * self.scaling
    
    def get_delta_weights(self):
        """
        Get the actual weight changes (ΔW = B @ A)
        This is mainly for analysis - we don't compute this during forward pass
        """
        return self.lora_B @ self.lora_A * self.scaling
    
    def merge_weights(self):
        """
        Merge LoRA weights into the original layer for deployment
        This creates a single weight matrix: W_new = W₀ + ΔW
        """
        if not hasattr(self.original_layer, 'weight'):
            raise ValueError("Original layer must have 'weight' parameter")
        
        delta_w = self.get_delta_weights()
        self.original_layer.weight.data += delta_w
        print("✅ LoRA weights merged into original layer")
    
    def get_parameter_info(self):
        """
        Get detailed parameter information
        """
        original_params = sum(p.numel() for p in self.original_layer.parameters())
        lora_params = self.lora_A.numel() + self.lora_B.numel()
        
        return {
            'original_params': original_params,
            'lora_params': lora_params,
            'total_params': original_params + lora_params,
            'trainable_params': lora_params,
            'reduction_factor': original_params / lora_params if lora_params > 0 else float('inf')
        }

# Test our LoRA implementation
print("🧪 Testing LoRA Implementation:")
print("="*40)

# Create a sample linear layer
original = nn.Linear(512, 256)

# Wrap it with LoRA
lora_layer = LoRALayer(original, rank=8, alpha=16)

# Test with sample input
x = torch.randn(2, 512)  # batch_size=2, features=512
output = lora_layer(x)
print(f"\n✅ Forward pass successful!")
print(f"   Input shape: {x.shape}")
print(f"   Output shape: {output.shape}")

## 🔍 Part 2: Let's Verify Our Implementation

Now let's test that our LoRA layer works correctly:

In [None]:
def test_lora_correctness():
    """
    Test that our LoRA implementation works correctly
    """
    print("🧪 Testing LoRA Correctness:")
    print("="*40)
    
    # Create a simple example
    original = nn.Linear(4, 3)
    lora = LoRALayer(original, rank=2, alpha=4)
    
    # Test input
    x = torch.randn(1, 4)
    
    # Method 1: Use our LoRA layer
    output_lora = lora(x)
    
    # Method 2: Manual computation for verification
    original_output = original(x)
    delta_w = lora.get_delta_weights()
    manual_output = original_output + F.linear(x, delta_w)
    
    # Check if they're the same
    difference = torch.abs(output_lora - manual_output).max().item()
    
    print(f"✅ Correctness test:")
    print(f"   LoRA output: {output_lora.flatten()[:3]}...")
    print(f"   Manual output: {manual_output.flatten()[:3]}...")
    print(f"   Max difference: {difference:.8f}")
    print(f"   {'✅ PASSED' if difference < 1e-6 else '❌ FAILED'}")
    
    return difference < 1e-6

test_lora_correctness()

## 📊 Part 3: Memory and Parameter Analysis

Let's see the real memory savings:

In [None]:
def analyze_parameter_efficiency():
    """
    Analyze parameter efficiency across different layer sizes and ranks
    """
    print("📊 Parameter Efficiency Analysis:")
    print("="*50)
    
    # Test different layer sizes (common in transformers)
    layer_configs = [
        (768, 768),    # BERT-base attention
        (768, 3072),   # BERT-base FFN
        (1024, 1024),  # Large attention
        (1024, 4096),  # Large FFN
        (2048, 2048),  # Very large attention
    ]
    
    ranks = [4, 8, 16, 32, 64]
    
    results = []
    
    for (in_dim, out_dim) in layer_configs:
        original = nn.Linear(in_dim, out_dim)
        original_params = sum(p.numel() for p in original.parameters())
        
        print(f"\n🔧 Layer: {in_dim} -> {out_dim} ({original_params:,} params)")
        print(f"   Rank  | LoRA Params | Reduction | Memory MB")
        print(f"   -----|-----------|---------|----------")
        
        for rank in ranks:
            # Calculate LoRA parameters
            lora_params = (in_dim * rank) + (out_dim * rank)
            reduction = original_params / lora_params
            memory_mb = lora_params * 4 / (1024 * 1024)  # 4 bytes per float32
            
            print(f"   {rank:4d}  | {lora_params:9,} | {reduction:6.1f}x  | {memory_mb:7.2f}")
            
            results.append({
                'layer_size': f"{in_dim}x{out_dim}",
                'rank': rank,
                'original_params': original_params,
                'lora_params': lora_params,
                'reduction': reduction,
                'memory_mb': memory_mb
            })
    
    return results

efficiency_results = analyze_parameter_efficiency()

In [None]:
# Visualize the efficiency results
import pandas as pd

# Convert to DataFrame for easier visualization
df = pd.DataFrame(efficiency_results)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Parameter reduction by rank for different layer sizes
unique_layers = df['layer_size'].unique()
colors = plt.cm.Set1(np.linspace(0, 1, len(unique_layers)))

ax1 = axes[0, 0]
for i, layer in enumerate(unique_layers):
    layer_data = df[df['layer_size'] == layer]
    ax1.plot(layer_data['rank'], layer_data['reduction'], 
             marker='o', linewidth=2, label=layer, color=colors[i])

ax1.set_xlabel('LoRA Rank')
ax1.set_ylabel('Parameter Reduction Factor')
ax1.set_title('Parameter Reduction vs Rank')
ax1.set_yscale('log')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Memory usage by rank
ax2 = axes[0, 1]
for i, layer in enumerate(unique_layers[:3]):  # Show only first 3 for clarity
    layer_data = df[df['layer_size'] == layer]
    ax2.plot(layer_data['rank'], layer_data['memory_mb'], 
             marker='s', linewidth=2, label=layer, color=colors[i])

ax2.set_xlabel('LoRA Rank')
ax2.set_ylabel('Memory Usage (MB)')
ax2.set_title('Memory Usage vs Rank')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Comparison at rank=16
rank16_data = df[df['rank'] == 16]
ax3 = axes[1, 0]
bars = ax3.bar(range(len(rank16_data)), rank16_data['reduction'])
ax3.set_xlabel('Layer Configuration')
ax3.set_ylabel('Parameter Reduction Factor')
ax3.set_title('Reduction Factor at Rank=16')
ax3.set_xticks(range(len(rank16_data)))
ax3.set_xticklabels(rank16_data['layer_size'], rotation=45)

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, rank16_data['reduction'])):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
             f'{val:.1f}x', ha='center', va='bottom')

# Plot 4: Parameter count comparison
ax4 = axes[1, 1]
sample_layer = df[df['layer_size'] == '1024x1024']
original_count = sample_layer['original_params'].iloc[0]

ax4.plot(sample_layer['rank'], [original_count] * len(sample_layer), 
         'r--', linewidth=3, label='Original (Full Fine-tuning)')
ax4.plot(sample_layer['rank'], sample_layer['lora_params'], 
         'b-o', linewidth=2, markersize=6, label='LoRA')

ax4.set_xlabel('LoRA Rank')
ax4.set_ylabel('Number of Parameters')
ax4.set_title('Parameter Count: 1024x1024 Layer')
ax4.set_yscale('log')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print key insights
print("\n🎯 Key Insights:")
print(f"   • Rank 8-16 typically offers good balance (10-50x reduction)")
print(f"   • Larger layers benefit more from LoRA")
print(f"   • Memory savings are substantial even for small ranks")
print(f"   • FFN layers (wide) get better compression than attention layers (square)")

## 🔗 Part 4: Easy Integration with Existing Models

Let's create a helper function to easily add LoRA to any model:

In [None]:
def add_lora_to_model(model, target_modules=None, rank=8, alpha=16, dropout=0.0):
    """
    Add LoRA to specified modules in a model
    
    Args:
        model: The model to modify
        target_modules: List of module names to apply LoRA to (e.g., ['query', 'key', 'value'])
        rank: LoRA rank
        alpha: LoRA alpha scaling parameter
        dropout: LoRA dropout probability
    """
    if target_modules is None:
        target_modules = ['query', 'key', 'value', 'dense']  # Common transformer module names
    
    lora_modules = []
    original_param_count = sum(p.numel() for p in model.parameters())
    
    print(f"🔧 Adding LoRA to model:")
    print(f"   Target modules: {target_modules}")
    print(f"   Rank: {rank}, Alpha: {alpha}, Dropout: {dropout}")
    print(f"   Original parameters: {original_param_count:,}")
    print()
    
    # Walk through all modules
    for name, module in model.named_modules():
        # Check if this module should get LoRA
        if any(target in name for target in target_modules) and isinstance(module, nn.Linear):
            print(f"   ✅ Adding LoRA to: {name}")
            
            # Replace the module with LoRA version
            lora_module = LoRALayer(module, rank=rank, alpha=alpha, dropout=dropout)
            
            # Set the module in the model (this is a bit tricky with nested modules)
            parent_module = model
            module_parts = name.split('.')
            
            for part in module_parts[:-1]:
                parent_module = getattr(parent_module, part)
            
            setattr(parent_module, module_parts[-1], lora_module)
            lora_modules.append((name, lora_module))
    
    # Calculate new parameter counts
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"\n📊 Results:")
    print(f"   Total parameters: {total_params:,}")
    print(f"   Trainable parameters: {trainable_params:,} ({100*trainable_params/total_params:.2f}%)")
    print(f"   LoRA modules added: {len(lora_modules)}")
    
    return lora_modules

# Test with a simple transformer-like model
class SimpleTransformer(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        
        # Attention layers
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.output = nn.Linear(d_model, d_model)
        
        # Feed-forward layers
        self.ffn1 = nn.Linear(d_model, d_model * 4)
        self.ffn2 = nn.Linear(d_model * 4, d_model)
        
    def forward(self, x):
        # Simplified forward pass
        return x

# Create and modify model
model = SimpleTransformer(d_model=512, num_heads=8)
lora_modules = add_lora_to_model(
    model, 
    target_modules=['query', 'key', 'value'],  # Only attention, not FFN
    rank=16, 
    alpha=32
)

## ⚡ Part 5: Training with LoRA

Let's see how to actually train a model with LoRA:

In [None]:
def demonstrate_lora_training():
    """
    Demonstrate how to train with LoRA
    """
    print("🚀 LoRA Training Demonstration:")
    print("="*40)
    
    # Create a simple classification model
    class SimpleClassifier(nn.Module):
        def __init__(self, input_size=784, hidden_size=512, num_classes=10):
            super().__init__()
            self.layer1 = nn.Linear(input_size, hidden_size)
            self.layer2 = nn.Linear(hidden_size, hidden_size)
            self.layer3 = nn.Linear(hidden_size, num_classes)
            self.relu = nn.ReLU()
            
        def forward(self, x):
            x = self.relu(self.layer1(x))
            x = self.relu(self.layer2(x))
            return self.layer3(x)
    
    # Create model
    model = SimpleClassifier()
    
    # Add LoRA to specific layers
    lora_modules = add_lora_to_model(
        model, 
        target_modules=['layer1', 'layer2'],  # Apply to first two layers
        rank=8, 
        alpha=16
    )
    
    # Create optimizer - only train LoRA parameters!
    lora_params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.Adam(lora_params, lr=1e-3)
    
    print(f"\n🎯 Training Setup:")
    print(f"   Optimizer parameters: {sum(p.numel() for p in lora_params):,}")
    print(f"   Memory for gradients: {sum(p.numel() for p in lora_params) * 4 / 1e6:.1f} MB")
    
    # Simulate training
    model.train()
    
    for epoch in range(3):  # Just a few steps for demonstration
        # Create dummy batch
        batch_x = torch.randn(32, 784)  # Batch of 32 samples
        batch_y = torch.randint(0, 10, (32,))  # Random labels
        
        # Forward pass
        outputs = model(batch_x)
        loss = F.cross_entropy(outputs, batch_y)
        
        # Backward pass - only LoRA parameters get gradients!
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f"   Epoch {epoch+1}: Loss = {loss.item():.4f}")
    
    print("\n✅ Training complete!")
    print("   • Only LoRA parameters were updated")
    print("   • Original weights remained frozen")
    print("   • Memory usage was much lower")

demonstrate_lora_training()

## 🎯 Part 6: Key Insights and Best Practices

Let's summarize what you've learned:

In [None]:
print("🎓 Key Insights from LoRA Implementation:")
print("="*50)
print()
print("1. 🏗️  ARCHITECTURE:")
print("   • LoRA adds trainable B and A matrices")
print("   • Original weights stay frozen")
print("   • Forward: output = W₀x + α/r * B(Ax)")
print()
print("2. 📊  PARAMETER EFFICIENCY:")
print("   • 10-1000x fewer trainable parameters")
print("   • Memory savings scale with layer size")
print("   • Rank 8-16 usually sufficient for most tasks")
print()
print("3. ⚙️  IMPLEMENTATION DETAILS:")
print("   • A initialized randomly, B initialized to zeros")
print("   • Alpha controls adaptation strength")
print("   • Don't materialize B@A during forward pass")
print()
print("4. 🚀  TRAINING BENEFITS:")
print("   • Much faster training (fewer parameters)")
print("   • Lower memory requirements")
print("   • Less prone to overfitting")
print("   • Can merge weights for deployment")
print()
print("5. 🎯  BEST PRACTICES:")
print("   • Start with rank=8-16, alpha=16-32")
print("   • Apply to attention layers first")
print("   • Use dropout for regularization")
print("   • Monitor both original and LoRA parameter norms")

# Create a final summary visualization
def create_summary_visualization():
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Memory comparison
    methods = ['Full Fine-tuning', 'LoRA (r=8)', 'LoRA (r=16)', 'LoRA (r=32)']
    memory_gb = [12.0, 0.8, 1.6, 3.2]  # Approximate for large model
    colors = ['red', 'lightblue', 'blue', 'darkblue']
    
    bars1 = ax1.bar(methods, memory_gb, color=colors)
    ax1.set_ylabel('Memory Usage (GB)')
    ax1.set_title('Training Memory Comparison')
    ax1.tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, val in zip(bars1, memory_gb):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                f'{val:.1f}GB', ha='center', va='bottom')
    
    # 2. Parameter efficiency by layer type
    layer_types = ['Small\n(256x256)', 'Medium\n(512x512)', 'Large\n(1024x1024)', 'XLarge\n(2048x2048)']
    reductions = [16, 32, 64, 128]  # Approximate reduction factors for rank=16
    
    bars2 = ax2.bar(layer_types, reductions, color='green', alpha=0.7)
    ax2.set_ylabel('Parameter Reduction Factor')
    ax2.set_title('LoRA Efficiency by Layer Size (rank=16)')
    ax2.set_yscale('log')
    
    for bar, val in zip(bars2, reductions):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() * 1.1, 
                f'{val}x', ha='center', va='bottom')
    
    # 3. Rank vs performance trade-off (conceptual)
    ranks = [1, 2, 4, 8, 16, 32, 64, 128]
    performance = [0.7, 0.8, 0.85, 0.9, 0.95, 0.97, 0.98, 0.99]  # Conceptual relative performance
    efficiency = [1/r for r in ranks]  # Inverse of rank (efficiency)
    
    ax3_twin = ax3.twinx()
    line1 = ax3.plot(ranks, performance, 'b-o', label='Performance', linewidth=2)
    line2 = ax3_twin.plot(ranks, efficiency, 'r-s', label='Efficiency (1/rank)', linewidth=2)
    
    ax3.set_xlabel('LoRA Rank')
    ax3.set_ylabel('Relative Performance', color='blue')
    ax3_twin.set_ylabel('Efficiency (1/rank)', color='red')
    ax3.set_title('Performance vs Efficiency Trade-off')
    ax3.set_xscale('log', base=2)
    
    # Add legend
    lines1, labels1 = ax3.get_legend_handles_labels()
    lines2, labels2 = ax3_twin.get_legend_handles_labels()
    ax3.legend(lines1 + lines2, labels1 + labels2, loc='center right')
    
    # 4. LoRA architecture diagram
    ax4.text(0.5, 0.9, 'LoRA Forward Pass', ha='center', va='top', fontsize=14, fontweight='bold')
    ax4.text(0.5, 0.8, 'y = W₀x + (α/r) × B × (A × x)', ha='center', va='center', 
            fontsize=12, bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue"))
    ax4.text(0.5, 0.65, '↓', ha='center', va='center', fontsize=20)
    ax4.text(0.1, 0.5, 'Original\nW₀x\n(frozen)', ha='center', va='center', 
            bbox=dict(boxstyle="round,pad=0.3", facecolor="lightcoral"))
    ax4.text(0.5, 0.5, '+', ha='center', va='center', fontsize=20)
    ax4.text(0.9, 0.5, 'LoRA\nBAx\n(trainable)', ha='center', va='center', 
            bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen"))
    ax4.text(0.5, 0.3, '↓', ha='center', va='center', fontsize=20)
    ax4.text(0.5, 0.15, 'Final Output', ha='center', va='center', 
            bbox=dict(boxstyle="round,pad=0.3", facecolor="gold"))
    
    ax4.set_xlim(0, 1)
    ax4.set_ylim(0, 1)
    ax4.set_xticks([])
    ax4.set_yticks([])
    ax4.set_title('LoRA Architecture')
    
    plt.tight_layout()
    plt.show()

create_summary_visualization()

## 🎉 Congratulations!

You've successfully built LoRA from scratch! Here's what you accomplished:

### ✅ **What You Built:**
1. **Complete LoRA implementation** - Core layer with all features
2. **Integration helpers** - Easy functions to add LoRA to any model
3. **Training examples** - How to train only LoRA parameters
4. **Analysis tools** - Parameter efficiency and memory calculations

### 🧠 **What You Learned:**
1. **Mathematical foundation** - How B×A decomposition works
2. **Implementation details** - Initialization, scaling, efficient computation
3. **Parameter efficiency** - Why LoRA saves 10-1000x parameters
4. **Practical aspects** - How to actually use LoRA in training

### 🚀 **Next Steps:**
In **Step 3**, we'll:
- Use LoRA with real pre-trained models (BERT, RoBERTa)
- Implement QLoRA for even more memory savings
- Build a complete email classification system

### 💡 **Quick Self-Check:**
- Can you explain why LoRA is memory efficient?
- Do you understand the B×A decomposition?
- Can you implement a basic LoRA layer from scratch?

If yes, you're ready for Step 3! 🎯