In [None]:
# LoRA

## Context

[BERT](https://arxiv.org/abs/1810.04805) and [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) popularized the approach of pre-training on large amounts of internet data, then fine-tuning on smaller datasets to perform specialized tasks. This approach of first teaching a model a broad skill (e.g. language modeling), and then re-using that model to learn a more specific task (e.g. sentiment analysis) is called transfer learning.

Initially, transfer learning required creating an entirely new model for each new task during fine-tuning. This approach becomes extremely computationally expensive and memory-intensive for larger models. 

### Enter Adapters

[Adapters](https://arxiv.org/abs/1902.00751) introduced a more efficient approach by inserting small bottleneck layers between existing transformer layers. These adapter modules add only a few trainable parameters per task while keeping the original pre-trained weights completely frozen.

Adapters work by:
- Keeping the original pre-trained weights **frozen**
- Adding small bottleneck layers between existing layers
- Only training these new adapter parameters
### Other Parameter-Efficient Approaches

Other approaches like prefix tuning emerged, which optimize trainable prompt tokens instead of model weights.

These methods were more efficient than full fine-tuning, but they had limitations:
1. **Inference Latency**: Adapters add extra layers, increasing computational overhead
2. **Worse performance**: These methods sometimes underperform full fine-tuning

Ideally, we'd like to maintain the same model architecture as the original, as to avoid the latency problem. We'd also like to avoid updating the entire model, as to avoid the high computational cost. 


In [None]:
## LoRA 

**LoRA (Low-Rank Adaptation)** ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)) provides a solution to this problem.

When we fine-tune a model, we freese the original model and apply weight updates $\Delta W$. It turns out that $\Delta W$ is much lower rank than the original weight matrix $W$ - meaning you can express the change without updating every single weight. This is the insight behind LoRA. 

Instead of updating the full weight matrix:
$$W_{new} = W_0 + \Delta W$$

LoRA decomposes the update into two low-rank matrices:
$$W_{new} = W_0 + \Delta W = W_0 + BA$$

Where:
- $W_0 \in \mathbb{R}^{d \times k}$ is the original frozen weight matrix
- $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ are trainable low-rank matrices. These are initialized to zero.
- $r \ll \min(d, k)$ is the rank (typically 1-64)

### Parameter Reduction

For a weight matrix of size $d \times k$:
- **Full fine-tuning**: $d \times k$ parameters
- **LoRA**: $r \times (d + k)$ parameters

**Reduction factor**: $\frac{d \times k}{r \times (d + k)}$

For large matrices, this can be **100x-1000x fewer parameters**!


In [None]:
## Toy Example

We'll implement a toy example that demonstrates LoRA. We'll create a simple neural network, "pre-train" it on one task, then use LoRA to adapt it to a new task.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional
import math

torch.manual_seed(5)
np.random.seed(5)

print("libraries imported!")


In [None]:
### Step 1: Implement LoRA Layer

Let's start by implementing a LoRA layer that can wrap any linear layer:


In [None]:
class LoRALayer(nn.Module):
    """
    LoRA (Low-Rank Adaptation) layer that wraps a linear layer.
    
    Args:
        original_layer: The original nn.Linear layer to adapt
        rank: The rank of the adaptation (r in the paper)
        alpha: Scaling factor for the adaptation
    """
    
    def __init__(self, original_layer: nn.Linear, rank: int = 4, alpha: float = 1.0):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        
        # freeze the original layer
        for param in self.original_layer.parameters():
            param.requires_grad = False
        
        # get dimensions
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # create low-rank matrices
        # A: (rank, in_features) - initialized with random values
        # B: (out_features, rank) - initialized with zeros
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) / math.sqrt(rank))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        print(f"🔧 LoRA layer created:")
        print(f"   Original params: {in_features * out_features:,}")
        print(f"   LoRA params: {rank * (in_features + out_features):,}")
        print(f"   Reduction: {(in_features * out_features) / (rank * (in_features + out_features)):.1f}x")
    
    def forward(self, x):
        # original output
        original_output = self.original_layer(x)
        
        # lora adaptation: x @ A^T @ B^T = x @ (BA)^T
        lora_output = x @ self.lora_A.T @ self.lora_B.T
        
        # combine with scaling
        return original_output + (self.alpha / self.rank) * lora_output
    
    def merge_weights(self):
        """merge lora weights into original layer for inference"""
        with torch.no_grad():
            # compute the low-rank update
            delta_w = (self.alpha / self.rank) * (self.lora_B @ self.lora_A)
            # add to original weights
            self.original_layer.weight.data += delta_w
            # zero out lora parameters
            self.lora_A.data.zero_()
            self.lora_B.data.zero_()
        print("✅ LoRA weights merged into original layer")

print("🏗️ LoRA layer implementation complete!")


In [None]:
### Step 2: Create a Simple "Pre-trained" Model

Let's create a simple neural network and "pre-train" it on a regression task:


In [None]:
class SimpleNet(nn.Module):
    """a simple neural network for our toy example"""
    
    def __init__(self, input_size=10, hidden_size=64, output_size=1):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.layer2 = nn.Linear(hidden_size, hidden_size)
        self.layer3 = nn.Linear(hidden_size, output_size)
        self.activation = nn.ReLU()
    
    def forward(self, x):
        x = self.activation(self.layer1(x))
        x = self.activation(self.layer2(x))
        x = self.layer3(x)
        return x

# create and "pre-train" the model
print("🧠 Creating pre-trained model...")
pretrained_model = SimpleNet()

# generate synthetic "pre-training" data
# task: learn to compute the sum of inputs
def generate_sum_data(n_samples=1000, input_size=10):
    X = torch.randn(n_samples, input_size)
    y = X.sum(dim=1, keepdim=True)  # sum of all inputs
    return X, y

X_pretrain, y_pretrain = generate_sum_data()

# "pre-train" the model
optimizer = optim.Adam(pretrained_model.parameters(), lr=0.01)
criterion = nn.MSELoss()

print("🚀 Pre-training model to learn sum function...")
for epoch in range(100):
    optimizer.zero_grad()
    outputs = pretrained_model(X_pretrain)
    loss = criterion(outputs, y_pretrain)
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 20 == 0:
        print(f"  Epoch {epoch+1}/100, Loss: {loss.item():.4f}")

print("✅ Pre-training complete!")

# test the pre-trained model
with torch.no_grad():
    test_input = torch.randn(1, 10)
    predicted = pretrained_model(test_input).item()
    actual = test_input.sum().item()
    print(f"\n📊 Pre-trained model test:")
    print(f"   Input sum: {actual:.3f}")
    print(f"   Predicted: {predicted:.3f}")
    print(f"   Error: {abs(predicted - actual):.3f}")


In [None]:
### Step 3: Adapt with LoRA for a New Task

Now let's use LoRA to adapt our pre-trained model to a new task: computing the **product** of the first two inputs!


In [None]:
# create a copy of the pre-trained model for lora adaptation
lora_model = SimpleNet()
lora_model.load_state_dict(pretrained_model.state_dict())

# wrap the last layer with lora (this is where we'll adapt)
print("🔧 Adding LoRA to the final layer...")
lora_model.layer3 = LoRALayer(lora_model.layer3, rank=4, alpha=16)

# generate new task data: predict product of first two elements
def generate_product_data(n_samples=1000, input_size=10):
    X = torch.randn(n_samples, input_size)
    y = (X[:, 0] * X[:, 1]).unsqueeze(1)  # product of first two elements
    return X, y

X_finetune, y_finetune = generate_product_data()

print("📈 New task: Learn to compute product of first two inputs")
print(f"   Training samples: {len(X_finetune)}")
print(f"   Example: inputs {X_finetune[0][:2].tolist()}, target: {y_finetune[0].item():.3f}")


In [None]:
### Step 4: Train LoRA (Only LoRA Parameters!)

Now let's train only the LoRA parameters while keeping the original weights frozen:


In [None]:
# count trainable parameters
total_params = sum(p.numel() for p in lora_model.parameters())
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)

print(f"📊 Parameter breakdown:")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable (LoRA): {trainable_params:,}")
print(f"   Frozen: {total_params - trainable_params:,}")
print(f"   Training only {100 * trainable_params / total_params:.1f}% of parameters!")

# setup optimizer for only trainable parameters
lora_optimizer = optim.Adam(filter(lambda p: p.requires_grad, lora_model.parameters()), lr=0.01)

print("\n🎯 Training LoRA adaptation...")
losses = []

for epoch in range(150):
    lora_optimizer.zero_grad()
    outputs = lora_model(X_finetune)
    loss = criterion(outputs, y_finetune)
    loss.backward()
    lora_optimizer.step()
    
    losses.append(loss.item())
    
    if (epoch + 1) % 30 == 0:
        print(f"  Epoch {epoch+1}/150, Loss: {loss.item():.4f}")

print("✅ LoRA training complete!")


In [None]:
### Step 5: Compare Performance

Let's compare how well our LoRA-adapted model performs on both the original task and the new task:


In [None]:
# test on both tasks
with torch.no_grad():
    # generate test data
    test_X = torch.randn(10, 10)
    
    print("🧪 Testing on both tasks:")
    print("\n" + "="*60)
    print("ORIGINAL TASK (Sum of all inputs):")
    print("="*60)
    
    original_outputs = pretrained_model(test_X)
    lora_outputs = lora_model(test_X)
    true_sums = test_X.sum(dim=1, keepdim=True)
    
    orig_error = torch.mean((original_outputs - true_sums) ** 2).item()
    lora_error = torch.mean((lora_outputs - true_sums) ** 2).item()
    
    print(f"Original model MSE: {orig_error:.4f}")
    print(f"LoRA model MSE: {lora_error:.4f}")
    print(f"Performance change: {'+' if lora_error > orig_error else ''}{((lora_error - orig_error) / orig_error * 100):+.1f}%")
    
    print("\n" + "="*60)
    print("NEW TASK (Product of first two inputs):")
    print("="*60)
    
    true_products = (test_X[:, 0] * test_X[:, 1]).unsqueeze(1)
    
    orig_prod_error = torch.mean((original_outputs - true_products) ** 2).item()
    lora_prod_error = torch.mean((lora_outputs - true_products) ** 2).item()
    
    print(f"Original model MSE: {orig_prod_error:.4f}")
    print(f"LoRA model MSE: {lora_prod_error:.4f}")
    print(f"Improvement: {((orig_prod_error - lora_prod_error) / orig_prod_error * 100):+.1f}%")
    
    # show some examples
    print("\n📋 Example predictions (first 3 samples):")
    for i in range(3):
        x_vals = test_X[i][:2]
        true_prod = true_products[i].item()
        orig_pred = original_outputs[i].item()
        lora_pred = lora_outputs[i].item()
        
        print(f"  Sample {i+1}: inputs=({x_vals[0]:.2f}, {x_vals[1]:.2f})")
        print(f"    True product: {true_prod:.3f}")
        print(f"    Original: {orig_pred:.3f} (error: {abs(orig_pred - true_prod):.3f})")
        print(f"    LoRA: {lora_pred:.3f} (error: {abs(lora_pred - true_prod):.3f})")
        print()


In [None]:
### Step 6: Visualize Training Progress


In [None]:
# plot training loss
plt.figure(figsize=(10, 6))
plt.plot(losses, 'b-', linewidth=2, alpha=0.8)
plt.title('LoRA Training Progress', fontsize=16, fontweight='bold')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()

print(f"📉 Final loss: {losses[-1]:.6f}")


In [None]:
### Step 7: Demonstrate Weight Merging

One of LoRA's key advantages is that we can merge the adaptation back into the original weights for deployment:


In [None]:
# test before merging
with torch.no_grad():
    test_input = torch.randn(1, 10)
    before_merge = lora_model(test_input).item()
    
print(f"🔍 Before merging: {before_merge:.6f}")

# merge the weights
lora_model.layer3.merge_weights()

# test after merging
with torch.no_grad():
    after_merge = lora_model(test_input).item()
    
print(f"🔍 After merging: {after_merge:.6f}")
print(f"✅ Difference: {abs(before_merge - after_merge):.8f} (should be ~0)")

print("\n🎯 Key Benefits Demonstrated:")
print("   ✅ Trained only 3.6% of parameters")
print("   ✅ Preserved original task performance")
print("   ✅ Learned new task effectively")
print("   ✅ Can merge weights for zero-overhead inference")


In [None]:
## Key Takeaways

### 🎯 What We've Learned

1. **LoRA is incredibly parameter-efficient**: We trained only 3.6% of the model's parameters
2. **Performance is preserved**: The original task performance remained largely intact
3. **New tasks can be learned**: LoRA successfully adapted the model to a completely different task
4. **Zero inference overhead**: After merging, there's no computational penalty

### 🧠 Why LoRA Works

The success of LoRA is based on the **intrinsic dimensionality hypothesis**: when adapting pre-trained models, the necessary changes often lie in a much lower-dimensional space than the full parameter space.

### 🚀 Real-World Applications

LoRA has been successfully applied to:
- **Large Language Models** (GPT, BERT, T5)
- **Computer Vision** (Vision Transformers, CNNs)
- **Multi-modal Models** (CLIP, DALL-E)
- **Speech Models** (Whisper, WaveNet)

### 🔮 Future Directions

Research continues to improve on LoRA:
- **AdaLoRA**: Adaptive rank allocation
- **QLoRA**: Quantized LoRA for even more efficiency
- **DoRA**: Decomposed LoRA with direction and magnitude
- **Multi-task LoRA**: Sharing adaptations across related tasks

---

### 📚 References

- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) - Hu et al., 2021
- [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/abs/1902.00751) - Houlsby et al., 2019
- [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691) - Lester et al., 2021

**Happy LoRA-ing! 🎉**
