In [None]:
# Google Colab Setup (run this cell only if you're in Colab)
import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("üîß Running in Google Colab - Setting up environment...")
    if not os.path.exists('transformer_from_scratch'):
        print("üì• Cloning repository...")
        !git clone https://github.com/melhzy/transformer_from_scratch.git
        print("‚úÖ Repository cloned!")
    os.chdir('transformer_from_scratch')
    print("üì¶ Installing dependencies...")
    !pip install -q torch torchvision matplotlib seaborn numpy pandas transformers datasets peft
    print("‚úÖ Dependencies installed!")
    if '/content/transformer_from_scratch' not in sys.path:
        sys.path.insert(0, '/content/transformer_from_scratch')
    print("‚úÖ Setup complete! Ready to run the tutorial.")
else:
    print("üíª Running locally - no setup needed.")

In [None]:
# Import libraries
import sys
import os
from pathlib import Path

# Add project root to path
if not IN_COLAB:
    sys.path.insert(0, str(Path.cwd().parent))

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import our transformer implementation
from src.transformer import Transformer
from src.modules.embeddings import TokenEmbedding
from src.modules.encoder import TransformerEncoder
from src.modules.decoder import TransformerDecoder

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Device: {device}")
print(f"‚úÖ PyTorch version: {torch.__version__}")

## 1. What is Fine-Tuning? üéØ

### Definition

**Fine-tuning** is the process of taking a pre-trained model and adapting it to a specific task or domain by continuing training on task-specific data.

### Why Fine-Tune?

1. **Pre-training is expensive**: Training a model from scratch requires massive compute (millions of GPU hours)
2. **Transfer learning works**: Models learn general language understanding that transfers to specific tasks
3. **Customization**: Adapt models to your specific use case, style, or domain
4. **Data efficiency**: Achieve good performance with relatively small datasets

### Pre-training vs Fine-tuning

```
PRE-TRAINING:
‚îú‚îÄ‚îÄ Data: Massive unlabeled text (trillions of tokens)
‚îú‚îÄ‚îÄ Task: Next-token prediction / Masked language modeling
‚îú‚îÄ‚îÄ Time: Weeks to months
‚îú‚îÄ‚îÄ Cost: $1M - $100M+
‚îî‚îÄ‚îÄ Result: General-purpose language model

FINE-TUNING:
‚îú‚îÄ‚îÄ Data: Task-specific labeled data (thousands to millions of examples)
‚îú‚îÄ‚îÄ Task: Specific (classification, QA, summarization, chat, etc.)
‚îú‚îÄ‚îÄ Time: Hours to days
‚îú‚îÄ‚îÄ Cost: $10 - $10,000
‚îî‚îÄ‚îÄ Result: Specialized model for your task
```

---

## 2. Fine-Tuning Strategies üõ†Ô∏è

### Strategy 1: Full Fine-Tuning

**What**: Update all model parameters during training

**Pros:**
- Maximum adaptation to your task
- Best performance potential
- Full control over model behavior

**Cons:**
- High memory requirements (need to store gradients for all parameters)
- Risk of catastrophic forgetting
- Expensive (GPU memory and time)

**When to use:**
- Small models (<1B parameters)
- Large dataset available
- Maximum performance needed

---

### Strategy 2: Parameter-Efficient Fine-Tuning (PEFT)

**What**: Update only a small subset of parameters

**Techniques:**
1. **Adapter Layers**: Insert small trainable modules between frozen layers
2. **Prefix Tuning**: Add trainable prefix tokens to input
3. **Prompt Tuning**: Learn soft prompts (continuous embeddings)
4. **LoRA** (Low-Rank Adaptation): Most popular - explained below

**Pros:**
- Much lower memory requirements
- Faster training
- Can maintain multiple task-specific adapters
- Less catastrophic forgetting

**Cons:**
- Slightly lower performance than full fine-tuning
- More complex implementation

---

### Strategy 3: LoRA (Low-Rank Adaptation) ‚≠ê

**Key Insight**: Weight updates during fine-tuning have low intrinsic rank

Instead of updating full weight matrix $W \in \mathbb{R}^{d \times k}$:

$$W' = W + \Delta W$$

Decompose update into low-rank matrices:

$$W' = W + BA$$

Where:
- $W$ is frozen (pre-trained weights)
- $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$
- $r \ll \min(d, k)$ (rank is much smaller, typically r=8, 16, 32)

**Parameters to train**: $d \times r + r \times k$ instead of $d \times k$

**Example**: For d=4096, k=4096, r=16:
- Full: 16,777,216 parameters
- LoRA: 131,072 parameters (0.78% of full!)

**Pros:**
- Extremely memory efficient
- No additional inference latency (can merge weights)
- Easy to switch between tasks
- Surprisingly good performance

---

### Strategy 4: QLoRA (Quantized LoRA) üöÄ

**What**: LoRA + 4-bit quantization of base model

**Key Innovation** (by Tim Dettmers et al.):
1. Quantize base model to 4-bit (NF4 - Normal Float 4)
2. Use double quantization for quantization constants
3. Paged optimizers to handle memory spikes
4. Train LoRA adapters in 16-bit

**Memory Savings**:
- 16-bit model: 30GB for 7B model
- 4-bit + LoRA: 5-6GB for 7B model!

**When to use:**
- Limited GPU memory (single consumer GPU)
- Large models (7B, 13B, 70B parameters)
- Quick iteration and experimentation

**This is what Unsloth AI optimizes!**

---

## 3. Comparison Table üìä

| Strategy | Trainable Params | Memory | Training Time | Performance | Use Case |
|----------|------------------|---------|---------------|-------------|----------|
| **Full Fine-Tuning** | 100% | Very High | Slow | Best | Small models, unlimited resources |
| **PEFT (Adapters)** | 1-5% | Medium | Medium | Good | Multiple tasks, moderate resources |
| **LoRA** | 0.1-1% | Low | Fast | Very Good | Most common choice, practical |
| **QLoRA** | 0.1-1% | Very Low | Fast | Very Good | Large models, limited GPU |

---

In [None]:
# Let's visualize the parameter efficiency
import pandas as pd

# Example: 7B parameter model
total_params = 7_000_000_000

strategies = {
    'Full Fine-Tuning': 1.0,
    'Adapter Layers': 0.03,
    'LoRA (r=8)': 0.005,
    'LoRA (r=16)': 0.01,
    'LoRA (r=32)': 0.02,
}

data = []
for strategy, ratio in strategies.items():
    trainable = int(total_params * ratio)
    frozen = total_params - trainable
    data.append({
        'Strategy': strategy,
        'Trainable (M)': trainable / 1_000_000,
        'Frozen (M)': frozen / 1_000_000,
        'Percentage': f"{ratio * 100:.2f}%"
    })

df = pd.DataFrame(data)
print("\nüìä Parameter Efficiency Comparison (7B Model)\n")
print(df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
strategies_list = [d['Strategy'] for d in data]
trainable = [d['Trainable (M)'] for d in data]
frozen = [d['Frozen (M)'] for d in data]

x = np.arange(len(strategies_list))
width = 0.6

ax.bar(x, frozen, width, label='Frozen Parameters', color='lightblue', alpha=0.7)
ax.bar(x, trainable, width, label='Trainable Parameters', color='orange', bottom=frozen)

ax.set_ylabel('Parameters (Millions)', fontsize=12)
ax.set_title('Parameter Efficiency: Fine-Tuning Strategies (7B Model)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(strategies_list, rotation=15, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight: LoRA trains <1% of parameters but maintains performance!")

## 4. Understanding LoRA Implementation üîß

Let's implement a simple LoRA layer to understand how it works.

In [None]:
class LoRALayer(nn.Module):
    """
    Low-Rank Adaptation (LoRA) Layer
    
    Adds trainable low-rank matrices to frozen pre-trained weights.
    """
    def __init__(self, in_features: int, out_features: int, rank: int = 8, alpha: float = 16.0):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Frozen pre-trained weight (simulated)
        self.weight = nn.Parameter(torch.randn(out_features, in_features), requires_grad=False)
        
        # LoRA low-rank matrices (trainable)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))  # (r, in)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))  # (out, r)
        
        # Initialize A with Kaiming uniform, B with zeros
        nn.init.kaiming_uniform_(self.lora_A, a=np.sqrt(5))
        
    def forward(self, x):
        # Original frozen path
        result = F.linear(x, self.weight)
        
        # LoRA path: x @ A^T @ B^T * scaling
        lora_result = F.linear(F.linear(x, self.lora_A), self.lora_B)
        lora_result = lora_result * self.scaling
        
        return result + lora_result
    
    def merge_weights(self):
        """Merge LoRA weights into base weights for inference (optional)"""
        merged = self.weight + (self.lora_B @ self.lora_A) * self.scaling
        return merged


# Test LoRA layer
d_model = 512
lora_rank = 8

lora_layer = LoRALayer(d_model, d_model, rank=lora_rank)

# Count parameters
frozen_params = sum(p.numel() for p in lora_layer.parameters() if not p.requires_grad)
trainable_params = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
total_params = frozen_params + trainable_params

print("\nüìä LoRA Layer Analysis:")
print(f"Frozen parameters: {frozen_params:,}")
print(f"Trainable parameters (LoRA): {trainable_params:,}")
print(f"Total parameters: {total_params:,}")
print(f"Trainable ratio: {trainable_params / total_params * 100:.2f}%")
print(f"\nMemory savings: {frozen_params / trainable_params:.1f}x fewer trainable params!")

# Test forward pass
x = torch.randn(2, 10, d_model)
output = lora_layer(x)
print(f"\n‚úÖ Input shape: {x.shape}")
print(f"‚úÖ Output shape: {output.shape}")

## 5. Where to Apply LoRA? üéØ

In Transformers, you can apply LoRA to different components:

### Common Choices:

1. **Query & Value matrices (Q, V)** - Most common, good balance
2. **All attention matrices (Q, K, V, O)** - More parameters, better adaptation
3. **Attention + FFN** - Maximum adaptation

### Reference to transformer-foundation:

From `transformer-foundation/03_multi_head_attention.ipynb`:
```python
class MultiHeadAttention:
    self.W_q = nn.Linear(d_model, d_model)  # ‚Üê Apply LoRA here
    self.W_k = nn.Linear(d_model, d_model)  # ‚Üê Optional
    self.W_v = nn.Linear(d_model, d_model)  # ‚Üê Apply LoRA here
    self.W_o = nn.Linear(d_model, d_model)  # ‚Üê Optional
```

From `transformer-foundation/04_feed_forward_networks.ipynb`:
```python
class PositionWiseFeedForward:
    self.fc1 = nn.Linear(d_model, d_ff)  # ‚Üê Can apply LoRA
    self.fc2 = nn.Linear(d_ff, d_model)  # ‚Üê Can apply LoRA
```

### Trade-offs:

- **Fewer modules**: Faster, less memory, slightly lower performance
- **More modules**: Better adaptation, more memory, slower training

In [None]:
# Calculate parameter counts for different LoRA configurations
d_model = 4096  # Llama-2 7B size
d_ff = 11008
n_layers = 32
n_heads = 32
lora_rank = 16

configs = {
    'Query + Value only': [
        ('Q', d_model, d_model),
        ('V', d_model, d_model),
    ],
    'All Attention (Q,K,V,O)': [
        ('Q', d_model, d_model),
        ('K', d_model, d_model),
        ('V', d_model, d_model),
        ('O', d_model, d_model),
    ],
    'Attention + FFN': [
        ('Q', d_model, d_model),
        ('K', d_model, d_model),
        ('V', d_model, d_model),
        ('O', d_model, d_model),
        ('FFN_up', d_model, d_ff),
        ('FFN_down', d_ff, d_model),
    ],
}

print("\nüìä LoRA Parameter Counts (per layer):\n")
for config_name, modules in configs.items():
    total_lora_params = 0
    for name, in_dim, out_dim in modules:
        # LoRA params: A (r √ó in) + B (out √ó r)
        lora_params = (lora_rank * in_dim) + (out_dim * lora_rank)
        total_lora_params += lora_params
    
    total_for_model = total_lora_params * n_layers
    print(f"{config_name}:")
    print(f"  Per layer: {total_lora_params:,} params")
    print(f"  Full model ({n_layers} layers): {total_for_model / 1e6:.2f}M params")
    print()

print("\nüí° Recommendation: Start with 'Query + Value only' for fastest iteration!")

## 6. Fine-Tuning Pipeline üîÑ

### Step-by-Step Process:

```
1. DATA PREPARATION
   ‚îú‚îÄ‚îÄ Collect task-specific data
   ‚îú‚îÄ‚îÄ Format (instruction, input, output)
   ‚îú‚îÄ‚îÄ Tokenize
   ‚îî‚îÄ‚îÄ Create DataLoader

2. MODEL SETUP
   ‚îú‚îÄ‚îÄ Load pre-trained model
   ‚îú‚îÄ‚îÄ Add LoRA adapters (or freeze layers)
   ‚îú‚îÄ‚îÄ Configure optimizer
   ‚îî‚îÄ‚îÄ Set hyperparameters

3. TRAINING
   ‚îú‚îÄ‚îÄ Forward pass
   ‚îú‚îÄ‚îÄ Compute loss
   ‚îú‚îÄ‚îÄ Backward pass (only LoRA params)
   ‚îú‚îÄ‚îÄ Update weights
   ‚îî‚îÄ‚îÄ Monitor metrics

4. EVALUATION
   ‚îú‚îÄ‚îÄ Test on validation set
   ‚îú‚îÄ‚îÄ Compare with base model
   ‚îú‚îÄ‚îÄ Check for overfitting
   ‚îî‚îÄ‚îÄ Measure task performance

5. DEPLOYMENT
   ‚îú‚îÄ‚îÄ Merge LoRA weights (optional)
   ‚îú‚îÄ‚îÄ Quantize for inference (optional)
   ‚îú‚îÄ‚îÄ Test inference speed
   ‚îî‚îÄ‚îÄ Deploy to production
```

**Next tutorials will cover each step in detail!**

---

## 7. Connection to Our Transformer Implementation üîó

### From `src/transformer.py`:

Our implementation provides the foundation. To add LoRA:

```python
# Original (from src/transformer.py)
class Transformer(nn.Module):
    def __init__(self, ...):
        self.encoder = TransformerEncoder(...)  # From src/modules/encoder.py
        self.decoder = TransformerDecoder(...)  # From src/modules/decoder.py

# With LoRA (next tutorial)
def add_lora_to_transformer(model, rank=8):
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            if 'W_q' in name or 'W_v' in name:
                # Replace with LoRA layer
                pass
```

### Key Modules to Understand:

1. **`src/modules/embeddings.py`** - Usually frozen during fine-tuning
2. **`src/modules/attention.py`** - Where LoRA is most effective
3. **`src/modules/feed_forward.py`** - Optional LoRA application
4. **`src/modules/encoder.py` & `src/modules/decoder.py`** - Container modules

---

## üöÄ Production-Ready Alternative: Unsloth AI

Now that you understand the theory and manual implementation, here's how professionals do it in production:

### Why Unsloth for Production?

After learning the fundamentals above, **Unsloth AI** provides:
- ‚ö° **2x faster training** with optimized CUDA kernels
- üíæ **30% less memory** usage (fit larger models)
- üì¶ **Pre-configured setups** for Llama, Gemma, Mistral, Qwen, etc.
- üîß **Production-tested** code used by thousands of developers

### Quick Example with Unsloth

```python
# Install Unsloth (in production environment)
# pip install unsloth

from unsloth import FastLanguageModel
import torch

# Load pre-trained model with LoRA in 3 lines
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/tinyllama-bnb-4bit",  # 4-bit quantized model
    max_seq_length=2048,
    dtype=None,  # Auto-detect best dtype
    load_in_4bit=True,  # Use 4-bit quantization
)

# Add LoRA adapters automatically
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

# That's it! Ready to train with standard HuggingFace Trainer
```

### Learning Path

1. ‚úÖ **This Tutorial** - Understand fundamentals (pre-training, fine-tuning, LoRA theory)
2. ‚û°Ô∏è **Tutorials 2-5** - Implement from scratch (LoRA, datasets, training, evaluation)
3. üöÄ **Then Use Unsloth** - Production deployment with optimizations

**Why this order?** Understanding how LoRA works (next tutorials) helps you debug issues, customize implementations, and make informed decisions when using production tools like Unsloth.

---

Ready to implement LoRA from scratch? Continue to **Tutorial 2**!

## 8. Summary & Next Steps üìù

### What We Learned:

‚úÖ Fine-tuning adapts pre-trained models to specific tasks  
‚úÖ Full fine-tuning updates all parameters (expensive)  
‚úÖ PEFT updates only subset of parameters (efficient)  
‚úÖ LoRA decomposes weight updates into low-rank matrices  
‚úÖ QLoRA adds quantization for extreme efficiency  
‚úÖ Different strategies trade off performance vs. resources  

### Key Takeaways:

1. **LoRA is the practical choice** for most use cases
2. **Start simple**: Query + Value matrices with rank=8 or 16
3. **Iterate quickly**: Low memory means faster experiments
4. **Know your constraints**: GPU memory, training time, performance needs

### Next Tutorials:

1. **02_lora_implementation.ipynb** - Implement LoRA from scratch
2. **03_data_preparation.ipynb** - Prepare datasets for fine-tuning
3. **04_instruction_tuning.ipynb** - Fine-tune for instruction following
4. **05_evaluation_metrics.ipynb** - Measure fine-tuning success

---

## üìö Additional Resources

### Papers:
- **LoRA**: "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
- **QLoRA**: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
- **PEFT Survey**: "Parameter-Efficient Fine-Tuning Methods" (Lialin et al., 2023)

### Libraries:
- **Hugging Face PEFT**: https://github.com/huggingface/peft
- **Unsloth AI**: https://github.com/unslothai/unsloth (optimized fine-tuning)
- **bitsandbytes**: https://github.com/TimDettmers/bitsandbytes (quantization)

### Related Tutorials:
- [transformer-foundation/](../transformer-foundation/) - Build understanding from scratch
- [papers/DeepSeek-R1-paper.pdf](../papers/DeepSeek-R1-paper.pdf) - Modern architecture

---

**Ready to implement LoRA? Continue to Tutorial 2! üöÄ**