# Chapter 8: Advanced LoRA Techniques

**Portfolio Project: Building LLMs from Scratch on AWS** üöÄ

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/llm-from-scratch-aws/blob/main/08_Advanced_LoRA.ipynb)

---

## üìã Chapter Overview

Advanced LoRA techniques for production deployment:
- **Multiple Task-Specific Adapters**: Train different adapters for different tasks
- **QLoRA**: Quantization + LoRA for extreme efficiency
- **Adapter Switching**: Dynamic adapter selection
- **A/B Testing**: Compare adapter performance
- **Adapter Merging**: Combine multiple adapters
- **Production Deployment**: Multi-adapter endpoints on AWS

**Learning Objectives:**
‚úÖ Multi-task learning with LoRA  
‚úÖ 4-bit quantization with QLoRA  
‚úÖ Dynamic adapter management  
‚úÖ Production-ready A/B testing  

**AWS Services:** SageMaker Multi-Model Endpoints, S3  
**Estimated Cost:** $5-15

---


## üîß Setup

### Cell Purpose: Install dependencies


In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install -q torch tiktoken matplotlib tqdm bitsandbytes
    
import torch
import torch.nn as nn
import torch.nn.functional as F
import tiktoken
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import numpy as np
import math
import json
import copy

print("‚úÖ Environment ready!")


## 8.1 Multi-Task LoRA Adapters

### Cell Purpose: Train different adapters for different tasks


In [None]:
# Multi-task datasets
tasks = {
    "summarization": [
        ("Summarize: Machine learning is a subset of AI.", "ML is part of AI technology."),
        ("Summarize: The weather today is sunny and warm.", "Today: sunny, warm weather."),
        ("Summarize: Python is a popular programming language.", "Python: popular language."),
    ] * 10,
    
    "translation": [
        ("Translate to Spanish: Hello", "Hola"),
        ("Translate to Spanish: Thank you", "Gracias"),
        ("Translate to Spanish: Good morning", "Buenos d√≠as"),
    ] * 10,
    
    "qa": [
        ("Question: What is AI? Answer:", "AI is Artificial Intelligence."),
        ("Question: What is Python? Answer:", "Python is a programming language."),
        ("Question: What is ML? Answer:", "ML is Machine Learning."),
    ] * 10,
}

class AdapterManager:
    """Manages multiple LoRA adapters for different tasks"""
    
    def __init__(self, base_model):
        self.base_model = base_model
        self.adapters = {}  # task_name -> adapter_params
        self.current_task = None
        
    def save_adapter(self, task_name, lora_params):
        """Save adapter parameters for a task"""
        self.adapters[task_name] = {
            name: param.data.clone()
            for name, param in lora_params
        }
        print(f"‚úÖ Saved adapter for task: {task_name}")
        
    def load_adapter(self, task_name):
        """Load adapter parameters for a task"""
        if task_name not in self.adapters:
            raise ValueError(f"No adapter found for task: {task_name}")
        
        # Load adapter weights
        for name, param in self.adapters[task_name].items():
            # Find and update the parameter in model
            for model_name, model_param in self.base_model.named_parameters():
                if name in model_name and 'lora' in model_name:
                    model_param.data = param.clone()
        
        self.current_task = task_name
        print(f"‚úÖ Loaded adapter for task: {task_name}")
        
    def list_adapters(self):
        """List all available adapters"""
        return list(self.adapters.keys())

print("‚úÖ Multi-task adapter system ready!")
print(f"   Tasks: {list(tasks.keys())}")
print(f"   Samples per task: {[len(data) for data in tasks.values()]}")


## 8.2 QLoRA: Quantized LoRA

### Cell Purpose: Implement 4-bit quantization with LoRA


In [None]:
class Quantize4bit:
    """Simple 4-bit quantization (educational implementation)"""
    
    @staticmethod
    def quantize(tensor):
        """Quantize tensor to 4-bit representation"""
        # Find min and max
        min_val = tensor.min()
        max_val = tensor.max()
        
        # Scale to 0-15 (4-bit range)
        scale = (max_val - min_val) / 15.0
        quantized = torch.round((tensor - min_val) / scale).to(torch.uint8)
        
        return quantized, scale, min_val
    
    @staticmethod
    def dequantize(quantized, scale, min_val):
        """Dequantize from 4-bit back to float"""
        return quantized.float() * scale + min_val

class QLoRALayer(nn.Module):
    """QLoRA: Quantized base weights + LoRA adapters"""
    
    def __init__(self, linear_layer, rank=4, alpha=16, quantize=True):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Quantize base weights
        if quantize:
            quantized, scale, min_val = Quantize4bit.quantize(linear_layer.weight.data)
            self.register_buffer('weight_quantized', quantized)
            self.register_buffer('weight_scale', torch.tensor(scale))
            self.register_buffer('weight_min', torch.tensor(min_val))
            self.quantized = True
        else:
            self.weight = linear_layer.weight
            self.quantized = False
        
        # Freeze base weights
        if linear_layer.bias is not None:
            self.bias = nn.Parameter(linear_layer.bias.data.clone())
        else:
            self.bias = None
        
        # LoRA adapters (trainable)
        in_features = linear_layer.in_features
        out_features = linear_layer.out_features
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
    def forward(self, x):
        # Get base weight
        if self.quantized:
            weight = Quantize4bit.dequantize(
                self.weight_quantized,
                self.weight_scale,
                self.weight_min
            )
        else:
            weight = self.weight
        
        # Base forward
        output = F.linear(x, weight, self.bias)
        
        # Add LoRA
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        
        return output + lora_output

# Memory comparison
print("="*60)
print("MEMORY COMPARISON")
print("="*60)

# Example: 1000x1000 weight matrix
sample_weight = torch.randn(1000, 1000)

# Full precision (FP32)
fp32_size = sample_weight.element_size() * sample_weight.nelement()

# 4-bit quantized
quantized, _, _ = Quantize4bit.quantize(sample_weight)
quant_size = quantized.element_size() * quantized.nelement()

# LoRA adapters (rank=8)
rank = 8
lora_A_size = 1000 * rank * 4  # FP32
lora_B_size = rank * 1000 * 4  # FP32

print(f"Original (FP32): {fp32_size / 1024:.2f} KB")
print(f"Quantized (4-bit): {quant_size / 1024:.2f} KB")
print(f"LoRA Adapters (rank={rank}): {(lora_A_size + lora_B_size) / 1024:.2f} KB")
print(f"QLoRA Total: {(quant_size + lora_A_size + lora_B_size) / 1024:.2f} KB")
print(f"\nMemory Reduction: {(1 - (quant_size + lora_A_size + lora_B_size) / fp32_size) * 100:.1f}%")
print("="*60)


In [None]:
class ABTestingFramework:
    """A/B testing for comparing adapter performance"""
    
    def __init__(self, model, adapter_manager):
        self.model = model
        self.adapter_manager = adapter_manager
        self.results = {}
        
    def run_test(self, test_data, adapters_to_test, metric_fn):
        """
        Run A/B test on multiple adapters
        
        Args:
            test_data: List of (input, expected_output) tuples
            adapters_to_test: List of adapter names
            metric_fn: Function to compute metric (higher is better)
        """
        print("="*60)
        print("A/B TESTING")
        print("="*60)
        
        for adapter_name in adapters_to_test:
            print(f"\nTesting adapter: {adapter_name}")
            
            # Load adapter
            self.adapter_manager.load_adapter(adapter_name)
            
            # Evaluate
            scores = []
            for input_text, expected in test_data:
                # Generate output (simplified)
                with torch.no_grad():
                    score = metric_fn(input_text, expected, self.model)
                    scores.append(score)
            
            # Store results
            avg_score = np.mean(scores)
            self.results[adapter_name] = {
                'scores': scores,
                'mean': avg_score,
                'std': np.std(scores),
                'n_samples': len(scores)
            }
            
            print(f"   Mean Score: {avg_score:.4f} ¬± {np.std(scores):.4f}")
        
        print("\n" + "="*60)
        self._print_summary()
        
    def _print_summary(self):
        """Print comparison summary"""
        print("SUMMARY")
        print("="*60)
        
        # Sort by mean score
        sorted_adapters = sorted(
            self.results.items(),
            key=lambda x: x[1]['mean'],
            reverse=True
        )
        
        print(f"{'Rank':<6}{'Adapter':<20}{'Score':<15}{'Samples'}")
        print("-"*60)
        
        for rank, (adapter_name, results) in enumerate(sorted_adapters, 1):
            print(f"{rank:<6}{adapter_name:<20}"
                  f"{results['mean']:.4f} ¬± {results['std']:.4f}  "
                  f"{results['n_samples']}")
        
        # Winner
        winner = sorted_adapters[0][0]
        print(f"\nüèÜ Winner: {winner}")
        print("="*60)
        
    def get_winner(self):
        """Get best performing adapter"""
        return max(self.results.items(), key=lambda x: x[1]['mean'])[0]
    
    def visualize_results(self):
        """Visualize A/B test results"""
        adapters = list(self.results.keys())
        means = [self.results[a]['mean'] for a in adapters]
        stds = [self.results[a]['std'] for a in adapters]
        
        plt.figure(figsize=(10, 6))
        x_pos = np.arange(len(adapters))
        plt.bar(x_pos, means, yerr=stds, capsize=5, 
                color=['#3498db', '#2ecc71', '#e74c3c'][:len(adapters)],
                edgecolor='black', alpha=0.7)
        plt.xlabel('Adapter', fontsize=12, fontweight='bold')
        plt.ylabel('Score', fontsize=12, fontweight='bold')
        plt.title('A/B Test Results', fontsize=14, fontweight='bold')
        plt.xticks(x_pos, adapters, rotation=45)
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()

# Example metric function
def simple_match_metric(input_text, expected, model):
    """Simple metric: 1 if length matches, 0 otherwise"""
    return 1.0 if len(input_text) > 0 else 0.0

print("‚úÖ A/B Testing framework ready!")


In [None]:
class AdapterMerger:
    """Merge multiple LoRA adapters"""
    
    @staticmethod
    def average_merge(adapters, weights=None):
        """
        Merge adapters by averaging
        
        Args:
            adapters: List of adapter parameter dicts
            weights: Optional weights for weighted average
        """
        if weights is None:
            weights = [1.0 / len(adapters)] * len(adapters)
        
        merged = {}
        
        # Get all parameter names from first adapter
        param_names = list(adapters[0].keys())
        
        for name in param_names:
            # Weighted average of parameters
            merged[name] = sum(
                w * adapter[name] for w, adapter in zip(weights, adapters)
            )
        
        return merged
    
    @staticmethod
    def task_vector_merge(adapter_base, adapter_A, adapter_B, alpha=0.5):
        """
        Merge using task vectors
        
        Task vector = adapter_params - base_params
        Merged = base + alpha * (taskA + taskB)
        """
        merged = {}
        
        for name in adapter_base.keys():
            # Compute task vectors
            task_vector_A = adapter_A[name] - adapter_base[name]
            task_vector_B = adapter_B[name] - adapter_base[name]
            
            # Merge
            merged[name] = adapter_base[name] + alpha * (task_vector_A + task_vector_B)
        
        return merged
    
    @staticmethod
    def visualize_adapter_similarity(adapters, names):
        """Visualize similarity between adapters"""
        n_adapters = len(adapters)
        similarity_matrix = np.zeros((n_adapters, n_adapters))
        
        for i in range(n_adapters):
            for j in range(n_adapters):
                # Compute cosine similarity of flattened parameters
                params_i = torch.cat([p.flatten() for p in adapters[i].values()])
                params_j = torch.cat([p.flatten() for p in adapters[j].values()])
                
                similarity = F.cosine_similarity(
                    params_i.unsqueeze(0),
                    params_j.unsqueeze(0)
                ).item()
                
                similarity_matrix[i, j] = similarity
        
        # Plot
        plt.figure(figsize=(8, 6))
        plt.imshow(similarity_matrix, cmap='RdYlGn', vmin=0, vmax=1)
        plt.colorbar(label='Cosine Similarity')
        plt.xticks(range(n_adapters), names, rotation=45)
        plt.yticks(range(n_adapters), names)
        plt.title('Adapter Similarity Matrix', fontsize=14, fontweight='bold')
        
        # Add text annotations
        for i in range(n_adapters):
            for j in range(n_adapters):
                plt.text(j, i, f'{similarity_matrix[i, j]:.2f}',
                        ha='center', va='center',
                        color='black' if similarity_matrix[i, j] > 0.5 else 'white')
        
        plt.tight_layout()
        plt.show()

print("‚úÖ Adapter merging tools ready!")


## 8.5 Production Deployment

### Cell Purpose: AWS multi-model endpoint setup


In [None]:
deployment_guide = """
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
AWS MULTI-ADAPTER DEPLOYMENT
‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

## Architecture Overview

Base Model (Frozen) + Multiple LoRA Adapters
‚îî‚îÄ‚îÄ Adapter Registry (S3)
    ‚îú‚îÄ‚îÄ summarization_adapter.pth
    ‚îú‚îÄ‚îÄ translation_adapter.pth
    ‚îî‚îÄ‚îÄ qa_adapter.pth

## Deployment Strategy

### Option 1: Multi-Model Endpoint
--------------------------------------------------------------
- Single endpoint serving multiple adapters
- Base model loaded once in memory
- Dynamically load adapters per request
- Most cost-effective for multiple tasks

```python
from sagemaker.multidatamodel import MultiDataModel

# Create multi-model
mdm = MultiDataModel(
    name='multi-adapter-endpoint',
    model_data_prefix='s3://bucket/adapters/',
    model=base_model,
    sagemaker_session=session
)

# Deploy
predictor = mdm.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge'
)

# Invoke with specific adapter
response = predictor.predict(
    data={'text': 'Hello'},
    target_model='summarization_adapter.tar.gz'
)
```

### Option 2: Adapter-per-Endpoint
--------------------------------------------------------------
- Separate endpoint for each task
- Best for high-traffic scenarios
- Auto-scaling per task

### Option 3: Serverless with Lambda
--------------------------------------------------------------
- Best for low/sporadic traffic
- Cold start: ~2-5 seconds
- Cost: Pay per request

## Implementation Steps

### 1. Prepare Adapters
--------------------------------------------------------------
```python
# Save each adapter separately
for task_name, adapter in adapters.items():
    torch.save({
        'adapter_params': adapter,
        'rank': 8,
        'alpha': 16,
        'task': task_name
    }, f'{task_name}_adapter.pth')
    
    # Upload to S3
    !aws s3 cp {task_name}_adapter.pth s3://bucket/adapters/
```

### 2. Create Inference Handler
--------------------------------------------------------------
```python
# inference.py
class ModelHandler:
    def __init__(self):
        self.base_model = load_base_model()
        self.current_adapter = None
        
    def load_adapter(self, adapter_name):
        if adapter_name != self.current_adapter:
            adapter_path = f'/opt/ml/model/{adapter_name}'
            load_lora_adapter(self.base_model, adapter_path)
            self.current_adapter = adapter_name
    
    def predict(self, data):
        adapter = data.get('adapter', 'default')
        self.load_adapter(adapter)
        return self.base_model.generate(data['text'])
```

### 3. Deploy Multi-Model Endpoint
--------------------------------------------------------------
```bash
# Package base model
tar -czf base_model.tar.gz model/ code/

# Upload
aws s3 cp base_model.tar.gz s3://bucket/models/

# Create SageMaker model
aws sagemaker create-model \\
    --model-name multi-adapter-model \\
    --primary-container \\
        Image=pytorch-inference:2.0 \\
        ModelDataUrl=s3://bucket/models/base_model.tar.gz \\
        Mode=MultiModel

# Create endpoint
aws sagemaker create-endpoint \\
    --endpoint-name multi-adapter-endpoint \\
    --endpoint-config-name multi-adapter-config
```

### 4. A/B Testing in Production
--------------------------------------------------------------
```python
# Route traffic to different adapters
from sagemaker import ProductionVariant

variants = [
    ProductionVariant(
        variant_name='AdapterV1',
        model_name='adapter_v1',
        initial_instance_count=1,
        instance_type='ml.g4dn.xlarge',
        initial_variant_weight=70  # 70% traffic
    ),
    ProductionVariant(
        variant_name='AdapterV2',
        model_name='adapter_v2',
        initial_instance_count=1,
        instance_type='ml.g4dn.xlarge',
        initial_variant_weight=30  # 30% traffic
    )
]

predictor = model.deploy(
    endpoint_name='ab-test-endpoint',
    production_variants=variants
)
```

## Cost Analysis

### Multi-Model Endpoint (Recommended)
--------------------------------------------------------------
Instance: ml.g4dn.xlarge (1x T4 GPU)
Base Cost: ~$0.736/hour

Storage (S3):
- Base model (500 MB): $0.01/month
- 10 adapters (5 MB each): $0.001/month
- Total: ~$0.011/month

Monthly Cost (24/7):
- Compute: ~$530/month
- Storage: ~$0.011/month
- Total: ~$530/month

With Auto-Scaling (typical):
- Min instances: 0 (after hours)
- Max instances: 3 (peak hours)
- Average: ~$150-300/month

### Serverless (Low Traffic)
--------------------------------------------------------------
Cost per invocation: ~$0.0001
10,000 requests/month: ~$1
100,000 requests/month: ~$10

Best for: <100k requests/month

## Monitoring

```python
# CloudWatch metrics
import boto3

cloudwatch = boto3.client('cloudwatch')

# Track adapter usage
cloudwatch.put_metric_data(
    Namespace='MultiAdapter',
    MetricData=[{
        'MetricName': 'AdapterInvocations',
        'Dimensions': [
            {'Name': 'AdapterName', 'Value': adapter_name}
        ],
        'Value': 1.0
    }]
)

# Track latency per adapter
# Track error rates
# Track model drift
```

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
"""

print(deployment_guide)


## üìù Chapter Summary

### What We Built:
1. ‚úÖ **Multi-Task Adapters**: Train different adapters for different tasks
2. ‚úÖ **QLoRA**: 4-bit quantization for 75% memory reduction
3. ‚úÖ **A/B Testing**: Compare adapter performance systematically
4. ‚úÖ **Adapter Merging**: Combine multiple adapters intelligently
5. ‚úÖ **Production Deployment**: Multi-model endpoints on AWS

### Key Concepts:
- **Task-Specific Adapters**: One base model, many specialized adapters
- **Quantization**: Reduce model size with minimal accuracy loss
- **A/B Testing**: Data-driven adapter selection
- **Adapter Merging**: Combine capabilities from multiple adapters
- **Multi-Model Endpoints**: Efficient serving of multiple models

### QLoRA Benefits:
- **Memory**: 75-80% reduction vs LoRA
- **Speed**: Similar inference speed
- **Quality**: <1% accuracy loss
- **Cost**: Train larger models on smaller GPUs

### Production Best Practices:
1. **Adapter Registry**: Central storage for all adapters (S3)
2. **Version Control**: Track adapter versions
3. **Monitoring**: Per-adapter metrics
4. **Caching**: Keep frequently-used adapters in memory
5. **Fallback**: Default adapter for unknown tasks

### Cost Comparison:
| Deployment Type | Monthly Cost | Best For |
|----------------|--------------|----------|
| Multi-Model (24/7) | ~$530 | High traffic |
| Multi-Model (Auto-scale) | ~$150-300 | Variable traffic |
| Serverless | ~$0.10/1k requests | Low/sporadic |
| Per-Adapter Endpoints | ~$530 per task | Critical tasks |

### Next Steps:
‚û°Ô∏è **Chapter 9**: Evaluation metrics (ROUGE, BLEU, Perplexity)  
‚û°Ô∏è **Advanced**: Mixture of Experts (MoE)  
‚û°Ô∏è **Research**: Adapter fusion techniques  

---

## üîó Resources

**Papers:**
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [AdapterHub](https://arxiv.org/abs/2007.07779)
- [Task Arithmetic](https://arxiv.org/abs/2212.04089)

**AWS Documentation:**
- [Multi-Model Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html)
- [SageMaker A/B Testing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html)

**Tools:**
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) - Quantization
- [PEFT](https://github.com/huggingface/peft) - Parameter-efficient fine-tuning
- [AdapterHub](https://adapterhub.ml/) - Adapter repository

**Ready for evaluation metrics? üìä**
