# Advanced Fine-Tuning Techniques

This notebook covers advanced fine-tuning techniques including:
- LoRA (Low-Rank Adaptation)
- QLoRA (Quantized LoRA)
- Custom training loops with gradient accumulation
- Multi-GPU training with DeepSpeed
- Advanced optimization strategies

In [None]:
# Install required packages
!pip install transformers datasets accelerate peft bitsandbytes deepspeed wandb

## 1. LoRA Fine-Tuning

LoRA (Low-Rank Adaptation) allows efficient fine-tuning by adding trainable low-rank matrices to existing weights.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
import torch
import numpy as np

# Load model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=16,  # Rank
    lora_alpha=32,  # Alpha parameter
    lora_dropout=0.1,  # Dropout probability
    target_modules=["q_lin", "v_lin"]  # Target attention layers
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.num_parameters(only_trainable=True):,}")
print(f"Total parameters: {model.num_parameters():,}")
print(f"Percentage of trainable params: {100 * model.num_parameters(only_trainable=True) / model.num_parameters():.2f}%")

# Sample dataset
texts = [
    "This movie is amazing!",
    "I hate this film.",
    "Great acting and storyline.",
    "Boring and predictable.",
    "Best movie ever!",
    "Waste of time."
]
labels = [1, 0, 1, 0, 1, 0]  # 1: positive, 0: negative

# Tokenize
tokenized = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
dataset = Dataset.from_dict({
    "input_ids": tokenized["input_ids"],
    "attention_mask": tokenized["attention_mask"],
    "labels": labels
})

# Training arguments optimized for LoRA
training_args = TrainingArguments(
    output_dir="./lora-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=3e-4,  # Higher LR for LoRA
    warmup_steps=100,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="no",
    remove_unused_columns=False,
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Train
trainer.train()

# Save LoRA weights (much smaller than full model)
model.save_pretrained("./lora-weights")
print("LoRA fine-tuning completed!")

## 2. QLoRA with 4-bit Quantization

QLoRA combines quantization with LoRA for even more memory-efficient fine-tuning of large models.

In [None]:
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,  # Nested quantization
    bnb_4bit_quant_type="nf4",  # Normal float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16  # Computation dtype
)

# Load quantized model (using a larger model to show benefits)
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    num_labels=2,
    device_map="auto",  # Automatic device mapping
    trust_remote_code=True
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA for quantized model
qlora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
        "c_attn",  # Attention projection
        "c_proj",  # Output projection
    ],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
)

# Apply QLoRA
model = get_peft_model(model, qlora_config)

print(f"GPU Memory before training: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
print(f"Trainable parameters: {model.num_parameters(only_trainable=True):,}")

# Training with QLoRA (using same dataset)
training_args = TrainingArguments(
    output_dir="./qlora-finetuned",
    num_train_epochs=2,
    per_device_train_batch_size=2,  # Smaller batch size
    gradient_accumulation_steps=4,  # Accumulate gradients
    learning_rate=2e-4,
    fp16=True,  # Mixed precision
    optim="adamw_8bit",  # 8-bit optimizer
    logging_steps=5,
    save_strategy="epoch",
    dataloader_pin_memory=False,  # Reduce memory usage
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
model.save_pretrained("./qlora-weights")
print("QLoRA fine-tuning completed!")

## 3. Custom Training Loop with Advanced Features

Implementing a custom training loop with gradient accumulation, learning rate scheduling, and mixed precision.

In [None]:
from torch.utils.data import DataLoader
from transformers import get_cosine_schedule_with_warmup, AdamW
from accelerate import Accelerator
import wandb
from tqdm import tqdm

# Initialize accelerator for mixed precision and distributed training
accelerator = Accelerator(
    mixed_precision="fp16",  # or "bf16"
    gradient_accumulation_steps=4
)

# Initialize Weights & Biases for logging
if accelerator.is_main_process:
    wandb.init(
        project="advanced-fine-tuning",
        name="custom-training-loop",
        config={
            "learning_rate": 2e-5,
            "batch_size": 8,
            "epochs": 3,
            "warmup_steps": 100,
        }
    )

# Model setup (using standard model for simplicity)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Create larger dataset for demonstration
texts = texts * 20  # Repeat for more data
labels = labels * 20

tokenized = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
dataset = Dataset.from_dict({
    "input_ids": tokenized["input_ids"],
    "attention_mask": tokenized["attention_mask"],
    "labels": labels
})

# DataLoader
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

# Optimizer with weight decay and different learning rates
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.01,
        "lr": 2e-5,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
        "lr": 2e-5,
    },
    {
        "params": model.classifier.parameters(),
        "weight_decay": 0.01,
        "lr": 1e-4,  # Higher LR for classifier
    }
]

optimizer = AdamW(optimizer_grouped_parameters, eps=1e-8)

# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(dataloader)
num_warmup_steps = min(100, num_training_steps // 10)

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

# Prepare for accelerated training
model, optimizer, dataloader, scheduler = accelerator.prepare(
    model, optimizer, dataloader, scheduler
)

# Custom training loop
model.train()
global_step = 0
total_loss = 0

for epoch in range(num_epochs):
    epoch_loss = 0
    progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    
    for step, batch in enumerate(progress_bar):
        # Forward pass
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss
            
            # Backward pass
            accelerator.backward(loss)
            
            # Gradient clipping
            accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            # Optimizer step
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
        
        # Logging
        current_loss = loss.detach().float()
        epoch_loss += current_loss
        total_loss += current_loss
        global_step += 1
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': f'{current_loss:.4f}',
            'lr': f'{scheduler.get_last_lr()[0]:.2e}'
        })
        
        # Log to wandb
        if accelerator.is_main_process and global_step % 10 == 0:
            wandb.log({
                "train_loss": current_loss,
                "learning_rate": scheduler.get_last_lr()[0],
                "epoch": epoch,
                "step": global_step
            })
    
    avg_epoch_loss = epoch_loss / len(dataloader)
    accelerator.print(f"Epoch {epoch+1} average loss: {avg_epoch_loss:.4f}")
    
    # Save checkpoint
    if accelerator.is_main_process:
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(f"./custom-checkpoint-epoch-{epoch+1}")

accelerator.print("Custom training completed!")
if accelerator.is_main_process:
    wandb.finish()

## 4. DeepSpeed Integration for Large Model Training

Using DeepSpeed ZeRO for memory-efficient training of large models.

In [None]:
# Create DeepSpeed configuration file
import json

deepspeed_config = {
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 2e-5,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-5,
            "warmup_num_steps": 100
        }
    },
    "zero_optimization": {
        "stage": 2,  # ZeRO stage 2: Optimizer state partitioning
        "offload_optimizer": {
            "device": "cpu",  # Offload optimizer to CPU
            "pin_memory": True
        },
        "allgather_partitions": True,
        "allgather_bucket_size": 5e8,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": True
    },
    "fp16": {
        "enabled": True,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "activation_checkpointing": {
        "partition_activations": True,
        "cpu_checkpointing": True,
        "contiguous_memory_optimization": False,
        "number_checkpoints": 4,
        "synchronize_checkpoint_boundary": False,
        "profile": False
    },
    "wall_clock_breakdown": False
}

# Save config
with open("deepspeed_config.json", "w") as f:
    json.dump(deepspeed_config, f, indent=2)

print("DeepSpeed configuration saved to deepspeed_config.json")

# Training with DeepSpeed (using Trainer)
training_args = TrainingArguments(
    output_dir="./deepspeed-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="no",
    deepspeed="deepspeed_config.json",  # DeepSpeed config
    fp16=True,
    dataloader_pin_memory=False,
    dataloader_num_workers=0,
    remove_unused_columns=False,
)

# Model (can be much larger with DeepSpeed)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

# Enable gradient checkpointing for memory savings
model.gradient_checkpointing_enable()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

print("Starting DeepSpeed training...")
trainer.train()

print("\nDeepSpeed training completed!")
print(f"Model saved to: {training_args.output_dir}")

# Example command to run with DeepSpeed from command line:
print("\nTo run with DeepSpeed from command line:")
print("deepspeed --num_gpus=1 train_script.py --deepspeed deepspeed_config.json")

## 5. Advanced Optimization Strategies

Implementing sophisticated optimization techniques including layer-wise learning rates and adaptive training strategies.

In [None]:
import torch.nn as nn
from transformers import get_polynomial_decay_schedule_with_warmup
import math

class LayerWiseLROptimizer:
    """Custom optimizer with layer-wise learning rate decay"""
    
    def __init__(self, model, base_lr=2e-5, decay_rate=0.9):
        self.model = model
        self.base_lr = base_lr
        self.decay_rate = decay_rate
        
    def get_optimizer_groups(self):
        """Create optimizer groups with layer-wise learning rates"""
        # Get all transformer layers
        if hasattr(self.model, 'distilbert'):
            layers = self.model.distilbert.transformer.layer
            num_layers = len(layers)
        elif hasattr(self.model, 'bert'):
            layers = self.model.bert.encoder.layer
            num_layers = len(layers)
        else:
            # Fallback for other architectures
            return [{'params': self.model.parameters(), 'lr': self.base_lr}]
        
        optimizer_groups = []
        
        # Different learning rates for different layers
        for i, layer in enumerate(layers):
            # Lower layers get lower learning rates
            layer_lr = self.base_lr * (self.decay_rate ** (num_layers - i - 1))
            
            optimizer_groups.append({
                'params': layer.parameters(),
                'lr': layer_lr,
                'weight_decay': 0.01
            })
        
        # Higher learning rate for classifier
        optimizer_groups.append({
            'params': self.model.classifier.parameters(),
            'lr': self.base_lr * 2,
            'weight_decay': 0.01
        })
        
        # Embeddings with lower learning rate
        if hasattr(self.model, 'distilbert'):
            optimizer_groups.append({
                'params': self.model.distilbert.embeddings.parameters(),
                'lr': self.base_lr * 0.1,
                'weight_decay': 0.0
            })
        
        return optimizer_groups

class AdaptiveTrainer:
    """Trainer with adaptive strategies"""
    
    def __init__(self, model, tokenizer, device='cuda'):
        self.model = model.to(device)
        self.tokenizer = tokenizer
        self.device = device
        self.best_loss = float('inf')
        self.patience_counter = 0
        self.lr_reduction_factor = 0.5
        self.patience = 3
        
    def adaptive_batch_size(self, initial_batch_size=8, max_batch_size=32):
        """Dynamically adjust batch size based on GPU memory"""
        batch_size = initial_batch_size
        
        while batch_size <= max_batch_size:
            try:
                # Test with dummy batch
                dummy_input = torch.randint(0, 1000, (batch_size, 128)).to(self.device)
                dummy_mask = torch.ones(batch_size, 128).to(self.device)
                
                with torch.no_grad():
                    _ = self.model(dummy_input, attention_mask=dummy_mask)
                
                batch_size *= 2  # Try larger batch size
                
            except RuntimeError as e:
                if "out of memory" in str(e):
                    batch_size = max(batch_size // 2, 1)
                    break
                else:
                    raise e
        
        print(f"Optimal batch size: {batch_size}")
        return batch_size
    
    def train_with_adaptive_strategies(self, dataset, num_epochs=3):
        """Train with multiple adaptive strategies"""
        
        # Setup layer-wise learning rates
        lr_optimizer = LayerWiseLROptimizer(self.model)
        optimizer_groups = lr_optimizer.get_optimizer_groups()
        optimizer = AdamW(optimizer_groups, eps=1e-8)
        
        # Adaptive batch size
        optimal_batch_size = self.adaptive_batch_size()
        dataloader = DataLoader(dataset, batch_size=optimal_batch_size, shuffle=True)
        
        # Polynomial decay scheduler
        num_training_steps = num_epochs * len(dataloader)
        scheduler = get_polynomial_decay_schedule_with_warmup(
            optimizer,
            num_warmup_steps=min(100, num_training_steps // 10),
            num_training_steps=num_training_steps,
            power=2.0
        )
        
        self.model.train()
        
        for epoch in range(num_epochs):
            total_loss = 0
            num_batches = 0
            
            progress_bar = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
            
            for batch in progress_bar:
                # Move batch to device
                batch = {k: v.to(self.device) for k, v in batch.items()}
                
                # Forward pass
                outputs = self.model(**batch)
                loss = outputs.loss
                
                # Backward pass
                loss.backward()
                
                # Gradient clipping with adaptive norm
                grad_norm = torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                
                # Update metrics
                current_loss = loss.item()
                total_loss += current_loss
                num_batches += 1
                
                # Update progress bar
                progress_bar.set_postfix({
                    'loss': f'{current_loss:.4f}',
                    'grad_norm': f'{grad_norm:.3f}',
                    'lr': f'{scheduler.get_last_lr()[0]:.2e}'
                })
            
            # Epoch statistics
            avg_loss = total_loss / num_batches
            print(f"Epoch {epoch+1} - Average Loss: {avg_loss:.4f}")
            
            # Adaptive learning rate reduction
            if avg_loss < self.best_loss:
                self.best_loss = avg_loss
                self.patience_counter = 0
                # Save best model
                torch.save(self.model.state_dict(), "best_model.pt")
            else:
                self.patience_counter += 1
                
                if self.patience_counter >= self.patience:
                    print(f"Reducing learning rate by factor {self.lr_reduction_factor}")
                    for param_group in optimizer.param_groups:
                        param_group['lr'] *= self.lr_reduction_factor
                    self.patience_counter = 0
            
            # Memory cleanup
            torch.cuda.empty_cache()
        
        print("Adaptive training completed!")
        return self.model

# Initialize and run adaptive training
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

adaptive_trainer = AdaptiveTrainer(model, tokenizer)
trained_model = adaptive_trainer.train_with_adaptive_strategies(dataset, num_epochs=2)

print("\nAdvanced optimization training completed!")
print("Key features used:")
print("- Layer-wise learning rates")
print("- Adaptive batch sizing")
print("- Polynomial decay scheduling")
print("- Adaptive learning rate reduction")
print("- Gradient norm monitoring")
print("- Memory optimization")

## Summary

This notebook covered five advanced fine-tuning techniques:

1. **LoRA**: Parameter-efficient fine-tuning with low-rank adaptation
2. **QLoRA**: Memory-efficient training combining quantization and LoRA
3. **Custom Training Loop**: Full control with gradient accumulation and mixed precision
4. **DeepSpeed Integration**: Scalable training for large models with ZeRO optimization
5. **Advanced Optimization**: Layer-wise learning rates and adaptive strategies

Each technique addresses different challenges in fine-tuning:
- **Memory efficiency**: QLoRA, DeepSpeed ZeRO
- **Parameter efficiency**: LoRA, QLoRA
- **Training stability**: Custom loops with proper scheduling
- **Scalability**: DeepSpeed for multi-GPU training
- **Optimization**: Advanced learning rate strategies

Choose the appropriate technique based on your model size, available hardware, and training requirements.