# 04 - Fine-Tuning: Training with QLoRA

**Goal**: Fine-tune a language model on our Ethereum transaction dataset using QLoRA.

In this notebook, you'll learn:
- What QLoRA is and why it enables training on consumer GPUs
- How to configure 4-bit quantization and LoRA adapters
- How to load and prepare datasets for training
- How to execute training with live monitoring
- How to manage VRAM usage and checkpoints
- Google Colab compatibility tips

**Prerequisites**: Completed `03-dataset-preparation.ipynb`, have prepared datasets

**Hardware Requirements**: 
- 12-16GB VRAM GPU (RTX 3060/4060, T4, or better)
- Or Google Colab with T4 GPU (free tier)

## 🚀 Google Colab Setup (Optional)

**If running on Google Colab**, uncomment and run this cell first:

In [None]:
# # Google Colab Setup
# # Uncomment these lines if running on Colab
# 
# # Check GPU availability
# !nvidia-smi
# 
# # Clone repository (if not already done)
# # !git clone https://github.com/YOUR_USERNAME/eth-finetuning-cookbook.git
# # %cd eth-finetuning-cookbook
# 
# # Install dependencies
# !pip install -q -U transformers datasets peft accelerate bitsandbytes torch torchvision torchaudio
# 
# # Verify installation
# import torch
# print(f"PyTorch version: {torch.__version__}")
# print(f"CUDA available: {torch.cuda.is_available()}")
# print(f"CUDA version: {torch.version.cuda}")
# if torch.cuda.is_available():
#     print(f"GPU: {torch.cuda.get_device_name(0)}")
#     print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

import json
import os
import sys
import time
from pathlib import Path

import torch
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import project modules
from eth_finetuning.training.config import TrainingConfig
from eth_finetuning.training.trainer import (
    setup_model_and_tokenizer,
    create_trainer,
    format_prompts,
)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Imports successful")
print(f"✓ Project root: {project_root}")
print(f"✓ PyTorch version: {torch.__version__}")
print(f"✓ CUDA available: {torch.cuda.is_available()}")

## Understanding QLoRA

### What is QLoRA?

**QLoRA (Quantized Low-Rank Adaptation)** is a technique that makes it possible to fine-tune large language models (7B+ parameters) on consumer GPUs.

### The Challenge

A 7B parameter model in full precision (32-bit) requires:
- **28GB VRAM** just to load the model
- **Additional VRAM** for optimizer states, gradients, activations
- **Total: ~50-60GB VRAM** 😱

### The Solution: QLoRA = Quantization + LoRA

**1. Quantization (4-bit)**
- Reduces model weights from 32-bit → 4-bit
- **Reduces VRAM: 28GB → ~7GB** 🎉
- Minimal accuracy loss with NF4 (NormalFloat4)

**2. LoRA (Low-Rank Adaptation)**
- Freezes base model (no gradient updates)
- Adds small trainable adapter layers
- **Only ~1-2% of parameters are trainable**
- Adapter weights: ~0.5GB

**Result:**
- Base model (frozen, 4-bit): ~7GB
- LoRA adapters (trainable): ~0.5GB
- Activations & gradients: ~3-4GB
- **Total: ~11-12GB VRAM** ✅

### Key Benefits

✓ Train on consumer GPUs (RTX 3060, 4060, etc.)

✓ Fast training (only updating 1-2% of parameters)

✓ Small adapter files (~100MB vs 14GB for full model)

✓ Minimal quality loss vs full fine-tuning

## Checking GPU Status

In [None]:
print("GPU STATUS")
print("=" * 80)

if torch.cuda.is_available():
    print(f"\n✓ CUDA available")
    print(f"  GPU Model:      {torch.cuda.get_device_name(0)}")
    print(f"  Total VRAM:     {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"  CUDA Version:   {torch.version.cuda}")
    print(f"  PyTorch Device: {torch.cuda.current_device()}")
    
    # Check current VRAM usage
    allocated = torch.cuda.memory_allocated(0) / 1024**3
    reserved = torch.cuda.memory_reserved(0) / 1024**3
    print(f"\nCurrent VRAM Usage:")
    print(f"  Allocated:      {allocated:.2f} GB")
    print(f"  Reserved:       {reserved:.2f} GB")
    
    total_vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
    if total_vram >= 12:
        print(f"\n✓ GPU has sufficient VRAM for QLoRA fine-tuning")
    else:
        print(f"\n⚠️  GPU has less than 12GB VRAM")
        print(f"   Training may fail or require smaller batch sizes")
else:
    print("\n✗ CUDA not available")
    print("   GPU is required for training")
    print("   Consider using Google Colab with T4 GPU (free tier)")

## Loading Training Configuration

Our training configuration is stored in `configs/training_config.yaml` and includes:
- Base model selection
- Quantization settings (4-bit)
- LoRA hyperparameters
- Training hyperparameters
- Memory optimization settings

In [None]:
# Load configuration
config_path = project_root / "configs" / "training_config.yaml"
config = TrainingConfig.from_yaml(str(config_path))

print("TRAINING CONFIGURATION")
print("=" * 80)
print(f"\nBase Model: {config.model_name}")
print(f"\nQuantization Settings:")
print(f"  4-bit:          {config.load_in_4bit}")
print(f"  Compute dtype:  {config.bnb_4bit_compute_dtype}")
print(f"  Quant type:     {config.bnb_4bit_quant_type}")
print(f"\nLoRA Settings:")
print(f"  Rank (r):       {config.lora_r}")
print(f"  Alpha:          {config.lora_alpha}")
print(f"  Dropout:        {config.lora_dropout}")
print(f"  Target modules: {', '.join(config.target_modules)}")
print(f"\nTraining Settings:")
print(f"  Learning rate:  {config.learning_rate}")
print(f"  Batch size:     {config.per_device_train_batch_size}")
print(f"  Grad accum:     {config.gradient_accumulation_steps}")
print(f"  Epochs:         {config.num_train_epochs}")
print(f"  Max seq length: {config.max_seq_length}")
print(f"\nEffective batch size: {config.per_device_train_batch_size * config.gradient_accumulation_steps}")

### Understanding Key Hyperparameters

**LoRA Rank (r=16)**
- Controls adapter expressiveness
- Higher rank = more parameters = better fit, but more VRAM
- 8-16 is typical for 7B models

**LoRA Alpha (32)**
- Scaling factor for adapter weights
- Typically 2x the rank
- Affects learning rate sensitivity

**Gradient Accumulation (16 steps)**
- Simulates larger batch sizes
- Effective batch size = 1 × 16 = 16
- Trades time for memory

**Learning Rate (2e-4)**
- Higher than full fine-tuning (typically 1e-5)
- LoRA adapters can handle higher rates
- Speeds up convergence

## Loading Dataset

Let's load the prepared dataset from the previous notebook.

In [None]:
print("LOADING DATASET")
print("=" * 80)

# Define dataset paths
dataset_dir = project_root / "data" / "datasets"

# Check if datasets exist
train_file = dataset_dir / "train.jsonl"
val_file = dataset_dir / "validation.jsonl"
test_file = dataset_dir / "test.jsonl"

if not all([train_file.exists(), val_file.exists(), test_file.exists()]):
    print("\n⚠️  Dataset files not found!")
    print(f"   Expected location: {dataset_dir}")
    print("   Please run notebook 03-dataset-preparation.ipynb first")
else:
    # Load dataset with HuggingFace
    dataset = load_dataset(
        'json',
        data_files={
            'train': str(train_file),
            'validation': str(val_file),
            'test': str(test_file),
        }
    )
    
    print("\n✓ Dataset loaded successfully")
    print(f"\nDataset splits:")
    print(f"  Train:      {len(dataset['train']):4d} examples")
    print(f"  Validation: {len(dataset['validation']):4d} examples")
    print(f"  Test:       {len(dataset['test']):4d} examples")
    
    # Show sample
    print("\nSample training example:")
    sample = dataset['train'][0]
    print(f"\nInstruction: {sample['instruction'][:80]}...")
    print(f"Input:       {sample['input'][:80]}...")
    print(f"Output:      {sample['output'][:80]}...")

## Setting Up Model and Tokenizer

This is where QLoRA magic happens:
1. Load model with 4-bit quantization
2. Prepare model for k-bit training
3. Add LoRA adapters
4. Enable gradient checkpointing

**Note**: Model download may take a few minutes on first run (~3-4GB).

In [None]:
print("LOADING BASE MODEL WITH QUANTIZATION")
print("=" * 80)
print("\nThis may take a few minutes on first run...")
print("(Model will be cached for future use)\n")

# Track initial VRAM
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    initial_vram = torch.cuda.memory_allocated(0) / 1024**3
    print(f"Initial VRAM: {initial_vram:.2f} GB\n")

# Load model and tokenizer
start_time = time.time()
model, tokenizer = setup_model_and_tokenizer(config)
load_time = time.time() - start_time

print(f"\n✓ Model and tokenizer loaded in {load_time:.1f} seconds")

# Check VRAM usage after loading
if torch.cuda.is_available():
    model_vram = torch.cuda.memory_allocated(0) / 1024**3
    print(f"\nVRAM after loading:")
    print(f"  Total allocated: {model_vram:.2f} GB")
    print(f"  Model size:      ~{model_vram - initial_vram:.2f} GB")
    
    if model_vram < 10:
        print(f"\n✓ Model fits comfortably in VRAM (< 10GB)")
    else:
        print(f"\n⚠️  Model using {model_vram:.2f} GB")

# Show trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
trainable_pct = 100 * trainable_params / total_params

print(f"\nModel Statistics:")
print(f"  Total parameters:     {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Trainable percentage: {trainable_pct:.2f}%")
print(f"\n✓ Only {trainable_pct:.2f}% of parameters will be updated (QLoRA efficiency!)")

## Preparing Dataset for Training

We need to:
1. Format examples as prompts
2. Tokenize with proper padding/truncation
3. Add labels for causal language modeling

In [None]:
print("PREPARING DATASET FOR TRAINING")
print("=" * 80)

# Format and tokenize datasets
print("\nFormatting and tokenizing...")

# Use our dataset formatting function
train_dataset = format_prompts(dataset['train'], tokenizer, config)
eval_dataset = format_prompts(dataset['validation'], tokenizer, config)

print(f"\n✓ Datasets prepared")
print(f"  Train: {len(train_dataset)} examples")
print(f"  Eval:  {len(eval_dataset)} examples")

# Show tokenization statistics
sample_lengths = [len(train_dataset[i]['input_ids']) for i in range(min(100, len(train_dataset)))]
print(f"\nToken length statistics (first 100 examples):")
print(f"  Mean:   {sum(sample_lengths) / len(sample_lengths):.0f} tokens")
print(f"  Min:    {min(sample_lengths)} tokens")
print(f"  Max:    {max(sample_lengths)} tokens")
print(f"  Limit:  {config.max_seq_length} tokens")

if max(sample_lengths) <= config.max_seq_length:
    print(f"\n✓ All samples fit within max sequence length")
else:
    truncated = sum(1 for l in sample_lengths if l > config.max_seq_length)
    print(f"\n⚠️  {truncated} sample(s) will be truncated")

## Creating Trainer

The HuggingFace `Trainer` handles:
- Training loop
- Gradient accumulation
- Mixed precision (FP16/BF16)
- Checkpointing
- Evaluation
- Logging

In [None]:
print("CREATING TRAINER")
print("=" * 80)

# Create output directory
output_dir = project_root / "models" / "fine-tuned" / "eth-intent-notebook"
output_dir.mkdir(parents=True, exist_ok=True)

# Create trainer
trainer = create_trainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    config=config,
    output_dir=str(output_dir),
)

print("\n✓ Trainer created successfully")
print(f"\nTraining arguments:")
print(f"  Output directory:  {output_dir}")
print(f"  Logging steps:     {config.logging_steps}")
print(f"  Eval steps:        {config.eval_steps}")
print(f"  Save steps:        {config.save_steps}")
print(f"  Gradient checkpt:  {config.gradient_checkpointing}")
print(f"  FP16:              {config.fp16}")

## Training Configuration Summary

Let's review what will happen during training:

In [None]:
# Calculate training statistics
num_examples = len(train_dataset)
batch_size = config.per_device_train_batch_size
grad_accum = config.gradient_accumulation_steps
effective_batch = batch_size * grad_accum
steps_per_epoch = num_examples // effective_batch
total_steps = steps_per_epoch * config.num_train_epochs

print("TRAINING PLAN")
print("=" * 80)
print(f"\nDataset:")
print(f"  Training examples:     {num_examples}")
print(f"  Evaluation examples:   {len(eval_dataset)}")
print(f"\nBatch Configuration:")
print(f"  Per-device batch size: {batch_size}")
print(f"  Gradient accumulation: {grad_accum}")
print(f"  Effective batch size:  {effective_batch}")
print(f"\nTraining Schedule:")
print(f"  Epochs:                {config.num_train_epochs}")
print(f"  Steps per epoch:       ~{steps_per_epoch}")
print(f"  Total training steps:  ~{total_steps}")
print(f"  Warmup steps:          {config.warmup_steps}")
print(f"\nCheckpointing:")
print(f"  Save every:            {config.save_steps} steps")
print(f"  Eval every:            {config.eval_steps} steps")
print(f"  Log every:             {config.logging_steps} steps")

# Estimate time (very rough)
if num_examples > 0:
    # Rough estimate: 1-2 seconds per step on T4
    estimated_time_min = total_steps * 1.0 / 60  # optimistic
    estimated_time_max = total_steps * 2.0 / 60  # conservative
    print(f"\nEstimated Training Time:")
    print(f"  Optimistic:  {estimated_time_min:.0f} minutes ({estimated_time_min/60:.1f} hours)")
    print(f"  Conservative: {estimated_time_max:.0f} minutes ({estimated_time_max/60:.1f} hours)")
    print(f"\n  (Actual time depends on GPU, data complexity, and system load)")

## 🚀 Starting Training

**Important Notes:**

- Training will take **significant time** (potentially hours)
- You can monitor progress with the progress bar
- Loss should decrease over time
- **You can interrupt training** with Kernel → Interrupt
- Training will save checkpoints every `save_steps`
- You can resume from checkpoint later

**For production training**, use the CLI script instead:
```bash
python scripts/training/train_model.py \
    --dataset data/datasets \
    --output models/fine-tuned/eth-intent-v1 \
    --config configs/training_config.yaml
```

In [None]:
print("\n" + "=" * 80)
print("STARTING TRAINING")
print("=" * 80)
print("\n⚠️  This will take significant time!")
print("   - Monitor VRAM usage with nvidia-smi")
print("   - Watch for decreasing loss")
print("   - You can interrupt and resume from checkpoint\n")

# Check VRAM before training
if torch.cuda.is_available():
    pre_train_vram = torch.cuda.memory_allocated(0) / 1024**3
    print(f"VRAM before training: {pre_train_vram:.2f} GB\n")

# Start training
start_time = time.time()

try:
    # Train the model
    train_result = trainer.train()
    
    training_time = time.time() - start_time
    
    print("\n" + "=" * 80)
    print("✓ TRAINING COMPLETE")
    print("=" * 80)
    print(f"\nTraining time: {training_time:.1f} seconds ({training_time/60:.1f} minutes)")
    print(f"\nFinal metrics:")
    print(f"  Train loss:    {train_result.training_loss:.4f}")
    print(f"  Steps:         {train_result.global_step}")
    
    # Check final VRAM
    if torch.cuda.is_available():
        final_vram = torch.cuda.memory_allocated(0) / 1024**3
        peak_vram = torch.cuda.max_memory_allocated(0) / 1024**3
        print(f"\nVRAM usage:")
        print(f"  Current:       {final_vram:.2f} GB")
        print(f"  Peak:          {peak_vram:.2f} GB")
        
        if peak_vram < 12:
            print(f"\n✓ Training completed within 12GB VRAM constraint")
    
except KeyboardInterrupt:
    print("\n⚠️  Training interrupted by user")
    print(f"   Training time so far: {(time.time() - start_time)/60:.1f} minutes")
    print(f"   Latest checkpoint saved to: {output_dir}")
    print(f"   You can resume training by re-running this cell")
    
except Exception as e:
    print(f"\n✗ Training failed with error: {e}")
    print(f"   Check VRAM usage and configuration")
    raise

## Saving the Fine-Tuned Model

Let's save the trained adapter and tokenizer.

In [None]:
print("SAVING FINE-TUNED MODEL")
print("=" * 80)

# Save the adapter
final_output_dir = output_dir / "final"
trainer.model.save_pretrained(str(final_output_dir))
tokenizer.save_pretrained(str(final_output_dir))

print(f"\n✓ Model saved to: {final_output_dir}")
print(f"\nSaved files:")
for file in final_output_dir.iterdir():
    file_size = file.stat().st_size / 1024**2  # MB
    print(f"  {file.name:30s} ({file_size:6.1f} MB)")

print(f"\n✓ Fine-tuning complete!")
print(f"\nAdapter files are small (~100MB) compared to full model (~14GB)")
print(f"This is the power of LoRA - only the adapter weights are saved!")

## Visualizing Training Metrics

Let's load and visualize the training logs.

In [None]:
# Load training logs
log_history = trainer.state.log_history

# Extract training loss
train_logs = [log for log in log_history if 'loss' in log]
eval_logs = [log for log in log_history if 'eval_loss' in log]

if train_logs:
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Training Loss
    steps = [log['step'] for log in train_logs]
    losses = [log['loss'] for log in train_logs]
    
    axes[0].plot(steps, losses, 'b-', linewidth=2, label='Training Loss')
    axes[0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('Step')
    axes[0].set_ylabel('Loss')
    axes[0].grid(True, alpha=0.3)
    axes[0].legend()
    
    # Plot 2: Training vs Evaluation Loss
    if eval_logs:
        eval_steps = [log['step'] for log in eval_logs]
        eval_losses = [log['eval_loss'] for log in eval_logs]
        
        axes[1].plot(steps, losses, 'b-', linewidth=2, label='Training Loss', alpha=0.7)
        axes[1].plot(eval_steps, eval_losses, 'r-', linewidth=2, marker='o', label='Eval Loss')
        axes[1].set_title('Training vs Evaluation Loss', fontsize=14, fontweight='bold')
        axes[1].set_xlabel('Step')
        axes[1].set_ylabel('Loss')
        axes[1].grid(True, alpha=0.3)
        axes[1].legend()
    else:
        axes[1].text(0.5, 0.5, 'No evaluation logs available', 
                    ha='center', va='center', transform=axes[1].transAxes)
        axes[1].set_title('Evaluation Loss', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nTraining Statistics:")
    print(f"  Initial loss:  {losses[0]:.4f}")
    print(f"  Final loss:    {losses[-1]:.4f}")
    print(f"  Improvement:   {(losses[0] - losses[-1])/losses[0]*100:.1f}%")
    print(f"  Total steps:   {steps[-1]}")
    
    if eval_logs:
        print(f"\nEvaluation Statistics:")
        print(f"  Final eval loss: {eval_losses[-1]:.4f}")
        print(f"  Evaluations:     {len(eval_logs)}")
else:
    print("No training logs available")

## Key Takeaways

✓ **QLoRA enables fine-tuning 7B models on consumer GPUs** (12GB VRAM)

✓ **4-bit quantization** reduces model size from 28GB → 7GB

✓ **LoRA adapters** add minimal parameters (~1-2%) while maintaining quality

✓ **Gradient accumulation** simulates larger batch sizes within VRAM constraints

✓ **Checkpointing** allows training to be interrupted and resumed

✓ **Adapter-only saving** means small file sizes (~100MB vs 14GB)

## Troubleshooting Tips

**Issue**: CUDA out of memory
- **Solution**: Reduce `per_device_train_batch_size` to 1, or reduce `max_seq_length`

**Issue**: Training is very slow
- **Solution**: Check if gradient accumulation is set correctly, verify GPU is being used

**Issue**: Loss not decreasing
- **Solution**: Check learning rate, verify dataset quality, increase training steps

**Issue**: Evaluation loss increasing
- **Solution**: Overfitting - reduce epochs, increase LoRA dropout, or get more data

**Issue**: Model download fails
- **Solution**: Check internet connection, HuggingFace API status, or use cached model

## Next Steps

In the next notebook (**05-evaluation.ipynb**), we'll learn how to:
- Load the fine-tuned model for inference
- Run evaluation on the test set
- Calculate accuracy metrics
- Visualize results with confusion matrices
- Analyze model predictions

---

**Ready to continue?** → `notebooks/05-evaluation.ipynb`