# CPU-Optimized LoRA Fine-tuning Guide

This notebook demonstrates how to fine-tune Phi-4-mini-instruct using LoRA on **CPU-only Azure ML compute** when GPU access is restricted.

## Available CPU VMs

| VM | Cores | RAM | Storage | Cost/hr | Best For |
|----|-------|-----|---------|---------|----------|
| **Standard_E4ds_v4** âœ… | 4 | 32GB | 150GB | $0.29 | **Training (Recommended)** |
| Standard_DS3_v2 | 4 | 14GB | 28GB | - | Small tests only |
| Standard_DS11_v2 | 2 | 14GB | 28GB | - | Development only |

**Recommendation:** Use **Standard_E4ds_v4** for actual training due to 32GB RAM.

## Key Optimizations

- âœ… No quantization (CPU doesn't support 4-bit)
- âœ… Smaller LoRA rank (8 vs 16)
- âœ… Reduced batch size (1) with gradient accumulation (16)
- âœ… Limited dataset (1000 samples vs 15K)
- âœ… Shorter sequences (256 vs 512 tokens)
- âœ… 2 data loading workers for parallel processing

## 1. Configuration Comparison: GPU vs CPU

In [None]:
import pandas as pd

# Configuration comparison
config_comparison = {
    'Setting': ['Compute', 'LoRA Rank', 'LoRA Alpha', 'Batch Size', 'Gradient Accumulation', 
                'Effective Batch', 'Max Seq Length', 'Dataset Samples', 'Quantization', 
                'Training Time', 'Memory Usage', 'Cost per Run'],
    'GPU (Standard_NC6s_v3)': ['V100 GPU', '16', '32', '4', '4', '16', '512', '15,000', 
                               '4-bit', '1-2 hours', '~5GB GPU', '$0.90-$1.80'],
    'CPU (Standard_E4ds_v4)': ['4 cores, 32GB', '8', '16', '1', '16', '16', '256', '1,000', 
                               'None', '2-4 hours', '~16GB RAM', '$0.58-$1.16']
}

df = pd.DataFrame(config_comparison)
print("GPU vs CPU Configuration Comparison:")
print("=" * 100)
print(df.to_string(index=False))
print("=" * 100)

## 2. Submit CPU Training Job to Azure ML

In [None]:
# Submit CPU training job - easiest method
import subprocess
import sys

print("Submitting CPU training job to Azure ML...")
print("=" * 80)
print("Using Standard_E4ds_v4: 4 cores, 32GB RAM, $0.29/hr")
print("Training 1000 samples from Databricks Dolly 15K dataset")
print("Expected time: 2-4 hours")
print("=" * 80)

# Run the submission script
result = subprocess.run(
    [sys.executable, "../jobs/submit_training_job_cpu.py"],
    capture_output=True,
    text=True
)

print(result.stdout)
if result.returncode != 0:
    print("Error:", result.stderr)

## 3. Local CPU Training (Optional)

In [None]:
# Run training locally on CPU (for testing only)
# WARNING: This will download model (~7GB) and take 2-4 hours

train_locally = False  # Set to True to actually run

if train_locally:
    import subprocess
    import sys
    
    print("Starting local CPU training...")
    print("This will take 2-4 hours. Monitor progress below.")
    
    result = subprocess.run([
        sys.executable, "../src/train_cpu.py",
        "--output_dir", "./outputs_cpu_local",
        "--max_samples", "100",  # Small subset for testing
        "--num_epochs", "1"
    ])
else:
    print("Local training disabled. Set train_locally=True to run.")
    print("For production training, use Azure ML (previous cell).")

## 4. Cost Analysis

In [None]:
# Cost comparison for training runs

cost_data = {
    'VM Type': ['Standard_E4ds_v4 (CPU)', 'Standard_NC6s_v3 (GPU)'],
    'Cost per Hour': ['$0.29', '$0.90'],
    'Training Time (1K samples)': ['2-4 hours', 'N/A'],
    'Training Time (15K samples)': ['N/A (too slow)', '1-2 hours'],
    'Cost per Run (1K)': ['$0.58 - $1.16', 'N/A'],
    'Cost per Run (15K)': ['Not recommended', '$0.90 - $1.80']
}

df_cost = pd.DataFrame(cost_data)
print("Cost Comparison:")
print("=" * 80)
print(df_cost.to_string(index=False))
print("=" * 80)
print("\nðŸ’¡ Tip: CPU is cost-effective for small datasets (1K samples)")
print("ðŸ’¡ Tip: For 15K samples, GPU is faster and more cost-efficient")

## 5. View CPU Training Configuration

In [None]:
import yaml

# Load CPU configuration
with open('../config/training_config_cpu.yaml', 'r') as f:
    cpu_config = yaml.safe_load(f)

print("CPU Training Configuration:")
print("=" * 80)
print(yaml.dump(cpu_config, default_flow_style=False))
print("=" * 80)

## 6. Memory and Performance Tips

In [None]:
tips = {
    'Issue': [
        'Out of Memory',
        'Training Too Slow',
        'Model Too Large',
        'Dataset Too Large',
        'VM Selection'
    ],
    'Solution': [
        'Reduce max_samples to 500 or max_seq_length to 128',
        'Use max_samples=100 for testing, 1000 for production',
        'Use LoRA rank=4 instead of 8 (fewer trainable params)',
        'Use max_samples to limit dataset size',
        'Always use Standard_E4ds_v4 (32GB RAM), not DS3_v2 (14GB)'
    ],
    'Config Parameter': [
        'data.max_samples, training.max_seq_length',
        'data.max_samples',
        'lora.r',
        'data.max_samples',
        'compute_config_cpu.yaml'
    ]
}

df_tips = pd.DataFrame(tips)
print("Troubleshooting Guide:")
print("=" * 100)
print(df_tips.to_string(index=False))
print("=" * 100)

## 7. Quick Command Reference

### Submit Azure ML Job (Recommended)
```bash
python jobs/submit_training_job_cpu.py
```

### Local Training (Testing)
```bash
# Test with 100 samples
python src/train_cpu.py --max_samples 100 --num_epochs 1 --output_dir ./test_output

# Production with 1000 samples
python src/train_cpu.py --max_samples 1000 --num_epochs 2 --output_dir ./outputs_cpu
```

### Compare Models After Training
```bash
python src/compare_models.py --adapter_path ./outputs_cpu/final_model
```

### View Configuration Files
- **CPU Config**: `config/training_config_cpu.yaml`
- **CPU Compute**: `config/compute_config_cpu.yaml`
- **Training Script**: `src/train_cpu.py`
- **Job Submission**: `jobs/submit_training_job_cpu.py`

## Summary

âœ… **Created CPU-optimized configurations** for Standard_E4ds_v4  
âœ… **Training script** (`src/train_cpu.py`) with memory optimizations  
âœ… **Job submission** (`jobs/submit_training_job_cpu.py`) for Azure ML  
âœ… **Cost-effective**: ~$0.58-$1.16 per training run (1000 samples)  
âœ… **No GPU required**: Works with your available VMs  

### Next Steps
1. Run cell 2 to submit Azure ML training job
2. Monitor job in Azure ML Studio
3. After training, compare models using `src/compare_models.py`
4. Adjust `max_samples` based on your needs (100-1000)