# 📚 Lecture 8: Loss Functions, Optimization, and Learning Rate Scheduling

## Interactive Hands-on Practice Notebook

**Author**: Ho-min Park  
**Contact**: homin.park@ghent.ac.kr

---

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
1. Implement and compare different loss functions for regression and classification
2. Build and visualize optimization algorithms from scratch
3. Design and apply learning rate scheduling strategies
4. Analyze the impact of different choices on model training

---

## Part 1: Setup and Imports

Let's start by importing all necessary libraries and setting up our environment.

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression, load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Deep learning libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("✅ All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")

---
## Part 2: Loss Function Design and Implementation

In this section, we'll explore various loss functions and understand their characteristics.

### Exercise 1: Comparing Regression Loss Functions (MSE vs MAE vs Huber)

#### 📖 Concept

Different regression loss functions have different properties:
- **MSE (L2 Loss)**: Penalizes large errors quadratically, sensitive to outliers
- **MAE (L1 Loss)**: Linear penalty, robust to outliers
- **Huber Loss**: Combines MSE for small errors and MAE for large errors

#### 💻 Implementation

In [None]:
def mse_loss(y_true, y_pred):
    """Mean Squared Error Loss"""
    return np.mean((y_true - y_pred) ** 2)

def mae_loss(y_true, y_pred):
    """Mean Absolute Error Loss"""
    return np.mean(np.abs(y_true - y_pred))

def huber_loss(y_true, y_pred, delta=1.0):
    """Huber Loss - combination of MSE and MAE"""
    error = y_true - y_pred
    is_small_error = np.abs(error) <= delta
    
    squared_loss = 0.5 * error ** 2
    linear_loss = delta * np.abs(error) - 0.5 * delta ** 2
    
    return np.mean(np.where(is_small_error, squared_loss, linear_loss))

# Generate sample data with outliers
np.random.seed(42)
X = np.linspace(-5, 5, 100)
y_true = 2 * X + 1 + np.random.normal(0, 0.5, 100)
# Add some outliers
outlier_indices = np.random.choice(100, 10, replace=False)
y_true[outlier_indices] += np.random.normal(0, 5, 10)

# Create predictions with varying errors
errors = np.linspace(-3, 3, 50)
losses_mse = []
losses_mae = []
losses_huber = []

for error in errors:
    y_pred = y_true + error
    losses_mse.append(mse_loss(y_true, y_pred))
    losses_mae.append(mae_loss(y_true, y_pred))
    losses_huber.append(huber_loss(y_true, y_pred, delta=1.5))

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss comparison
ax1.plot(errors, losses_mse, label='MSE', linewidth=2)
ax1.plot(errors, losses_mae, label='MAE', linewidth=2)
ax1.plot(errors, losses_huber, label='Huber (δ=1.5)', linewidth=2)
ax1.set_xlabel('Prediction Error', fontsize=12)
ax1.set_ylabel('Loss Value', fontsize=12)
ax1.set_title('Comparison of Regression Loss Functions', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Data with outliers
ax2.scatter(X[~np.isin(np.arange(100), outlier_indices)], 
           y_true[~np.isin(np.arange(100), outlier_indices)],
           alpha=0.6, label='Normal points')
ax2.scatter(X[outlier_indices], y_true[outlier_indices], 
           color='red', s=100, label='Outliers', marker='^')
ax2.set_xlabel('X', fontsize=12)
ax2.set_ylabel('y', fontsize=12)
ax2.set_title('Data with Outliers', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Key Insights:")
print("• MSE grows quadratically with error - very sensitive to outliers")
print("• MAE grows linearly - more robust to outliers")
print("• Huber combines best of both - quadratic for small errors, linear for large")

#### 🎯 Your Turn

**Task**: Implement a custom weighted Huber loss where you can adjust the weight for outliers. Test it with different delta values and visualize how it affects the loss landscape.

```python
# TODO: Implement weighted_huber_loss(y_true, y_pred, delta, outlier_weight)
# Hint: Multiply the linear part by outlier_weight
```

---
### Exercise 2: Classification Losses - Cross-Entropy vs Focal Loss

#### 📖 Concept

**Cross-Entropy Loss**: Standard loss for classification, but can struggle with class imbalance.  
**Focal Loss**: Designed to address class imbalance by down-weighting easy examples and focusing on hard ones.

Formula: `FL = -α(1-p)^γ * log(p)`

#### 💻 Implementation

In [None]:
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
        
    def forward(self, inputs, targets):
        ce_loss = nn.functional.cross_entropy(inputs, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

# Create imbalanced dataset
n_samples = 1000
n_features = 2
n_classes = 2

# Generate imbalanced data (90% class 0, 10% class 1)
X, y = make_classification(n_samples=n_samples, n_features=n_features,
                          n_informative=2, n_redundant=0,
                          n_clusters_per_class=1, weights=[0.9, 0.1],
                          flip_y=0.01, random_state=42)

# Convert to tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.LongTensor(y)

# Simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(2, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 2)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Train with different losses
def train_with_loss(loss_fn, loss_name, epochs=100):
    model = SimpleNet()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    
    losses = []
    accuracies = []
    
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X_tensor)
        loss = loss_fn(outputs, y_tensor)
        loss.backward()
        optimizer.step()
        
        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        accuracy = (predicted == y_tensor).float().mean()
        
        losses.append(loss.item())
        accuracies.append(accuracy.item())
    
    # Calculate per-class accuracy
    outputs = model(X_tensor)
    _, predicted = torch.max(outputs.data, 1)
    
    class_0_acc = ((predicted == 0) & (y_tensor == 0)).sum().float() / (y_tensor == 0).sum()
    class_1_acc = ((predicted == 1) & (y_tensor == 1)).sum().float() / (y_tensor == 1).sum()
    
    return losses, accuracies, class_0_acc.item(), class_1_acc.item()

# Train with both losses
ce_loss = nn.CrossEntropyLoss()
focal_loss = FocalLoss(alpha=1, gamma=2)

ce_losses, ce_accs, ce_c0_acc, ce_c1_acc = train_with_loss(ce_loss, 'Cross-Entropy')
fl_losses, fl_accs, fl_c0_acc, fl_c1_acc = train_with_loss(focal_loss, 'Focal Loss')

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Loss curves
axes[0, 0].plot(ce_losses, label='Cross-Entropy', linewidth=2)
axes[0, 0].plot(fl_losses, label='Focal Loss', linewidth=2)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training Loss Comparison', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Accuracy curves
axes[0, 1].plot(ce_accs, label='Cross-Entropy', linewidth=2)
axes[0, 1].plot(fl_accs, label='Focal Loss', linewidth=2)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Overall Accuracy')
axes[0, 1].set_title('Overall Accuracy Comparison', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Class-wise accuracy
classes = ['Majority\n(90%)', 'Minority\n(10%)']
ce_class_accs = [ce_c0_acc, ce_c1_acc]
fl_class_accs = [fl_c0_acc, fl_c1_acc]

x = np.arange(len(classes))
width = 0.35

axes[1, 0].bar(x - width/2, ce_class_accs, width, label='Cross-Entropy', color='steelblue')
axes[1, 0].bar(x + width/2, fl_class_accs, width, label='Focal Loss', color='coral')
axes[1, 0].set_xlabel('Class')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Per-Class Accuracy', fontweight='bold')
axes[1, 0].set_xticks(x)
axes[1, 0].set_xticklabels(classes)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Data distribution
axes[1, 1].scatter(X[y==0, 0], X[y==0, 1], alpha=0.5, label=f'Class 0 (n={sum(y==0)})')
axes[1, 1].scatter(X[y==1, 0], X[y==1, 1], alpha=0.8, label=f'Class 1 (n={sum(y==1)})', 
                  color='red', s=50)
axes[1, 1].set_xlabel('Feature 1')
axes[1, 1].set_ylabel('Feature 2')
axes[1, 1].set_title('Imbalanced Dataset', fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Key Insights:")
print(f"• Cross-Entropy - Majority class: {ce_c0_acc:.2%}, Minority class: {ce_c1_acc:.2%}")
print(f"• Focal Loss - Majority class: {fl_c0_acc:.2%}, Minority class: {fl_c1_acc:.2%}")
print("• Focal Loss improves minority class performance significantly!")

---
## Part 3: Optimization Algorithms

Let's implement and visualize different optimization algorithms to understand their behavior.

### Exercise 3: Optimizer Comparison - SGD vs Momentum vs Adam

#### 📖 Concept

Different optimizers navigate the loss landscape differently:
- **SGD**: Simple gradient descent, can get stuck in local minima
- **Momentum**: Accelerates SGD by accumulating velocity
- **Adam**: Adaptive learning rates with momentum

#### 💻 Implementation

In [None]:
# Create a complex loss landscape
def create_loss_landscape():
    x = np.linspace(-2, 2, 100)
    y = np.linspace(-2, 2, 100)
    X, Y = np.meshgrid(x, y)
    
    # Rosenbrock function (challenging optimization landscape)
    a, b = 1, 100
    Z = (a - X)**2 + b * (Y - X**2)**2
    
    return X, Y, Z

def rosenbrock_gradient(x, y):
    """Gradient of Rosenbrock function"""
    a, b = 1, 100
    dx = -2 * (a - x) - 4 * b * x * (y - x**2)
    dy = 2 * b * (y - x**2)
    return np.array([dx, dy])

# Implement optimizers from scratch
class Optimizer:
    def __init__(self, lr=0.01):
        self.lr = lr
        self.path = []
    
    def step(self, pos, grad):
        raise NotImplementedError

class SGD(Optimizer):
    def step(self, pos, grad):
        new_pos = pos - self.lr * grad
        self.path.append(new_pos.copy())
        return new_pos

class Momentum(Optimizer):
    def __init__(self, lr=0.01, momentum=0.9):
        super().__init__(lr)
        self.momentum = momentum
        self.velocity = np.array([0.0, 0.0])
    
    def step(self, pos, grad):
        self.velocity = self.momentum * self.velocity + grad
        new_pos = pos - self.lr * self.velocity
        self.path.append(new_pos.copy())
        return new_pos

class AdamSimple(Optimizer):
    def __init__(self, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8):
        super().__init__(lr)
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.m = np.array([0.0, 0.0])
        self.v = np.array([0.0, 0.0])
        self.t = 0
    
    def step(self, pos, grad):
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2
        
        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        
        new_pos = pos - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        self.path.append(new_pos.copy())
        return new_pos

# Run optimization
def run_optimization(optimizer, start_pos, n_steps=100):
    pos = start_pos.copy()
    optimizer.path = [pos.copy()]
    
    for _ in range(n_steps):
        grad = rosenbrock_gradient(pos[0], pos[1])
        pos = optimizer.step(pos, grad)
        
        # Stop if converged
        if np.linalg.norm(grad) < 1e-4:
            break
    
    return np.array(optimizer.path)

# Initialize optimizers
start_pos = np.array([-1.5, -1.5])
optimizers = {
    'SGD': SGD(lr=0.001),
    'Momentum': Momentum(lr=0.001, momentum=0.9),
    'Adam': AdamSimple(lr=0.01)
}

paths = {}
for name, opt in optimizers.items():
    paths[name] = run_optimization(opt, start_pos, n_steps=500)

# Visualization
X, Y, Z = create_loss_landscape()

fig = plt.figure(figsize=(15, 5))

for i, (name, path) in enumerate(paths.items()):
    ax = fig.add_subplot(1, 3, i+1)
    
    # Plot contour
    contour = ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), alpha=0.5)
    
    # Plot optimization path
    ax.plot(path[:, 0], path[:, 1], 'r-', linewidth=2, alpha=0.8)
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
    ax.plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_title(f'{name} Optimization Path\n({len(path)} steps)', fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-2, 2)
    ax.set_ylim(-2, 2)

plt.tight_layout()
plt.show()

print("\n📊 Optimization Results:")
for name, path in paths.items():
    final_pos = path[-1]
    final_loss = (1 - final_pos[0])**2 + 100 * (final_pos[1] - final_pos[0]**2)**2
    print(f"{name:10} - Steps: {len(path):3d}, Final loss: {final_loss:.6f}, "
          f"Final position: ({final_pos[0]:.3f}, {final_pos[1]:.3f})")

print("\n💡 Key Insights:")
print("• SGD struggles with the narrow valley and takes many steps")
print("• Momentum accelerates through the valley more efficiently")
print("• Adam adapts its learning rate and navigates most efficiently")

---
### Exercise 4: Learning Rate Impact Analysis

#### 📖 Concept

The learning rate is arguably the most important hyperparameter:
- **Too large**: Divergence, oscillation
- **Too small**: Slow convergence, stuck in local minima
- **Just right**: Efficient convergence to good minima

#### 💻 Implementation

In [None]:
# Test different learning rates on a simple problem
def train_with_lr(lr, X, y, epochs=200):
    torch.manual_seed(42)
    
    # Simple linear model
    model = nn.Sequential(
        nn.Linear(X.shape[1], 32),
        nn.ReLU(),
        nn.Linear(32, 16),
        nn.ReLU(),
        nn.Linear(16, 1)
    )
    
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    
    losses = []
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()
        
        losses.append(loss.item())
        
        # Check for divergence
        if loss.item() > 1e6 or np.isnan(loss.item()):
            losses.extend([np.nan] * (epochs - epoch - 1))
            break
    
    return losses

# Generate regression data
X_reg, y_reg = make_regression(n_samples=500, n_features=10, noise=10, random_state=42)
X_reg = torch.FloatTensor(StandardScaler().fit_transform(X_reg))
y_reg = torch.FloatTensor(y_reg.reshape(-1, 1))

# Test different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 0.5]
all_losses = {}

for lr in learning_rates:
    losses = train_with_lr(lr, X_reg, y_reg)
    all_losses[lr] = losses

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, lr in enumerate(learning_rates):
    ax = axes[i]
    losses = all_losses[lr]
    
    if not any(np.isnan(losses)):
        ax.plot(losses, linewidth=2, color='steelblue')
        ax.set_title(f'LR = {lr}\nFinal Loss: {losses[-1]:.4f}', fontweight='bold')
    else:
        valid_losses = [l for l in losses if not np.isnan(l)]
        ax.plot(valid_losses, linewidth=2, color='red')
        ax.set_title(f'LR = {lr}\n⚠️ DIVERGED!', fontweight='bold', color='red')
    
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.grid(True, alpha=0.3)
    ax.set_yscale('log')

# Summary plot
ax = axes[5]
for lr in learning_rates:
    losses = all_losses[lr]
    if not any(np.isnan(losses)):
        ax.plot(losses, label=f'LR={lr}', linewidth=2, alpha=0.7)

ax.set_xlabel('Epoch')
ax.set_ylabel('Loss (log scale)')
ax.set_title('All Learning Rates Comparison', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_yscale('log')

plt.tight_layout()
plt.show()

print("\n📊 Learning Rate Analysis:")
for lr in learning_rates:
    losses = all_losses[lr]
    if not any(np.isnan(losses)):
        print(f"LR {lr:6.4f}: Final loss = {losses[-1]:8.4f}, "
              f"Converged in ~{np.argmin(np.gradient(losses[:100]))} epochs")
    else:
        print(f"LR {lr:6.4f}: ❌ DIVERGED")

print("\n💡 Key Insights:")
print("• Too small LR (0.0001): Very slow convergence")
print("• Optimal LR (0.01): Fast and stable convergence")
print("• Too large LR (0.5): Training diverges immediately")

---
## Part 4: Learning Rate Scheduling Strategies

Adaptive learning rates can significantly improve training efficiency and final performance.

### Exercise 5: Learning Rate Scheduling Comparison

#### 📖 Concept

Different scheduling strategies:
- **Step Decay**: Reduce by factor at specific epochs
- **Exponential Decay**: Smooth exponential reduction
- **Cosine Annealing**: Cosine-shaped reduction
- **Warm-up + Linear**: Start small, increase, then decrease

#### 💻 Implementation

In [None]:
class LRScheduler:
    def __init__(self, initial_lr, total_epochs):
        self.initial_lr = initial_lr
        self.total_epochs = total_epochs
        self.history = []
    
    def get_lr(self, epoch):
        raise NotImplementedError
    
    def step(self, epoch):
        lr = self.get_lr(epoch)
        self.history.append(lr)
        return lr

class StepDecayScheduler(LRScheduler):
    def __init__(self, initial_lr, total_epochs, step_size=30, gamma=0.1):
        super().__init__(initial_lr, total_epochs)
        self.step_size = step_size
        self.gamma = gamma
    
    def get_lr(self, epoch):
        return self.initial_lr * (self.gamma ** (epoch // self.step_size))

class ExponentialDecayScheduler(LRScheduler):
    def __init__(self, initial_lr, total_epochs, decay_rate=0.96):
        super().__init__(initial_lr, total_epochs)
        self.decay_rate = decay_rate
    
    def get_lr(self, epoch):
        return self.initial_lr * (self.decay_rate ** epoch)

class CosineAnnealingScheduler(LRScheduler):
    def __init__(self, initial_lr, total_epochs, min_lr=0.00001):
        super().__init__(initial_lr, total_epochs)
        self.min_lr = min_lr
    
    def get_lr(self, epoch):
        return self.min_lr + (self.initial_lr - self.min_lr) * \
               (1 + np.cos(np.pi * epoch / self.total_epochs)) / 2

class WarmupLinearScheduler(LRScheduler):
    def __init__(self, initial_lr, total_epochs, warmup_epochs=10):
        super().__init__(initial_lr, total_epochs)
        self.warmup_epochs = warmup_epochs
    
    def get_lr(self, epoch):
        if epoch < self.warmup_epochs:
            # Warmup phase
            return self.initial_lr * (epoch + 1) / self.warmup_epochs
        else:
            # Linear decay phase
            progress = (epoch - self.warmup_epochs) / (self.total_epochs - self.warmup_epochs)
            return self.initial_lr * (1 - progress)

class CyclicalScheduler(LRScheduler):
    def __init__(self, initial_lr, total_epochs, min_lr=0.0001, cycle_length=20):
        super().__init__(initial_lr, total_epochs)
        self.min_lr = min_lr
        self.cycle_length = cycle_length
    
    def get_lr(self, epoch):
        cycle_progress = (epoch % self.cycle_length) / self.cycle_length
        if cycle_progress < 0.5:
            # Increasing phase
            return self.min_lr + (self.initial_lr - self.min_lr) * (2 * cycle_progress)
        else:
            # Decreasing phase
            return self.initial_lr - (self.initial_lr - self.min_lr) * (2 * (cycle_progress - 0.5))

# Initialize schedulers
total_epochs = 150
initial_lr = 0.1

schedulers = {
    'Constant': LRScheduler(initial_lr, total_epochs),  # Base class acts as constant
    'Step Decay': StepDecayScheduler(initial_lr, total_epochs, step_size=50),
    'Exponential': ExponentialDecayScheduler(initial_lr, total_epochs, decay_rate=0.97),
    'Cosine': CosineAnnealingScheduler(initial_lr, total_epochs),
    'Warmup-Linear': WarmupLinearScheduler(initial_lr, total_epochs, warmup_epochs=20),
    'Cyclical': CyclicalScheduler(initial_lr, total_epochs, cycle_length=30)
}

# For constant scheduler, override get_lr
schedulers['Constant'].get_lr = lambda epoch: initial_lr

# Generate learning rate schedules
for name, scheduler in schedulers.items():
    for epoch in range(total_epochs):
        scheduler.step(epoch)

# Visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

colors = ['gray', 'blue', 'green', 'red', 'purple', 'orange']

for i, (name, scheduler) in enumerate(schedulers.items()):
    ax = axes[i]
    ax.plot(scheduler.history, linewidth=2.5, color=colors[i])
    ax.set_xlabel('Epoch', fontsize=11)
    ax.set_ylabel('Learning Rate', fontsize=11)
    ax.set_title(f'{name} Schedule', fontweight='bold', fontsize=12)
    ax.grid(True, alpha=0.3)
    
    # Add annotations for key features
    if name == 'Step Decay':
        for step in range(50, total_epochs, 50):
            ax.axvline(x=step, color='red', linestyle='--', alpha=0.3)
    elif name == 'Warmup-Linear':
        ax.axvline(x=20, color='red', linestyle='--', alpha=0.3)
        ax.text(20, ax.get_ylim()[1]*0.9, 'Warmup End', rotation=90, 
               verticalalignment='top', fontsize=9)

plt.tight_layout()
plt.show()

print("\n📊 Learning Rate Schedule Statistics:")
print("-" * 60)
for name, scheduler in schedulers.items():
    lr_values = scheduler.history
    print(f"{name:15} | Initial: {lr_values[0]:.4f} | "
          f"Final: {lr_values[-1]:.4f} | "
          f"Mean: {np.mean(lr_values):.4f}")

print("\n💡 Key Insights:")
print("• Step Decay: Simple and effective, sudden drops can cause loss spikes")
print("• Exponential: Smooth decay, may decrease too quickly")
print("• Cosine: Smooth with slow start and end, good for fine-tuning")
print("• Warmup: Prevents instability with large initial learning rates")
print("• Cyclical: Helps escape local minima through exploration")

---
### Exercise 6: Metric Learning - Contrastive and Triplet Loss

#### 📖 Concept

Metric learning losses learn embeddings where:
- **Contrastive Loss**: Similar pairs are close, dissimilar pairs are far
- **Triplet Loss**: Anchor is closer to positive than negative by a margin

#### 💻 Implementation

In [None]:
class ContrastiveLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin
    
    def forward(self, output1, output2, label):
        """label=1 for similar, label=0 for dissimilar"""
        distance = torch.nn.functional.pairwise_distance(output1, output2)
        loss = label * distance.pow(2) + \
               (1 - label) * torch.clamp(self.margin - distance, min=0).pow(2)
        return loss.mean()

class TripletLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(TripletLoss, self).__init__()
        self.margin = margin
    
    def forward(self, anchor, positive, negative):
        pos_distance = torch.nn.functional.pairwise_distance(anchor, positive)
        neg_distance = torch.nn.functional.pairwise_distance(anchor, negative)
        loss = torch.clamp(pos_distance - neg_distance + self.margin, min=0)
        return loss.mean()

# Create embedding network
class EmbeddingNet(nn.Module):
    def __init__(self, input_dim=784, embedding_dim=32):
        super(EmbeddingNet, self).__init__()
        self.fc1 = nn.Linear(input_dim, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, embedding_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Generate synthetic data for demonstration
np.random.seed(42)
n_samples = 1000
n_features = 50
n_classes = 5
embedding_dim = 2  # 2D for visualization

# Create clustered data
X_metric = []
y_metric = []
for class_id in range(n_classes):
    center = np.random.randn(n_features) * 5
    samples = center + np.random.randn(n_samples // n_classes, n_features)
    X_metric.append(samples)
    y_metric.extend([class_id] * (n_samples // n_classes))

X_metric = np.vstack(X_metric)
X_metric = torch.FloatTensor(X_metric)
y_metric = torch.LongTensor(y_metric)

# Train with contrastive loss
model = EmbeddingNet(input_dim=n_features, embedding_dim=embedding_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
contrastive_loss = ContrastiveLoss(margin=2.0)

# Training loop for contrastive loss
n_epochs = 50
model.train()
for epoch in range(n_epochs):
    # Sample pairs
    indices = torch.randperm(len(X_metric))
    
    total_loss = 0
    for i in range(0, len(indices)-1, 2):
        idx1, idx2 = indices[i], indices[i+1]
        
        x1, x2 = X_metric[idx1:idx1+1], X_metric[idx2:idx2+1]
        y1, y2 = y_metric[idx1], y_metric[idx2]
        
        # Label: 1 if same class, 0 if different
        label = torch.FloatTensor([1.0 if y1 == y2 else 0.0])
        
        optimizer.zero_grad()
        embed1 = model(x1)
        embed2 = model(x2)
        loss = contrastive_loss(embed1, embed2, label)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()

# Get embeddings
model.eval()
with torch.no_grad():
    embeddings = model(X_metric).numpy()

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Original high-dimensional data (PCA projection)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_metric.numpy())

ax = axes[0]
for class_id in range(n_classes):
    mask = y_metric.numpy() == class_id
    ax.scatter(X_pca[mask, 0], X_pca[mask, 1], 
              label=f'Class {class_id}', alpha=0.6, s=30)
ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_title('Original Data (PCA Projection)', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Learned embeddings
ax = axes[1]
for class_id in range(n_classes):
    mask = y_metric.numpy() == class_id
    ax.scatter(embeddings[mask, 0], embeddings[mask, 1], 
              label=f'Class {class_id}', alpha=0.6, s=30)
ax.set_xlabel('Embedding Dimension 1')
ax.set_ylabel('Embedding Dimension 2')
ax.set_title('Learned Embeddings (Contrastive Loss)', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate inter and intra class distances
from scipy.spatial.distance import cdist

intra_distances = []
inter_distances = []

for i in range(n_classes):
    class_embeddings = embeddings[y_metric.numpy() == i]
    # Intra-class distances
    if len(class_embeddings) > 1:
        intra_dist = cdist(class_embeddings, class_embeddings)
        intra_distances.extend(intra_dist[np.triu_indices_from(intra_dist, k=1)])
    
    # Inter-class distances
    for j in range(i+1, n_classes):
        other_embeddings = embeddings[y_metric.numpy() == j]
        inter_dist = cdist(class_embeddings, other_embeddings)
        inter_distances.extend(inter_dist.flatten())

print("\n📊 Embedding Quality Metrics:")
print(f"Average intra-class distance: {np.mean(intra_distances):.3f}")
print(f"Average inter-class distance: {np.mean(inter_distances):.3f}")
print(f"Separation ratio: {np.mean(inter_distances) / np.mean(intra_distances):.2f}")

print("\n💡 Key Insights:")
print("• Contrastive loss successfully separates different classes")
print("• Similar samples are pulled together in embedding space")
print("• Dissimilar samples are pushed apart by the margin")

---
### Exercise 7: Advanced Optimizer Analysis - AdaGrad, RMSprop, AdamW

#### 📖 Concept

Adaptive optimizers adjust learning rates per parameter:
- **AdaGrad**: Accumulates squared gradients (can stop learning)
- **RMSprop**: Uses exponential moving average (fixes AdaGrad's issue)
- **AdamW**: Adam with decoupled weight decay (better generalization)

#### 💻 Implementation

In [None]:
# Compare adaptive optimizers on a challenging problem
def create_noisy_classification_problem():
    """Create a classification problem with noise and class imbalance"""
    X, y = make_classification(n_samples=2000, n_features=20, n_informative=15,
                              n_redundant=5, n_classes=3, n_clusters_per_class=2,
                              weights=[0.5, 0.3, 0.2], flip_y=0.1, random_state=42)
    return torch.FloatTensor(X), torch.LongTensor(y)

X_opt, y_opt = create_noisy_classification_problem()
X_train, X_test, y_train, y_test = train_test_split(
    X_opt, y_opt, test_size=0.3, random_state=42
)

# Define model
class ClassifierNet(nn.Module):
    def __init__(self, input_dim=20, n_classes=3):
        super(ClassifierNet, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 16)
        self.fc4 = nn.Linear(16, n_classes)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.relu(self.fc3(x))
        x = self.fc4(x)
        return x

def train_model(optimizer_name, optimizer_class, **optimizer_kwargs):
    torch.manual_seed(42)
    model = ClassifierNet()
    criterion = nn.CrossEntropyLoss()
    optimizer = optimizer_class(model.parameters(), **optimizer_kwargs)
    
    train_losses = []
    test_losses = []
    train_accs = []
    test_accs = []
    
    n_epochs = 100
    for epoch in range(n_epochs):
        # Training
        model.train()
        optimizer.zero_grad()
        outputs = model(X_train)
        loss = criterion(outputs, y_train)
        loss.backward()
        optimizer.step()
        
        # Evaluation
        model.eval()
        with torch.no_grad():
            # Training metrics
            train_outputs = model(X_train)
            train_loss = criterion(train_outputs, y_train)
            _, train_predicted = torch.max(train_outputs, 1)
            train_acc = (train_predicted == y_train).float().mean()
            
            # Test metrics
            test_outputs = model(X_test)
            test_loss = criterion(test_outputs, y_test)
            _, test_predicted = torch.max(test_outputs, 1)
            test_acc = (test_predicted == y_test).float().mean()
        
        train_losses.append(train_loss.item())
        test_losses.append(test_loss.item())
        train_accs.append(train_acc.item())
        test_accs.append(test_acc.item())
    
    return train_losses, test_losses, train_accs, test_accs

# Compare optimizers
optimizers_config = {
    'SGD': (optim.SGD, {'lr': 0.01, 'momentum': 0.9}),
    'AdaGrad': (optim.Adagrad, {'lr': 0.01}),
    'RMSprop': (optim.RMSprop, {'lr': 0.001}),
    'Adam': (optim.Adam, {'lr': 0.001}),
    'AdamW': (optim.AdamW, {'lr': 0.001, 'weight_decay': 0.01})
}

results = {}
for name, (opt_class, kwargs) in optimizers_config.items():
    print(f"Training with {name}...")
    results[name] = train_model(name, opt_class, **kwargs)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Training loss
ax = axes[0, 0]
for name, (train_losses, _, _, _) in results.items():
    ax.plot(train_losses, label=name, linewidth=2, alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Test loss
ax = axes[0, 1]
for name, (_, test_losses, _, _) in results.items():
    ax.plot(test_losses, label=name, linewidth=2, alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Test Loss', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Training accuracy
ax = axes[1, 0]
for name, (_, _, train_accs, _) in results.items():
    ax.plot(train_accs, label=name, linewidth=2, alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Training Accuracy', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Test accuracy
ax = axes[1, 1]
for name, (_, _, _, test_accs) in results.items():
    ax.plot(test_accs, label=name, linewidth=2, alpha=0.8)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Test Accuracy', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Final performance comparison
print("\n📊 Final Performance Comparison:")
print("-" * 70)
print(f"{'Optimizer':<12} | {'Train Loss':>10} | {'Test Loss':>10} | "
      f"{'Train Acc':>10} | {'Test Acc':>10}")
print("-" * 70)

for name, (train_losses, test_losses, train_accs, test_accs) in results.items():
    print(f"{name:<12} | {train_losses[-1]:>10.4f} | {test_losses[-1]:>10.4f} | "
          f"{train_accs[-1]:>10.2%} | {test_accs[-1]:>10.2%}")

print("\n💡 Key Insights:")
print("• AdaGrad: Learning rate may decrease too quickly, causing early stopping")
print("• RMSprop: Fixes AdaGrad's diminishing learning rate problem")
print("• Adam: Combines momentum and adaptive learning rates effectively")
print("• AdamW: Weight decay decoupling often improves generalization")

---
### Exercise 8: Complete Training Pipeline with Best Practices

#### 📖 Concept

Combining everything we've learned:
- Appropriate loss function
- Optimal optimizer choice
- Learning rate scheduling
- Regularization

#### 💻 Implementation

In [None]:
class CompletePipeline:
    def __init__(self, model, loss_fn, optimizer, scheduler=None, device='cpu'):
        self.model = model.to(device)
        self.loss_fn = loss_fn
        self.optimizer = optimizer
        self.scheduler = scheduler
        self.device = device
        self.history = {'train_loss': [], 'val_loss': [], 
                       'train_acc': [], 'val_acc': [], 'lr': []}
    
    def train_epoch(self, dataloader):
        self.model.train()
        total_loss = 0
        correct = 0
        total = 0
        
        for batch_X, batch_y in dataloader:
            batch_X = batch_X.to(self.device)
            batch_y = batch_y.to(self.device)
            
            self.optimizer.zero_grad()
            outputs = self.model(batch_X)
            loss = self.loss_fn(outputs, batch_y)
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
            
            self.optimizer.step()
            
            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == batch_y).sum().item()
            total += batch_y.size(0)
        
        return total_loss / len(dataloader), correct / total
    
    def validate(self, dataloader):
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch_X, batch_y in dataloader:
                batch_X = batch_X.to(self.device)
                batch_y = batch_y.to(self.device)
                
                outputs = self.model(batch_X)
                loss = self.loss_fn(outputs, batch_y)
                
                total_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                correct += (predicted == batch_y).sum().item()
                total += batch_y.size(0)
        
        return total_loss / len(dataloader), correct / total
    
    def train(self, train_loader, val_loader, epochs, early_stopping_patience=10):
        best_val_loss = float('inf')
        patience_counter = 0
        
        for epoch in range(epochs):
            # Training
            train_loss, train_acc = self.train_epoch(train_loader)
            
            # Validation
            val_loss, val_acc = self.validate(val_loader)
            
            # Learning rate scheduling
            current_lr = self.optimizer.param_groups[0]['lr']
            if self.scheduler:
                if isinstance(self.scheduler, optim.lr_scheduler.ReduceLROnPlateau):
                    self.scheduler.step(val_loss)
                else:
                    self.scheduler.step()
                current_lr = self.optimizer.param_groups[0]['lr']
            
            # Record history
            self.history['train_loss'].append(train_loss)
            self.history['val_loss'].append(val_loss)
            self.history['train_acc'].append(train_acc)
            self.history['val_acc'].append(val_acc)
            self.history['lr'].append(current_lr)
            
            # Early stopping
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= early_stopping_patience:
                    print(f"Early stopping at epoch {epoch+1}")
                    break
            
            # Progress reporting
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{epochs} | "
                      f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2%} | "
                      f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2%} | "
                      f"LR: {current_lr:.6f}")

# Prepare data
iris = load_iris()
X_iris = torch.FloatTensor(iris.data)
y_iris = torch.LongTensor(iris.target)

# Split and scale
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_iris = torch.FloatTensor(scaler.fit_transform(X_train_iris))
X_test_iris = torch.FloatTensor(scaler.transform(X_test_iris))

# Create data loaders
train_dataset = TensorDataset(X_train_iris, y_train_iris)
test_dataset = TensorDataset(X_test_iris, y_test_iris)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Create model
class IrisNet(nn.Module):
    def __init__(self):
        super(IrisNet, self).__init__()
        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 3)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Setup complete pipeline
model = IrisNet()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=0.01, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

pipeline = CompletePipeline(model, loss_fn, optimizer, scheduler)

# Train
print("🚀 Starting training with complete pipeline...\n")
pipeline.train(train_loader, test_loader, epochs=100, early_stopping_patience=15)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Loss curves
ax = axes[0]
ax.plot(pipeline.history['train_loss'], label='Training', linewidth=2)
ax.plot(pipeline.history['val_loss'], label='Validation', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Loss Curves', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Accuracy curves
ax = axes[1]
ax.plot(pipeline.history['train_acc'], label='Training', linewidth=2)
ax.plot(pipeline.history['val_acc'], label='Validation', linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy Curves', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Learning rate schedule
ax = axes[2]
ax.plot(pipeline.history['lr'], linewidth=2, color='green')
ax.set_xlabel('Epoch')
ax.set_ylabel('Learning Rate')
ax.set_title('Learning Rate Schedule', fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n📊 Final Results:")
print(f"Best Validation Accuracy: {max(pipeline.history['val_acc']):.2%}")
print(f"Final Training Accuracy: {pipeline.history['train_acc'][-1]:.2%}")
print(f"Final Validation Accuracy: {pipeline.history['val_acc'][-1]:.2%}")

---
## Part 5: Summary and Practice Exercises

### 🎯 Key Takeaways

1. **Loss Functions**:
   - Choose based on problem type and data characteristics
   - Consider robustness requirements (outliers)
   - Custom losses for domain-specific objectives

2. **Optimization Algorithms**:
   - Start with Adam/AdamW for rapid prototyping
   - Fine-tune with SGD+Momentum for best generalization
   - Adaptive methods handle different parameter scales well

3. **Learning Rate Scheduling**:
   - Always use some form of scheduling
   - Warm-up for large models/batch sizes
   - Cosine annealing works well in practice

### 📝 Practice Exercises

#### Exercise A: Custom Loss Function
Design and implement a custom loss function that:
- Combines MSE with a penalty for prediction uncertainty
- Includes a regularization term for model complexity
- Adapts weights based on sample difficulty

#### Exercise B: Optimizer Ensemble
Create an ensemble optimizer that:
- Switches between SGD and Adam based on training progress
- Uses different optimizers for different layer types
- Implements a voting mechanism for parameter updates

#### Exercise C: Advanced Scheduling
Implement a learning rate scheduler that:
- Detects plateaus automatically
- Combines multiple scheduling strategies
- Adapts based on gradient statistics

### 🚀 Next Steps

1. Experiment with different combinations on your own datasets
2. Profile and benchmark different approaches
3. Implement advanced techniques like LARS, LAMB optimizers
4. Explore meta-learning for automatic hyperparameter tuning

---

**Remember**: The best combination of loss, optimizer, and scheduler depends on your specific problem. Always validate empirically!

Happy learning! 🎉