# 4. Loss Functions

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/maleehahassan/NNBuildingBlocksTeachingPt1/blob/main/content/04_loss_functions.ipynb)

## Learning Objectives

By the end of this section, you will understand:
- What loss functions are and why they're crucial
- Common loss functions for different problem types
- How loss functions guide the learning process
- The relationship between loss functions and optimization
- How to choose the right loss function for your problem

## What is a Loss Function?

A **loss function** (also called cost function or objective function) measures how "wrong" our model's predictions are. It:

- **Quantifies the error** between predicted and actual values
- **Guides the learning process** by telling us which direction to adjust weights
- **Enables optimization** through gradient descent
- **Defines what "good" means** for our specific problem

### The Learning Process:
1. Make predictions
2. Calculate loss (how wrong we are)
3. Compute gradients (which direction to improve)
4. Update weights (take a step toward better predictions)
5. Repeat until loss is minimized

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit

# Visualize the concept of loss
def visualize_loss_concept():
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # Example data
    np.random.seed(42)
    x = np.linspace(0, 10, 20)
    y_true = 2 * x + 1 + np.random.normal(0, 2, len(x))  # True relationship with noise
    
    # Three different model predictions
    models = [
        ("Bad Model", 0.5 * x + 5),
        ("OK Model", 1.5 * x + 2),
        ("Good Model", 2.1 * x + 0.8)
    ]
    
    colors = ['red', 'orange', 'green']
    
    for i, (name, y_pred) in enumerate(models):
        ax = axes[i]
        
        # Plot data and predictions
        ax.scatter(x, y_true, color='blue', alpha=0.7, s=50, label='True Data')
        ax.plot(x, y_pred, color=colors[i], linewidth=3, label=f'{name} Prediction')
        
        # Draw error lines
        for xi, yi_true, yi_pred in zip(x, y_true, y_pred):
            ax.plot([xi, xi], [yi_true, yi_pred], 'k--', alpha=0.5, linewidth=1)
        
        # Calculate and display loss (Mean Squared Error)
        mse = np.mean((y_true - y_pred)**2)
        ax.set_title(f'{name}\nMSE Loss = {mse:.2f}', fontsize=14, fontweight='bold')
        ax.set_xlabel('Input (x)')
        ax.set_ylabel('Output (y)')
        ax.legend()
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("Key Insight: Lower loss = better model performance!")
    print("The loss function gives us a single number to optimize.")

visualize_loss_concept()

## Loss Functions for Regression

Regression problems predict continuous values (prices, temperatures, etc.). Common loss functions:

### 1. Mean Squared Error (MSE)
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

- **Most popular** for regression
- **Penalizes large errors heavily** (squaring)
- **Differentiable everywhere**
- **Sensitive to outliers**

### 2. Mean Absolute Error (MAE)
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

- **Less sensitive to outliers**
- **Equal penalty for all errors**
- **Not differentiable at zero**

### 3. Huber Loss
$$\text{Huber}(\delta) = \begin{cases} 
\frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\
\delta |y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise}
\end{cases}$$

- **Combines MSE and MAE**
- **Robust to outliers**

In [None]:
# Implement and compare regression loss functions
def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def huber_loss(y_true, y_pred, delta=1.0):
    error = np.abs(y_true - y_pred)
    return np.mean(np.where(error <= delta, 
                           0.5 * error**2, 
                           delta * error - 0.5 * delta**2))

# Visualize how different loss functions respond to errors
errors = np.linspace(-5, 5, 1000)
mse_values = errors**2
mae_values = np.abs(errors)
huber_values = np.where(np.abs(errors) <= 1, 
                       0.5 * errors**2, 
                       np.abs(errors) - 0.5)

plt.figure(figsize=(15, 10))

# Plot 1: Loss function shapes
plt.subplot(2, 2, 1)
plt.plot(errors, mse_values, 'r-', linewidth=3, label='MSE (Squared Error)')
plt.plot(errors, mae_values, 'b-', linewidth=3, label='MAE (Absolute Error)')
plt.plot(errors, huber_values, 'g-', linewidth=3, label='Huber Loss (δ=1)')
plt.xlabel('Error (y_true - y_pred)')
plt.ylabel('Loss Value')
plt.title('Regression Loss Functions', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(-5, 5)
plt.ylim(0, 10)

# Plot 2: Sensitivity to outliers demonstration
plt.subplot(2, 2, 2)
# Normal data
y_true_normal = np.array([1, 2, 3, 4, 5])
y_pred_normal = np.array([1.1, 2.2, 2.8, 4.1, 4.9])

# Data with outlier
y_true_outlier = np.array([1, 2, 3, 4, 15])  # Last point is outlier
y_pred_outlier = np.array([1.1, 2.2, 2.8, 4.1, 4.9])

scenarios = ['Normal Data', 'With Outlier']
y_true_scenarios = [y_true_normal, y_true_outlier]
y_pred_scenarios = [y_pred_normal, y_pred_outlier]

mse_values = []
mae_values = []
huber_values = []

for y_true, y_pred in zip(y_true_scenarios, y_pred_scenarios):
    mse_values.append(mse_loss(y_true, y_pred))
    mae_values.append(mae_loss(y_true, y_pred))
    huber_values.append(huber_loss(y_true, y_pred))

x_pos = np.arange(len(scenarios))
width = 0.25

plt.bar(x_pos - width, mse_values, width, label='MSE', color='red', alpha=0.7)
plt.bar(x_pos, mae_values, width, label='MAE', color='blue', alpha=0.7)
plt.bar(x_pos + width, huber_values, width, label='Huber', color='green', alpha=0.7)

plt.xlabel('Scenario')
plt.ylabel('Loss Value')
plt.title('Outlier Sensitivity Comparison', fontsize=14, fontweight='bold')
plt.xticks(x_pos, scenarios)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Gradient comparison
plt.subplot(2, 2, 3)
# MSE gradient: 2 * error
mse_grad = 2 * errors
# MAE gradient: sign(error)
mae_grad = np.sign(errors)
# Huber gradient
huber_grad = np.where(np.abs(errors) <= 1, errors, np.sign(errors))

plt.plot(errors, mse_grad, 'r-', linewidth=3, label='MSE Gradient')
plt.plot(errors, mae_grad, 'b-', linewidth=3, label='MAE Gradient')
plt.plot(errors, huber_grad, 'g-', linewidth=3, label='Huber Gradient')
plt.xlabel('Error')
plt.ylabel('Gradient')
plt.title('Loss Function Gradients', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(-3, 3)
plt.ylim(-3, 3)

# Plot 4: Real example with outliers
plt.subplot(2, 2, 4)
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_clean = 2 * x + 1 + np.random.normal(0, 1, len(x))
y_outliers = y_clean.copy()
y_outliers[10] += 15  # Add outlier
y_outliers[30] -= 12  # Add another outlier

plt.scatter(x, y_outliers, alpha=0.7)
plt.scatter(x[10], y_outliers[10], color='red', s=100, label='Outliers')
plt.scatter(x[30], y_outliers[30], color='red', s=100)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Dataset with Outliers', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Loss Function Characteristics:")
print("MSE: Heavily penalizes large errors, sensitive to outliers")
print("MAE: Treats all errors equally, robust to outliers")
print("Huber: Compromise between MSE and MAE")
print(f"\nWith outliers - MSE: {mse_values[1]:.2f}, MAE: {mae_values[1]:.2f}, Huber: {huber_values[1]:.2f}")

## Loss Functions for Classification

Classification problems predict discrete categories. Different loss functions for different scenarios:

### 1. Binary Cross-Entropy (Log Loss)
For binary classification (two classes):
$$\text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

- **Outputs probabilities** (0 to 1)
- **Heavily penalizes confident wrong predictions**
- **Smooth and differentiable**

### 2. Categorical Cross-Entropy
For multi-class classification:
$$\text{CCE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$

### 3. Sparse Categorical Cross-Entropy
Same as categorical, but with integer labels instead of one-hot encoding.

### 4. Hinge Loss
Used in Support Vector Machines:
$$\text{Hinge} = \max(0, 1 - y \cdot \hat{y})$$

In [None]:
# Implement and visualize classification loss functions
def binary_cross_entropy(y_true, y_pred):
    # Clip predictions to prevent log(0)
    y_pred_clipped = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(y_true * np.log(y_pred_clipped) + (1 - y_true) * np.log(1 - y_pred_clipped))

def hinge_loss(y_true, y_pred):
    # Convert 0/1 labels to -1/1 for hinge loss
    y_true_hinge = 2 * y_true - 1
    y_pred_hinge = 2 * y_pred - 1
    return np.mean(np.maximum(0, 1 - y_true_hinge * y_pred_hinge))

# Visualize binary classification loss functions
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Generate probability predictions
probabilities = np.linspace(0.001, 0.999, 1000)

# Plot 1: Binary Cross-Entropy for different true labels
ax1 = axes[0, 0]
bce_positive = -np.log(probabilities)  # y_true = 1
bce_negative = -np.log(1 - probabilities)  # y_true = 0

ax1.plot(probabilities, bce_positive, 'b-', linewidth=3, label='True class = 1')
ax1.plot(probabilities, bce_negative, 'r-', linewidth=3, label='True class = 0')
ax1.set_xlabel('Predicted Probability')
ax1.set_ylabel('Loss')
ax1.set_title('Binary Cross-Entropy Loss', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# Add annotations
ax1.annotate('Confident wrong\nprediction = high loss', 
            xy=(0.1, -np.log(0.1)), xytext=(0.3, 5),
            arrowprops=dict(arrowstyle='->', color='red', alpha=0.7))
ax1.annotate('Confident correct\nprediction = low loss', 
            xy=(0.9, -np.log(0.9)), xytext=(0.7, 0.5),
            arrowprops=dict(arrowstyle='->', color='green', alpha=0.7))

# Plot 2: Comparison of classification losses
ax2 = axes[0, 1]
# For true class = 1
y_true_ones = np.ones_like(probabilities)
bce_loss = binary_cross_entropy(y_true_ones, probabilities)
hinge_loss_values = np.maximum(0, 1 - (2 * probabilities - 1))
zero_one_loss = (probabilities < 0.5).astype(float)  # 0-1 loss

ax2.plot(probabilities, bce_positive, 'b-', linewidth=3, label='Cross-Entropy')
ax2.plot(probabilities, hinge_loss_values, 'g-', linewidth=3, label='Hinge Loss')
ax2.plot(probabilities, zero_one_loss, 'r-', linewidth=3, label='0-1 Loss')
ax2.set_xlabel('Predicted Probability (True class = 1)')
ax2.set_ylabel('Loss')
ax2.set_title('Classification Loss Comparison', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 5)

# Plot 3: Real classification example
ax3 = axes[1, 0]
# Generate sample data
np.random.seed(42)
n_samples = 100
X = np.random.randn(n_samples, 2)
y_true = (X[:, 0] + X[:, 1] > 0).astype(int)

# Simulate predictions with different confidence levels
predictions_good = y_true + np.random.normal(0, 0.1, n_samples)
predictions_good = np.clip(predictions_good, 0.01, 0.99)

predictions_bad = 0.5 + np.random.normal(0, 0.1, n_samples)  # Random predictions
predictions_bad = np.clip(predictions_bad, 0.01, 0.99)

bce_good = binary_cross_entropy(y_true, predictions_good)
bce_bad = binary_cross_entropy(y_true, predictions_bad)

models = ['Good Model', 'Bad Model']
losses = [bce_good, bce_bad]
colors = ['green', 'red']

bars = ax3.bar(models, losses, color=colors, alpha=0.7)
ax3.set_ylabel('Binary Cross-Entropy Loss')
ax3.set_title('Model Comparison', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

# Add value labels on bars
for bar, loss in zip(bars, losses):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{loss:.3f}', ha='center', va='bottom', fontweight='bold')

# Plot 4: Multi-class visualization
ax4 = axes[1, 1]
# Demonstrate categorical cross-entropy for 3 classes
classes = ['Class A', 'Class B', 'Class C']
true_class = 0  # True class is A

# Different prediction scenarios
scenarios = [
    ([0.8, 0.1, 0.1], 'Confident Correct'),
    ([0.4, 0.3, 0.3], 'Uncertain Correct'),
    ([0.1, 0.8, 0.1], 'Confident Wrong'),
    ([0.33, 0.33, 0.34], 'Random Guess')
]

scenario_names = [name for _, name in scenarios]
losses = []

for pred, name in scenarios:
    # Categorical cross-entropy for true class 0
    loss = -np.log(pred[true_class])
    losses.append(loss)

bars = ax4.bar(scenario_names, losses, color=['green', 'yellow', 'red', 'orange'], alpha=0.7)
ax4.set_ylabel('Categorical Cross-Entropy Loss')
ax4.set_title('Multi-class Classification Loss', fontsize=14, fontweight='bold')
ax4.grid(True, alpha=0.3)
plt.setp(ax4.get_xticklabels(), rotation=45, ha='right')

# Add value labels
for bar, loss in zip(bars, losses):
    ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{loss:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("Classification Loss Insights:")
print("• Cross-entropy heavily penalizes confident wrong predictions")
print("• Provides smooth gradients for optimization")
print("• Encourages probability outputs (useful for uncertainty)")
print("• Hinge loss focuses on margin (decision boundary distance)")

## The Connection to Optimization

Loss functions are not just measurement tools - they **drive the learning process** through optimization algorithms like gradient descent.

### Gradient Descent Process:
1. **Forward Pass**: Calculate predictions and loss
2. **Backward Pass**: Compute gradients of loss w.r.t. weights
3. **Weight Update**: Move weights in direction that reduces loss
4. **Repeat**: Until convergence or stopping criteria

### Mathematical Foundation:
$$\mathbf{w}_{new} = \mathbf{w}_{old} - \eta \nabla_\mathbf{w} L$$

Where:
- $\mathbf{w}$ = weights
- $\eta$ = learning rate
- $\nabla_\mathbf{w} L$ = gradient of loss w.r.t. weights

In [None]:
# Demonstrate gradient descent with different loss functions
def demonstrate_gradient_descent():
    # Simple 1D optimization problem
    # True function: y = 3x + 2
    # We'll learn the slope (weight) starting from a random guess
    
    np.random.seed(42)
    x_data = np.random.randn(100)
    y_data = 3 * x_data + 2 + np.random.normal(0, 0.5, 100)
    
    def mse_gradient(w, x, y):
        """Gradient of MSE loss w.r.t. weight w"""
        y_pred = w * x
        return 2 * np.mean((y_pred - y) * x)
    
    def mae_gradient(w, x, y):
        """Gradient of MAE loss w.r.t. weight w"""
        y_pred = w * x
        return np.mean(np.sign(y_pred - y) * x)
    
    # Initialize weights
    w_mse = 0.0
    w_mae = 0.0
    learning_rate = 0.01
    epochs = 100
    
    # Track progress
    weights_mse = [w_mse]
    weights_mae = [w_mae]
    losses_mse = []
    losses_mae = []
    
    # Gradient descent
    for epoch in range(epochs):
        # MSE optimization
        grad_mse = mse_gradient(w_mse, x_data, y_data)
        w_mse -= learning_rate * grad_mse
        loss_mse = np.mean((w_mse * x_data - y_data)**2)
        
        # MAE optimization  
        grad_mae = mae_gradient(w_mae, x_data, y_data)
        w_mae -= learning_rate * grad_mae
        loss_mae = np.mean(np.abs(w_mae * x_data - y_data))
        
        weights_mse.append(w_mse)
        weights_mae.append(w_mae)
        losses_mse.append(loss_mse)
        losses_mae.append(loss_mae)
    
    # Visualize optimization process
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Plot 1: Weight convergence
    ax1 = axes[0, 0]
    ax1.plot(weights_mse, 'b-', linewidth=2, label='MSE Optimization')
    ax1.plot(weights_mae, 'r-', linewidth=2, label='MAE Optimization')
    ax1.axhline(y=3, color='green', linestyle='--', linewidth=2, label='True Weight = 3')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Weight Value')
    ax1.set_title('Weight Convergence', fontsize=14, fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Loss convergence
    ax2 = axes[0, 1]
    ax2.plot(losses_mse, 'b-', linewidth=2, label='MSE Loss')
    ax2.plot(losses_mae, 'r-', linewidth=2, label='MAE Loss')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Loss Value')
    ax2.set_title('Loss Convergence', fontsize=14, fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_yscale('log')
    
    # Plot 3: Loss landscape
    ax3 = axes[1, 0]
    weight_range = np.linspace(-1, 5, 100)
    mse_landscape = [np.mean((w * x_data - y_data)**2) for w in weight_range]
    mae_landscape = [np.mean(np.abs(w * x_data - y_data)) for w in weight_range]
    
    ax3.plot(weight_range, mse_landscape, 'b-', linewidth=3, label='MSE Loss')
    ax3.plot(weight_range, mae_landscape, 'r-', linewidth=3, label='MAE Loss')
    ax3.axvline(x=3, color='green', linestyle='--', linewidth=2, label='True Weight')
    
    # Show optimization paths
    mse_path_losses = [np.mean((w * x_data - y_data)**2) for w in weights_mse[::10]]
    mae_path_losses = [np.mean(np.abs(w * x_data - y_data)) for w in weights_mae[::10]]
    
    ax3.plot(weights_mse[::10], mse_path_losses, 'bo', markersize=8, alpha=0.7)
    ax3.plot(weights_mae[::10], mae_path_losses, 'ro', markersize=8, alpha=0.7)
    
    ax3.set_xlabel('Weight Value')
    ax3.set_ylabel('Loss Value')
    ax3.set_title('Loss Landscape & Optimization Path', fontsize=14, fontweight='bold')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # Plot 4: Final predictions
    ax4 = axes[1, 1]
    x_test = np.linspace(-3, 3, 100)
    y_true_line = 3 * x_test + 2
    y_pred_mse = weights_mse[-1] * x_test
    y_pred_mae = weights_mae[-1] * x_test
    
    ax4.scatter(x_data, y_data, alpha=0.5, label='Training Data')
    ax4.plot(x_test, y_true_line, 'g-', linewidth=3, label='True Function')
    ax4.plot(x_test, y_pred_mse, 'b--', linewidth=2, label=f'MSE Result (w={weights_mse[-1]:.2f})')
    ax4.plot(x_test, y_pred_mae, 'r--', linewidth=2, label=f'MAE Result (w={weights_mae[-1]:.2f})')
    
    ax4.set_xlabel('x')
    ax4.set_ylabel('y')
    ax4.set_title('Final Learned Functions', fontsize=14, fontweight='bold')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"Final Results:")
    print(f"True weight: 3.00")
    print(f"MSE learned weight: {weights_mse[-1]:.3f}")
    print(f"MAE learned weight: {weights_mae[-1]:.3f}")
    print(f"MSE final loss: {losses_mse[-1]:.4f}")
    print(f"MAE final loss: {losses_mae[-1]:.4f}")

demonstrate_gradient_descent()

## Advanced Loss Functions

Modern deep learning uses specialized loss functions for specific problems:

### 1. Focal Loss
For imbalanced classification problems:
$$\text{FL}(p_t) = -(1-p_t)^\gamma \log(p_t)$$

- **Focuses on hard examples**
- **Reduces impact of easy examples**
- **Addresses class imbalance**

### 2. Dice Loss
For image segmentation:
$$\text{Dice} = 1 - \frac{2|A \cap B|}{|A| + |B|}$$

### 3. Contrastive Loss
For similarity learning:
$$L = \frac{1}{2N} \sum_{n=1}^{N} [y d^2 + (1-y) \max(0, m-d)^2]$$

### 4. Triplet Loss
For embedding learning:
$$L = \max(0, d(a,p) - d(a,n) + \text{margin})$$

In [None]:
# Demonstrate focal loss for imbalanced classification
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """Focal Loss implementation"""
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    
    # Calculate focal loss
    ce_loss = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
    p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
    focal_weight = (1 - p_t) ** gamma
    
    return np.mean(alpha * focal_weight * ce_loss)

# Compare standard cross-entropy with focal loss
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Plot 1: Focal loss vs cross-entropy for different confidence levels
ax1 = axes[0, 0]
probs = np.linspace(0.01, 0.99, 100)
y_true_positive = np.ones_like(probs)

ce_loss = binary_cross_entropy(y_true_positive, probs)
fl_loss_gamma1 = [focal_loss(np.array([1]), np.array([p]), gamma=1.0) for p in probs]
fl_loss_gamma2 = [focal_loss(np.array([1]), np.array([p]), gamma=2.0) for p in probs]
fl_loss_gamma5 = [focal_loss(np.array([1]), np.array([p]), gamma=5.0) for p in probs]

ax1.plot(probs, -np.log(probs), 'k-', linewidth=3, label='Cross-Entropy')
ax1.plot(probs, fl_loss_gamma1, 'b-', linewidth=2, label='Focal (γ=1)')
ax1.plot(probs, fl_loss_gamma2, 'r-', linewidth=2, label='Focal (γ=2)')
ax1.plot(probs, fl_loss_gamma5, 'g-', linewidth=2, label='Focal (γ=5)')

ax1.set_xlabel('Predicted Probability (True class = 1)')
ax1.set_ylabel('Loss')
ax1.set_title('Focal Loss vs Cross-Entropy', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# Plot 2: Effect on easy vs hard examples
ax2 = axes[0, 1]
# Easy examples (high confidence correct predictions)
easy_probs = np.array([0.9, 0.95, 0.99])
# Hard examples (low confidence correct predictions)
hard_probs = np.array([0.6, 0.7, 0.8])

easy_ce = -np.log(easy_probs)
hard_ce = -np.log(hard_probs)
easy_focal = [(1-p)**2 * (-np.log(p)) for p in easy_probs]
hard_focal = [(1-p)**2 * (-np.log(p)) for p in hard_probs]

x = np.arange(3)
width = 0.35

ax2.bar(x - width/2, easy_ce, width, label='Easy Examples (CE)', color='lightblue', alpha=0.7)
ax2.bar(x + width/2, easy_focal, width, label='Easy Examples (Focal)', color='blue', alpha=0.7)
ax2.bar(x + 3 - width/2, hard_ce, width, label='Hard Examples (CE)', color='lightcoral', alpha=0.7)
ax2.bar(x + 3 + width/2, hard_focal, width, label='Hard Examples (Focal)', color='red', alpha=0.7)

ax2.set_xlabel('Example Index')
ax2.set_ylabel('Loss Value')
ax2.set_title('Easy vs Hard Examples', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xticks([1, 4])
ax2.set_xticklabels(['Easy Examples\n(0.9, 0.95, 0.99)', 'Hard Examples\n(0.6, 0.7, 0.8)'])

# Plot 3: Imbalanced dataset simulation
ax3 = axes[1, 0]
np.random.seed(42)

# Create imbalanced dataset (10% positive class)
n_samples = 1000
n_positive = 100
n_negative = 900

# Simulate predictions (model struggles with minority class)
positive_preds = np.random.beta(2, 3, n_positive)  # Skewed toward lower probabilities
negative_preds = np.random.beta(1, 4, n_negative)  # Skewed toward lower probabilities

y_true_imbalanced = np.concatenate([np.ones(n_positive), np.zeros(n_negative)])
y_pred_imbalanced = np.concatenate([positive_preds, negative_preds])

# Calculate losses
ce_loss_imbalanced = binary_cross_entropy(y_true_imbalanced, y_pred_imbalanced)
focal_loss_imbalanced = focal_loss(y_true_imbalanced, y_pred_imbalanced, gamma=2.0)

losses = [ce_loss_imbalanced, focal_loss_imbalanced]
loss_names = ['Cross-Entropy', 'Focal Loss']
colors = ['blue', 'red']

bars = ax3.bar(loss_names, losses, color=colors, alpha=0.7)
ax3.set_ylabel('Average Loss')
ax3.set_title('Imbalanced Dataset (10% positive class)', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

for bar, loss in zip(bars, losses):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
             f'{loss:.4f}', ha='center', va='bottom', fontweight='bold')

# Plot 4: Class-wise loss breakdown
ax4 = axes[1, 1]
# Calculate per-class losses
positive_mask = y_true_imbalanced == 1
negative_mask = y_true_imbalanced == 0

ce_pos = np.mean(-np.log(np.clip(y_pred_imbalanced[positive_mask], 1e-15, 1)))
ce_neg = np.mean(-np.log(np.clip(1 - y_pred_imbalanced[negative_mask], 1e-15, 1)))

focal_pos = np.mean([focal_loss(np.array([1]), np.array([p]), gamma=2.0) 
                    for p in y_pred_imbalanced[positive_mask]])
focal_neg = np.mean([focal_loss(np.array([0]), np.array([p]), gamma=2.0) 
                    for p in y_pred_imbalanced[negative_mask]])

classes = ['Positive Class\n(minority)', 'Negative Class\n(majority)']
ce_losses = [ce_pos, ce_neg]
focal_losses = [focal_pos, focal_neg]

x = np.arange(len(classes))
width = 0.35

ax4.bar(x - width/2, ce_losses, width, label='Cross-Entropy', alpha=0.7)
ax4.bar(x + width/2, focal_losses, width, label='Focal Loss', alpha=0.7)

ax4.set_ylabel('Average Loss per Class')
ax4.set_title('Per-Class Loss Comparison', fontsize=14, fontweight='bold')
ax4.set_xticks(x)
ax4.set_xticklabels(classes)
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Focal Loss Benefits:")
print("• Reduces loss for easy examples (well-classified)")
print("• Maintains high loss for hard examples (misclassified)")
print("• Helps with class imbalance by focusing on difficult cases")
print(f"• In this example: CE loss = {ce_loss_imbalanced:.4f}, Focal loss = {focal_loss_imbalanced:.4f}")

## Choosing the Right Loss Function

### Decision Framework:

#### Problem Type:
1. **Regression** → MSE, MAE, Huber
2. **Binary Classification** → Binary Cross-Entropy, Hinge
3. **Multi-class Classification** → Categorical Cross-Entropy
4. **Multi-label Classification** → Binary Cross-Entropy per label

#### Data Characteristics:
- **Outliers present** → MAE, Huber Loss
- **Class imbalance** → Focal Loss, Weighted Cross-Entropy
- **Need probabilities** → Cross-Entropy based losses
- **Need margins** → Hinge Loss, SVM-style losses

#### Optimization Considerations:
- **Smooth gradients needed** → Cross-Entropy, MSE
- **Robust to noise** → Huber, MAE
- **Fast convergence** → Well-conditioned losses

### Common Pitfalls:
1. **Wrong loss for problem type** (MSE for classification)
2. **Ignoring class imbalance** (standard CE for imbalanced data)
3. **Not considering outliers** (MSE with noisy data)
4. **Inappropriate output activation** (sigmoid with MSE)

In [None]:
# Create a comprehensive comparison table
import pandas as pd

loss_comparison = {
    'Loss Function': [
        'Mean Squared Error (MSE)',
        'Mean Absolute Error (MAE)', 
        'Huber Loss',
        'Binary Cross-Entropy',
        'Categorical Cross-Entropy',
        'Hinge Loss',
        'Focal Loss'
    ],
    'Problem Type': [
        'Regression',
        'Regression',
        'Regression',
        'Binary Classification',
        'Multi-class Classification',
        'Binary Classification',
        'Imbalanced Classification'
    ],
    'Output Range': [
        '[0, ∞)',
        '[0, ∞)',
        '[0, ∞)',
        '[0, ∞)',
        '[0, ∞)',
        '[0, ∞)',
        '[0, ∞)'
    ],
    'Outlier Sensitivity': [
        'High',
        'Low',
        'Medium',
        'Medium',
        'Medium',
        'Low',
        'Adaptive'
    ],
    'Probabilistic Output': [
        'No',
        'No',
        'No',
        'Yes',
        'Yes',
        'No',
        'Yes'
    ],
    'Common Use Cases': [
        'General regression, least squares',
        'Robust regression, outlier-prone data',
        'Regression with some outliers',
        'Binary classification, probability estimation',
        'Multi-class classification',
        'SVMs, margin-based classification',
        'Imbalanced datasets, object detection'
    ]
}

df = pd.DataFrame(loss_comparison)
print("Loss Function Comparison Guide:")
print("=" * 120)
print(df.to_string(index=False, max_colwidth=50))

# Quick selection flowchart
print("\n" + "="*60)
print("QUICK SELECTION GUIDE:")
print("="*60)
print("📊 REGRESSION PROBLEMS:")
print("   • Normal data, no outliers → MSE")
print("   • Data with outliers → MAE or Huber Loss")
print("   • Need smooth gradients → MSE or Huber")
print()
print("🎯 CLASSIFICATION PROBLEMS:")
print("   • Binary classification → Binary Cross-Entropy")
print("   • Multi-class classification → Categorical Cross-Entropy")
print("   • Imbalanced classes → Focal Loss or Weighted CE")
print("   • Need decision margins → Hinge Loss")
print()
print("⚡ SPECIAL CASES:")
print("   • Object detection → Focal Loss")
print("   • Image segmentation → Dice Loss")
print("   • Similarity learning → Contrastive/Triplet Loss")
print("   • Ranking problems → Pairwise/Listwise losses")

## Key Takeaways

### Loss Functions Are Essential Because They:
1. **Define the learning objective** - what "good" means
2. **Guide optimization** - provide gradients for weight updates
3. **Match problem requirements** - regression vs classification
4. **Handle data characteristics** - outliers, imbalance, noise
5. **Enable different behaviors** - probability vs margin-based

### Best Practices:
1. **Match loss to problem type** (don't use MSE for classification)
2. **Consider data characteristics** (outliers, imbalance)
3. **Ensure compatible output activation** (sigmoid + BCE)
4. **Monitor during training** (loss should generally decrease)
5. **Use appropriate metrics** (loss ≠ always the best evaluation metric)

### Common Mistakes to Avoid:
- Using MSE for classification problems
- Ignoring class imbalance in loss selection
- Not preprocessing data appropriately for chosen loss
- Confusing loss functions with evaluation metrics
- Not considering computational efficiency for large datasets

### The Big Picture:
Loss functions are the **bridge between problem definition and optimization**. They translate our high-level goals ("classify images correctly", "predict house prices accurately") into mathematical objectives that computers can optimize.

## Discussion Questions

1. Why can't we just use 0-1 loss (number of mistakes) for classification?
2. When might you want to design a custom loss function?
3. How do loss functions relate to the business objectives of a project?
4. What happens if you use the "wrong" loss function?

---

**Next**: Put it all together with **Hands-on Exercises** where you'll implement and experiment with these concepts!