# Deep Learning: Initialization and Normalization Techniques
## Interactive Learning Notebook

**Author**: Based on Lecture 9 by Ho-min Park

**Contents**:
1. **Part 1**: Setup and Fundamentals
2. **Part 2**: Weight Initialization Strategies
3. **Part 3**: Normalization Techniques
4. **Part 4**: Regularization and Generalization
5. **Part 5**: Advanced Techniques and Summary

---

### Learning Objectives
- Understand the importance of proper weight initialization
- Implement various initialization strategies (Xavier, He, LSUV)
- Master normalization techniques (Batch, Layer, Group Norm)
- Apply regularization methods (Dropout, Data Augmentation)
- Analyze the impact on training stability and convergence

## Part 1: Setup and Fundamentals

In [None]:
# Essential imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Deep Learning imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Sklearn for datasets and metrics
from sklearn.datasets import make_classification, make_regression, load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

# Warnings and display settings
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Display settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('✅ All libraries imported successfully!')
print(f'PyTorch version: {torch.__version__}')

---
## Exercise 1: Visualizing Gradient Vanishing/Exploding Problem

### 📚 Concept
When weights are initialized poorly, gradients can either vanish (become too small) or explode (become too large) during backpropagation. This happens due to repeated multiplication of weights during the backward pass.

- **Vanishing gradients**: Weights initialized too small → gradients shrink exponentially
- **Exploding gradients**: Weights initialized too large → gradients grow exponentially

In [None]:
def simulate_gradient_flow(init_scale, num_layers=10, activation='tanh'):
    """Simulate gradient flow through a deep network with different initializations"""
    np.random.seed(42)
    
    # Initialize weights and gradients
    weights = []
    gradients = []
    layer_outputs = []
    
    # Input
    x = np.random.randn(100, 10)  # 100 samples, 10 features
    layer_outputs.append(x)
    
    # Forward pass
    for i in range(num_layers):
        w = np.random.randn(10, 10) * init_scale
        weights.append(w)
        
        # Linear transformation
        z = layer_outputs[-1] @ w
        
        # Apply activation
        if activation == 'tanh':
            a = np.tanh(z)
        elif activation == 'relu':
            a = np.maximum(0, z)
        else:
            a = z
        
        layer_outputs.append(a)
    
    # Simulate backward pass (simplified)
    grad = np.ones_like(layer_outputs[-1])
    gradient_norms = []
    
    for i in reversed(range(num_layers)):
        # Gradient through activation
        if activation == 'tanh':
            grad = grad * (1 - layer_outputs[i+1]**2)
        elif activation == 'relu':
            grad = grad * (layer_outputs[i+1] > 0)
        
        # Gradient through linear layer
        grad = grad @ weights[i].T
        gradient_norms.append(np.linalg.norm(grad))
    
    return gradient_norms[::-1], [np.linalg.norm(out) for out in layer_outputs]

# Test different initialization scales
scales = [0.01, 0.1, 1.0, 2.0]
fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=[f'Init Scale = {s}' for s in scales],
                    shared_yaxes=True)

for idx, scale in enumerate(scales):
    grad_norms, output_norms = simulate_gradient_flow(scale)
    row = idx // 2 + 1
    col = idx % 2 + 1
    
    fig.add_trace(
        go.Scatter(x=list(range(len(grad_norms))), y=grad_norms,
                   mode='lines+markers', name=f'Gradients (scale={scale})'),
        row=row, col=col
    )
    
    fig.add_trace(
        go.Scatter(x=list(range(len(output_norms))), y=output_norms,
                   mode='lines+markers', name=f'Outputs (scale={scale})',
                   line=dict(dash='dash')),
        row=row, col=col
    )

fig.update_layout(height=600, title_text="Gradient Flow with Different Initializations",
                  showlegend=False)
fig.update_xaxes(title_text="Layer")
fig.update_yaxes(title_text="Norm", type="log")
fig.show()

print("📊 Observations:")
print("- Small init (0.01): Gradients vanish quickly")
print("- Large init (2.0): Gradients may explode")
print("- Moderate init (0.1-1.0): More stable gradient flow")

---\n## Exercise 2: Xavier/Glorot Initialization\n\n### 📚 Concept\nXavier initialization maintains variance across layers by scaling weights based on the number of input and output neurons:\n- **Uniform**: $W \sim U(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}})$\n- **Normal**: $W \sim N(0, \frac{2}{n_{in} + n_{out}})$\n\nDesigned for **sigmoid and tanh** activations.

In [None]:
class XavierInitNetwork(nn.Module):\n    def __init__(self, input_size, hidden_sizes, output_size, init_type='xavier'):\n        super(XavierInitNetwork, self).__init__()\n        \n        layers = []\n        prev_size = input_size\n        \n        for hidden_size in hidden_sizes:\n            layer = nn.Linear(prev_size, hidden_size)\n            \n            # Apply initialization\n            if init_type == 'xavier':\n                nn.init.xavier_uniform_(layer.weight)\n            elif init_type == 'xavier_normal':\n                nn.init.xavier_normal_(layer.weight)\n            elif init_type == 'zeros':\n                nn.init.zeros_(layer.weight)\n            elif init_type == 'random':\n                nn.init.uniform_(layer.weight, -0.1, 0.1)\n            \n            layers.append(layer)\n            layers.append(nn.Tanh())  # Using tanh for Xavier\n            prev_size = hidden_size\n        \n        # Output layer\n        layers.append(nn.Linear(prev_size, output_size))\n        self.network = nn.Sequential(*layers)\n    \n    def forward(self, x):\n        return self.network(x)\n\n# Test initialization methods\nprint('Xavier initialization implemented!')

---\n## Exercise 3: He Initialization for ReLU Networks\n\n### 📚 Concept\nHe initialization accounts for ReLU killing approximately half the neurons:\n- **Formula**: $W \sim N(0, \frac{2}{n_{in}})$\n- **Key insight**: Variance factor of 2 compensates for ReLU's effect\n- **Best for**: Networks with ReLU, LeakyReLU, PReLU activations

In [None]:
def visualize_activation_distributions():\n    '''Visualize how different initializations affect activation distributions'''\n    torch.manual_seed(42)\n    \n    # Create input\n    x = torch.randn(1000, 100)\n    \n    # Test He vs Xavier with ReLU\n    init_methods = {\n        'He': lambda w: nn.init.kaiming_normal_(w, nonlinearity='relu'),\n        'Xavier': lambda w: nn.init.xavier_normal_(w),\n        'Small (0.01)': lambda w: nn.init.normal_(w, std=0.01),\n        'Large (1.0)': lambda w: nn.init.normal_(w, std=1.0)\n    }\n    \n    fig, axes = plt.subplots(1, 4, figsize=(16, 4))\n    \n    for idx, (name, init_fn) in enumerate(init_methods.items()):\n        # Create layer\n        layer = nn.Linear(100, 100)\n        init_fn(layer.weight)\n        nn.init.zeros_(layer.bias)\n        \n        # Forward pass\n        with torch.no_grad():\n            output = F.relu(layer(x))\n        \n        # Plot distribution\n        axes[idx].hist(output.numpy().flatten(), bins=50, alpha=0.7, edgecolor='black')\n        axes[idx].set_title(f'{name} Initialization')\n        axes[idx].set_xlabel('Activation Value')\n        axes[idx].set_ylabel('Frequency')\n        axes[idx].axvline(x=0, color='red', linestyle='--', alpha=0.5)\n        \n        # Add statistics\n        mean_val = output.mean().item()\n        std_val = output.std().item()\n        axes[idx].text(0.95, 0.95, f'μ={mean_val:.2f}\\nσ={std_val:.2f}',\n                      transform=axes[idx].transAxes, ha='right', va='top',\n                      bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))\n    \n    plt.tight_layout()\n    plt.show()\n    print('He initialization maintains better distribution with ReLU!')\n\nvisualize_activation_distributions()

---\n## Exercise 4: Batch Normalization Implementation\n\n### 📚 Concept\nBatch Normalization normalizes inputs across the batch dimension:\n- **Training**: Uses batch statistics (mean, variance)\n- **Inference**: Uses running statistics\n- **Formula**: $\hat{x} = \frac{x - \mu_{batch}}{\sqrt{\sigma^2_{batch} + \epsilon}}$, $y = \gamma \cdot \hat{x} + \beta$

In [None]:
class CustomBatchNorm(nn.Module):\n    '''Custom Batch Normalization implementation for educational purposes'''\n    \n    def __init__(self, num_features, eps=1e-5, momentum=0.1):\n        super(CustomBatchNorm, self).__init__()\n        \n        # Learnable parameters\n        self.gamma = nn.Parameter(torch.ones(num_features))\n        self.beta = nn.Parameter(torch.zeros(num_features))\n        \n        # Running statistics (not learnable)\n        self.register_buffer('running_mean', torch.zeros(num_features))\n        self.register_buffer('running_var', torch.ones(num_features))\n        \n        self.eps = eps\n        self.momentum = momentum\n    \n    def forward(self, x):\n        if self.training:\n            # Calculate batch statistics\n            batch_mean = x.mean(dim=0)\n            batch_var = x.var(dim=0, unbiased=False)\n            \n            # Update running statistics\n            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * batch_mean\n            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * batch_var\n            \n            # Normalize\n            x_normalized = (x - batch_mean) / torch.sqrt(batch_var + self.eps)\n        else:\n            # Use running statistics during inference\n            x_normalized = (x - self.running_mean) / torch.sqrt(self.running_var + self.eps)\n        \n        # Scale and shift\n        return self.gamma * x_normalized + self.beta\n\nprint('Custom BatchNorm implementation complete!')

---\n## Exercise 5: Comparing Normalization Techniques\n\n### 📚 Decision Guide\n- **Batch Norm**: Large batch CNNs\n- **Layer Norm**: RNNs, Transformers, small batches\n- **Instance Norm**: Style transfer, GANs\n- **Group Norm**: Object detection, segmentation

In [None]:
def compare_normalizations():\n    '''Compare different normalization techniques'''\n    \n    # Create sample data\n    batch_size, channels, height, width = 32, 16, 28, 28\n    x = torch.randn(batch_size, channels, height, width)\n    \n    # Initialize normalization layers\n    norms = {\n        'BatchNorm': nn.BatchNorm2d(channels),\n        'LayerNorm': nn.LayerNorm([channels, height, width]),\n        'InstanceNorm': nn.InstanceNorm2d(channels),\n        'GroupNorm': nn.GroupNorm(num_groups=4, num_channels=channels)\n    }\n    \n    # Apply normalizations and collect statistics\n    results = {}\n    for name, norm_layer in norms.items():\n        with torch.no_grad():\n            output = norm_layer(x)\n            results[name] = {\n                'mean': output.mean().item(),\n                'std': output.std().item(),\n                'shape': output.shape\n            }\n    \n    # Display results\n    df = pd.DataFrame(results).T\n    print(df)\n    \n    return results\n\nresults = compare_normalizations()

---\n## Exercise 6: Implementing Dropout Variants\n\n### 📚 Concept\nDropout randomly deactivates neurons during training to prevent overfitting:\n- **Standard Dropout**: Drops individual neurons\n- **DropConnect**: Drops connections (weights)\n- **DropBlock**: Drops contiguous regions (for CNNs)

In [None]:
class DropoutVariants(nn.Module):\n    '''Implementation of various dropout techniques'''\n    \n    def __init__(self, drop_prob=0.5):\n        super(DropoutVariants, self).__init__()\n        self.drop_prob = drop_prob\n    \n    def standard_dropout(self, x, training=True):\n        '''Standard dropout - drops neurons'''\n        if not training:\n            return x\n        \n        mask = torch.bernoulli(torch.ones_like(x) * (1 - self.drop_prob))\n        return x * mask / (1 - self.drop_prob)\n    \n    def dropconnect(self, weight, training=True):\n        '''DropConnect - drops weights'''\n        if not training:\n            return weight\n        \n        mask = torch.bernoulli(torch.ones_like(weight) * (1 - self.drop_prob))\n        return weight * mask\n\n# Test dropout\ndropout = DropoutVariants(drop_prob=0.3)\nx = torch.ones(1, 10)\nx_dropped = dropout.standard_dropout(x)\nprint(f'Original: {x}')\nprint(f'After dropout: {x_dropped}')

---\n## Exercise 7: Data Augmentation - Mixup and CutMix\n\n### 📚 Concept\nAdvanced data augmentation techniques that create new training samples:\n- **Mixup**: Linear interpolation of samples: $x_{mix} = \lambda x_i + (1-\lambda) x_j$\n- **CutMix**: Cuts and pastes patches between samples\n- **Benefits**: Smoother decision boundaries, better calibration

In [None]:
def mixup_data(x, y, alpha=1.0):\n    '''Performs mixup augmentation'''\n    if alpha > 0:\n        lam = np.random.beta(alpha, alpha)\n    else:\n        lam = 1\n    \n    batch_size = x.size()[0]\n    index = torch.randperm(batch_size)\n    \n    mixed_x = lam * x + (1 - lam) * x[index, :]\n    y_a, y_b = y, y[index]\n    \n    return mixed_x, y_a, y_b, lam\n\n# Example usage\nx = torch.randn(8, 3, 32, 32)  # Batch of images\ny = torch.randint(0, 10, (8,))  # Labels\nmixed_x, y_a, y_b, lam = mixup_data(x, y)\nprint(f'Lambda: {lam:.3f}')\nprint(f'Mixed shape: {mixed_x.shape}')

---\n## Exercise 8: Early Stopping Implementation\n\n### 📚 Concept\nEarly stopping prevents overfitting by monitoring validation performance:\n- **Patience**: Number of epochs to wait before stopping\n- **Best model**: Save model with best validation performance\n- **Restore best**: Load best model after training

In [None]:
class EarlyStopping:\n    '''Early stopping to prevent overfitting'''\n    \n    def __init__(self, patience=10, min_delta=0.001, mode='min'):\n        self.patience = patience\n        self.min_delta = min_delta\n        self.mode = mode\n        self.counter = 0\n        self.best_score = None\n        self.early_stop = False\n    \n    def __call__(self, val_score):\n        if self.best_score is None:\n            self.best_score = val_score\n        elif self._is_improvement(val_score):\n            self.best_score = val_score\n            self.counter = 0\n        else:\n            self.counter += 1\n            if self.counter >= self.patience:\n                self.early_stop = True\n        return self.early_stop\n    \n    def _is_improvement(self, score):\n        if self.mode == 'min':\n            return score < self.best_score - self.min_delta\n        else:\n            return score > self.best_score + self.min_delta\n\n# Example usage\nearly_stopper = EarlyStopping(patience=5)\nval_losses = [0.5, 0.4, 0.35, 0.36, 0.37, 0.38, 0.39, 0.40, 0.41]\n\nfor epoch, loss in enumerate(val_losses):\n    if early_stopper(loss):\n        print(f'Early stopping at epoch {epoch}')\n        break\n    print(f'Epoch {epoch}: val_loss = {loss:.3f}')

---\n## Exercise 9: Ensemble Methods\n\n### 📚 Concept\nEnsemble methods combine multiple models for better performance:\n- **Averaging**: Mean of predictions\n- **Voting**: Majority vote for classification\n- **Stacking**: Meta-model learns from base models

In [None]:
class EnsembleModel:\n    '''Simple ensemble implementation'''\n    \n    def __init__(self, models):\n        self.models = models\n    \n    def predict_average(self, x):\n        '''Average predictions from all models'''\n        predictions = []\n        for model in self.models:\n            model.eval()\n            with torch.no_grad():\n                pred = model(x)\n                predictions.append(pred)\n        return torch.stack(predictions).mean(dim=0)\n    \n    def predict_voting(self, x):\n        '''Majority voting for classification'''\n        predictions = []\n        for model in self.models:\n            model.eval()\n            with torch.no_grad():\n                pred = torch.argmax(model(x), dim=1)\n                predictions.append(pred)\n        \n        # Get mode (most common prediction)\n        stacked = torch.stack(predictions)\n        mode_values, _ = torch.mode(stacked, dim=0)\n        return mode_values\n\nprint('Ensemble methods implemented!')

---\n## Exercise 10: Comprehensive Technique Comparison\n\n### 📚 Final Summary\nLet's compare all the techniques we've learned:

In [None]:
def create_comparison_table():\n    '''Create a comprehensive comparison of all techniques'''\n    \n    data = {\n        'Technique': ['Xavier Init', 'He Init', 'BatchNorm', 'LayerNorm', \n                     'Dropout', 'Mixup', 'Early Stopping', 'Ensemble'],\n        'Type': ['Initialization', 'Initialization', 'Normalization', 'Normalization',\n                'Regularization', 'Augmentation', 'Regularization', 'Ensemble'],\n        'Best For': ['Tanh/Sigmoid', 'ReLU', 'Large Batch CNN', 'RNN/Transformer',\n                    'Fully Connected', 'Classification', 'All Models', 'Final Performance'],\n        'Key Benefit': ['Stable gradients', 'ReLU optimization', 'Reduces shift',\n                       'Batch independent', 'Prevents overfit', 'Smooth boundaries',\n                       'Stops overfit', 'Reduces variance']\n    }\n    \n    df = pd.DataFrame(data)\n    \n    # Display with style\n    styled_df = df.style.set_properties(**{\n        'background-color': 'lightblue',\n        'color': 'black',\n        'border-color': 'white'\n    })\n    \n    return df\n\ncomparison_df = create_comparison_table()\nprint(comparison_df.to_string())

## Summary and Key Takeaways\n\n### 🎯 What We've Learned\n\n1. **Weight Initialization**:\n   - Xavier/Glorot for tanh/sigmoid\n   - He initialization for ReLU\n   - LSUV for very deep networks\n\n2. **Normalization Techniques**:\n   - BatchNorm for CNNs with large batches\n   - LayerNorm for RNNs and Transformers\n   - GroupNorm for small batch training\n\n3. **Regularization Methods**:\n   - Dropout and variants\n   - Data augmentation (Mixup, CutMix)\n   - Early stopping\n\n### 📝 Best Practices\n\n- Always initialize weights properly based on activation function\n- Choose normalization based on architecture and batch size\n- Combine multiple regularization techniques for best results\n- Monitor validation metrics to prevent overfitting\n\n### 🚀 Next Steps\n\n- Experiment with different combinations on your own datasets\n- Try implementing custom initialization methods\n- Explore advanced normalization techniques (e.g., AdaNorm, CrossNorm)\n- Investigate the impact on different architectures (ResNet, Transformer, etc.)