# Lab 4.4: Weight Initialization Strategies for Deep Networks

**Duration**: 45 minutes

## Learning Objectives
By the end of this lab, you will be able to:
- Understand the critical importance of proper weight initialization
- Implement various initialization methods (Zero, Random, Xavier, He, etc.)
- Analyze the impact of initialization on gradient flow and convergence
- Choose appropriate initialization strategies for different activation functions
- Design custom initialization schemes for specific architectures

## Prerequisites
- Completed Labs 4.1, 4.2, and 4.3
- Understanding of activation functions and backpropagation
- Familiarity with gradient flow problems

## Lab Overview
Weight initialization is one of the most critical factors in successfully training deep neural networks. Poor initialization can lead to vanishing/exploding gradients, slow convergence, or complete training failure. This lab explores various initialization strategies and their mathematical foundations.

## Part 1: Environment Setup and Mathematical Foundations

### Instructions:
1. Run this cell to import all necessary libraries
2. Review the mathematical foundation for initialization strategies
3. Verify all imports are successful

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_circles, make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib for better visualization
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("Environment setup complete!")
print(f"NumPy version: {np.__version__}")
print("Ready to explore weight initialization strategies!")

# Mathematical foundations for initialization
print("\n" + "="*60)
print("MATHEMATICAL FOUNDATIONS")
print("="*60)
print("""
Key Principles for Weight Initialization:

1. VARIANCE PRESERVATION:
   - Forward pass: Var(output) ≈ Var(input)
   - Backward pass: Var(gradient) should remain stable

2. XAVIER/GLOROT INITIALIZATION:
   - For sigmoid/tanh: Var(W) = 1/n_in
   - Uniform: W ~ U(-√(6/(n_in+n_out)), √(6/(n_in+n_out)))
   - Normal: W ~ N(0, √(2/(n_in+n_out)))

3. HE INITIALIZATION:
   - For ReLU: Var(W) = 2/n_in
   - Normal: W ~ N(0, √(2/n_in))
   - Uniform: W ~ U(-√(6/n_in), √(6/n_in))

4. LECUN INITIALIZATION:
   - For SELU: Var(W) = 1/n_in
   - Normal: W ~ N(0, √(1/n_in))
""")

## Part 2: Implementing Initialization Strategies

### Instructions:
1. Implement various initialization methods
2. Understand the mathematical reasoning behind each method
3. Observe the statistical properties of different initializations

In [None]:
class WeightInitializer:
    """Comprehensive weight initialization class"""
    
    @staticmethod
    def zeros(shape):
        """Zero initialization (not recommended for weights)"""
        return np.zeros(shape)
    
    @staticmethod
    def ones(shape):
        """Ones initialization (not recommended for weights)"""
        return np.ones(shape)
    
    @staticmethod
    def random_small(shape, scale=0.01):
        """Small random values (traditional approach)"""
        return np.random.randn(*shape) * scale
    
    @staticmethod
    def random_large(shape, scale=1.0):
        """Large random values (problematic)"""
        return np.random.randn(*shape) * scale
    
    @staticmethod
    def xavier_normal(shape):
        """Xavier/Glorot normal initialization"""
        n_in, n_out = shape[1], shape[0]
        std = np.sqrt(2.0 / (n_in + n_out))
        return np.random.randn(*shape) * std
    
    @staticmethod
    def xavier_uniform(shape):
        """Xavier/Glorot uniform initialization"""
        n_in, n_out = shape[1], shape[0]
        limit = np.sqrt(6.0 / (n_in + n_out))
        return np.random.uniform(-limit, limit, shape)
    
    @staticmethod
    def he_normal(shape):
        """He normal initialization (for ReLU)"""
        n_in = shape[1]
        std = np.sqrt(2.0 / n_in)
        return np.random.randn(*shape) * std
    
    @staticmethod
    def he_uniform(shape):
        """He uniform initialization (for ReLU)"""
        n_in = shape[1]
        limit = np.sqrt(6.0 / n_in)
        return np.random.uniform(-limit, limit, shape)
    
    @staticmethod
    def lecun_normal(shape):
        """LeCun normal initialization (for SELU)"""
        n_in = shape[1]
        std = np.sqrt(1.0 / n_in)
        return np.random.randn(*shape) * std
    
    @staticmethod
    def orthogonal(shape):
        """Orthogonal initialization (good for RNNs)"""
        if len(shape) != 2:
            raise ValueError("Orthogonal initialization only supports 2D arrays")
        
        # Generate random matrix
        a = np.random.randn(*shape)
        
        # SVD decomposition
        u, _, v = np.linalg.svd(a, full_matrices=False)
        
        # Pick the one with the correct shape
        q = u if u.shape == shape else v
        return q.reshape(shape)
    
    @staticmethod
    def variance_scaling(shape, scale=1.0, mode='fan_in', distribution='normal'):
        """General variance scaling initialization"""
        n_in, n_out = shape[1], shape[0]
        
        if mode == 'fan_in':
            n = n_in
        elif mode == 'fan_out':
            n = n_out
        elif mode == 'fan_avg':
            n = (n_in + n_out) / 2.0
        else:
            raise ValueError(f"Invalid mode: {mode}")
        
        if distribution == 'normal':
            std = np.sqrt(scale / n)
            return np.random.randn(*shape) * std
        elif distribution == 'uniform':
            limit = np.sqrt(3.0 * scale / n)
            return np.random.uniform(-limit, limit, shape)
        else:
            raise ValueError(f"Invalid distribution: {distribution}")

# Test initialization methods
test_shape = (64, 128)  # (n_out, n_in)
print(f"Testing initialization methods for shape {test_shape}:")
print("=" * 60)

initializers = {
    'Random Small (0.01)': lambda: WeightInitializer.random_small(test_shape, 0.01),
    'Random Large (1.0)': lambda: WeightInitializer.random_large(test_shape, 1.0),
    'Xavier Normal': lambda: WeightInitializer.xavier_normal(test_shape),
    'Xavier Uniform': lambda: WeightInitializer.xavier_uniform(test_shape),
    'He Normal': lambda: WeightInitializer.he_normal(test_shape),
    'He Uniform': lambda: WeightInitializer.he_uniform(test_shape),
    'LeCun Normal': lambda: WeightInitializer.lecun_normal(test_shape),
    'Orthogonal': lambda: WeightInitializer.orthogonal(test_shape)
}

# Analyze statistical properties
for name, init_func in initializers.items():
    weights = init_func()
    print(f"{name:<20}: Mean={np.mean(weights):.6f}, Std={np.std(weights):.6f}, "
          f"Min={np.min(weights):.6f}, Max={np.max(weights):.6f}")

print("\n✅ All initialization methods implemented successfully!")

## Part 3: Visualizing Weight Distributions

### Instructions:
1. Create comprehensive visualizations of weight distributions
2. Compare how different initializations affect the distribution shape
3. Understand the relationship between initialization and activation statistics

In [None]:
def visualize_weight_distributions(shape=(100, 200), num_samples=5000):
    """Visualize weight distributions for different initialization methods"""
    
    # Select key initializations to compare
    init_methods = {
        'Random Small (0.01)': lambda: WeightInitializer.random_small(shape, 0.01),
        'Random Large (1.0)': lambda: WeightInitializer.random_large(shape, 1.0),
        'Xavier Normal': lambda: WeightInitializer.xavier_normal(shape),
        'He Normal': lambda: WeightInitializer.he_normal(shape),
        'LeCun Normal': lambda: WeightInitializer.lecun_normal(shape),
        'Orthogonal': lambda: WeightInitializer.orthogonal(shape)
    }
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    colors = ['red', 'orange', 'green', 'blue', 'purple', 'brown']
    
    for idx, (name, init_func) in enumerate(init_methods.items()):
        # Generate weights
        weights = init_func().flatten()
        
        # Create histogram
        axes[idx].hist(weights, bins=50, alpha=0.7, color=colors[idx], 
                       density=True, edgecolor='black', linewidth=0.5)
        
        # Add statistics
        mean = np.mean(weights)
        std = np.std(weights)
        axes[idx].axvline(mean, color='red', linestyle='--', alpha=0.8, linewidth=2, label=f'Mean: {mean:.3f}')
        axes[idx].axvline(mean + std, color='orange', linestyle=':', alpha=0.8, label=f'±1 Std: {std:.3f}')
        axes[idx].axvline(mean - std, color='orange', linestyle=':', alpha=0.8)
        
        axes[idx].set_title(f'{name}\n(Shape: {shape})', fontsize=12, fontweight='bold')
        axes[idx].set_xlabel('Weight Value')
        axes[idx].set_ylabel('Density')
        axes[idx].legend(fontsize=9)
        axes[idx].grid(True, alpha=0.3)
    
    plt.suptitle('Weight Distribution Comparison', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Visualize distributions
visualize_weight_distributions()

# Create theoretical comparison
def theoretical_vs_empirical_comparison():
    """Compare theoretical and empirical standard deviations"""
    shape = (64, 128)  # (n_out, n_in)
    n_in, n_out = shape[1], shape[0]
    
    print("\nTheoretical vs Empirical Standard Deviations:")
    print("=" * 70)
    print(f"Network layer shape: {shape} (n_out={n_out}, n_in={n_in})")
    print()
    
    # Xavier Normal
    xavier_theoretical = np.sqrt(2.0 / (n_in + n_out))
    xavier_empirical = np.std(WeightInitializer.xavier_normal(shape))
    print(f"Xavier Normal:  Theoretical={xavier_theoretical:.6f}, Empirical={xavier_empirical:.6f}")
    
    # He Normal
    he_theoretical = np.sqrt(2.0 / n_in)
    he_empirical = np.std(WeightInitializer.he_normal(shape))
    print(f"He Normal:      Theoretical={he_theoretical:.6f}, Empirical={he_empirical:.6f}")
    
    # LeCun Normal
    lecun_theoretical = np.sqrt(1.0 / n_in)
    lecun_empirical = np.std(WeightInitializer.lecun_normal(shape))
    print(f"LeCun Normal:   Theoretical={lecun_theoretical:.6f}, Empirical={lecun_empirical:.6f}")
    
    print("\n✅ Empirical values closely match theoretical predictions!")

theoretical_vs_empirical_comparison()

## Part 4: Impact on Activation Statistics

### Instructions:
1. Create a deep network to analyze activation statistics
2. Compare how different initializations affect activation distributions
3. Understand the connection between initialization and gradient flow

In [None]:
class ActivationAnalyzer:
    """Analyze activation statistics for different initializations"""
    
    def __init__(self, layer_dims, activation='relu'):
        self.layer_dims = layer_dims
        self.num_layers = len(layer_dims) - 1
        self.activation = activation
        self.activations = {}
        self.activation_stats = {}
    
    def initialize_network(self, init_method):
        """Initialize network with specified method"""
        self.parameters = {}
        
        for l in range(1, self.num_layers + 1):
            shape = (self.layer_dims[l], self.layer_dims[l-1])
            
            if init_method == 'random_small':
                self.parameters[f'W{l}'] = WeightInitializer.random_small(shape, 0.01)
            elif init_method == 'random_large':
                self.parameters[f'W{l}'] = WeightInitializer.random_large(shape, 1.0)
            elif init_method == 'xavier_normal':
                self.parameters[f'W{l}'] = WeightInitializer.xavier_normal(shape)
            elif init_method == 'he_normal':
                self.parameters[f'W{l}'] = WeightInitializer.he_normal(shape)
            elif init_method == 'lecun_normal':
                self.parameters[f'W{l}'] = WeightInitializer.lecun_normal(shape)
            elif init_method == 'orthogonal':
                self.parameters[f'W{l}'] = WeightInitializer.orthogonal(shape)
            else:
                raise ValueError(f"Unknown initialization: {init_method}")
            
            self.parameters[f'b{l}'] = np.zeros((self.layer_dims[l], 1))
    
    def activate(self, Z):
        """Apply activation function"""
        if self.activation == 'relu':
            return np.maximum(0, Z)
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
        elif self.activation == 'tanh':
            return np.tanh(Z)
        elif self.activation == 'linear':
            return Z
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def forward_pass(self, X):
        """Forward pass with activation tracking"""
        self.activations = {'A0': X}
        A = X
        
        for l in range(1, self.num_layers + 1):
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']
            
            Z = np.dot(W, A) + b
            
            # Apply activation (sigmoid for output layer, specified for others)
            if l == self.num_layers:
                A = self.activate(Z) if self.activation != 'sigmoid' else 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
            else:
                A = self.activate(Z)
            
            # Store activations and statistics
            self.activations[f'Z{l}'] = Z
            self.activations[f'A{l}'] = A
            
            # Calculate statistics
            self.activation_stats[f'layer_{l}'] = {
                'pre_activation': {
                    'mean': np.mean(Z),
                    'std': np.std(Z),
                    'min': np.min(Z),
                    'max': np.max(Z)
                },
                'post_activation': {
                    'mean': np.mean(A),
                    'std': np.std(A),
                    'min': np.min(A),
                    'max': np.max(A),
                    'fraction_dead': np.mean(A == 0) if self.activation == 'relu' else 0
                }
            }
        
        return A

# Generate test data
X_test, _ = make_classification(n_samples=1000, n_features=50, n_informative=30,
                                n_redundant=10, random_state=42)
X_test = X_test.T  # Shape: (features, samples)

# Network architecture
architecture = [50, 100, 80, 60, 40, 20, 1]

# Test different initializations
init_methods = ['random_small', 'random_large', 'xavier_normal', 'he_normal']
activation_types = ['relu', 'sigmoid', 'tanh']

print("Analyzing activation statistics for different initialization methods...\n")

def analyze_activation_flow(architecture, X_test, init_method, activation):
    """Analyze activation flow for specific configuration"""
    analyzer = ActivationAnalyzer(architecture, activation)
    analyzer.initialize_network(init_method)
    analyzer.forward_pass(X_test)
    
    # Extract statistics
    layers = []
    pre_means = []
    pre_stds = []
    post_means = []
    post_stds = []
    dead_fractions = []
    
    for l in range(1, analyzer.num_layers + 1):
        stats = analyzer.activation_stats[f'layer_{l}']
        layers.append(l)
        pre_means.append(stats['pre_activation']['mean'])
        pre_stds.append(stats['pre_activation']['std'])
        post_means.append(stats['post_activation']['mean'])
        post_stds.append(stats['post_activation']['std'])
        dead_fractions.append(stats['post_activation']['fraction_dead'])
    
    return {
        'layers': layers,
        'pre_means': pre_means,
        'pre_stds': pre_stds,
        'post_means': post_means,
        'post_stds': post_stds,
        'dead_fractions': dead_fractions
    }

# Create comprehensive comparison for ReLU networks
print("ReLU Networks - Initialization Impact:")
print("=" * 60)

relu_results = {}
for init_method in init_methods:
    relu_results[init_method] = analyze_activation_flow(architecture, X_test, init_method, 'relu')
    
    # Print summary statistics
    result = relu_results[init_method]
    avg_pre_std = np.mean(result['pre_stds'])
    avg_post_std = np.mean(result['post_stds'])
    avg_dead_fraction = np.mean(result['dead_fractions'])
    
    print(f"{init_method:<15}: Avg Pre-Act Std={avg_pre_std:.4f}, "
          f"Avg Post-Act Std={avg_post_std:.4f}, Avg Dead Neurons={avg_dead_fraction:.2%}")

print("\n✅ Activation analysis complete!")

## Part 5: Visualization of Activation Statistics

### Instructions:
1. Create detailed visualizations of activation statistics
2. Compare initialization methods across different metrics
3. Identify optimal initialization for different scenarios

In [None]:
def visualize_activation_analysis(results, activation_type='ReLU'):
    """Create comprehensive visualization of activation statistics"""
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    colors = {'random_small': 'red', 'random_large': 'orange', 
              'xavier_normal': 'green', 'he_normal': 'blue'}
    
    # Plot 1: Pre-activation standard deviation
    for method, result in results.items():
        axes[0, 0].plot(result['layers'], result['pre_stds'], 
                       marker='o', label=method, color=colors[method], linewidth=2)
    
    axes[0, 0].set_title(f'Pre-Activation Standard Deviation ({activation_type})')
    axes[0, 0].set_xlabel('Layer Number')
    axes[0, 0].set_ylabel('Standard Deviation')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].set_yscale('log')
    
    # Plot 2: Post-activation standard deviation
    for method, result in results.items():
        axes[0, 1].plot(result['layers'], result['post_stds'], 
                       marker='s', label=method, color=colors[method], linewidth=2)
    
    axes[0, 1].set_title(f'Post-Activation Standard Deviation ({activation_type})')
    axes[0, 1].set_xlabel('Layer Number')
    axes[0, 1].set_ylabel('Standard Deviation')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].set_yscale('log')
    
    # Plot 3: Dead neurons (for ReLU)
    if activation_type.lower() == 'relu':
        for method, result in results.items():
            axes[1, 0].plot(result['layers'], [f*100 for f in result['dead_fractions']], 
                           marker='^', label=method, color=colors[method], linewidth=2)
        
        axes[1, 0].set_title('Dead Neurons Percentage (ReLU)')
        axes[1, 0].set_xlabel('Layer Number')
        axes[1, 0].set_ylabel('Dead Neurons (%)')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
    else:
        # For non-ReLU activations, show mean activations
        for method, result in results.items():
            axes[1, 0].plot(result['layers'], result['post_means'], 
                           marker='^', label=method, color=colors[method], linewidth=2)
        
        axes[1, 0].set_title(f'Mean Activations ({activation_type})')
        axes[1, 0].set_xlabel('Layer Number')
        axes[1, 0].set_ylabel('Mean Activation')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
    
    # Plot 4: Variance preservation score
    for method, result in results.items():
        # Calculate how well variance is preserved (closer to 1 is better)
        variance_scores = []
        for i in range(len(result['pre_stds'])):
            if i == 0:
                input_std = 1.0  # Normalized input
            else:
                input_std = result['post_stds'][i-1]
            
            if input_std > 0:
                score = result['pre_stds'][i] / input_std
            else:
                score = 0
            variance_scores.append(score)
        
        axes[1, 1].plot(result['layers'], variance_scores, 
                       marker='d', label=method, color=colors[method], linewidth=2)
    
    axes[1, 1].axhline(y=1.0, color='black', linestyle='--', alpha=0.7, label='Ideal')
    axes[1, 1].set_title('Variance Preservation Score')
    axes[1, 1].set_xlabel('Layer Number')
    axes[1, 1].set_ylabel('Variance Ratio')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    axes[1, 1].set_yscale('log')
    
    plt.suptitle(f'Activation Statistics Analysis - {activation_type} Networks', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Visualize ReLU results
visualize_activation_analysis(relu_results, 'ReLU')

# Compare different activation functions with best initialization
print("\nComparing activation functions with He initialization:")
print("=" * 60)

activation_comparison = {}
for activation in ['relu', 'sigmoid', 'tanh']:
    result = analyze_activation_flow(architecture, X_test, 'he_normal', activation)
    activation_comparison[activation] = result
    
    # Print summary
    avg_pre_std = np.mean(result['pre_stds'])
    avg_post_std = np.mean(result['post_stds'])
    print(f"{activation:<8}: Avg Pre-Act Std={avg_pre_std:.4f}, Avg Post-Act Std={avg_post_std:.4f}")

# Visualize activation function comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

act_colors = {'relu': 'blue', 'sigmoid': 'red', 'tanh': 'green'}

for idx, (metric, title) in enumerate([
    ('pre_stds', 'Pre-Activation Std'),
    ('post_stds', 'Post-Activation Std'),
    ('post_means', 'Post-Activation Mean')
]):
    for activation, result in activation_comparison.items():
        axes[idx].plot(result['layers'], result[metric], 
                      marker='o', label=activation, color=act_colors[activation], linewidth=2)
    
    axes[idx].set_title(f'{title} (He Initialization)')
    axes[idx].set_xlabel('Layer Number')
    axes[idx].set_ylabel(title.split()[1])
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    if 'Std' in title:
        axes[idx].set_yscale('log')

plt.suptitle('Activation Function Comparison with He Initialization', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✅ Comprehensive activation analysis complete!")

## Part 6: Training Performance Comparison

### Instructions:
1. Train networks with different initialization methods
2. Compare convergence speed and final accuracy
3. Understand the practical impact of initialization choices

In [None]:
class InitializationTrainer:
    """Neural network trainer for initialization comparison"""
    
    def __init__(self, layer_dims, activation='relu'):
        self.layer_dims = layer_dims
        self.num_layers = len(layer_dims) - 1
        self.activation = activation
        self.training_history = {}
    
    def initialize_parameters(self, method):
        """Initialize network parameters"""
        self.parameters = {}
        
        for l in range(1, self.num_layers + 1):
            shape = (self.layer_dims[l], self.layer_dims[l-1])
            
            if method == 'random_small':
                self.parameters[f'W{l}'] = WeightInitializer.random_small(shape, 0.01)
            elif method == 'xavier_normal':
                self.parameters[f'W{l}'] = WeightInitializer.xavier_normal(shape)
            elif method == 'he_normal':
                self.parameters[f'W{l}'] = WeightInitializer.he_normal(shape)
            elif method == 'lecun_normal':
                self.parameters[f'W{l}'] = WeightInitializer.lecun_normal(shape)
            else:
                raise ValueError(f"Unknown method: {method}")
            
            self.parameters[f'b{l}'] = np.zeros((self.layer_dims[l], 1))
    
    def activate(self, Z, activation_type=None):
        """Apply activation function"""
        if activation_type is None:
            activation_type = self.activation
            
        if activation_type == 'relu':
            return np.maximum(0, Z)
        elif activation_type == 'sigmoid':
            return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
        elif activation_type == 'tanh':
            return np.tanh(Z)
        else:
            raise ValueError(f"Unknown activation: {activation_type}")
    
    def activate_derivative(self, Z, activation_type=None):
        """Compute activation derivative"""
        if activation_type is None:
            activation_type = self.activation
            
        if activation_type == 'relu':
            return (Z > 0).astype(float)
        elif activation_type == 'sigmoid':
            A = self.activate(Z, 'sigmoid')
            return A * (1 - A)
        elif activation_type == 'tanh':
            A = self.activate(Z, 'tanh')
            return 1 - A**2
        else:
            raise ValueError(f"Unknown activation: {activation_type}")
    
    def forward_propagation(self, X):
        """Forward propagation"""
        self.cache = {'A0': X}
        A = X
        
        for l in range(1, self.num_layers + 1):
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']
            
            Z = np.dot(W, A) + b
            
            # Use sigmoid for output layer, specified activation for hidden layers
            if l == self.num_layers:
                A = self.activate(Z, 'sigmoid')
            else:
                A = self.activate(Z)
            
            self.cache[f'Z{l}'] = Z
            self.cache[f'A{l}'] = A
        
        return A
    
    def backward_propagation(self, X, Y):
        """Backward propagation"""
        m = X.shape[1]
        gradients = {}
        
        # Output layer
        AL = self.cache[f'A{self.num_layers}']
        dAL = -(Y / (AL + 1e-8) - (1 - Y) / (1 - AL + 1e-8))
        
        # Backward through layers
        dA = dAL
        for l in reversed(range(1, self.num_layers + 1)):
            A_prev = self.cache[f'A{l-1}']
            Z = self.cache[f'Z{l}']
            W = self.parameters[f'W{l}']
            
            # Compute gradients
            if l == self.num_layers:
                dZ = dA * self.activate_derivative(Z, 'sigmoid')
            else:
                dZ = dA * self.activate_derivative(Z)
            
            dW = (1/m) * np.dot(dZ, A_prev.T)
            db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
            
            if l > 1:
                dA = np.dot(W.T, dZ)
            
            gradients[f'dW{l}'] = dW
            gradients[f'db{l}'] = db
        
        return gradients
    
    def compute_cost(self, AL, Y):
        """Compute binary cross-entropy cost"""
        m = Y.shape[1]
        cost = -(1/m) * np.sum(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8))
        return np.squeeze(cost)
    
    def train(self, X_train, y_train, X_test, y_test, init_method, 
              epochs=200, learning_rate=0.01, verbose=False):
        """Train the network"""
        # Initialize parameters
        self.initialize_parameters(init_method)
        
        # Training history
        history = {
            'train_costs': [],
            'test_costs': [],
            'train_accuracies': [],
            'test_accuracies': [],
            'gradient_norms': []
        }
        
        for epoch in range(epochs):
            # Forward propagation
            AL_train = self.forward_propagation(X_train)
            train_cost = self.compute_cost(AL_train, y_train)
            
            # Backward propagation
            gradients = self.backward_propagation(X_train, y_train)
            
            # Calculate gradient norm
            grad_norm = sum([np.sum(gradients[key]**2) for key in gradients])**0.5
            
            # Update parameters
            for l in range(1, self.num_layers + 1):
                self.parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
                self.parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
            
            # Evaluate
            AL_test = self.forward_propagation(X_test)
            test_cost = self.compute_cost(AL_test, y_test)
            
            # Calculate accuracies
            train_pred = (AL_train > 0.5).astype(float)
            test_pred = (AL_test > 0.5).astype(float)
            train_acc = np.mean(train_pred == y_train) * 100
            test_acc = np.mean(test_pred == y_test) * 100
            
            # Store history
            history['train_costs'].append(train_cost)
            history['test_costs'].append(test_cost)
            history['train_accuracies'].append(train_acc)
            history['test_accuracies'].append(test_acc)
            history['gradient_norms'].append(grad_norm)
            
            # Print progress
            if verbose and epoch % 50 == 0:
                print(f"Epoch {epoch:3d}: Train Cost={train_cost:.4f}, "
                      f"Test Cost={test_cost:.4f}, Train Acc={train_acc:.1f}%, "
                      f"Test Acc={test_acc:.1f}%")
        
        return history

# Prepare dataset
X_data, y_data = make_classification(n_samples=2000, n_features=20, n_informative=15,
                                     n_redundant=5, n_clusters_per_class=2,
                                     random_state=42)
X_data = X_data.T
y_data = y_data.reshape(1, -1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_data.T, y_data.T, test_size=0.2, random_state=42
)
X_train, X_test = X_train.T, X_test.T
y_train, y_test = y_train.T, y_test.T

# Standardize data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.T).T
X_test_scaled = scaler.transform(X_test.T).T

# Network architecture
network_architecture = [20, 64, 32, 16, 8, 1]

print("Training networks with different initialization methods...\n")
print("Network Architecture:", network_architecture)
print("Dataset: {} training samples, {} test samples".format(
    X_train_scaled.shape[1], X_test_scaled.shape[1]))
print("=" * 70)

# Train with different initializations
init_methods = ['random_small', 'xavier_normal', 'he_normal', 'lecun_normal']
training_results = {}

for init_method in init_methods:
    print(f"\nTraining with {init_method} initialization...")
    
    trainer = InitializationTrainer(network_architecture, activation='relu')
    history = trainer.train(
        X_train_scaled, y_train, X_test_scaled, y_test,
        init_method=init_method, epochs=200, learning_rate=0.01, verbose=True
    )
    
    training_results[init_method] = history
    
    # Print final results
    final_train_acc = history['train_accuracies'][-1]
    final_test_acc = history['test_accuracies'][-1]
    print(f"Final Results - Train Acc: {final_train_acc:.1f}%, Test Acc: {final_test_acc:.1f}%")

print("\n✅ Training comparison complete!")

## Part 7: Performance Visualization and Analysis

### Instructions:
1. Create comprehensive training performance comparisons
2. Analyze convergence characteristics for each initialization method
3. Draw conclusions about best practices

In [None]:
def visualize_training_comparison(training_results):
    """Visualize training performance comparison"""
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    colors = {'random_small': 'red', 'xavier_normal': 'green', 
              'he_normal': 'blue', 'lecun_normal': 'purple'}
    
    # Plot 1: Training cost
    for method, history in training_results.items():
        axes[0, 0].plot(history['train_costs'], label=method.replace('_', ' ').title(), 
                       color=colors[method], linewidth=2)
    
    axes[0, 0].set_title('Training Cost Evolution')
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Cost')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].set_yscale('log')
    
    # Plot 2: Test accuracy
    for method, history in training_results.items():
        axes[0, 1].plot(history['test_accuracies'], label=method.replace('_', ' ').title(), 
                       color=colors[method], linewidth=2)
    
    axes[0, 1].set_title('Test Accuracy Evolution')
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('Accuracy (%)')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    axes[0, 1].set_ylim([40, 100])
    
    # Plot 3: Gradient norms
    for method, history in training_results.items():
        axes[1, 0].plot(history['gradient_norms'], label=method.replace('_', ' ').title(), 
                       color=colors[method], linewidth=2, alpha=0.7)
    
    axes[1, 0].set_title('Gradient Norm Evolution')
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Gradient Norm')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    axes[1, 0].set_yscale('log')
    
    # Plot 4: Final performance comparison
    methods = list(training_results.keys())
    final_accuracies = [training_results[method]['test_accuracies'][-1] for method in methods]
    convergence_epochs = []
    
    for method in methods:
        # Find epoch where accuracy reaches 90% of final value
        final_acc = training_results[method]['test_accuracies'][-1]
        target_acc = final_acc * 0.9
        
        convergence_epoch = len(training_results[method]['test_accuracies'])
        for epoch, acc in enumerate(training_results[method]['test_accuracies']):
            if acc >= target_acc:
                convergence_epoch = epoch
                break
        
        convergence_epochs.append(convergence_epoch)
    
    x_pos = np.arange(len(methods))
    
    # Create dual y-axis bar chart
    ax4_twin = axes[1, 1].twinx()
    
    bars1 = axes[1, 1].bar(x_pos - 0.2, final_accuracies, 0.4, 
                          label='Final Accuracy', color='skyblue', alpha=0.8)
    bars2 = ax4_twin.bar(x_pos + 0.2, convergence_epochs, 0.4, 
                        label='Convergence Speed', color='orange', alpha=0.8)
    
    axes[1, 1].set_xlabel('Initialization Method')
    axes[1, 1].set_ylabel('Final Test Accuracy (%)', color='blue')
    ax4_twin.set_ylabel('Epochs to 90% Convergence', color='orange')
    
    axes[1, 1].set_title('Final Performance & Convergence Speed')
    axes[1, 1].set_xticks(x_pos)
    axes[1, 1].set_xticklabels([m.replace('_', ' ').title() for m in methods], rotation=45)
    
    # Add value labels on bars
    for bar, acc in zip(bars1, final_accuracies):
        height = bar.get_height()
        axes[1, 1].text(bar.get_x() + bar.get_width()/2., height + 0.5,
                       f'{acc:.1f}%', ha='center', va='bottom', fontsize=10)
    
    for bar, epochs in zip(bars2, convergence_epochs):
        height = bar.get_height()
        ax4_twin.text(bar.get_x() + bar.get_width()/2., height + 5,
                     f'{epochs}', ha='center', va='bottom', fontsize=10)
    
    # Add legends
    lines1, labels1 = axes[1, 1].get_legend_handles_labels()
    lines2, labels2 = ax4_twin.get_legend_handles_labels()
    axes[1, 1].legend(lines1 + lines2, labels1 + labels2, loc='upper right')
    
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle('Initialization Methods Training Comparison', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Visualize training comparison
visualize_training_comparison(training_results)

# Detailed performance analysis
def analyze_initialization_performance(training_results):
    """Analyze and compare initialization performance"""
    
    print("\nDetailed Performance Analysis:")
    print("=" * 80)
    print(f"{'Method':<15} {'Final Acc':<12} {'Best Acc':<12} {'Stability':<12} {'Conv. Speed':<15}")
    print("-" * 80)
    
    performance_metrics = {}
    
    for method, history in training_results.items():
        # Calculate metrics
        final_acc = history['test_accuracies'][-1]
        best_acc = max(history['test_accuracies'])
        
        # Stability (inverse of variance in last 50 epochs)
        stability = 1.0 / (np.var(history['test_accuracies'][-50:]) + 1e-8)
        
        # Convergence speed (epochs to reach 90% of final accuracy)
        target_acc = final_acc * 0.9
        conv_speed = len(history['test_accuracies'])
        for epoch, acc in enumerate(history['test_accuracies']):
            if acc >= target_acc:
                conv_speed = epoch
                break
        
        performance_metrics[method] = {
            'final_accuracy': final_acc,
            'best_accuracy': best_acc,
            'stability': stability,
            'convergence_speed': conv_speed
        }
        
        print(f"{method.replace('_', ' '):<15} {final_acc:>10.1f}% {best_acc:>10.1f}% "
              f"{stability:>10.1f} {conv_speed:>13d} epochs")
    
    # Determine best method
    print("\nRankings by Metric:")
    print("-" * 50)
    
    # Rank by final accuracy
    acc_ranking = sorted(performance_metrics.items(), 
                        key=lambda x: x[1]['final_accuracy'], reverse=True)
    print("Final Accuracy:")
    for i, (method, metrics) in enumerate(acc_ranking):
        print(f"  {i+1}. {method.replace('_', ' ').title()}: {metrics['final_accuracy']:.1f}%")
    
    # Rank by convergence speed
    speed_ranking = sorted(performance_metrics.items(), 
                          key=lambda x: x[1]['convergence_speed'])
    print("\nConvergence Speed:")
    for i, (method, metrics) in enumerate(speed_ranking):
        print(f"  {i+1}. {method.replace('_', ' ').title()}: {metrics['convergence_speed']} epochs")
    
    # Overall recommendation
    print("\nRecommendations:")
    print("=" * 50)
    
    best_overall = acc_ranking[0][0]
    fastest = speed_ranking[0][0]
    
    print(f"🏆 Best Overall Performance: {best_overall.replace('_', ' ').title()}")
    print(f"🚀 Fastest Convergence: {fastest.replace('_', ' ').title()}")
    
    if best_overall == 'he_normal':
        print("\n✅ He initialization performs best for ReLU networks (as expected)")
    if fastest == 'he_normal':
        print("✅ He initialization also provides fastest convergence")
    
    return performance_metrics

# Analyze performance
performance_metrics = analyze_initialization_performance(training_results)

## Part 8: Custom Initialization Strategies

### Instructions:
1. Design custom initialization methods for specific scenarios
2. Test adaptive initialization strategies
3. Implement layer-specific initialization schemes

In [None]:
class CustomInitializer:
    """Advanced custom initialization strategies"""
    
    @staticmethod
    def adaptive_he(shape, activation='relu', layer_depth=1):
        """
        Adaptive He initialization that considers layer depth
        Deeper layers get slightly smaller initialization to prevent explosion
        """
        n_in = shape[1]
        
        # Base He initialization
        if activation == 'relu':
            base_std = np.sqrt(2.0 / n_in)
        elif activation == 'leaky_relu':
            base_std = np.sqrt(2.0 / n_in)  # Can be adjusted based on leak parameter
        else:
            base_std = np.sqrt(2.0 / n_in)
        
        # Depth adjustment factor
        depth_factor = 1.0 / np.sqrt(1 + 0.1 * layer_depth)
        
        adjusted_std = base_std * depth_factor
        return np.random.randn(*shape) * adjusted_std
    
    @staticmethod
    def layer_sequential_init(shapes, activation='relu'):
        """
        Initialize all layers considering the full network architecture
        Each layer's initialization depends on previous layers
        """
        parameters = {}
        cumulative_factor = 1.0
        
        for l, shape in enumerate(shapes, 1):
            n_in = shape[1]
            
            if activation == 'relu':
                base_std = np.sqrt(2.0 / n_in)
            else:
                base_std = np.sqrt(1.0 / n_in)
            
            # Sequential adjustment
            if l > 1:
                cumulative_factor *= 0.95  # Slightly reduce variance with depth
            
            adjusted_std = base_std * cumulative_factor
            parameters[f'W{l}'] = np.random.randn(*shape) * adjusted_std
            parameters[f'b{l}'] = np.zeros((shape[0], 1))
        
        return parameters
    
    @staticmethod
    def residual_aware_init(shape, has_residual=False):
        """
        Initialization aware of residual connections
        Residual layers can be initialized with smaller values
        """
        n_in = shape[1]
        
        if has_residual:
            # Smaller initialization for residual layers
            std = np.sqrt(1.0 / n_in)
        else:
            # Standard He initialization
            std = np.sqrt(2.0 / n_in)
        
        return np.random.randn(*shape) * std
    
    @staticmethod
    def width_aware_init(shape, target_std=1.0):
        """
        Initialization that considers layer width
        Maintains target standard deviation regardless of layer size
        """
        n_in, n_out = shape[1], shape[0]
        
        # Adjust for both input and output dimensions
        width_factor = np.sqrt(n_out / (n_in + n_out))
        std = target_std * width_factor / np.sqrt(n_in)
        
        return np.random.randn(*shape) * std
    
    @staticmethod
    def spectral_norm_init(shape, spectral_radius=0.9):
        """
        Initialize weights with controlled spectral radius
        Useful for RNNs and very deep networks
        """
        # Generate random matrix
        W = np.random.randn(*shape)
        
        # Compute spectral norm (largest singular value)
        _, s, _ = np.linalg.svd(W)
        current_spectral_norm = s[0]
        
        # Scale to desired spectral radius
        W_normalized = W * (spectral_radius / current_spectral_norm)
        
        return W_normalized

# Test custom initialization strategies
print("Testing Custom Initialization Strategies:")
print("=" * 60)

test_shape = (64, 128)
test_architectures = [(128, 100), (100, 80), (80, 60), (60, 40), (40, 1)]

# Test adaptive He initialization
print("\n1. Adaptive He Initialization (depth-aware):")
for depth in [1, 2, 3, 4, 5]:
    weights = CustomInitializer.adaptive_he(test_shape, layer_depth=depth)
    print(f"   Layer {depth}: std={np.std(weights):.6f}")

# Test layer sequential initialization
print("\n2. Layer Sequential Initialization:")
seq_params = CustomInitializer.layer_sequential_init(test_architectures)
for l in range(1, len(test_architectures) + 1):
    std = np.std(seq_params[f'W{l}'])
    print(f"   Layer {l}: std={std:.6f}")

# Test spectral norm initialization
print("\n3. Spectral Norm Initialization:")
for radius in [0.5, 0.9, 1.0, 1.2]:
    weights = CustomInitializer.spectral_norm_init(test_shape, spectral_radius=radius)
    actual_radius = np.linalg.svd(weights)[1][0]
    print(f"   Target radius: {radius:.1f}, Actual: {actual_radius:.6f}")

print("\n✅ Custom initialization strategies tested!")

## Part 9: Initialization Decision Framework

### Instructions:
1. Create a decision framework for choosing initialization methods
2. Implement automatic initialization selection
3. Test the framework on different network architectures

In [None]:
class InitializationRecommender:
    """Intelligent initialization recommendation system"""
    
    def __init__(self):
        self.recommendations = {
            'relu': {
                'shallow': 'he_normal',
                'deep': 'adaptive_he',
                'very_deep': 'spectral_norm'
            },
            'leaky_relu': {
                'shallow': 'he_normal',
                'deep': 'he_normal',
                'very_deep': 'adaptive_he'
            },
            'sigmoid': {
                'shallow': 'xavier_normal',
                'deep': 'xavier_normal',
                'very_deep': 'layer_sequential'
            },
            'tanh': {
                'shallow': 'xavier_normal',
                'deep': 'xavier_normal', 
                'very_deep': 'layer_sequential'
            },
            'selu': {
                'shallow': 'lecun_normal',
                'deep': 'lecun_normal',
                'very_deep': 'lecun_normal'
            }
        }
    
    def analyze_network(self, layer_dims, activation='relu'):
        """Analyze network characteristics"""
        num_hidden = len(layer_dims) - 2
        total_params = sum(layer_dims[i] * layer_dims[i+1] + layer_dims[i+1] 
                          for i in range(len(layer_dims)-1))
        
        # Determine network depth category
        if num_hidden <= 3:
            depth_category = 'shallow'
        elif num_hidden <= 10:
            depth_category = 'deep'
        else:
            depth_category = 'very_deep'
        
        # Analyze potential issues
        issues = []
        if max(layer_dims) > 1000:
            issues.append('wide_layers')
        if min(layer_dims[1:-1]) < 10:
            issues.append('narrow_bottleneck')
        if num_hidden > 20:
            issues.append('very_deep')
        
        # Check for irregular architecture
        width_ratios = [layer_dims[i+1]/layer_dims[i] for i in range(len(layer_dims)-1)]
        if max(width_ratios) > 10 or min(width_ratios) < 0.1:
            issues.append('irregular_width')
        
        return {
            'depth_category': depth_category,
            'num_hidden': num_hidden,
            'total_params': total_params,
            'issues': issues
        }
    
    def recommend_initialization(self, layer_dims, activation='relu', 
                               problem_type='classification', verbose=True):
        """Recommend best initialization strategy"""
        analysis = self.analyze_network(layer_dims, activation)
        
        if verbose:
            print(f"Network Analysis:")
            print(f"  Architecture: {layer_dims}")
            print(f"  Activation: {activation}")
            print(f"  Hidden layers: {analysis['num_hidden']}")
            print(f"  Total parameters: {analysis['total_params']:,}")
            print(f"  Depth category: {analysis['depth_category']}")
            if analysis['issues']:
                print(f"  Potential issues: {', '.join(analysis['issues'])}")
        
        # Base recommendation
        if activation.lower() in self.recommendations:
            base_rec = self.recommendations[activation.lower()][analysis['depth_category']]
        else:
            base_rec = 'he_normal'  # Default fallback
        
        # Adjust for specific issues
        final_rec = base_rec
        additional_advice = []
        
        if 'wide_layers' in analysis['issues']:
            additional_advice.append("Consider batch normalization for wide layers")
            if base_rec == 'he_normal':
                final_rec = 'adaptive_he'
        
        if 'narrow_bottleneck' in analysis['issues']:
            additional_advice.append("Narrow layers may cause information loss")
        
        if 'irregular_width' in analysis['issues']:
            additional_advice.append("Consider layer-specific initialization")
            if base_rec in ['he_normal', 'xavier_normal']:
                final_rec = 'width_aware'
        
        if 'very_deep' in analysis['issues']:
            additional_advice.append("Consider residual connections")
            final_rec = 'spectral_norm'
        
        if verbose:
            print(f"\nRecommendation: {final_rec.replace('_', ' ').title()}")
            if additional_advice:
                print("Additional advice:")
                for advice in additional_advice:
                    print(f"  - {advice}")
        
        return {
            'method': final_rec,
            'confidence': self._calculate_confidence(analysis, activation),
            'alternatives': self._get_alternatives(final_rec, activation),
            'advice': additional_advice,
            'analysis': analysis
        }
    
    def _calculate_confidence(self, analysis, activation):
        """Calculate confidence in recommendation"""
        base_confidence = 0.8
        
        # Higher confidence for common scenarios
        if activation.lower() in ['relu', 'sigmoid', 'tanh']:
            base_confidence += 0.1
        
        # Lower confidence for problematic architectures
        if len(analysis['issues']) > 2:
            base_confidence -= 0.2
        
        return min(max(base_confidence, 0.3), 0.95)
    
    def _get_alternatives(self, primary, activation):
        """Get alternative initialization methods"""
        alternatives = []
        
        if primary == 'he_normal':
            alternatives = ['he_uniform', 'adaptive_he']
        elif primary == 'xavier_normal':
            alternatives = ['xavier_uniform', 'lecun_normal']
        elif primary == 'adaptive_he':
            alternatives = ['he_normal', 'spectral_norm']
        else:
            alternatives = ['he_normal', 'xavier_normal']
        
        return alternatives

# Create recommendation system
recommender = InitializationRecommender()

print("Testing Initialization Recommendation System:")
print("=" * 70)

# Test various network architectures
test_cases = [
    {
        'name': 'Simple Classification Network',
        'architecture': [784, 128, 64, 10],
        'activation': 'relu'
    },
    {
        'name': 'Very Deep Network',
        'architecture': [100, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1],
        'activation': 'relu'
    },
    {
        'name': 'Wide Shallow Network',
        'architecture': [1000, 2000, 1000, 1],
        'activation': 'relu'
    },
    {
        'name': 'Sigmoid Network',
        'architecture': [50, 100, 50, 25, 1],
        'activation': 'sigmoid'
    },
    {
        'name': 'Irregular Architecture',
        'architecture': [10, 1000, 5, 500, 1],
        'activation': 'relu'
    }
]

for i, test_case in enumerate(test_cases, 1):
    print(f"\n{i}. {test_case['name']}:")
    print("-" * (len(test_case['name']) + 4))
    
    recommendation = recommender.recommend_initialization(
        test_case['architecture'], 
        test_case['activation'],
        verbose=True
    )
    
    print(f"   Confidence: {recommendation['confidence']:.1%}")
    print(f"   Alternatives: {', '.join(recommendation['alternatives'])}")

print("\n✅ Recommendation system testing complete!")

## Part 10: Best Practices Summary and Guidelines

### Instructions:
1. Review comprehensive best practices for weight initialization
2. Understand when to use each initialization method
3. Complete practical exercises to reinforce learning

In [None]:
# Create comprehensive best practices guide
def create_initialization_guide():
    """Create comprehensive initialization best practices guide"""
    
    guide = """
    🎯 WEIGHT INITIALIZATION BEST PRACTICES GUIDE
    ============================================
    
    1. ACTIVATION FUNCTION SPECIFIC:
    
    ReLU Networks:
    ✅ USE: He initialization (He Normal or He Uniform)
    📈 Variance: Var(W) = 2/n_in
    🎯 Why: Accounts for half the neurons being zero
    
    Sigmoid/Tanh Networks:
    ✅ USE: Xavier/Glorot initialization
    📈 Variance: Var(W) = 1/n_in or 2/(n_in + n_out)
    🎯 Why: Maintains variance for symmetric activations
    
    SELU Networks:
    ✅ USE: LeCun initialization
    📈 Variance: Var(W) = 1/n_in
    🎯 Why: Designed for self-normalizing properties
    
    2. NETWORK DEPTH CONSIDERATIONS:
    
    Shallow Networks (≤3 hidden layers):
    ✅ Standard initialization methods work well
    ✅ He Normal for ReLU, Xavier for sigmoid/tanh
    
    Deep Networks (4-10 hidden layers):
    ✅ Use proper initialization + batch normalization
    ✅ Consider gradient clipping
    ⚠️ Monitor gradient flow carefully
    
    Very Deep Networks (>10 hidden layers):
    ✅ Residual connections + careful initialization
    ✅ Layer-specific initialization strategies
    ✅ Spectral norm control
    ❌ Avoid standard methods without modifications
    
    3. COMMON PITFALLS TO AVOID:
    
    ❌ Zero initialization for weights (symmetry breaking problem)
    ❌ Same initialization for all weights (no learning)
    ❌ Too large values (exploding gradients)
    ❌ Too small values (vanishing gradients)
    ❌ Ignoring activation function choice
    ❌ Using sigmoid/tanh in deep networks without proper init
    
    4. PRACTICAL RECOMMENDATIONS:
    
    Starting Point:
    🟢 Use He Normal for ReLU networks (90% of cases)
    🟢 Use Xavier Normal for sigmoid/tanh networks
    🟢 Always initialize biases to zero
    
    If Training Fails:
    🔄 Check gradient flow (use gradient norm tracking)
    🔄 Try different initialization methods
    🔄 Add batch normalization
    🔄 Implement gradient clipping
    🔄 Reduce learning rate
    
    For Special Cases:
    🎯 Transfer Learning: Initialize only last layer
    🎯 GANs: Careful initialization of both networks
    🎯 RNNs: Use orthogonal initialization
    🎯 Autoencoders: Symmetric initialization
    
    5. DEBUGGING CHECKLIST:
    
    Before Training:
    □ Check weight distributions match theoretical expectations
    □ Verify no NaN or infinite values
    □ Confirm reasonable activation magnitudes
    
    During Training:
    □ Monitor gradient norms (should be stable)
    □ Track activation statistics
    □ Watch for dead neurons (ReLU)
    □ Check for gradient explosion/vanishing
    
    6. PERFORMANCE OPTIMIZATION:
    
    Speed up Convergence:
    ⚡ Use batch normalization
    ⚡ Proper learning rate scheduling
    ⚡ Warm-up initialization schemes
    
    Improve Stability:
    🛡️ Gradient clipping
    🛡️ Spectral normalization
    🛡️ Layer-wise adaptive rates
    
    7. MODERN BEST PRACTICES:
    
    Current State-of-Art:
    🌟 He initialization + Batch Normalization + ReLU
    🌟 Residual connections for very deep networks
    🌟 Attention mechanisms with scaled initialization
    🌟 Layer normalization for transformers
    
    """
    
    return guide

# Display the guide
print(create_initialization_guide())

# Create practical decision tree
def initialization_decision_tree():
    """Interactive decision tree for initialization"""
    
    print("\n🌳 INITIALIZATION DECISION TREE")
    print("=" * 50)
    
    decision_tree = """
    Start Here: What's your activation function?
    ├── ReLU Family (ReLU, LeakyReLU, ELU)
    │   ├── Shallow Network (≤3 layers)
    │   │   └── ✅ Use He Normal
    │   ├── Deep Network (4-10 layers)
    │   │   └── ✅ Use He Normal + Batch Norm
    │   └── Very Deep (>10 layers)
    │       └── ✅ Use Adaptive He + Residual Connections
    │
    ├── Sigmoid/Tanh
    │   ├── Shallow Network
    │   │   └── ✅ Use Xavier Normal
    │   └── Deep Network
    │       └── ⚠️ Consider ReLU instead, or Xavier + Batch Norm
    │
    ├── SELU
    │   └── ✅ Use LeCun Normal (any depth)
    │
    └── Custom/Other
        └── 🔬 Experiment with variance scaling
    
    Special Considerations:
    • Wide layers (>1000 units): Add batch normalization
    • Irregular architecture: Consider layer-specific initialization
    • Transfer learning: Initialize only new layers
    • Training instability: Add gradient clipping
    """
    
    print(decision_tree)

initialization_decision_tree()

# Create quick reference table
print("\n📋 QUICK REFERENCE TABLE")
print("=" * 80)
print(f"{'Scenario':<25} {'Recommended Method':<20} {'Alternative':<20} {'Notes'}")
print("-" * 80)

scenarios = [
    ("ReLU + Shallow", "He Normal", "He Uniform", "Standard choice"),
    ("ReLU + Deep", "He Normal + BN", "Adaptive He", "Add batch norm"),
    ("ReLU + Very Deep", "Residual + He", "Spectral Norm", "Need skip connections"),
    ("Sigmoid + Any", "Xavier Normal", "Xavier Uniform", "Consider ReLU instead"),
    ("Tanh + Any", "Xavier Normal", "LeCun Normal", "Symmetric activation"),
    ("SELU + Any", "LeCun Normal", "LeCun Uniform", "Self-normalizing"),
    ("Transfer Learning", "Pretrained + He", "Fine-tune only", "Last layer only"),
    ("GAN Generator", "Xavier Normal", "He Normal", "Stable training"),
    ("GAN Discriminator", "He Normal", "Spectral Norm", "Prevent mode collapse"),
    ("RNN/LSTM", "Orthogonal", "Xavier Normal", "Recurrent weights"),
]

for scenario, method, alt, notes in scenarios:
    print(f"{scenario:<25} {method:<20} {alt:<20} {notes}")

print("\n" + "="*80)
print("🎓 CONGRATULATIONS! You've mastered weight initialization strategies!")
print("="*80)

## Lab Complete! 🎉

### What You've Accomplished:
✅ **Mastered Initialization Theory**: Understood the mathematical foundations behind different initialization methods  
✅ **Implemented All Major Methods**: Built Xavier, He, LeCun, and custom initialization strategies  
✅ **Analyzed Impact on Training**: Compared how initialization affects convergence and stability  
✅ **Created Custom Solutions**: Developed adaptive and architecture-aware initialization  
✅ **Built Decision Framework**: Created an intelligent recommendation system  
✅ **Established Best Practices**: Learned when and how to apply each method  

### Key Takeaways:
1. **Activation Function Matters**: ReLU needs He, Sigmoid/Tanh need Xavier
2. **Network Depth is Critical**: Deeper networks need more sophisticated initialization
3. **One Size Doesn't Fit All**: Different architectures need different approaches
4. **Monitor and Adapt**: Track gradient flow and activation statistics
5. **Combine with Other Techniques**: Initialization works best with batch norm and proper regularization

### Next Steps:
1. **Experiment with Your Own Data**: Apply these techniques to real-world problems
2. **Explore Advanced Methods**: Look into LSUV, FIXUP, and other modern techniques
3. **Study Architecture-Specific Methods**: Learn about transformer, CNN, and RNN initialization
4. **Implement in Deep Learning Frameworks**: Apply these concepts in PyTorch/TensorFlow

### Additional Challenges:
1. **Create Dynamic Initialization**: Build initialization that adapts during training
2. **Analyze Real Networks**: Study initialization in popular architectures (ResNet, BERT, etc.)
3. **Develop New Methods**: Design initialization for novel activation functions
4. **Benchmark Performance**: Create comprehensive comparison across different domains

### Troubleshooting Summary:
- **Training not converging?** → Check initialization scale and gradient flow
- **Gradients exploding?** → Reduce initialization variance or add clipping
- **Gradients vanishing?** → Use proper initialization for activation function
- **Slow convergence?** → Add batch normalization or adjust learning rate
- **Unstable training?** → Consider spectral normalization or residual connections

### Cleanup:
```python
# Clear variables to free memory
import gc
gc.collect()
print("Memory cleanup complete!")
```

**Remember**: Good initialization is the foundation of successful deep learning. It's often the difference between a model that learns and one that doesn't! 🚀