# Lab 4.2: Deep Network Propagation Algorithms

## Duration: 45 minutes

## Learning Objectives
By the end of this lab, you will be able to:
- Implement forward propagation for deep neural networks
- Understand and implement backward propagation with the chain rule
- Cache intermediate values for efficient backpropagation
- Debug propagation algorithms and verify gradient computations

## Prerequisites
- Completed Lab 4.1: Deep Network Architecture Implementation
- Understanding of matrix operations and the chain rule
- Knowledge of activation functions and their derivatives

## Key Concepts
- **Forward Propagation**: Computing outputs by passing data through the network
- **Backward Propagation**: Computing gradients using the chain rule
- **Caching**: Storing intermediate values for efficient gradient computation
- **Chain Rule**: Mathematical foundation for backpropagation

## Setup and Imports

First, let's import all necessary libraries and set up our environment.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib for better plots
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)

print("Environment setup complete!")
print(f"NumPy version: {np.__version__}")

## Step 1: Activation Functions and Their Derivatives

Before implementing propagation algorithms, let's define activation functions and their derivatives:

In [None]:
class ActivationFunctions:
    """
    Collection of activation functions and their derivatives
    """
    
    @staticmethod
    def relu(z):
        """ReLU activation function"""
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        """Derivative of ReLU"""
        return (z > 0).astype(float)
    
    @staticmethod
    def sigmoid(z):
        """Sigmoid activation function"""
        # Clip z to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    @staticmethod
    def sigmoid_derivative(z):
        """Derivative of sigmoid"""
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    @staticmethod
    def tanh(z):
        """Tanh activation function"""
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        """Derivative of tanh"""
        return 1 - np.tanh(z) ** 2
    
    @staticmethod
    def linear(z):
        """Linear activation function"""
        return z
    
    @staticmethod
    def linear_derivative(z):
        """Derivative of linear"""
        return np.ones_like(z)
    
    @staticmethod
    def softmax(z):
        """Softmax activation function"""
        # Subtract max for numerical stability
        exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))
        return exp_z / np.sum(exp_z, axis=0, keepdims=True)

# Test activation functions
print("Testing Activation Functions:")

# Test data
z = np.linspace(-3, 3, 100)

# Plot activation functions and their derivatives
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

activations = [
    ('ReLU', ActivationFunctions.relu, ActivationFunctions.relu_derivative),
    ('Sigmoid', ActivationFunctions.sigmoid, ActivationFunctions.sigmoid_derivative),
    ('Tanh', ActivationFunctions.tanh, ActivationFunctions.tanh_derivative)
]

for idx, (name, func, deriv_func) in enumerate(activations):
    # Activation function
    y = func(z)
    axes[0, idx].plot(z, y, 'b-', linewidth=2, label=name)
    axes[0, idx].grid(True, alpha=0.3)
    axes[0, idx].set_title(f'{name} Function')
    axes[0, idx].set_xlabel('z')
    axes[0, idx].set_ylabel(f'{name.lower()}(z)')
    
    # Derivative
    dy = deriv_func(z)
    axes[1, idx].plot(z, dy, 'r-', linewidth=2, label=f"{name}' derivative")
    axes[1, idx].grid(True, alpha=0.3)
    axes[1, idx].set_title(f'{name} Derivative')
    axes[1, idx].set_xlabel('z')
    axes[1, idx].set_ylabel(f"d{name.lower()}/dz")

plt.tight_layout()
plt.show()

print("\nActivation functions and derivatives are working correctly!")

## Step 2: Deep Neural Network with Propagation

Let's extend our neural network class to include forward and backward propagation:

In [None]:
class DeepNeuralNetworkWithPropagation:
    """
    Deep Neural Network with forward and backward propagation
    """
    
    def __init__(self, layer_sizes, hidden_activation='relu', output_activation='sigmoid', 
                 initialization='he_normal', random_seed=None):
        """
        Initialize the deep neural network with propagation capabilities
        
        Parameters:
        layer_sizes: list of integers, number of units in each layer
        hidden_activation: activation function for hidden layers
        output_activation: activation function for output layer
        initialization: weight initialization method
        random_seed: random seed for reproducibility
        """
        if random_seed:
            np.random.seed(random_seed)
            
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)
        self.hidden_activation = hidden_activation
        self.output_activation = output_activation
        
        # Initialize parameters
        self.parameters = self._initialize_parameters(initialization)
        
        # Cache for forward propagation
        self.cache = {}
        
        # Gradients
        self.gradients = {}
        
        # Activation function mappings
        self.activation_functions = {
            'relu': (ActivationFunctions.relu, ActivationFunctions.relu_derivative),
            'sigmoid': (ActivationFunctions.sigmoid, ActivationFunctions.sigmoid_derivative),
            'tanh': (ActivationFunctions.tanh, ActivationFunctions.tanh_derivative),
            'linear': (ActivationFunctions.linear, ActivationFunctions.linear_derivative),
            'softmax': (ActivationFunctions.softmax, None)  # Softmax derivative handled separately
        }
        
        print(f"Deep Neural Network initialized:")
        print(f"Architecture: {layer_sizes}")
        print(f"Hidden activation: {hidden_activation}")
        print(f"Output activation: {output_activation}")
        print(f"Total parameters: {self._count_parameters():,}")
    
    def _initialize_parameters(self, initialization='he_normal'):
        """
        Initialize weights and biases
        """
        parameters = {}
        
        for layer in range(1, self.num_layers):
            # Weight initialization
            if initialization == 'he_normal':
                parameters[f'W{layer}'] = np.random.randn(
                    self.layer_sizes[layer], self.layer_sizes[layer-1]
                ) * np.sqrt(2 / self.layer_sizes[layer-1])
            elif initialization == 'xavier_normal':
                parameters[f'W{layer}'] = np.random.randn(
                    self.layer_sizes[layer], self.layer_sizes[layer-1]
                ) * np.sqrt(2 / (self.layer_sizes[layer-1] + self.layer_sizes[layer]))
            else:
                parameters[f'W{layer}'] = np.random.randn(
                    self.layer_sizes[layer], self.layer_sizes[layer-1]
                ) * 0.01
            
            # Bias initialization
            parameters[f'b{layer}'] = np.zeros((self.layer_sizes[layer], 1))
        
        return parameters
    
    def _count_parameters(self):
        """Count total parameters"""
        return sum(param.size for param in self.parameters.values())
    
    def forward_propagation(self, X):
        """
        Forward propagation through the network
        
        Parameters:
        X: input data of shape (input_size, m) where m is number of examples
        
        Returns:
        AL: output of the last layer
        """
        # Clear previous cache
        self.cache = {}
        
        # Input layer
        A = X
        self.cache['A0'] = A
        
        # Forward through all layers
        for layer in range(1, self.num_layers):
            A_prev = A
            
            # Linear transformation
            Z = np.dot(self.parameters[f'W{layer}'], A_prev) + self.parameters[f'b{layer}']
            
            # Store linear cache
            self.cache[f'Z{layer}'] = Z
            self.cache[f'A{layer-1}'] = A_prev
            
            # Apply activation function
            if layer == self.num_layers - 1:  # Output layer
                activation_name = self.output_activation
            else:  # Hidden layers
                activation_name = self.hidden_activation
            
            activation_func, _ = self.activation_functions[activation_name]
            A = activation_func(Z)
            
            # Store activation cache
            self.cache[f'A{layer}'] = A
        
        return A
    
    def backward_propagation(self, AL, Y):
        """
        Backward propagation to compute gradients
        
        Parameters:
        AL: output of forward propagation
        Y: true labels
        
        Returns:
        gradients: dictionary containing gradients
        """
        m = AL.shape[1]  # Number of examples
        self.gradients = {}
        
        # Initialize backward propagation
        # For binary classification with sigmoid output
        if self.output_activation == 'sigmoid':
            dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
        elif self.output_activation == 'linear':
            dAL = AL - Y
        else:
            # General case - derivative of cost with respect to output
            dAL = AL - Y
        
        # Backward propagation through all layers
        for layer in range(self.num_layers - 1, 0, -1):
            # Get cached values
            Z = self.cache[f'Z{layer}']
            A_prev = self.cache[f'A{layer-1}']
            
            # Get activation derivative
            if layer == self.num_layers - 1:  # Output layer
                activation_name = self.output_activation
            else:  # Hidden layers
                activation_name = self.hidden_activation
            
            _, activation_derivative = self.activation_functions[activation_name]
            
            # Compute dZ
            if activation_name == 'softmax':
                # For softmax, we assume dAL already includes the derivative
                dZ = dAL
            else:
                dZ = dAL * activation_derivative(Z)
            
            # Compute gradients
            self.gradients[f'dW{layer}'] = (1/m) * np.dot(dZ, A_prev.T)
            self.gradients[f'db{layer}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
            
            # Compute dA_prev for next iteration
            if layer > 1:  # Not the first layer
                dAL = np.dot(self.parameters[f'W{layer}'].T, dZ)
        
        return self.gradients
    
    def compute_cost(self, AL, Y, cost_function='binary_crossentropy'):
        """
        Compute the cost/loss
        
        Parameters:
        AL: output of forward propagation
        Y: true labels
        cost_function: type of cost function to use
        
        Returns:
        cost: scalar cost value
        """
        m = Y.shape[1]
        
        if cost_function == 'binary_crossentropy':
            # Binary cross-entropy
            cost = -(1/m) * np.sum(Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8))
        elif cost_function == 'mean_squared_error':
            # Mean squared error
            cost = (1/(2*m)) * np.sum(np.square(AL - Y))
        else:
            # Default to MSE
            cost = (1/(2*m)) * np.sum(np.square(AL - Y))
        
        return cost

# Test the extended neural network
print("Testing Deep Neural Network with Propagation:")
print()

# Create test network
test_network = DeepNeuralNetworkWithPropagation(
    layer_sizes=[4, 8, 6, 3, 1],
    hidden_activation='relu',
    output_activation='sigmoid',
    random_seed=42
)

# Create test data
m = 100  # number of examples
X = np.random.randn(4, m)  # 4 features, 100 examples
Y = np.random.randint(0, 2, (1, m))  # Binary labels

print(f"\nTest data shapes:")
print(f"X: {X.shape}")
print(f"Y: {Y.shape}")

# Test forward propagation
print("\nTesting forward propagation...")
AL = test_network.forward_propagation(X)
print(f"Output shape: {AL.shape}")
print(f"Output range: [{AL.min():.4f}, {AL.max():.4f}]")

# Test cost computation
cost = test_network.compute_cost(AL, Y)
print(f"\nCost: {cost:.6f}")

# Test backward propagation
print("\nTesting backward propagation...")
gradients = test_network.backward_propagation(AL, Y)

print(f"\nGradient shapes:")
for key, grad in gradients.items():
    print(f"{key}: {grad.shape}")

print("\n✅ All propagation tests passed!")

## Step 3: Gradient Checking

Let's implement gradient checking to verify our backpropagation implementation:

In [None]:
def gradient_check(network, X, Y, epsilon=1e-7, threshold=1e-7):
    """
    Perform gradient checking to verify backpropagation implementation
    
    Parameters:
    network: neural network instance
    X: input data
    Y: labels
    epsilon: small value for numerical gradient computation
    threshold: threshold for considering gradients as matching
    
    Returns:
    difference: relative difference between analytical and numerical gradients
    """
    print("Performing gradient checking...")
    
    # Forward propagation and backward propagation
    AL = network.forward_propagation(X)
    analytical_gradients = network.backward_propagation(AL, Y)
    
    # Convert parameters and gradients to vectors
    parameters_vector = []
    gradients_vector = []
    keys = []
    
    for key in network.parameters.keys():
        parameters_vector.extend(network.parameters[key].flatten())
        if key.replace('W', 'dW').replace('b', 'db') in analytical_gradients:
            grad_key = key.replace('W', 'dW').replace('b', 'db')
            gradients_vector.extend(analytical_gradients[grad_key].flatten())
            keys.extend([f"{key}_{i}" for i in range(network.parameters[key].size)])
    
    parameters_vector = np.array(parameters_vector)
    gradients_vector = np.array(gradients_vector)
    
    # Compute numerical gradients
    numerical_gradients = np.zeros_like(parameters_vector)
    
    print(f"Computing numerical gradients for {len(parameters_vector)} parameters...")
    
    # Sample a subset of parameters for efficiency (checking first 50)
    check_indices = np.random.choice(len(parameters_vector), min(50, len(parameters_vector)), replace=False)
    
    for idx in check_indices:
        # Create theta_plus
        theta_plus = parameters_vector.copy()
        theta_plus[idx] += epsilon
        
        # Create theta_minus
        theta_minus = parameters_vector.copy()
        theta_minus[idx] -= epsilon
        
        # Compute J_plus and J_minus
        J_plus = _compute_cost_with_parameters(network, theta_plus, X, Y)
        J_minus = _compute_cost_with_parameters(network, theta_minus, X, Y)
        
        # Compute numerical gradient
        numerical_gradients[idx] = (J_plus - J_minus) / (2 * epsilon)
    
    # Compare gradients for sampled indices
    analytical_sample = gradients_vector[check_indices]
    numerical_sample = numerical_gradients[check_indices]
    
    # Compute relative difference
    numerator = np.linalg.norm(analytical_sample - numerical_sample)
    denominator = np.linalg.norm(analytical_sample) + np.linalg.norm(numerical_sample)
    difference = numerator / (denominator + 1e-8)
    
    print(f"\nGradient Check Results:")
    print(f"Relative difference: {difference:.2e}")
    print(f"Threshold: {threshold:.2e}")
    
    if difference < threshold:
        print("✅ Gradient check PASSED! Backpropagation is correct.")
    else:
        print("❌ Gradient check FAILED! There may be a bug in backpropagation.")
        
        # Show detailed comparison for debugging
        print("\nDetailed comparison (first 10 parameters):")
        for i in range(min(10, len(check_indices))):
            idx = check_indices[i]
            print(f"Parameter {idx}: Analytical = {analytical_sample[i]:.6e}, "
                  f"Numerical = {numerical_sample[i]:.6e}, "
                  f"Diff = {abs(analytical_sample[i] - numerical_sample[i]):.6e}")
    
    return difference

def _compute_cost_with_parameters(network, parameters_vector, X, Y):
    """
    Helper function to compute cost with given parameters vector
    """
    # Convert parameters vector back to parameter dictionary
    idx = 0
    for key in network.parameters.keys():
        param_size = network.parameters[key].size
        param_shape = network.parameters[key].shape
        network.parameters[key] = parameters_vector[idx:idx+param_size].reshape(param_shape)
        idx += param_size
    
    # Forward propagation and cost computation
    AL = network.forward_propagation(X)
    cost = network.compute_cost(AL, Y)
    
    return cost

# Test gradient checking
print("Testing Gradient Checking:")
print()

# Create a smaller network for faster gradient checking
small_network = DeepNeuralNetworkWithPropagation(
    layer_sizes=[3, 4, 2, 1],
    hidden_activation='relu',
    output_activation='sigmoid',
    random_seed=42
)

# Create small test dataset
X_test = np.random.randn(3, 10)
Y_test = np.random.randint(0, 2, (1, 10))

# Perform gradient check
grad_diff = gradient_check(small_network, X_test, Y_test)

print("\nGradient checking complete!")

## Step 4: Visualizing Propagation Flow

Let's visualize how information flows through the network during forward and backward propagation:

In [None]:
def visualize_propagation_flow(network, X, Y, layer_to_analyze=2):
    """
    Visualize the flow of information during propagation
    
    Parameters:
    network: neural network instance
    X: input data
    Y: labels
    layer_to_analyze: which layer to analyze in detail
    """
    print(f"Visualizing propagation flow for layer {layer_to_analyze}:")
    
    # Forward propagation
    AL = network.forward_propagation(X)
    
    # Backward propagation
    gradients = network.backward_propagation(AL, Y)
    
    # Extract information for visualization
    Z = network.cache[f'Z{layer_to_analyze}']
    A = network.cache[f'A{layer_to_analyze}']
    A_prev = network.cache[f'A{layer_to_analyze-1}']
    
    W = network.parameters[f'W{layer_to_analyze}']
    b = network.parameters[f'b{layer_to_analyze}']
    dW = gradients[f'dW{layer_to_analyze}']
    db = gradients[f'db{layer_to_analyze}']
    
    # Create visualization
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    
    # Row 1: Forward propagation
    # Input activations
    im1 = axes[0, 0].imshow(A_prev[:10, :10], cmap='RdBu', aspect='auto')
    axes[0, 0].set_title(f'Input Activations A{layer_to_analyze-1}\n(first 10x10)')
    axes[0, 0].set_xlabel('Examples')
    axes[0, 0].set_ylabel('Neurons')
    plt.colorbar(im1, ax=axes[0, 0])
    
    # Weights
    im2 = axes[0, 1].imshow(W, cmap='RdBu', aspect='auto')
    axes[0, 1].set_title(f'Weights W{layer_to_analyze}')
    axes[0, 1].set_xlabel('Input Neurons')
    axes[0, 1].set_ylabel('Output Neurons')
    plt.colorbar(im2, ax=axes[0, 1])
    
    # Linear outputs
    im3 = axes[0, 2].imshow(Z[:10, :10], cmap='RdBu', aspect='auto')
    axes[0, 2].set_title(f'Linear Outputs Z{layer_to_analyze}\n(first 10x10)')
    axes[0, 2].set_xlabel('Examples')
    axes[0, 2].set_ylabel('Neurons')
    plt.colorbar(im3, ax=axes[0, 2])
    
    # Activations
    im4 = axes[0, 3].imshow(A[:10, :10], cmap='RdBu', aspect='auto')
    axes[0, 3].set_title(f'Activations A{layer_to_analyze}\n(first 10x10)')
    axes[0, 3].set_xlabel('Examples')
    axes[0, 3].set_ylabel('Neurons')
    plt.colorbar(im4, ax=axes[0, 3])
    
    # Row 2: Backward propagation (gradients)
    # Weight gradients
    im5 = axes[1, 0].imshow(dW, cmap='RdBu', aspect='auto')
    axes[1, 0].set_title(f'Weight Gradients dW{layer_to_analyze}')
    axes[1, 0].set_xlabel('Input Neurons')
    axes[1, 0].set_ylabel('Output Neurons')
    plt.colorbar(im5, ax=axes[1, 0])
    
    # Bias gradients
    axes[1, 1].bar(range(len(db.flatten())), db.flatten())
    axes[1, 1].set_title(f'Bias Gradients db{layer_to_analyze}')
    axes[1, 1].set_xlabel('Neuron Index')
    axes[1, 1].set_ylabel('Gradient Value')
    axes[1, 1].grid(True, alpha=0.3)
    
    # Gradient statistics
    grad_stats = [
        ('W grad mean', dW.mean()),
        ('W grad std', dW.std()),
        ('b grad mean', db.mean()),
        ('b grad std', db.std())
    ]
    
    stats_text = '\n'.join([f'{stat}: {val:.6f}' for stat, val in grad_stats])
    axes[1, 2].text(0.1, 0.5, stats_text, transform=axes[1, 2].transAxes,
                    fontsize=12, verticalalignment='center',
                    bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
    axes[1, 2].set_title(f'Gradient Statistics\nLayer {layer_to_analyze}')
    axes[1, 2].axis('off')
    
    # Activation distribution
    axes[1, 3].hist(A.flatten(), bins=30, alpha=0.7, density=True)
    axes[1, 3].set_title(f'Activation Distribution\nLayer {layer_to_analyze}')
    axes[1, 3].set_xlabel('Activation Value')
    axes[1, 3].set_ylabel('Density')
    axes[1, 3].grid(True, alpha=0.3)
    axes[1, 3].axvline(A.mean(), color='red', linestyle='--', label=f'Mean: {A.mean():.3f}')
    axes[1, 3].legend()
    
    plt.tight_layout()
    plt.show()
    
    # Print summary statistics
    print(f"\nLayer {layer_to_analyze} Summary:")
    print(f"Input shape: {A_prev.shape}")
    print(f"Output shape: {A.shape}")
    print(f"Weight shape: {W.shape}")
    print(f"Bias shape: {b.shape}")
    print(f"Activation range: [{A.min():.4f}, {A.max():.4f}]")
    print(f"Gradient magnitudes: dW={np.linalg.norm(dW):.6f}, db={np.linalg.norm(db):.6f}")

# Visualize propagation for our test network
print("Propagation Flow Visualization:")
print()

# Use the test network with more data for better visualization
X_vis = np.random.randn(4, 50)
Y_vis = np.random.randint(0, 2, (1, 50))

visualize_propagation_flow(test_network, X_vis, Y_vis, layer_to_analyze=2)

print("\nVisualization complete!")

## Step 5: Debugging Propagation Issues

Let's create tools to debug common propagation problems:

In [None]:
class PropagationDebugger:
    """
    Tools for debugging propagation issues
    """
    
    @staticmethod
    def check_vanishing_exploding_gradients(network, threshold_vanishing=1e-6, threshold_exploding=10):
        """
        Check for vanishing or exploding gradients
        
        Parameters:
        network: neural network instance
        threshold_vanishing: threshold below which gradients are considered vanishing
        threshold_exploding: threshold above which gradients are considered exploding
        
        Returns:
        analysis: dictionary with gradient analysis
        """
        analysis = {
            'vanishing_layers': [],
            'exploding_layers': [],
            'gradient_norms': {},
            'status': 'healthy'
        }
        
        print("Checking for vanishing/exploding gradients:")
        print("-" * 50)
        
        for key, grad in network.gradients.items():
            if 'dW' in key:
                layer_num = int(key.replace('dW', ''))
                grad_norm = np.linalg.norm(grad)
                analysis['gradient_norms'][f'Layer_{layer_num}'] = grad_norm
                
                print(f"Layer {layer_num}: ||dW|| = {grad_norm:.6e}")
                
                if grad_norm < threshold_vanishing:
                    analysis['vanishing_layers'].append(layer_num)
                    print(f"  ⚠️  Potential vanishing gradient!")
                    
                elif grad_norm > threshold_exploding:
                    analysis['exploding_layers'].append(layer_num)
                    print(f"  ⚠️  Potential exploding gradient!")
                    
                else:
                    print(f"  ✅ Gradient magnitude is healthy")
        
        # Set overall status
        if analysis['vanishing_layers'] or analysis['exploding_layers']:
            analysis['status'] = 'problematic'
        
        return analysis
    
    @staticmethod
    def analyze_activation_distributions(network):
        """
        Analyze activation distributions across layers
        
        Parameters:
        network: neural network instance
        
        Returns:
        analysis: dictionary with activation analysis
        """
        analysis = {
            'layer_stats': {},
            'potential_issues': []
        }
        
        print("\nAnalyzing activation distributions:")
        print("-" * 50)
        
        for layer in range(network.num_layers):
            if f'A{layer}' in network.cache:
                activations = network.cache[f'A{layer}']
                
                stats = {
                    'mean': activations.mean(),
                    'std': activations.std(),
                    'min': activations.min(),
                    'max': activations.max(),
                    'zeros_percentage': (activations == 0).mean() * 100
                }
                
                analysis['layer_stats'][f'Layer_{layer}'] = stats
                
                print(f"Layer {layer}:")
                print(f"  Shape: {activations.shape}")
                print(f"  Mean: {stats['mean']:.4f}, Std: {stats['std']:.4f}")
                print(f"  Range: [{stats['min']:.4f}, {stats['max']:.4f}]")
                print(f"  Zeros: {stats['zeros_percentage']:.1f}%")
                
                # Check for potential issues
                if layer > 0:  # Skip input layer
                    if stats['zeros_percentage'] > 50:
                        analysis['potential_issues'].append(
                            f"Layer {layer}: High percentage of zeros ({stats['zeros_percentage']:.1f}%) - possible dead neurons"
                        )
                    
                    if abs(stats['mean']) > 10:
                        analysis['potential_issues'].append(
                            f"Layer {layer}: Large mean activation ({stats['mean']:.4f}) - possible exploding activations"
                        )
                    
                    if stats['std'] < 0.01:
                        analysis['potential_issues'].append(
                            f"Layer {layer}: Very low standard deviation ({stats['std']:.4f}) - possible vanishing activations"
                        )
                
                print()
        
        if analysis['potential_issues']:
            print("⚠️  Potential Issues Detected:")
            for issue in analysis['potential_issues']:
                print(f"  - {issue}")
        else:
            print("✅ No activation distribution issues detected")
        
        return analysis
    
    @staticmethod
    def visualize_gradient_flow(network):
        """
        Visualize gradient flow across layers
        
        Parameters:
        network: neural network instance
        """
        layers = []
        gradient_norms = []
        
        for key in sorted(network.gradients.keys()):
            if 'dW' in key:
                layer_num = int(key.replace('dW', ''))
                grad_norm = np.linalg.norm(network.gradients[key])
                layers.append(layer_num)
                gradient_norms.append(grad_norm)
        
        plt.figure(figsize=(12, 6))
        
        # Plot gradient norms
        plt.subplot(1, 2, 1)
        plt.plot(layers, gradient_norms, 'bo-', linewidth=2, markersize=8)
        plt.yscale('log')
        plt.xlabel('Layer Number')
        plt.ylabel('Gradient Norm (log scale)')
        plt.title('Gradient Flow Across Layers')
        plt.grid(True, alpha=0.3)
        
        # Add horizontal lines for thresholds
        plt.axhline(y=1e-6, color='r', linestyle='--', alpha=0.7, label='Vanishing threshold')
        plt.axhline(y=10, color='r', linestyle='--', alpha=0.7, label='Exploding threshold')
        plt.legend()
        
        # Plot gradient ratio between consecutive layers
        plt.subplot(1, 2, 2)
        if len(gradient_norms) > 1:
            ratios = [gradient_norms[i+1] / gradient_norms[i] for i in range(len(gradient_norms)-1)]
            layer_pairs = [f"{layers[i]}->{layers[i+1]}" for i in range(len(layers)-1)]
            
            plt.bar(range(len(ratios)), ratios, alpha=0.7)
            plt.yscale('log')
            plt.xlabel('Layer Transition')
            plt.ylabel('Gradient Ratio (log scale)')
            plt.title('Gradient Ratio Between Consecutive Layers')
            plt.xticks(range(len(ratios)), layer_pairs, rotation=45)
            plt.grid(True, alpha=0.3)
            
            # Add reference line at ratio = 1
            plt.axhline(y=1, color='g', linestyle='-', alpha=0.7, label='Ratio = 1')
            plt.legend()
        
        plt.tight_layout()
        plt.show()

# Test the debugging tools
print("Testing Propagation Debugging Tools:")
print()

# Create test scenario with potential issues
debugger = PropagationDebugger()

# Run forward and backward propagation first
AL_debug = test_network.forward_propagation(X_vis)
gradients_debug = test_network.backward_propagation(AL_debug, Y_vis)

# Check for vanishing/exploding gradients
gradient_analysis = debugger.check_vanishing_exploding_gradients(test_network)

# Analyze activation distributions
activation_analysis = debugger.analyze_activation_distributions(test_network)

# Visualize gradient flow
print("\nVisualizing gradient flow:")
debugger.visualize_gradient_flow(test_network)

print("\n✅ Debugging analysis complete!")

## Step 6: Performance Optimization

Let's explore techniques to optimize propagation performance:

In [None]:
import time

def benchmark_propagation(layer_sizes, num_examples, num_iterations=10):
    """
    Benchmark forward and backward propagation performance
    
    Parameters:
    layer_sizes: architecture of the network
    num_examples: number of training examples
    num_iterations: number of iterations to average over
    
    Returns:
    results: dictionary with timing results
    """
    print(f"Benchmarking propagation for architecture {layer_sizes} with {num_examples} examples:")
    
    # Create network
    network = DeepNeuralNetworkWithPropagation(
        layer_sizes=layer_sizes,
        hidden_activation='relu',
        output_activation='sigmoid',
        random_seed=42
    )
    
    # Create test data
    X = np.random.randn(layer_sizes[0], num_examples)
    Y = np.random.randint(0, 2, (layer_sizes[-1], num_examples))
    
    # Benchmark forward propagation
    forward_times = []
    for _ in range(num_iterations):
        start_time = time.time()
        AL = network.forward_propagation(X)
        forward_times.append(time.time() - start_time)
    
    # Benchmark backward propagation
    backward_times = []
    for _ in range(num_iterations):
        AL = network.forward_propagation(X)  # Need forward pass for backward
        start_time = time.time()
        gradients = network.backward_propagation(AL, Y)
        backward_times.append(time.time() - start_time)
    
    # Benchmark full forward+backward pass
    full_times = []
    for _ in range(num_iterations):
        start_time = time.time()
        AL = network.forward_propagation(X)
        gradients = network.backward_propagation(AL, Y)
        full_times.append(time.time() - start_time)
    
    results = {
        'forward_time': np.mean(forward_times),
        'forward_std': np.std(forward_times),
        'backward_time': np.mean(backward_times),
        'backward_std': np.std(backward_times),
        'full_time': np.mean(full_times),
        'full_std': np.std(full_times),
        'parameters': network._count_parameters(),
        'architecture': layer_sizes,
        'examples': num_examples
    }
    
    print(f"Forward propagation: {results['forward_time']*1000:.2f} ± {results['forward_std']*1000:.2f} ms")
    print(f"Backward propagation: {results['backward_time']*1000:.2f} ± {results['backward_std']*1000:.2f} ms")
    print(f"Full pass: {results['full_time']*1000:.2f} ± {results['full_std']*1000:.2f} ms")
    print(f"Parameters: {results['parameters']:,}")
    print()
    
    return results

def compare_architectures():
    """
    Compare performance of different architectures
    """
    print("Performance Comparison of Different Architectures:")
    print("=" * 60)
    
    architectures = [
        [100, 50, 10],
        [100, 100, 50, 10],
        [100, 200, 100, 50, 10],
        [100, 500, 200, 50, 10],
    ]
    
    num_examples = 1000
    results = []
    
    for arch in architectures:
        print(f"\nTesting architecture: {arch}")
        print("-" * 40)
        result = benchmark_propagation(arch, num_examples, num_iterations=5)
        results.append(result)
    
    # Visualize results
    plt.figure(figsize=(15, 10))
    
    # Plot timing comparison
    plt.subplot(2, 2, 1)
    architectures_labels = [str(r['architecture']) for r in results]
    forward_times = [r['forward_time']*1000 for r in results]
    backward_times = [r['backward_time']*1000 for r in results]
    
    x = range(len(results))
    width = 0.35
    
    plt.bar([i - width/2 for i in x], forward_times, width, label='Forward', alpha=0.8)
    plt.bar([i + width/2 for i in x], backward_times, width, label='Backward', alpha=0.8)
    
    plt.xlabel('Architecture')
    plt.ylabel('Time (ms)')
    plt.title('Propagation Time Comparison')
    plt.xticks(x, [f'Arch {i+1}' for i in range(len(results))], rotation=45)
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Plot parameters vs time
    plt.subplot(2, 2, 2)
    parameters = [r['parameters'] for r in results]
    full_times = [r['full_time']*1000 for r in results]
    
    plt.scatter(parameters, full_times, s=100, alpha=0.7)
    plt.xlabel('Number of Parameters')
    plt.ylabel('Full Pass Time (ms)')
    plt.title('Parameters vs Execution Time')
    plt.grid(True, alpha=0.3)
    
    # Annotate points
    for i, (param, time) in enumerate(zip(parameters, full_times)):
        plt.annotate(f'Arch {i+1}', (param, time), xytext=(5, 5), 
                    textcoords='offset points', fontsize=9)
    
    # Plot depth vs time
    plt.subplot(2, 2, 3)
    depths = [len(r['architecture']) - 1 for r in results]  # Number of hidden layers
    
    plt.scatter(depths, full_times, s=100, alpha=0.7, color='green')
    plt.xlabel('Network Depth (Hidden Layers)')
    plt.ylabel('Full Pass Time (ms)')
    plt.title('Network Depth vs Execution Time')
    plt.grid(True, alpha=0.3)
    
    # Plot efficiency (time per parameter)
    plt.subplot(2, 2, 4)
    efficiency = [r['full_time']*1000 / r['parameters'] for r in results]
    
    plt.bar(range(len(results)), efficiency, alpha=0.7, color='orange')
    plt.xlabel('Architecture')
    plt.ylabel('Time per Parameter (ms/param × 1000)')
    plt.title('Computational Efficiency')
    plt.xticks(range(len(results)), [f'Arch {i+1}' for i in range(len(results))], rotation=45)
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return results

# Run performance benchmarks
print("Performance Benchmarking:")
print()

benchmark_results = compare_architectures()

print("\n📊 Performance analysis complete!")

## Step 7: Progress Tracking and Key Concepts Summary

Let's summarize what we've learned and check our progress:

In [None]:
# Progress Tracking Checklist
progress_checklist = {
    "Implementing activation functions and derivatives": True,
    "Building deep network with forward propagation": True,
    "Implementing backward propagation with chain rule": True,
    "Creating gradient checking for verification": True,
    "Visualizing propagation flow through layers": True,
    "Debugging vanishing/exploding gradient issues": True,
    "Analyzing activation distributions": True,
    "Benchmarking propagation performance": True,
    "Understanding caching for efficient backprop": True
}

print("Progress Tracking Checklist:")
print("=" * 50)
for item, completed in progress_checklist.items():
    status = "✅" if completed else "❌"
    print(f"{status} {item}")

completed_items = sum(progress_checklist.values())
total_items = len(progress_checklist)
print(f"\nProgress: {completed_items}/{total_items} ({completed_items/total_items*100:.1f}%) Complete")

print("\n" + "=" * 60)
print("KEY CONCEPTS SUMMARY")
print("=" * 60)

key_concepts = {
    "Forward Propagation": "Sequential computation of outputs through network layers",
    "Backward Propagation": "Gradient computation using chain rule in reverse order",
    "Chain Rule": "Mathematical foundation for computing gradients in deep networks",
    "Activation Functions": "Non-linear functions that enable networks to learn complex patterns",
    "Caching": "Storing intermediate values for efficient gradient computation",
    "Gradient Checking": "Numerical verification of analytical gradient computations",
    "Vanishing Gradients": "Problem where gradients become too small in deep networks",
    "Exploding Gradients": "Problem where gradients become too large and unstable",
    "Performance Optimization": "Techniques to improve computational efficiency of propagation"
}

for concept, description in key_concepts.items():
    print(f"\n{concept}:")
    print(f"  {description}")

print("\n" + "=" * 60)
print("MATHEMATICAL FOUNDATIONS")
print("=" * 60)

math_foundations = [
    "Forward Pass: A[l] = g(Z[l]) where Z[l] = W[l]A[l-1] + b[l]",
    "Chain Rule: dC/dW[l] = dC/dA[l] * dA[l]/dZ[l] * dZ[l]/dW[l]",
    "Weight Gradient: dW[l] = (1/m) * dZ[l] * A[l-1].T",
    "Bias Gradient: db[l] = (1/m) * sum(dZ[l])",
    "Activation Gradient: dA[l-1] = W[l].T * dZ[l]",
    "ReLU Derivative: g'(z) = 1 if z > 0, else 0",
    "Sigmoid Derivative: g'(z) = g(z) * (1 - g(z))"
]

for i, formula in enumerate(math_foundations, 1):
    print(f"{i}. {formula}")

print("\n" + "=" * 60)
print("DEBUGGING CHECKLIST")
print("=" * 60)

debugging_checklist = [
    "✓ Check gradient magnitudes (not too small/large)",
    "✓ Verify activation distributions are reasonable",
    "✓ Ensure no NaN or Inf values in computations",
    "✓ Confirm gradient checking passes (<1e-7 difference)",
    "✓ Monitor for dead neurons (high percentage of zeros)",
    "✓ Validate input data preprocessing and scaling",
    "✓ Check weight initialization is appropriate",
    "✓ Verify mathematical implementation matches theory"
]

for item in debugging_checklist:
    print(f"  {item}")

print("\n" + "=" * 60)
print("NEXT STEPS")
print("=" * 60)
print("1. Implement training loops with optimization algorithms")
print("2. Add regularization techniques (L1, L2, dropout)")
print("3. Implement different cost functions")
print("4. Add learning rate scheduling")
print("5. Test on real classification/regression problems")
print("6. Implement batch processing and mini-batch gradient descent")
print("7. Add momentum and advanced optimizers (Adam, RMSprop)")

## Lab Cleanup Instructions

### Windows Users:
1. Close all Jupyter notebook tabs
2. Press `Ctrl+C` in the command prompt to stop Jupyter server
3. Type `conda deactivate` or `deactivate` to exit virtual environment
4. Close command prompt

### Mac Users:
1. Close all Jupyter notebook tabs
2. Press `Ctrl+C` in terminal to stop Jupyter server
3. Type `conda deactivate` or `deactivate` to exit virtual environment
4. Close terminal

### Save Your Work:
- Your notebook is automatically saved
- Consider saving a copy with your name: `lab_4_2_[your_name].ipynb`
- Export as HTML for offline viewing: File → Download as → HTML

## Troubleshooting Guide

### Common Issues and Solutions:

**Issue 1: Gradient checking fails**
- **Solution**: Check activation function derivatives
- **Check**: Ensure epsilon value is appropriate (1e-7)
- **Debug**: Print intermediate gradient values to locate error

**Issue 2: NaN or Inf values in gradients**
- **Solution**: Add numerical stability (clip extreme values)
- **Check**: Weight initialization may be too large
- **Fix**: Use proper initialization (He or Xavier)

**Issue 3: Very slow execution**
- **Solution**: Reduce network size or number of examples
- **Optimize**: Use vectorized operations instead of loops
- **Check**: Memory usage and available RAM

**Issue 4: Vanishing gradients detected**
- **Solution**: Use ReLU activation instead of sigmoid/tanh
- **Try**: Different initialization method (He for ReLU)
- **Consider**: Batch normalization or residual connections

**Issue 5: Memory errors with large networks**
- **Solution**: Reduce batch size or network complexity
- **Alternative**: Process data in smaller chunks
- **Check**: System memory and close other applications

**Issue 6: Incorrect gradient shapes**
- **Solution**: Verify matrix dimensions in forward pass
- **Check**: Transpose operations are correct
- **Debug**: Print shapes at each step

### Getting Help:
- Check error messages for specific line numbers
- Try restarting the kernel: Kernel → Restart & Clear Output
- Use print statements to debug intermediate values
- Ask instructor or teaching assistant for complex issues
- Refer to NumPy documentation: https://numpy.org/doc/