# Deep Learning Fundamentals: From Shallow to Deep Networks
## Interactive Learning Notebook

**Based on Lecture 7: From Logistic Regression to Multi-layer Perceptrons**

This notebook provides hands-on practice with deep learning concepts including:
- Understanding the limitations of shallow networks
- Implementing deep neural networks from scratch
- Exploring gradient problems and solutions
- Working with modern activation functions
- Visualizing network behavior and performance

### Learning Objectives
1. Understand why deep networks are more powerful than shallow ones
2. Implement and compare different activation functions
3. Diagnose and solve gradient problems
4. Build practical deep learning models

---

## Part 1: Environment Setup and Imports
Let's import all necessary libraries for our deep learning experiments.

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Deep Learning imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Interactive visualizations
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
tf.random.set_seed(42)

print(f'TensorFlow version: {tf.__version__}')
print(f'Keras version: {keras.__version__}')
print('Setup complete!')

---
## Exercise 1: Understanding Shallow vs Deep Networks

### Concept: Universal Approximation Theorem
The Universal Approximation Theorem states that a shallow network with a single hidden layer can theoretically approximate any continuous function. However, it might need exponentially many neurons to do so efficiently.

Let's demonstrate why deep networks are more parameter-efficient than shallow ones.

In [None]:
# Generate a complex non-linear dataset
def generate_spiral_data(n_samples=1000, n_classes=3, noise=0.2):
    """Generate spiral dataset for classification"""
    np.random.seed(42)
    n_per_class = n_samples // n_classes
    
    X = []
    y = []
    
    for class_idx in range(n_classes):
        theta = np.linspace(0, 4 * np.pi, n_per_class) + (class_idx * 2 * np.pi / n_classes)
        radius = np.linspace(0, 1, n_per_class)
        
        x1 = radius * np.cos(theta) + np.random.randn(n_per_class) * noise
        x2 = radius * np.sin(theta) + np.random.randn(n_per_class) * noise
        
        X.append(np.column_stack([x1, x2]))
        y.append(np.full(n_per_class, class_idx))
    
    return np.vstack(X), np.hstack(y)

# Generate data
X_spiral, y_spiral = generate_spiral_data()

# Visualize the dataset
fig = go.Figure()
for class_idx in range(3):
    mask = y_spiral == class_idx
    fig.add_trace(go.Scatter(
        x=X_spiral[mask, 0],
        y=X_spiral[mask, 1],
        mode='markers',
        name=f'Class {class_idx}',
        marker=dict(size=5)
    ))

fig.update_layout(
    title='Complex Spiral Dataset for Classification',
    xaxis_title='Feature 1',
    yaxis_title='Feature 2',
    height=500,
    width=700
)
fig.show()

print(f"Dataset shape: X={X_spiral.shape}, y={y_spiral.shape}")
print(f"Number of classes: {len(np.unique(y_spiral))}")

In [None]:
# Build and compare shallow vs deep networks
def build_shallow_network(input_dim, hidden_units, n_classes):
    """Build a shallow network with one hidden layer"""
    model = models.Sequential([
        layers.Dense(hidden_units, activation='relu', input_shape=(input_dim,)),
        layers.Dense(n_classes, activation='softmax')
    ])
    return model

def build_deep_network(input_dim, layer_sizes, n_classes):
    """Build a deep network with multiple hidden layers"""
    model = models.Sequential()
    model.add(layers.Dense(layer_sizes[0], activation='relu', input_shape=(input_dim,)))
    
    for size in layer_sizes[1:]:
        model.add(layers.Dense(size, activation='relu'))
    
    model.add(layers.Dense(n_classes, activation='softmax'))
    return model

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_spiral, y_spiral, test_size=0.2, random_state=42, stratify=y_spiral
)

# Build models
shallow_model = build_shallow_network(2, 1000, 3)  # 1000 hidden units
deep_model = build_deep_network(2, [100, 100, 100], 3)  # 3 layers of 100 units each

# Compile models
for model in [shallow_model, deep_model]:
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

# Display model summaries
print("SHALLOW NETWORK (1 hidden layer with 1000 units):")
print(f"Total parameters: {shallow_model.count_params():,}")
print("\nDEEP NETWORK (3 hidden layers with 100 units each):")
print(f"Total parameters: {deep_model.count_params():,}")
print(f"\nParameter efficiency: Deep network uses {deep_model.count_params()/shallow_model.count_params():.1%} of shallow network's parameters")

In [None]:
# Train and compare both models
history_shallow = shallow_model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

history_deep = deep_model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(history_shallow.history['loss'], label='Shallow - Train', linewidth=2)
ax1.plot(history_shallow.history['val_loss'], label='Shallow - Val', linewidth=2, linestyle='--')
ax1.plot(history_deep.history['loss'], label='Deep - Train', linewidth=2)
ax1.plot(history_deep.history['val_loss'], label='Deep - Val', linewidth=2, linestyle='--')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Comparison')
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(history_shallow.history['accuracy'], label='Shallow - Train', linewidth=2)
ax2.plot(history_shallow.history['val_accuracy'], label='Shallow - Val', linewidth=2, linestyle='--')
ax2.plot(history_deep.history['accuracy'], label='Deep - Train', linewidth=2)
ax2.plot(history_deep.history['val_accuracy'], label='Deep - Val', linewidth=2, linestyle='--')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training Accuracy Comparison')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Evaluate on test set
shallow_acc = shallow_model.evaluate(X_test, y_test, verbose=0)[1]
deep_acc = deep_model.evaluate(X_test, y_test, verbose=0)[1]

print(f"\nTest Accuracy:")
print(f"Shallow Network: {shallow_acc:.3f}")
print(f"Deep Network: {deep_acc:.3f}")
print(f"\nDeep network achieves {(deep_acc/shallow_acc - 1)*100:.1f}% better accuracy with {(1 - deep_model.count_params()/shallow_model.count_params())*100:.1f}% fewer parameters!")

### 🔍 Key Insights:
- **Parameter Efficiency**: Deep networks achieve better performance with significantly fewer parameters
- **Hierarchical Learning**: Deep networks learn features hierarchically, from simple to complex
- **Generalization**: Deep networks often generalize better despite having the capacity to overfit

### 💡 Your Turn:
Modify the network architectures above:
1. Try different numbers of layers in the deep network (2, 4, 5 layers)
2. Experiment with different layer widths (50, 200, 500 units)
3. Compare the decision boundaries using the visualization function below

---
## Exercise 2: Visualizing the Vanishing Gradient Problem

### Concept: Gradient Flow in Deep Networks
During backpropagation, gradients are multiplied through many layers. When using certain activation functions like sigmoid or tanh, these gradients can vanish (approach zero) or explode (become very large).

In [None]:
# Implement different activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Visualize activation functions and their derivatives
x = np.linspace(-5, 5, 1000)

fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('Sigmoid', 'Tanh', 'ReLU',
                   'Sigmoid Derivative', 'Tanh Derivative', 'ReLU Derivative')
)

# Activation functions
fig.add_trace(go.Scatter(x=x, y=sigmoid(x), name='Sigmoid'), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=tanh(x), name='Tanh'), row=1, col=2)
fig.add_trace(go.Scatter(x=x, y=relu(x), name='ReLU'), row=1, col=3)

# Derivatives
fig.add_trace(go.Scatter(x=x, y=sigmoid_derivative(x), name='Sigmoid\''), row=2, col=1)
fig.add_trace(go.Scatter(x=x, y=tanh_derivative(x), name='Tanh\''), row=2, col=2)
fig.add_trace(go.Scatter(x=x, y=relu_derivative(x), name='ReLU\''), row=2, col=3)

fig.update_layout(height=600, showlegend=False, title_text="Activation Functions and Their Derivatives")
fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="y", row=1)
fig.update_yaxes(title_text="dy/dx", row=2)
fig.show()

print("Maximum derivative values:")
print(f"Sigmoid: {np.max(sigmoid_derivative(x)):.3f}")
print(f"Tanh: {np.max(tanh_derivative(x)):.3f}")
print(f"ReLU: {np.max(relu_derivative(x)):.3f}")

In [None]:
# Simulate gradient flow through deep networks
def simulate_gradient_flow(activation_fn, derivative_fn, n_layers=10, n_simulations=100):
    """Simulate gradient backpropagation through multiple layers"""
    gradients_by_layer = []
    
    for _ in range(n_simulations):
        # Random initialization
        gradient = 1.0  # Start with gradient of 1 from output
        layer_gradients = [gradient]
        
        for layer in range(n_layers):
            # Random pre-activation values
            z = np.random.randn()
            # Gradient gets multiplied by derivative
            gradient *= derivative_fn(z)
            layer_gradients.append(gradient)
        
        gradients_by_layer.append(layer_gradients)
    
    return np.array(gradients_by_layer)

# Simulate for different activation functions
n_layers = 20
sigmoid_grads = simulate_gradient_flow(sigmoid, sigmoid_derivative, n_layers)
tanh_grads = simulate_gradient_flow(tanh, tanh_derivative, n_layers)
relu_grads = simulate_gradient_flow(relu, relu_derivative, n_layers)

# Visualize gradient flow
fig = go.Figure()

layers = list(range(n_layers + 1))

# Add mean gradient flow for each activation
fig.add_trace(go.Scatter(
    x=layers,
    y=np.mean(sigmoid_grads, axis=0),
    name='Sigmoid',
    line=dict(width=3),
    mode='lines+markers'
))

fig.add_trace(go.Scatter(
    x=layers,
    y=np.mean(tanh_grads, axis=0),
    name='Tanh',
    line=dict(width=3),
    mode='lines+markers'
))

fig.add_trace(go.Scatter(
    x=layers,
    y=np.mean(relu_grads, axis=0),
    name='ReLU',
    line=dict(width=3),
    mode='lines+markers'
))

fig.update_layout(
    title='Gradient Flow Through Deep Networks',
    xaxis_title='Layer (from output to input)',
    yaxis_title='Average Gradient Magnitude',
    yaxis_type='log',
    height=500,
    hovermode='x unified'
)
fig.show()

print("Final gradient (at input layer):")
print(f"Sigmoid: {np.mean(sigmoid_grads[:, -1]):.2e}")
print(f"Tanh: {np.mean(tanh_grads[:, -1]):.2e}")
print(f"ReLU: {np.mean(relu_grads[:, -1]):.3f}")
print("\n⚠️ Notice how sigmoid and tanh gradients vanish!")

### 💡 Your Turn:
1. Modify the number of layers and observe how it affects gradient flow
2. Implement Leaky ReLU (f(x) = max(0.01x, x)) and add it to the comparison
3. What happens if you initialize the network differently?

---
## Exercise 3: Implementing Modern Activation Functions

### Concept: Advanced Activations
Modern deep learning uses sophisticated activation functions that address the limitations of traditional ones.

In [None]:
# Implement modern activation functions
class ModernActivations:
    @staticmethod
    def leaky_relu(x, alpha=0.01):
        """Leaky ReLU: allows small negative gradients"""
        return np.where(x > 0, x, alpha * x)
    
    @staticmethod
    def elu(x, alpha=1.0):
        """Exponential Linear Unit"""
        return np.where(x > 0, x, alpha * (np.exp(x) - 1))
    
    @staticmethod
    def selu(x):
        """Scaled Exponential Linear Unit (self-normalizing)"""
        alpha = 1.6732632423543772
        scale = 1.0507009873554805
        return scale * np.where(x > 0, x, alpha * (np.exp(x) - 1))
    
    @staticmethod
    def swish(x, beta=1.0):
        """Swish: x * sigmoid(βx)"""
        return x * sigmoid(beta * x)
    
    @staticmethod
    def gelu(x):
        """Gaussian Error Linear Unit (approximation)"""
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
    
    @staticmethod
    def mish(x):
        """Mish: x * tanh(softplus(x))"""
        return x * np.tanh(np.log(1 + np.exp(x)))

# Compare all activation functions
x = np.linspace(-3, 3, 1000)
activations = ModernActivations()

fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('Leaky ReLU', 'ELU', 'SELU', 'Swish', 'GELU', 'Mish')
)

# Plot each activation
functions = [
    (activations.leaky_relu, 'Leaky ReLU'),
    (activations.elu, 'ELU'),
    (activations.selu, 'SELU'),
    (activations.swish, 'Swish'),
    (activations.gelu, 'GELU'),
    (activations.mish, 'Mish')
]

for idx, (func, name) in enumerate(functions):
    row = idx // 3 + 1
    col = idx % 3 + 1
    
    y = func(x)
    fig.add_trace(
        go.Scatter(x=x, y=y, name=name, line=dict(width=3)),
        row=row, col=col
    )
    
    # Add ReLU for comparison (in light gray)
    fig.add_trace(
        go.Scatter(x=x, y=relu(x), name='ReLU', 
                  line=dict(color='lightgray', dash='dash')),
        row=row, col=col
    )

fig.update_layout(
    height=600,
    title_text="Modern Activation Functions (compared to ReLU in gray)",
    showlegend=False
)
fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="y")
fig.show()

# Compare key properties
print("Activation Function Properties at x = -1:")
print(f"ReLU:       {relu(-1):.3f}")
print(f"Leaky ReLU: {activations.leaky_relu(-1):.3f}")
print(f"ELU:        {activations.elu(-1):.3f}")
print(f"SELU:       {activations.selu(-1):.3f}")
print(f"Swish:      {activations.swish(-1):.3f}")
print(f"GELU:       {activations.gelu(-1):.3f}")
print(f"Mish:       {activations.mish(-1):.3f}")

---
## Exercise 4: Performance Comparison of Activation Functions

### Concept: Choosing the Right Activation
Different activation functions perform better in different scenarios. Let's compare them on a real task.

In [None]:
# Create a more complex dataset
from sklearn.datasets import make_moons, make_circles

# Generate combined dataset
X1, y1 = make_moons(n_samples=500, noise=0.15, random_state=42)
X2, y2 = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
X2 = X2 * 2 + 2  # Shift the circles

X_complex = np.vstack([X1, X2])
y_complex = np.hstack([y1, y2 + 2])  # 4 classes total

# Standardize features
scaler = StandardScaler()
X_complex = scaler.fit_transform(X_complex)

# Visualize dataset
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_complex[:, 0], X_complex[:, 1], c=y_complex, 
                     cmap='viridis', alpha=0.6, edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, label='Class')
plt.title('Complex Classification Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.show()

print(f"Dataset shape: {X_complex.shape}")
print(f"Number of classes: {len(np.unique(y_complex))}")

In [None]:
# Build models with different activation functions
def build_model_with_activation(activation, input_dim=2, n_classes=4):
    """Build a deep network with specified activation function"""
    model = models.Sequential([
        layers.Dense(128, activation=activation, input_shape=(input_dim,)),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        
        layers.Dense(64, activation=activation),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        
        layers.Dense(32, activation=activation),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        
        layers.Dense(n_classes, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Train models with different activations
activations_to_test = ['relu', 'elu', 'selu', 'swish', 'gelu']
histories = {}
models_dict = {}

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_complex, y_complex, test_size=0.2, random_state=42, stratify=y_complex
)

print("Training models with different activation functions...")
for activation in activations_to_test:
    print(f"\nTraining with {activation.upper()}...")
    
    # Build and train model
    model = build_model_with_activation(activation)
    
    history = model.fit(
        X_train, y_train,
        epochs=50,
        batch_size=32,
        validation_split=0.2,
        verbose=0,
        callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)]
    )
    
    histories[activation] = history
    models_dict[activation] = model
    
    # Evaluate
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    print(f"{activation.upper()} - Test Accuracy: {test_acc:.3f}")

print("\nTraining complete!")

In [None]:
# Visualize training curves for all activations
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Training Accuracy', 'Validation Accuracy')
)

colors = ['blue', 'red', 'green', 'purple', 'orange']

for idx, (activation, history) in enumerate(histories.items()):
    # Training accuracy
    fig.add_trace(
        go.Scatter(
            y=history.history['accuracy'],
            name=activation.upper(),
            line=dict(color=colors[idx], width=2),
            showlegend=True
        ),
        row=1, col=1
    )
    
    # Validation accuracy
    fig.add_trace(
        go.Scatter(
            y=history.history['val_accuracy'],
            name=activation.upper(),
            line=dict(color=colors[idx], width=2, dash='dash'),
            showlegend=False
        ),
        row=1, col=2
    )

fig.update_xaxes(title_text="Epoch")
fig.update_yaxes(title_text="Accuracy")
fig.update_layout(
    height=400,
    title_text="Activation Function Performance Comparison"
)
fig.show()

# Summary statistics
print("\nFinal Performance Summary:")
print("-" * 40)
for activation in activations_to_test:
    test_loss, test_acc = models_dict[activation].evaluate(X_test, y_test, verbose=0)
    train_acc = histories[activation].history['accuracy'][-1]
    val_acc = histories[activation].history['val_accuracy'][-1]
    
    print(f"{activation.upper():8s} | Train: {train_acc:.3f} | Val: {val_acc:.3f} | Test: {test_acc:.3f}")

---
## Exercise 5: Visualizing Hierarchical Feature Learning

### Concept: Layer-wise Representations
Deep networks learn increasingly abstract representations at each layer. Let's visualize this hierarchy.

In [None]:
# Create a model with accessible intermediate layers
def build_model_with_outputs(input_dim=2, n_classes=4):
    """Build model that returns intermediate layer outputs"""
    input_layer = layers.Input(shape=(input_dim,))
    
    # Layer 1
    x1 = layers.Dense(128, activation='relu', name='layer1')(input_layer)
    x1 = layers.BatchNormalization()(x1)
    
    # Layer 2
    x2 = layers.Dense(64, activation='relu', name='layer2')(x1)
    x2 = layers.BatchNormalization()(x2)
    
    # Layer 3
    x3 = layers.Dense(32, activation='relu', name='layer3')(x2)
    x3 = layers.BatchNormalization()(x3)
    
    # Output layer
    output = layers.Dense(n_classes, activation='softmax', name='output')(x3)
    
    # Create model
    model = models.Model(inputs=input_layer, outputs=output)
    
    # Create models for intermediate outputs
    layer1_model = models.Model(inputs=input_layer, outputs=x1)
    layer2_model = models.Model(inputs=input_layer, outputs=x2)
    layer3_model = models.Model(inputs=input_layer, outputs=x3)
    
    return model, [layer1_model, layer2_model, layer3_model]

# Build and train the model
main_model, intermediate_models = build_model_with_outputs()

main_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model
main_model.fit(
    X_train, y_train,
    epochs=30,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

print("Model trained successfully!")
print(f"Test accuracy: {main_model.evaluate(X_test, y_test, verbose=0)[1]:.3f}")

In [None]:
# Visualize intermediate representations using t-SNE
from sklearn.manifold import TSNE

# Get intermediate representations
sample_size = 500
sample_indices = np.random.choice(len(X_test), sample_size, replace=False)
X_sample = X_test[sample_indices]
y_sample = y_test[sample_indices]

# Get representations from each layer
representations = {
    'Input': X_sample,
    'Layer 1': intermediate_models[0].predict(X_sample, verbose=0),
    'Layer 2': intermediate_models[1].predict(X_sample, verbose=0),
    'Layer 3': intermediate_models[2].predict(X_sample, verbose=0)
}

# Apply t-SNE to high-dimensional representations
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for idx, (name, rep) in enumerate(representations.items()):
    ax = axes[idx]
    
    # Apply t-SNE if needed
    if rep.shape[1] > 2:
        tsne = TSNE(n_components=2, random_state=42, perplexity=30)
        rep_2d = tsne.fit_transform(rep)
    else:
        rep_2d = rep
    
    # Plot
    scatter = ax.scatter(rep_2d[:, 0], rep_2d[:, 1], c=y_sample, 
                        cmap='viridis', alpha=0.6, s=20)
    ax.set_title(f'{name}\n({rep.shape[1]} dims)')
    ax.set_xlabel('Component 1')
    ax.set_ylabel('Component 2')
    ax.grid(True, alpha=0.3)

plt.suptitle('Hierarchical Feature Learning: Layer-wise Representations', fontsize=14, y=1.05)
plt.tight_layout()
plt.show()

print("Notice how the representations become increasingly separable at deeper layers!")

---
## Exercise 6: Understanding the Dead ReLU Problem

### Concept: Dying Neurons
ReLU neurons can 'die' during training, meaning they always output zero and stop learning.

In [None]:
# Monitor dead neurons during training
class DeadReLUCallback(keras.callbacks.Callback):
    def __init__(self):
        super().__init__()
        self.dead_neurons_history = []
    
    def on_epoch_end(self, epoch, logs=None):
        dead_count = 0
        total_count = 0
        
        for layer in self.model.layers:
            if isinstance(layer, layers.Dense) and layer.activation.__name__ == 'relu':
                weights = layer.get_weights()[0]  # Get weight matrix
                # Check how many neurons have all negative weights
                # (more likely to be dead)
                neuron_max_weights = np.max(weights, axis=0)
                dead = np.sum(neuron_max_weights <= 0)
                
                dead_count += dead
                total_count += weights.shape[1]
        
        dead_percentage = (dead_count / total_count) * 100 if total_count > 0 else 0
        self.dead_neurons_history.append(dead_percentage)

# Build models with different initializations
def build_relu_model_with_init(init_method):
    model = models.Sequential([
        layers.Dense(256, activation='relu', kernel_initializer=init_method, input_shape=(2,)),
        layers.Dense(128, activation='relu', kernel_initializer=init_method),
        layers.Dense(64, activation='relu', kernel_initializer=init_method),
        layers.Dense(4, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Test different initializations
init_methods = {
    'zeros': 'zeros',  # Worst case
    'random_normal': keras.initializers.RandomNormal(mean=0, stddev=0.5),
    'he_normal': 'he_normal',  # Best for ReLU
    'glorot_uniform': 'glorot_uniform'  # Xavier initialization
}

dead_neuron_results = {}

print("Testing different weight initializations...\n")
for name, init in init_methods.items():
    print(f"Training with {name} initialization...")
    
    model = build_relu_model_with_init(init)
    callback = DeadReLUCallback()
    
    history = model.fit(
        X_train, y_train,
        epochs=20,
        batch_size=32,
        validation_split=0.2,
        verbose=0,
        callbacks=[callback]
    )
    
    dead_neuron_results[name] = callback.dead_neurons_history
    
    test_acc = model.evaluate(X_test, y_test, verbose=0)[1]
    print(f"  Test Accuracy: {test_acc:.3f}")
    print(f"  Final dead neurons: {callback.dead_neurons_history[-1]:.1f}%\n")

In [None]:
# Visualize dead neuron evolution
plt.figure(figsize=(10, 6))

for name, history in dead_neuron_results.items():
    plt.plot(history, label=name, linewidth=2, marker='o')

plt.xlabel('Epoch')
plt.ylabel('Percentage of Dead Neurons (%)')
plt.title('Dead ReLU Problem with Different Initializations')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Key Insights:")
print("1. Poor initialization (zeros) causes massive neuron death")
print("2. He initialization is designed specifically for ReLU")
print("3. Dead neurons stop learning and reduce model capacity")

---
## Exercise 7: Implementing Gradient Clipping

### Concept: Controlling Exploding Gradients
Gradient clipping prevents gradients from becoming too large, stabilizing training.

In [None]:
# Create a dataset that might cause gradient explosion
np.random.seed(42)
X_unstable = np.random.randn(1000, 10) * 10  # Large input values
y_unstable = np.random.randint(0, 2, 1000)

X_train_uns, X_test_uns, y_train_uns, y_test_uns = train_test_split(
    X_unstable, y_unstable, test_size=0.2, random_state=42
)

# Build models with and without gradient clipping
def build_model_with_clipping(clip_value=None):
    model = models.Sequential([
        layers.Dense(128, activation='tanh', input_shape=(10,)),
        layers.Dense(128, activation='tanh'),
        layers.Dense(128, activation='tanh'),
        layers.Dense(128, activation='tanh'),
        layers.Dense(1, activation='sigmoid')
    ])
    
    # Configure optimizer with gradient clipping
    if clip_value:
        optimizer = optimizers.Adam(clipvalue=clip_value)
    else:
        optimizer = optimizers.Adam()
    
    model.compile(
        optimizer=optimizer,
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Train without clipping
print("Training without gradient clipping...")
model_no_clip = build_model_with_clipping(clip_value=None)
history_no_clip = model_no_clip.fit(
    X_train_uns, y_train_uns,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Train with clipping
print("Training with gradient clipping (clip_value=1.0)...")
model_clip = build_model_with_clipping(clip_value=1.0)
history_clip = model_clip.fit(
    X_train_uns, y_train_uns,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

print("Training complete!")

In [None]:
# Compare training stability
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Loss comparison
ax1.plot(history_no_clip.history['loss'], label='No Clipping', linewidth=2, alpha=0.7)
ax1.plot(history_clip.history['loss'], label='With Clipping', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Comparison')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, max(history_no_clip.history['loss'][5:])*1.1])  # Zoom in after initial epochs

# Validation accuracy comparison
ax2.plot(history_no_clip.history['val_accuracy'], label='No Clipping', linewidth=2, alpha=0.7)
ax2.plot(history_clip.history['val_accuracy'], label='With Clipping', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Validation Accuracy')
ax2.set_title('Validation Accuracy Comparison')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate stability metrics
loss_variance_no_clip = np.var(history_no_clip.history['loss'][10:])
loss_variance_clip = np.var(history_clip.history['loss'][10:])

print(f"\nTraining Stability (lower is better):")
print(f"Loss variance without clipping: {loss_variance_no_clip:.4f}")
print(f"Loss variance with clipping: {loss_variance_clip:.4f}")
print(f"Improvement: {(1 - loss_variance_clip/loss_variance_no_clip)*100:.1f}% more stable")

---
## Exercise 8: Understanding Batch Normalization

### Concept: Internal Covariate Shift
Batch Normalization stabilizes training by normalizing inputs to each layer.

In [None]:
# Compare models with and without batch normalization
def build_model_batchnorm(use_batchnorm=True, use_dropout=False):
    model = models.Sequential()
    
    # Input layer
    model.add(layers.Dense(256, input_shape=(2,)))
    if use_batchnorm:
        model.add(layers.BatchNormalization())
    model.add(layers.Activation('relu'))
    if use_dropout:
        model.add(layers.Dropout(0.3))
    
    # Hidden layers
    for units in [128, 64, 32]:
        model.add(layers.Dense(units))
        if use_batchnorm:
            model.add(layers.BatchNormalization())
        model.add(layers.Activation('relu'))
        if use_dropout:
            model.add(layers.Dropout(0.3))
    
    # Output layer
    model.add(layers.Dense(4, activation='softmax'))
    
    model.compile(
        optimizer=optimizers.Adam(learning_rate=0.001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Train models with different configurations
configs = [
    ('No BatchNorm, No Dropout', False, False),
    ('BatchNorm Only', True, False),
    ('Dropout Only', False, True),
    ('BatchNorm + Dropout', True, True)
]

results = {}

print("Training models with different regularization techniques...\n")
for name, use_bn, use_dropout in configs:
    print(f"Training: {name}")
    
    model = build_model_batchnorm(use_bn, use_dropout)
    
    history = model.fit(
        X_train, y_train,
        epochs=50,
        batch_size=32,
        validation_split=0.2,
        verbose=0,
        callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)]
    )
    
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    
    results[name] = {
        'history': history,
        'test_acc': test_acc,
        'final_val_acc': history.history['val_accuracy'][-1]
    }
    
    print(f"  Test Accuracy: {test_acc:.3f}\n")

In [None]:
# Visualize the effects of batch normalization
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Training Curves', 'Final Performance')
)

colors = {'No BatchNorm, No Dropout': 'red',
          'BatchNorm Only': 'blue',
          'Dropout Only': 'green',
          'BatchNorm + Dropout': 'purple'}

# Training curves
for name, result in results.items():
    fig.add_trace(
        go.Scatter(
            y=result['history'].history['val_accuracy'],
            name=name,
            line=dict(color=colors[name], width=2)
        ),
        row=1, col=1
    )

# Bar chart of final performance
names = list(results.keys())
test_accs = [results[name]['test_acc'] for name in names]

fig.add_trace(
    go.Bar(
        x=names,
        y=test_accs,
        marker_color=list(colors.values()),
        showlegend=False
    ),
    row=1, col=2
)

fig.update_xaxes(title_text="Epoch", row=1, col=1)
fig.update_xaxes(title_text="Configuration", tickangle=45, row=1, col=2)
fig.update_yaxes(title_text="Validation Accuracy", row=1, col=1)
fig.update_yaxes(title_text="Test Accuracy", row=1, col=2)

fig.update_layout(height=400, title_text="Impact of Batch Normalization and Dropout")
fig.show()

print("\nKey Observations:")
print("1. BatchNorm accelerates training and improves convergence")
print("2. BatchNorm + Dropout provides best regularization")
print("3. BatchNorm alone can sometimes be sufficient for regularization")

---
## Part 5: Summary and Practice Exercises

### 🎯 Key Takeaways

1. **Deep vs Shallow Networks**
   - Deep networks are more parameter-efficient
   - They learn hierarchical representations
   - Feature reuse and composition provide exponential expressiveness

2. **Gradient Problems**
   - Vanishing gradients: Common with sigmoid/tanh activations
   - Exploding gradients: Can occur with poor initialization
   - Solutions: Better activations, normalization, gradient clipping

3. **Modern Activation Functions**
   - ReLU: Simple and effective, but can die
   - Leaky ReLU/ELU: Address dead neuron problem
   - SELU: Self-normalizing properties
   - Swish/GELU: State-of-the-art for many tasks

4. **Training Techniques**
   - Proper initialization (He/Xavier) is crucial
   - Batch normalization stabilizes training
   - Gradient clipping prevents explosions
   - Dropout provides regularization

### 📝 Practice Exercises

Try these exercises to deepen your understanding:

1. **Architecture Exploration**
   - Build a 10-layer network and train it with different activation functions
   - Compare convergence speed and final accuracy

2. **Custom Activation Function**
   - Implement your own activation function (e.g., x * tanh(sqrt(x)))
   - Test it on the spiral dataset

3. **Gradient Analysis**
   - Track gradient magnitudes during training
   - Visualize how they change with depth

4. **Hyperparameter Study**
   - Systematically vary learning rate, batch size, and network depth
   - Create a heatmap of performance

5. **Real Dataset Challenge**
   - Apply these concepts to MNIST or CIFAR-10
   - Build the deepest network you can train successfully

### 🚀 Advanced Topics to Explore

- Residual connections (ResNet)
- Dense connections (DenseNet)  
- Attention mechanisms
- Neural Architecture Search (NAS)
- Pruning and quantization

### 📚 References

1. Glorot & Bengio (2010) - Understanding the difficulty of training deep feedforward neural networks
2. He et al. (2015) - Delving Deep into Rectifiers
3. Ioffe & Szegedy (2015) - Batch Normalization
4. Ramachandran et al. (2017) - Searching for Activation Functions
5. Klambauer et al. (2017) - Self-Normalizing Neural Networks
