# Multi-layer Perceptrons (MLPs): From Theory to Practice
## Based on Lecture 5: From Logistic Regression to Multi-layer Perceptrons

**Author:** Ho-min Park  
**Interactive Notebook Version**

---

## 🎯 Learning Objectives

By the end of this notebook, you will:
1. Understand why we need neural networks (XOR problem)
2. Implement neurons and activation functions from scratch
3. Build a complete MLP architecture
4. Master forward propagation
5. Implement backpropagation algorithm
6. Train neural networks on real datasets
7. Visualize decision boundaries and training dynamics
8. Compare different activation functions and architectures

---

## Part 0: Setup and Imports

Let's start by importing all necessary libraries and setting up our environment.

In [None]:
# Essential imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print('Setup complete! ✅')
print(f'NumPy version: {np.__version__}')
print(f'Pandas version: {pd.__version__}')

---
## Part 1: Neural Network Motivation

### Why do we need neural networks?

Linear models like logistic regression have fundamental limitations. Let's explore the famous XOR problem that demonstrates why we need non-linear models.

### Exercise 1: The XOR Problem - Linear Inseparability

#### 📚 Concept
The XOR (exclusive OR) problem is a classic example that shows the limitations of linear classifiers. The XOR function outputs 1 when inputs are different, and 0 when they're the same. This creates a pattern that cannot be separated by a single straight line.

#### 💻 Code Implementation

In [None]:
# Create XOR dataset
def create_xor_data():
    """Generate XOR problem dataset"""
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([0, 1, 1, 0])  # XOR logic
    return X, y

# Generate data
X_xor, y_xor = create_xor_data()

# Display truth table
xor_df = pd.DataFrame({
    'x₁': X_xor[:, 0],
    'x₂': X_xor[:, 1],
    'XOR Output': y_xor
})

print("XOR Truth Table:")
print(xor_df.to_string(index=False))
print("\n🔍 Notice: Outputs with same inputs → 0, different inputs → 1")

In [None]:
# Visualize XOR problem
plt.figure(figsize=(10, 4))

# Subplot 1: Data points
plt.subplot(1, 2, 1)
colors = ['red' if y == 0 else 'blue' for y in y_xor]
plt.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors='black', linewidth=2)

# Add labels
for i, (x1, x2, y) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):
    plt.annotate(f'({int(x1)},{int(x2)})\ny={y}', 
                xy=(x1, x2), xytext=(x1-0.15, x2+0.1), fontsize=10)

plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.xlabel('x₁', fontsize=12)
plt.ylabel('x₂', fontsize=12)
plt.title('XOR Problem: Data Points', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Subplot 2: Failed linear separation attempt
plt.subplot(1, 2, 2)
plt.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors='black', linewidth=2)

# Try to draw a separating line (will fail)
x_line = np.linspace(-0.5, 1.5, 100)
y_line = 1 - x_line  # Example line
plt.plot(x_line, y_line, 'g--', linewidth=2, label='Attempted separator')

plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.xlabel('x₁', fontsize=12)
plt.ylabel('x₂', fontsize=12)
plt.title('XOR: No Linear Separation Possible!', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("❌ No single line can separate red from blue points!")
print("✅ This is why we need neural networks with hidden layers.")

#### 🎯 Your Turn

**Task:** Implement a simple logistic regression to verify it cannot solve XOR. Then, think about what transformation might help.

**Hint:** What if we could transform the input space? For example, what if we added a feature like x₁ × x₂?

In [None]:
# TODO: Your code here
# 1. Try logistic regression on XOR (it will fail)
# 2. Add a new feature: x1 * x2
# 3. Try logistic regression again with the new feature

from sklearn.linear_model import LogisticRegression

# Your implementation:
# lr = LogisticRegression()
# lr.fit(X_xor, y_xor)
# predictions = lr.predict(X_xor)
# print(f"Accuracy with linear features: {accuracy_score(y_xor, predictions):.2f}")

---
### Exercise 2: Implementing Activation Functions

#### 📚 Concept
Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns. Without activation functions, even deep networks would collapse to a single linear transformation.

#### 💻 Code Implementation

In [None]:
class ActivationFunctions:
    """Collection of activation functions and their derivatives"""
    
    @staticmethod
    def sigmoid(z):
        """Sigmoid activation: σ(z) = 1/(1 + e^(-z))"""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    @staticmethod
    def sigmoid_derivative(z):
        """Derivative of sigmoid: σ'(z) = σ(z)(1 - σ(z))"""
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    @staticmethod
    def tanh(z):
        """Hyperbolic tangent activation"""
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        """Derivative of tanh: (1 - tanh²(z))"""
        return 1 - np.tanh(z) ** 2
    
    @staticmethod
    def relu(z):
        """ReLU activation: max(0, z)"""
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        """Derivative of ReLU: 1 if z > 0, else 0"""
        return (z > 0).astype(float)
    
    @staticmethod
    def leaky_relu(z, alpha=0.01):
        """Leaky ReLU: max(αz, z)"""
        return np.where(z > 0, z, alpha * z)
    
    @staticmethod
    def leaky_relu_derivative(z, alpha=0.01):
        """Derivative of Leaky ReLU"""
        return np.where(z > 0, 1, alpha)

# Test activation functions
af = ActivationFunctions()
z = np.linspace(-5, 5, 100)

print("Activation Functions Implementation Complete! ✅")
print(f"Sigmoid at z=0: {af.sigmoid(0):.4f}")
print(f"ReLU at z=-1: {af.relu(-1):.4f}")
print(f"ReLU at z=1: {af.relu(1):.4f}")

In [None]:
# Visualize activation functions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
z = np.linspace(-5, 5, 100)

# Define functions and derivatives
functions = [
    ('Sigmoid', af.sigmoid, af.sigmoid_derivative),
    ('Tanh', af.tanh, af.tanh_derivative),
    ('ReLU', af.relu, af.relu_derivative),
    ('Leaky ReLU', af.leaky_relu, af.leaky_relu_derivative)
]

for idx, (name, func, deriv) in enumerate(functions):
    # Plot activation function
    ax = axes[0, idx]
    ax.plot(z, func(z), linewidth=2.5, color=f'C{idx}')
    ax.set_title(f'{name}', fontsize=12, fontweight='bold')
    ax.set_xlabel('z')
    ax.set_ylabel('f(z)')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linestyle='-', alpha=0.2)
    ax.axvline(x=0, color='k', linestyle='-', alpha=0.2)
    
    # Plot derivative
    ax = axes[1, idx]
    ax.plot(z, deriv(z), linewidth=2.5, color=f'C{idx}', linestyle='--')
    ax.set_title(f'{name} Derivative', fontsize=12)
    ax.set_xlabel('z')
    ax.set_ylabel("f'(z)")
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linestyle='-', alpha=0.2)
    ax.axvline(x=0, color='k', linestyle='-', alpha=0.2)

plt.suptitle('Activation Functions and Their Derivatives', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("📊 Key Observations:")
print("• Sigmoid: Bounded [0,1], suffers from vanishing gradients")
print("• Tanh: Zero-centered, bounded [-1,1], also vanishing gradients")
print("• ReLU: Unbounded, fast, but has dead neurons problem")
print("• Leaky ReLU: Solves dead neurons by allowing small negative gradients")

---
### Exercise 3: Building a Single Neuron

#### 📚 Concept
A neuron (perceptron) is the basic building block of neural networks. It computes a weighted sum of inputs, adds a bias, and applies an activation function.

**Formula:** `output = activation(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)`

#### 💻 Code Implementation

In [None]:
class Neuron:
    """A single neuron with configurable activation"""
    
    def __init__(self, n_inputs, activation='sigmoid'):
        """Initialize neuron with random weights and bias"""
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = np.random.randn() * 0.1
        self.activation = activation
        self.af = ActivationFunctions()
        
        # Store for backpropagation
        self.last_input = None
        self.last_z = None
        self.last_output = None
    
    def forward(self, inputs):
        """Compute neuron output"""
        self.last_input = inputs
        self.last_z = np.dot(inputs, self.weights) + self.bias
        
        # Apply activation
        if self.activation == 'sigmoid':
            self.last_output = self.af.sigmoid(self.last_z)
        elif self.activation == 'tanh':
            self.last_output = self.af.tanh(self.last_z)
        elif self.activation == 'relu':
            self.last_output = self.af.relu(self.last_z)
        else:
            self.last_output = self.last_z  # Linear
        
        return self.last_output
    
    def backward(self, error, learning_rate=0.01):
        """Update weights using gradient descent"""
        # Compute gradient based on activation
        if self.activation == 'sigmoid':
            grad = error * self.af.sigmoid_derivative(self.last_z)
        elif self.activation == 'tanh':
            grad = error * self.af.tanh_derivative(self.last_z)
        elif self.activation == 'relu':
            grad = error * self.af.relu_derivative(self.last_z)
        else:
            grad = error
        
        # Update weights and bias
        self.weights -= learning_rate * grad * self.last_input
        self.bias -= learning_rate * grad
        
        return grad

# Test single neuron
neuron = Neuron(n_inputs=2, activation='sigmoid')
test_input = np.array([1.0, 0.5])
output = neuron.forward(test_input)

print(f"Neuron created with {len(neuron.weights)} weights")
print(f"Weights: {neuron.weights}")
print(f"Bias: {neuron.bias:.4f}")
print(f"Input: {test_input}")
print(f"Output: {output:.4f}")

#### 🎯 Your Turn

**Task:** Train a single neuron to learn the AND gate logic.

AND gate truth table:
- (0,0) → 0
- (0,1) → 0  
- (1,0) → 0
- (1,1) → 1

In [None]:
# TODO: Train a neuron to learn AND gate
# Hint: Use the Neuron class above with a training loop

# AND gate data
X_and = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_and = np.array([0, 0, 0, 1])

# Your code here:
# and_neuron = Neuron(n_inputs=2, activation='sigmoid')
# for epoch in range(1000):
#     for x, target in zip(X_and, y_and):
#         output = and_neuron.forward(x)
#         error = target - output
#         and_neuron.backward(error, learning_rate=0.1)

---
## Part 2: Multi-Layer Perceptron Architecture

Now let's build a complete MLP from scratch!

### Exercise 4: Complete MLP Implementation

#### 📚 Concept
An MLP consists of:
- **Input layer**: Receives raw features
- **Hidden layers**: Learn representations through non-linear transformations
- **Output layer**: Produces final predictions

Information flows forward during prediction and gradients flow backward during training.

#### 💻 Code Implementation

In [None]:
class MLP:
    """Multi-Layer Perceptron implementation from scratch"""
    
    def __init__(self, layer_sizes, activation='relu', output_activation='sigmoid'):
        """
        Initialize MLP
        layer_sizes: list of layer dimensions [input_size, hidden1, hidden2, ..., output_size]
        """
        self.layer_sizes = layer_sizes
        self.n_layers = len(layer_sizes)
        self.activation = activation
        self.output_activation = output_activation
        self.af = ActivationFunctions()
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers - 1):
            # He initialization for ReLU, Xavier for others
            if activation == 'relu':
                w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            else:
                w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(1.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            
            self.weights.append(w)
            self.biases.append(b)
        
        # Storage for forward pass (needed for backprop)
        self.activations = []
        self.z_values = []
    
    def forward(self, X):
        """Forward propagation"""
        self.activations = [X]
        self.z_values = []
        
        current_input = X
        
        for i in range(self.n_layers - 1):
            z = np.dot(current_input, self.weights[i]) + self.biases[i]
            self.z_values.append(z)
            
            # Apply activation function
            if i == self.n_layers - 2:  # Output layer
                if self.output_activation == 'sigmoid':
                    a = self.af.sigmoid(z)
                elif self.output_activation == 'softmax':
                    a = self.softmax(z)
                else:
                    a = z  # Linear
            else:  # Hidden layers
                if self.activation == 'relu':
                    a = self.af.relu(z)
                elif self.activation == 'tanh':
                    a = self.af.tanh(z)
                elif self.activation == 'sigmoid':
                    a = self.af.sigmoid(z)
                else:
                    a = z
            
            self.activations.append(a)
            current_input = a
        
        return self.activations[-1]
    
    def backward(self, X, y, learning_rate=0.01):
        """Backpropagation algorithm"""
        m = X.shape[0]
        
        # Compute output layer gradient
        delta = self.activations[-1] - y
        
        # Backpropagate through layers
        for i in range(self.n_layers - 2, -1, -1):
            # Compute gradients
            dW = (1/m) * np.dot(self.activations[i].T, delta)
            db = (1/m) * np.sum(delta, axis=0, keepdims=True)
            
            # Update weights and biases
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db
            
            # Compute delta for next layer
            if i > 0:
                delta = np.dot(delta, self.weights[i].T)
                # Apply activation derivative
                if self.activation == 'relu':
                    delta *= self.af.relu_derivative(self.z_values[i-1])
                elif self.activation == 'tanh':
                    delta *= self.af.tanh_derivative(self.z_values[i-1])
                elif self.activation == 'sigmoid':
                    delta *= self.af.sigmoid_derivative(self.z_values[i-1])
    
    def train(self, X, y, epochs=100, learning_rate=0.01, verbose=True):
        """Train the network"""
        losses = []
        
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)
            
            # Compute loss (binary cross-entropy for binary classification)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            losses.append(loss)
            
            # Backward pass
            self.backward(X, y, learning_rate)
            
            if verbose and epoch % 10 == 0:
                print(f"Epoch {epoch}, Loss: {loss:.4f}")
        
        return losses
    
    def predict(self, X):
        """Make predictions"""
        output = self.forward(X)
        return (output > 0.5).astype(int)
    
    def softmax(self, z):
        """Softmax activation for multi-class"""
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Test MLP on XOR
mlp_xor = MLP(layer_sizes=[2, 4, 1], activation='tanh', output_activation='sigmoid')
print(f"Created MLP with architecture: {mlp_xor.layer_sizes}")
print(f"Number of parameters: {sum([w.size for w in mlp_xor.weights]) + sum([b.size for b in mlp_xor.biases])}")

### Exercise 5: Solving XOR with MLP

#### 📚 Concept
Now we'll demonstrate that an MLP with hidden layers CAN solve the XOR problem that stumped our linear models.

#### 💻 Code Implementation

In [None]:
# Train MLP on XOR
X_xor, y_xor = create_xor_data()
y_xor = y_xor.reshape(-1, 1)

# Create and train network
mlp_xor = MLP(layer_sizes=[2, 4, 1], activation='tanh', output_activation='sigmoid')
losses = mlp_xor.train(X_xor, y_xor, epochs=1000, learning_rate=0.5, verbose=False)

# Make predictions
predictions = mlp_xor.predict(X_xor)
accuracy = np.mean(predictions == y_xor)

print(f"✅ MLP Accuracy on XOR: {accuracy:.2%}")
print("\nPredictions:")
for i, (x, y_true, y_pred) in enumerate(zip(X_xor, y_xor.flatten(), predictions.flatten())):
    print(f"  Input: {x}, True: {y_true}, Predicted: {y_pred}")

# Plot training loss
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(losses, linewidth=2)
plt.title('Training Loss', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Binary Cross-Entropy Loss')
plt.grid(True, alpha=0.3)

# Visualize decision boundary
plt.subplot(1, 2, 2)
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = mlp_xor.forward(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
plt.colorbar(label='Output')
colors = ['red' if y == 0 else 'blue' for y in y_xor.flatten()]
plt.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=200, edgecolors='black', linewidth=2)
plt.title('MLP Decision Boundary', fontsize=14, fontweight='bold')
plt.xlabel('x₁')
plt.ylabel('x₂')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎉 SUCCESS! The MLP learned a non-linear decision boundary to solve XOR!")

---
### Exercise 6: Visualizing Hidden Layer Representations

#### 📚 Concept
Hidden layers learn to transform the input space into representations where the data becomes linearly separable. Let's visualize what the hidden layer learns.

#### 💻 Code Implementation

In [None]:
# Get hidden layer activations
_ = mlp_xor.forward(X_xor)
hidden_activations = mlp_xor.activations[1]  # First hidden layer

print("Hidden Layer Activations (4 neurons):")
print(hidden_activations)
print(f"\nShape: {hidden_activations.shape}")

# Create interactive 3D visualization with Plotly
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Original XOR Space', 'Hidden Layer Space'),
    specs=[[{'type': 'scatter'}, {'type': 'scatter3d'}]]
)

# Original 2D space
colors_plotly = ['red' if y == 0 else 'blue' for y in y_xor.flatten()]
fig.add_trace(
    go.Scatter(x=X_xor[:, 0], y=X_xor[:, 1],
               mode='markers+text',
               marker=dict(size=15, color=colors_plotly, line=dict(width=2, color='black')),
               text=[f'({int(x)},{int(y)})' for x, y in X_xor],
               textposition='top center',
               name='Original'),
    row=1, col=1
)

# Hidden layer 3D space (using first 3 hidden neurons)
if hidden_activations.shape[1] >= 3:
    fig.add_trace(
        go.Scatter3d(x=hidden_activations[:, 0],
                     y=hidden_activations[:, 1],
                     z=hidden_activations[:, 2],
                     mode='markers+text',
                     marker=dict(size=10, color=colors_plotly, line=dict(width=2, color='black')),
                     text=[f'Point {i}' for i in range(4)],
                     name='Hidden'),
        row=1, col=2
    )

fig.update_layout(height=500, title_text="Feature Space Transformation", showlegend=False)
fig.update_xaxes(title_text="x₁", row=1, col=1)
fig.update_yaxes(title_text="x₂", row=1, col=1)
fig.show()

print("\n🔍 Notice how the hidden layer transforms the space!")
print("The points that were not linearly separable in 2D might become separable in the hidden layer space.")

---
## Part 3: Practical Applications

Let's apply our MLP to more complex datasets!

### Exercise 7: Non-linear Classification - Moons Dataset

#### 📚 Concept
Real-world data often has complex, non-linear patterns. The moons dataset is a classic example of interleaving crescent shapes that require non-linear decision boundaries.

#### 💻 Code Implementation

In [None]:
# Generate moons dataset
from sklearn.datasets import make_moons
X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_moons, y_moons, test_size=0.3, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualize dataset
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', alpha=0.7, edgecolors='black')
plt.title('Training Data', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')

# Train MLP
mlp_moons = MLP(layer_sizes=[2, 8, 8, 1], activation='relu', output_activation='sigmoid')
y_train_reshaped = y_train.reshape(-1, 1)

print("Training MLP on Moons dataset...")
losses = mlp_moons.train(X_train_scaled, y_train_reshaped, epochs=200, learning_rate=0.1, verbose=False)

# Evaluate
train_pred = mlp_moons.predict(X_train_scaled)
test_pred = mlp_moons.predict(X_test_scaled)

train_acc = np.mean(train_pred.flatten() == y_train)
test_acc = np.mean(test_pred.flatten() == y_test)

print(f"\nResults:")
print(f"Training Accuracy: {train_acc:.2%}")
print(f"Testing Accuracy: {test_acc:.2%}")

# Plot loss curve
plt.subplot(1, 3, 2)
plt.plot(losses, linewidth=2, color='orange')
plt.title('Training Loss', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)

# Plot decision boundary
plt.subplot(1, 3, 3)
x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = mlp_moons.forward(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.4)
plt.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train, 
           cmap='viridis', edgecolors='black', linewidth=1)
plt.title('Decision Boundary', fontsize=14, fontweight='bold')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.colorbar(label='Prediction')

plt.tight_layout()
plt.show()

#### 🎯 Your Turn

**Task:** Experiment with different architectures and activation functions. Try:
1. Changing the number of hidden layers (try [2, 16, 1] vs [2, 8, 8, 8, 1])
2. Using different activation functions (relu vs tanh)
3. Adjusting the learning rate

What combination gives the best test accuracy?

In [None]:
# TODO: Your experiments here
# Try different architectures:
# architectures = [
#     [2, 16, 1],
#     [2, 8, 8, 1],
#     [2, 8, 8, 8, 1],
#     [2, 32, 16, 8, 1]
# ]

# for arch in architectures:
#     mlp = MLP(layer_sizes=arch, activation='relu')
#     # Train and evaluate...

---
### Exercise 8: Comparing Activation Functions

#### 📚 Concept
Different activation functions have different properties:
- **Sigmoid**: Smooth, bounded [0,1], but suffers from vanishing gradients
- **Tanh**: Zero-centered, bounded [-1,1], also vanishing gradients
- **ReLU**: Fast, no vanishing gradients, but can have dead neurons

#### 💻 Code Implementation

In [None]:
# Compare activation functions on circles dataset
X_circles, y_circles = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=42)

# Prepare data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_circles, y_circles, test_size=0.3, random_state=42
)
scaler_c = StandardScaler()
X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)
y_train_c = y_train_c.reshape(-1, 1)
y_test_c = y_test_c.reshape(-1, 1)

# Test different activations
activations = ['sigmoid', 'tanh', 'relu']
results = {}

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Activation Function Comparison on Circles Dataset', fontsize=16, fontweight='bold')

for idx, activation in enumerate(activations):
    print(f"\nTraining with {activation} activation...")
    
    # Train model
    mlp = MLP(layer_sizes=[2, 16, 16, 1], activation=activation, output_activation='sigmoid')
    losses = mlp.train(X_train_c_scaled, y_train_c, epochs=200, learning_rate=0.1, verbose=False)
    
    # Evaluate
    train_pred = mlp.predict(X_train_c_scaled)
    test_pred = mlp.predict(X_test_c_scaled)
    train_acc = np.mean(train_pred == y_train_c)
    test_acc = np.mean(test_pred == y_test_c)
    
    results[activation] = {
        'train_acc': train_acc,
        'test_acc': test_acc,
        'final_loss': losses[-1]
    }
    
    # Plot loss curve
    ax = axes[0, idx]
    ax.plot(losses, linewidth=2, color=f'C{idx}')
    ax.set_title(f'{activation.upper()} - Loss Curve', fontsize=12, fontweight='bold')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.grid(True, alpha=0.3)
    ax.text(0.5, 0.95, f'Train Acc: {train_acc:.2%}\nTest Acc: {test_acc:.2%}',
            transform=ax.transAxes, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    # Plot decision boundary
    ax = axes[1, idx]
    x_min, x_max = X_train_c_scaled[:, 0].min() - 0.5, X_train_c_scaled[:, 0].max() + 0.5
    y_min, y_max = X_train_c_scaled[:, 1].min() - 0.5, X_train_c_scaled[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = mlp.forward(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, levels=20, cmap='coolwarm', alpha=0.4)
    ax.scatter(X_train_c_scaled[:, 0], X_train_c_scaled[:, 1], 
              c=y_train_c.flatten(), cmap='coolwarm', edgecolors='black', linewidth=1)
    ax.set_title(f'{activation.upper()} - Decision Boundary', fontsize=12, fontweight='bold')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Summary table
results_df = pd.DataFrame(results).T
results_df.columns = ['Train Accuracy', 'Test Accuracy', 'Final Loss']
print("\n📊 Results Summary:")
print(results_df.round(4))
print(f"\n🏆 Best activation: {results_df['Test Accuracy'].idxmax()} with {results_df['Test Accuracy'].max():.2%} test accuracy")

---
## Part 4: Advanced Topics

Let's explore gradient flow and backpropagation visualization.

### Exercise 9: Understanding Gradient Flow

#### 📚 Concept
Backpropagation uses the chain rule to compute gradients layer by layer. Understanding gradient flow helps diagnose training problems like vanishing or exploding gradients.

#### 💻 Code Implementation

In [None]:
class MLPWithGradientTracking(MLP):
    """Extended MLP that tracks gradients for visualization"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.gradient_history = []
    
    def backward(self, X, y, learning_rate=0.01):
        """Backpropagation with gradient tracking"""
        m = X.shape[0]
        gradients = []
        
        # Compute output layer gradient
        delta = self.activations[-1] - y
        
        # Backpropagate through layers
        for i in range(self.n_layers - 2, -1, -1):
            # Compute gradients
            dW = (1/m) * np.dot(self.activations[i].T, delta)
            db = (1/m) * np.sum(delta, axis=0, keepdims=True)
            
            # Store gradient magnitudes
            grad_norm = np.linalg.norm(dW)
            gradients.append(grad_norm)
            
            # Update weights and biases
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db
            
            # Compute delta for next layer
            if i > 0:
                delta = np.dot(delta, self.weights[i].T)
                if self.activation == 'relu':
                    delta *= self.af.relu_derivative(self.z_values[i-1])
                elif self.activation == 'tanh':
                    delta *= self.af.tanh_derivative(self.z_values[i-1])
                elif self.activation == 'sigmoid':
                    delta *= self.af.sigmoid_derivative(self.z_values[i-1])
        
        self.gradient_history.append(gradients[::-1])  # Reverse to match layer order

# Train network with gradient tracking
print("Training network with gradient tracking...")
mlp_grad = MLPWithGradientTracking(layer_sizes=[2, 8, 8, 1], activation='sigmoid')
X_sample, y_sample = make_moons(n_samples=100, noise=0.1, random_state=42)
y_sample = y_sample.reshape(-1, 1)

# Manual training loop to track gradients
for epoch in range(100):
    output = mlp_grad.forward(X_sample)
    mlp_grad.backward(X_sample, y_sample, learning_rate=0.5)

# Visualize gradient flow
gradient_history = np.array(mlp_grad.gradient_history)

plt.figure(figsize=(14, 5))

# Plot 1: Gradient magnitude over time
plt.subplot(1, 2, 1)
for layer in range(gradient_history.shape[1]):
    plt.plot(gradient_history[:, layer], label=f'Layer {layer+1}', linewidth=2)
plt.title('Gradient Magnitude During Training', fontsize=14, fontweight='bold')
plt.xlabel('Training Step')
plt.ylabel('Gradient Norm')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 2: Gradient heatmap
plt.subplot(1, 2, 2)
plt.imshow(gradient_history.T, aspect='auto', cmap='viridis', interpolation='nearest')
plt.colorbar(label='Gradient Magnitude')
plt.title('Gradient Flow Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Training Step')
plt.ylabel('Layer')
plt.yticks(range(gradient_history.shape[1]), [f'Layer {i+1}' for i in range(gradient_history.shape[1])])

plt.tight_layout()
plt.show()

print("\n📊 Gradient Flow Analysis:")
print(f"• Average gradient (first layer): {gradient_history[:, 0].mean():.6f}")
print(f"• Average gradient (last layer): {gradient_history[:, -1].mean():.6f}")
print(f"• Gradient ratio (last/first): {gradient_history[:, -1].mean() / gradient_history[:, 0].mean():.2f}")
print("\n💡 Healthy gradient flow shows relatively consistent magnitudes across layers.")
print("   Vanishing gradients: earlier layers have much smaller gradients.")
print("   Exploding gradients: gradients grow exponentially through layers.")

---
### Exercise 10: Mini-batch Gradient Descent

#### 📚 Concept
Mini-batch gradient descent balances between:
- **Batch GD**: Uses all data, stable but slow
- **SGD**: Uses one sample, fast but noisy
- **Mini-batch**: Uses small batches, good balance

#### 💻 Code Implementation

In [None]:
def train_with_minibatch(mlp, X, y, epochs=100, batch_size=32, learning_rate=0.01):
    """Train MLP using mini-batch gradient descent"""
    n_samples = X.shape[0]
    losses = []
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        epoch_losses = []
        
        # Mini-batch training
        for start_idx in range(0, n_samples, batch_size):
            end_idx = min(start_idx + batch_size, n_samples)
            X_batch = X_shuffled[start_idx:end_idx]
            y_batch = y_shuffled[start_idx:end_idx]
            
            # Forward and backward pass
            output = mlp.forward(X_batch)
            loss = -np.mean(y_batch * np.log(output + 1e-8) + 
                           (1 - y_batch) * np.log(1 - output + 1e-8))
            epoch_losses.append(loss)
            mlp.backward(X_batch, y_batch, learning_rate)
        
        losses.append(np.mean(epoch_losses))
        
        if epoch % 20 == 0:
            print(f"Epoch {epoch}, Loss: {losses[-1]:.4f}")
    
    return losses

# Compare batch sizes
X_train_mb, y_train_mb = make_classification(n_samples=500, n_features=2, n_redundant=0,
                                            n_informative=2, n_clusters_per_class=2,
                                            random_state=42)
y_train_mb = y_train_mb.reshape(-1, 1)

batch_sizes = [1, 16, 32, 64, 500]  # 500 = full batch
results = {}

plt.figure(figsize=(15, 4))

for idx, batch_size in enumerate(batch_sizes):
    print(f"\nTraining with batch size: {batch_size}")
    
    # Create new network for each batch size
    mlp_mb = MLP(layer_sizes=[2, 16, 1], activation='relu')
    
    # Train
    losses = train_with_minibatch(mlp_mb, X_train_mb, y_train_mb, 
                                 epochs=100, batch_size=batch_size, 
                                 learning_rate=0.01)
    
    results[f'Batch {batch_size}'] = losses
    
    # Plot
    plt.subplot(1, 5, idx+1)
    plt.plot(losses, linewidth=2)
    plt.title(f'Batch Size: {batch_size}', fontsize=12, fontweight='bold')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.grid(True, alpha=0.3)
    plt.ylim([0, max(losses[10:]) * 1.1])  # Zoom in after initial epochs

plt.suptitle('Effect of Batch Size on Training', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n📊 Batch Size Comparison:")
print("• Batch Size 1 (SGD): Very noisy, fast updates")
print("• Batch Size 16-64: Good balance of speed and stability")
print("• Batch Size 500 (Full): Smooth but fewer updates per epoch")

#### 🎯 Your Turn

**Task:** Implement learning rate scheduling - reduce the learning rate as training progresses.

Common schedules:
1. Step decay: lr = lr * 0.9 every 10 epochs
2. Exponential decay: lr = lr * exp(-decay * epoch)
3. 1/t decay: lr = lr / (1 + decay * epoch)

In [None]:
# TODO: Implement learning rate scheduling
def train_with_lr_schedule(mlp, X, y, epochs=100, initial_lr=0.1, schedule='step'):
    """Train with learning rate scheduling"""
    losses = []
    lrs = []
    
    for epoch in range(epochs):
        # Calculate current learning rate
        if schedule == 'step':
            lr = initial_lr * (0.9 ** (epoch // 10))
        elif schedule == 'exponential':
            lr = initial_lr * np.exp(-0.01 * epoch)
        elif schedule == '1/t':
            lr = initial_lr / (1 + 0.01 * epoch)
        else:
            lr = initial_lr
        
        lrs.append(lr)
        
        # Your training code here
        # output = mlp.forward(X)
        # loss = ...
        # mlp.backward(X, y, lr)
        
    return losses, lrs

# Test your implementation
# mlp_schedule = MLP([2, 16, 1])
# losses, lrs = train_with_lr_schedule(mlp_schedule, X_train, y_train)

---
## Part 5: Summary and Final Exercises

### 🎓 Key Takeaways

1. **Why Neural Networks?**
   - Linear models cannot solve XOR and other non-linearly separable problems
   - Hidden layers learn feature transformations
   - Multiple layers create hierarchical representations

2. **Architecture Components:**
   - **Neurons**: Basic computational units (weighted sum + activation)
   - **Activation Functions**: Introduce non-linearity
   - **Layers**: Transform representations progressively
   - **Weights & Biases**: Learnable parameters

3. **Learning Process:**
   - **Forward Propagation**: Compute predictions layer by layer
   - **Loss Function**: Measure prediction error
   - **Backpropagation**: Compute gradients using chain rule
   - **Gradient Descent**: Update weights to minimize loss

4. **Practical Considerations:**
   - **Initialization**: He/Xavier initialization for stable training
   - **Activation Choice**: ReLU for hidden layers, sigmoid/softmax for output
   - **Architecture Design**: Depth vs width tradeoff
   - **Training Techniques**: Mini-batch, learning rate scheduling

---

### 📝 Final Exercise: Build Your Own Neural Network Library

Create a complete neural network library with:
1. Multiple activation functions
2. Different weight initialization schemes
3. Various optimizers (SGD, Momentum, Adam)
4. Regularization (L1, L2, Dropout)
5. Early stopping
6. Model saving/loading

In [None]:
# Your complete neural network library
class NeuralNetwork:
    """Your enhanced neural network implementation"""
    
    def __init__(self, architecture, activation='relu', optimizer='sgd',
                 regularization=None, reg_lambda=0.01):
        """Initialize your network"""
        # TODO: Your implementation
        pass
    
    def add_layer(self, units, activation=None):
        """Add a layer to the network"""
        # TODO: Your implementation
        pass
    
    def compile(self, loss='binary_crossentropy', metrics=['accuracy']):
        """Compile the model"""
        # TODO: Your implementation
        pass
    
    def fit(self, X, y, epochs=100, batch_size=32, validation_split=0.2,
            callbacks=None):
        """Train the model"""
        # TODO: Your implementation
        pass
    
    def predict(self, X):
        """Make predictions"""
        # TODO: Your implementation
        pass
    
    def save(self, filepath):
        """Save model weights"""
        # TODO: Your implementation
        pass
    
    def load(self, filepath):
        """Load model weights"""
        # TODO: Your implementation
        pass

print("🎉 Congratulations! You've completed the MLP tutorial!")
print("\n📚 Next Steps:")
print("1. Implement the complete neural network library above")
print("2. Try on real datasets (MNIST, Fashion-MNIST)")
print("3. Explore convolutional neural networks (CNNs)")
print("4. Learn about recurrent neural networks (RNNs)")
print("5. Dive into modern architectures (Transformers, GANs)")

---
### 🔗 Resources for Further Learning

1. **Books:**
   - Deep Learning by Goodfellow, Bengio, and Courville
   - Neural Networks and Deep Learning by Michael Nielsen

2. **Courses:**
   - Andrew Ng's Deep Learning Specialization
   - Fast.ai Practical Deep Learning

3. **Frameworks to Explore:**
   - PyTorch
   - TensorFlow/Keras
   - JAX

4. **Papers:**
   - Backpropagation: Rumelhart et al. (1986)
   - Universal Approximation: Cybenko (1989)
   - Deep Learning Review: LeCun, Bengio, Hinton (2015)

---

**Thank you for learning with us!** 🚀

**Author:** Ho-min Park  
**Contact:** homin.park@ghent.ac.kr