# Module 4 - Exercise 2: Feedforward Neural Networks

<a href="https://colab.research.google.com/github/jumpingsphinx/jumpingsphinx.github.io/blob/main/notebooks/module4-neural-networks/exercise2-feedforward-networks.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives

By the end of this exercise, you will be able to:

- Build multi-layer feedforward networks
- Understand forward propagation through layers
- Implement different activation functions
- Solve non-linearly separable problems (like XOR)
- Visualize network architectures and activations
- Initialize weights properly

## Prerequisites

- Completion of Exercise 1 (Perceptron)
- Understanding of matrix operations
- Familiarity with activation functions

## Setup

Run this cell first to import required libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

print("NumPy version:", np.__version__)
print("Setup complete!")

---

## Part 1: Building a 2-Layer Neural Network

### Background

A 2-layer neural network (1 hidden layer) consists of:

**Layer 1 (Input → Hidden)**:
$$\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}$$
$$\mathbf{a}^{[1]} = f^{[1]}(\mathbf{z}^{[1]})$$

**Layer 2 (Hidden → Output)**:
$$\mathbf{z}^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}$$
$$\mathbf{a}^{[2]} = f^{[2]}(\mathbf{z}^{[2]})$$

### Exercise 1.1: Implement a 2-Layer Network

**Task:** Complete the NeuralNetwork class with forward propagation.

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    """ReLU activation function."""
    return np.maximum(0, z)

def tanh(z):
    """Tanh activation function."""
    return np.tanh(z)

class NeuralNetwork:
    """
    2-layer neural network for binary classification.
    
    Architecture: [n_input, n_hidden, 1]
    """
    
    def __init__(self, n_input, n_hidden, hidden_activation='relu'):
        """
        Initialize network parameters.
        
        Parameters:
        -----------
        n_input : int
            Number of input features
        n_hidden : int
            Number of neurons in hidden layer
        hidden_activation : str
            Activation function for hidden layer ('relu', 'sigmoid', 'tanh')
        """
        # Initialize weights with small random values
        self.b1 = np.zeros((n_hidden, 1))
        self.b2 = np.zeros((1, 1))
    def forward(self, X):
        """
        Forward propagation through the network.
        
        Parameters:
        -----------
        X : ndarray, shape (n_input, m)
            Input data
        
        Returns:
        --------
        A2 : ndarray, shape (1, m)
            Output predictions
        cache : dict
            Cached values for visualization
        """
        # Layer 1: Input → Hidden
        A1 = self.hidden_activation(Z1)
        Z2 = np.dot(self.W2, A1) + self.b2
        return A2, cache
    
    def predict(self, X):
        """
        Make binary predictions.
        
        Parameters:
        -----------
        X : ndarray
            Input data
        
        Returns:
        --------
        predictions : ndarray
            Binary predictions (0 or 1)
        """
        A2, _ = self.forward(X)
        return (A2 > 0.5).astype(int)
# Test the network
print("Testing 2-Layer Neural Network")
print("=" * 60)

nn = NeuralNetwork(n_input=2, n_hidden=4, hidden_activation='relu')

# Test with random data
X_test = np.random.randn(2, 5)
A2, cache = nn.forward(X_test)

print(f"Input shape: {X_test.shape}")
print(f"Hidden activations shape: {cache['A1'].shape}")
print(f"Output shape: {A2.shape}")
print(f"\nOutput probabilities:\n{A2}")
print(f"\nBinary predictions:\n{nn.predict(X_test)}")

---

## Part 2: Solving XOR with a 2-Layer Network

### Background

Recall that a single perceptron cannot solve XOR. A 2-layer network can!

The key insight: the hidden layer learns a new representation where XOR becomes linearly separable.

### Exercise 2.1: Manually Set Weights for XOR

**Task:** Set weights manually to solve XOR (to understand how it works).

In [None]:
# XOR dataset
X_xor = np.array([[0, 0, 1, 1],
                  [0, 1, 0, 1]])
y_xor = np.array([[0, 1, 1, 0]])

print("XOR Truth Table:")
print("x1  x2  | XOR")
print("-" * 15)
for i in range(4):
    print(f"{X_xor[0, i]}   {X_xor[1, i]}   | {y_xor[0, i]}")

# Visualize XOR problem
plt.figure(figsize=(8, 6))
plt.scatter(X_xor[0, y_xor[0]==0], X_xor[1, y_xor[0]==0], 
           c='blue', s=200, edgecolors='k', marker='o', label='Class 0')
plt.scatter(X_xor[0, y_xor[0]==1], X_xor[1, y_xor[0]==1], 
           c='red', s=200, edgecolors='k', marker='s', label='Class 1')
plt.xlabel('$x_1$', fontsize=12)
plt.ylabel('$x_2$', fontsize=12)
plt.title('XOR Problem: Not Linearly Separable!', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.show()

# Create network and manually set weights
nn_xor = NeuralNetwork(n_input=2, n_hidden=2, hidden_activation='relu')

# Manually set weights to implement XOR
# Hidden layer neurons: h1 ≈ AND(x1, x2), h2 ≈ OR(x1, x2)
# Output: y ≈ AND(NOT h1, h2) = XOR(x1, x2)
nn_xor.W1 = np.array([[20, 20],   # First hidden neuron (AND-like)
                      [20, 20]])   # Second hidden neuron (OR-like)
nn_xor.b1 = np.array([[-30],       # High threshold for AND
                      [-10]])      # Low threshold for OR

nn_xor.W2 = np.array([[-20, 20]])  # Negative first, positive second
nn_xor.b2 = np.array([[-10]])

# Test the network
predictions, cache = nn_xor.forward(X_xor)

print("\nXOR Solution with Manual Weights:")
print("=" * 60)
print(f"{'x1':<5} {'x2':<5} {'True':<8} {'Predicted':<12} {'Correct'}")
print("-" * 60)
for i in range(4):
    x1, x2 = X_xor[:, i]
    true_y = y_xor[0, i]
    pred_y = predictions[0, i]
    correct = '✓' if (pred_y > 0.5) == true_y else '✗'
    print(f"{x1:<5.0f} {x2:<5.0f} {true_y:<8.0f} {pred_y:<12.4f} {correct}")

accuracy = np.mean((predictions > 0.5) == y_xor)
print(f"\nAccuracy: {accuracy:.2%}")

# Show hidden layer transformations
print("\nHidden Layer Activations:")
print("-" * 60)
print("(Shows how the network transforms the input)")
print(f"{'x1':<5} {'x2':<5} {'h1 (AND-like)':<15} {'h2 (OR-like)'}")
print("-" * 60)
for i in range(4):
    x1, x2 = X_xor[:, i]
    h1, h2 = cache['A1'][:, i]
    print(f"{x1:<5.0f} {x2:<5.0f} {h1:<15.4f} {h2:<15.4f}")

### Exercise 2.2: Visualize Hidden Layer Transformation

**Task:** Visualize how the hidden layer makes XOR linearly separable.

In [None]:
# Get hidden layer activations
predictions, cache = nn_xor.forward(X_xor)
H = cache['A1']  # Hidden activations (2 x 4)

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Original space (not linearly separable)
ax1.scatter(X_xor[0, y_xor[0]==0], X_xor[1, y_xor[0]==0],
           c='blue', s=200, edgecolors='k', marker='o', label='Class 0')
ax1.scatter(X_xor[0, y_xor[0]==1], X_xor[1, y_xor[0]==1],
           c='red', s=200, edgecolors='k', marker='s', label='Class 1')
ax1.set_xlabel('$x_1$', fontsize=14)
ax1.set_ylabel('$x_2$', fontsize=14)
ax1.set_title('Original Space\n(Not Linearly Separable)', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.set_xlim(-0.5, 1.5)
ax1.set_ylim(-0.5, 1.5)

# Hidden space (linearly separable!)
ax2.scatter(H[0, y_xor[0]==0], H[1, y_xor[0]==0],
           c='blue', s=200, edgecolors='k', marker='o', label='Class 0')
ax2.scatter(H[0, y_xor[0]==1], H[1, y_xor[0]==1],
           c='red', s=200, edgecolors='k', marker='s', label='Class 1')

# Draw decision boundary in hidden space
h_vals = np.linspace(-0.5, 1.5, 100)
# W2[0] * h1 + W2[1] * h2 + b2 = 0
# h2 = -(W2[0] * h1 + b2) / W2[1]
boundary_h2 = -(nn_xor.W2[0, 0] * h_vals + nn_xor.b2[0, 0]) / nn_xor.W2[0, 1]
ax2.plot(h_vals, boundary_h2, 'k-', linewidth=2, label='Decision Boundary')

ax2.set_xlabel('$h_1$ (AND-like)', fontsize=14)
ax2.set_ylabel('$h_2$ (OR-like)', fontsize=14)
ax2.set_title('Hidden Layer Space\n(Linearly Separable!)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-0.5, 1.5)
ax2.set_ylim(-0.5, 1.5)

plt.tight_layout()
plt.show()

print("The Magic of Hidden Layers:")
print("=" * 60)
print("In the original space (left), XOR is not linearly separable.")
print("But in the hidden layer space (right), a simple line separates the classes!")
print("\nThis is why neural networks are so powerful:")
print("Hidden layers learn representations that make problems easier to solve.")

---

## Part 3: Network Architecture - Depth vs Width

### Exercise 3.1: Experiment with Network Width

**Task:** Test different numbers of hidden neurons on XOR.

In [None]:
# Your code here: Create networks with different hidden layer sizes
# Test sizes: 2, 4, 8, 16 neurons
# For each, count the total number of parameters

hidden_sizes = [2, 4, 8, 16]

print("Network Width Comparison:")
print("=" * 60)
print(f"{'Hidden Size':<15} {'Parameters':<15} {'Architecture'}")
print("-" * 60)

for n_hidden in hidden_sizes:
    nn = NeuralNetwork(n_input=2, n_hidden=n_hidden)
    
    # Count parameters
    # Your code here
    params_layer1 = n_hidden * 2 + n_hidden  # W1 + b1
    params_layer2 = 
    total_params = params_layer1 + params_layer2
    
    print(f"{n_hidden:<15} {total_params:<15} [2, {n_hidden}, 1]")

print("\nObservation: More neurons = more parameters = more capacity")
print("But: Also more risk of overfitting and slower training!")

### Exercise 3.2: Build a 3-Layer Network (2 Hidden Layers)

**Task:** Extend the network to have 2 hidden layers.

In [None]:
class DeepNeuralNetwork:
    """
    3-layer neural network (2 hidden layers) for binary classification.
    
    Architecture: [n_input, n_hidden1, n_hidden2, 1]
    """
    
    def __init__(self, n_input, n_hidden1, n_hidden2):
        # Your code here: Initialize 3 layers of weights and biases
        # Layer 1: input -> hidden1
        self.W1 = np.random.randn(n_hidden1, n_input) * 0.01
        self.b1 = np.zeros((n_hidden1, 1))
        
        # Layer 2: hidden1 -> hidden2
        # Your code here
        self.W2 = 
        self.b2 = 
        
        # Layer 3: hidden2 -> output
        # Your code here
        self.W3 = 
        self.b3 = 
    
    def forward(self, X):
        """
        Forward propagation through 3 layers.
        """
        # Layer 1
        Z1 = np.dot(self.W1, X) + self.b1
        A1 = relu(Z1)
        
        # Layer 2
        # Your code here
        Z2 = 
        A2 = relu(Z2)
        
        # Layer 3
        # Your code here
        Z3 = 
        A3 = sigmoid(Z3)
        
        cache = {'A1': A1, 'A2': A2, 'A3': A3}
        return A3, cache
    
    def predict(self, X):
        A3, _ = self.forward(X)
        return (A3 > 0.5).astype(int)

# Test the deep network
print("Testing 3-Layer Neural Network")
print("=" * 60)

deep_nn = DeepNeuralNetwork(n_input=2, n_hidden1=4, n_hidden2=3)

# Test on XOR
predictions, cache = deep_nn.forward(X_xor)

print(f"Architecture: [2, 4, 3, 1]")
print(f"Hidden layer 1 activations shape: {cache['A1'].shape}")
print(f"Hidden layer 2 activations shape: {cache['A2'].shape}")
print(f"Output shape: {predictions.shape}")
print(f"\nPredictions (untrained):\n{predictions}")

---

## Part 4: Testing on Non-Linearly Separable Datasets

### Exercise 4.1: Moons Dataset

**Task:** Test the network on the moons dataset.

In [None]:
# Generate moons dataset
X_moons, y_moons = make_moons(n_samples=200, noise=0.2, random_state=42)
X_moons = X_moons.T  # Shape: (2, 200)
y_moons = y_moons.reshape(1, -1)  # Shape: (1, 200)

# Visualize dataset
plt.figure(figsize=(10, 6))
plt.scatter(X_moons[0, y_moons[0]==0], X_moons[1, y_moons[0]==0],
           c='blue', alpha=0.7, edgecolors='k', label='Class 0')
plt.scatter(X_moons[0, y_moons[0]==1], X_moons[1, y_moons[0]==1],
           c='red', alpha=0.7, edgecolors='k', label='Class 1')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Moons Dataset (Non-linearly Separable)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Dataset Information:")
print(f"Shape: {X_moons.shape}")
print(f"Labels: {np.unique(y_moons)}")
print(f"Class distribution: {np.bincount(y_moons[0])}")

### Exercise 4.2: Circles Dataset

**Task:** Test on the circles dataset (even harder!).

In [None]:
# Your code here: Generate and visualize circles dataset
X_circles, y_circles = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
X_circles = X_circles.T
y_circles = y_circles.reshape(1, -1)

# Visualize
plt.figure(figsize=(10, 6))
# Your code here

print("Circles Dataset - Even More Challenging!")
print("Requires more complex decision boundaries.")

---

## Part 5: Comparing with sklearn's MLPClassifier

### Exercise 5.1: Train MLPClassifier on XOR

**Task:** Use sklearn's MLP to solve XOR.

In [None]:
from sklearn.neural_network import MLPClassifier

# Prepare data for sklearn (needs transposed format)
X_xor_sklearn = X_xor.T  # Shape: (4, 2)
y_xor_sklearn = y_xor.ravel()  # Shape: (4,)

# Create MLP classifier
# Your code here: Create MLPClassifier with architecture (4,) - one hidden layer with 4 neurons
mlp = MLPClassifier(hidden_layer_sizes=(4,), 
                   activation='relu',
                   max_iter=2000,
                   learning_rate_init=0.1,
                   random_state=42)

# Train
mlp.fit(X_xor_sklearn, y_xor_sklearn)

# Predict
predictions_sklearn = mlp.predict(X_xor_sklearn)

print("sklearn MLPClassifier on XOR:")
print("=" * 60)
print(f"{'x1':<5} {'x2':<5} {'True':<8} {'Predicted':<12} {'Correct'}")
print("-" * 60)
for i in range(4):
    x1, x2 = X_xor_sklearn[i]
    true_y = y_xor_sklearn[i]
    pred_y = predictions_sklearn[i]
    correct = '✓' if pred_y == true_y else '✗'
    print(f"{x1:<5.0f} {x2:<5.0f} {true_y:<8.0f} {pred_y:<12.0f} {correct}")

accuracy_sklearn = accuracy_score(y_xor_sklearn, predictions_sklearn)
print(f"\nAccuracy: {accuracy_sklearn:.2%}")

print(f"\nNetwork architecture: {[2] + list(mlp.hidden_layer_sizes) + [1]}")
print(f"Number of iterations: {mlp.n_iter_}")
print(f"Loss: {mlp.loss_:.4f}")

### Exercise 5.2: Compare on Moons Dataset

**Task:** Train and compare different architectures.

In [None]:
# Prepare data
X_moons_sklearn = X_moons.T
y_moons_sklearn = y_moons.ravel()

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X_moons_sklearn, y_moons_sklearn, test_size=0.3, random_state=42
)

# Test different architectures
architectures = [
    (4,),
    (8,),
    (16,),
    (8, 4),
    (16, 8)
]

print("Comparing MLP Architectures on Moons Dataset:")
print("=" * 70)
print(f"{'Architecture':<20} {'Train Acc':<15} {'Test Acc':<15} {'Iterations'}")
print("-" * 70)

for arch in architectures:
    # Your code here: Train MLP with this architecture
    mlp = MLPClassifier(hidden_layer_sizes=arch,
                       activation='relu',
                       max_iter=1000,
                       random_state=42)
    mlp.fit(X_train, y_train)
    
    train_acc = mlp.score(X_train, y_train)
    test_acc = mlp.score(X_test, y_test)
    
    arch_str = str([2] + list(arch) + [1])
    print(f"{arch_str:<20} {train_acc:<15.4f} {test_acc:<15.4f} {mlp.n_iter_}")

print("\nObservations:")
print("  • Deeper/wider networks may achieve higher training accuracy")
print("  • But watch out for overfitting (train acc >> test acc)")
print("  • Balance between capacity and generalization")

---

## Part 6: Visualizing Decision Boundaries

### Exercise 6.1: Plot Decision Boundary

**Task:** Visualize what the network learned.

In [None]:
def plot_decision_boundary_mlp(mlp, X, y, title):
    """
    Plot decision boundary for sklearn MLP.
    
    Parameters:
    -----------
    mlp : MLPClassifier
        Trained MLP
    X : ndarray, shape (m, 2)
        Input data
    y : ndarray, shape (m,)
        Labels
    title : str
        Plot title
    """
    # Create mesh
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict on mesh
    Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 7))
    plt.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.6)
    plt.colorbar(label='Prediction')
    
    # Plot data
    plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', s=60,
               edgecolors='k', label='Class 0', alpha=0.7)
    plt.scatter(X[y==1, 0], X[y==1, 1], c='red', s=60,
               edgecolors='k', label='Class 1', alpha=0.7)
    
    plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2.5)
    
    plt.xlabel('Feature 1', fontsize=12)
    plt.ylabel('Feature 2', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Train on moons and visualize
mlp_moons = MLPClassifier(hidden_layer_sizes=(8, 4), 
                         activation='relu',
                         max_iter=1000,
                         random_state=42)
mlp_moons.fit(X_moons_sklearn, y_moons_sklearn)

plot_decision_boundary_mlp(mlp_moons, X_moons_sklearn, y_moons_sklearn,
                          "MLP Decision Boundary: Moons Dataset")

print(f"Test accuracy: {mlp_moons.score(X_moons_sklearn, y_moons_sklearn):.2%}")

---

## Challenge Problems (Optional)

### Challenge 1: Implement 4-Layer Network

Extend the network to have 3 hidden layers.

In [None]:
# Your code here: Build a 4-layer network
# Architecture: [n_input, n_hidden1, n_hidden2, n_hidden3, 1]

print("Challenge: Implement 4-layer neural network!")

### Challenge 2: Activation Function Comparison

Compare ReLU, Sigmoid, and Tanh activations on the same dataset.

In [None]:
# Your code here: Train MLPs with different activation functions
# Compare performance

activations = ['relu', 'logistic', 'tanh']

# Test each activation on moons dataset
# Report train/test accuracy for each

print("Challenge: Compare activation functions!")

### Challenge 3: Universal Approximation

Approximate a complex function with a neural network.

In [None]:
# Your code here: Create a complex 1D function
# Use MLP to approximate it
# Plot original vs approximation

def complex_function(x):
    return np.sin(x) + 0.5 * np.cos(3*x) + 0.2 * np.sin(5*x)

# Generate data
X_func = np.linspace(-3, 3, 200).reshape(-1, 1)
y_func = complex_function(X_func.ravel())

# Train MLP regressor to approximate this function
# Compare with different network sizes

print("Challenge: Universal approximation theorem in action!")

---

## Reflection Questions

1. **Why do we need hidden layers?**
   - Think about what problems can't be solved with a single layer

2. **What's the difference between network depth (layers) and width (neurons)?**
   - When would you increase depth vs width?

3. **How does the hidden layer transform the input space?**
   - What did you observe in the XOR visualization?

4. **Why does the network need non-linear activation functions?**
   - What would happen with only linear activations?

5. **How do you choose the network architecture?**
   - What factors influence the number of layers and neurons?

---

## Summary

In this exercise, you learned:

- How to build multi-layer perceptrons (feedforward networks)
- Why hidden layers are crucial for solving non-linear problems
- How hidden layers transform the input space
- The XOR problem and its solution with 2-layer networks
- Network architecture design (depth vs width)
- How to use sklearn's MLPClassifier
- Visualizing decision boundaries

**Key Takeaways:**

- Single perceptrons are limited to linear boundaries
- Hidden layers learn new representations of the data
- Non-linear activations are essential
- Deeper/wider networks have more capacity but risk overfitting
- Neural networks can approximate any continuous function (Universal Approximation Theorem)

**Next Steps:**

- Complete Exercise 3 on Backpropagation
- Review [Lesson 2: Feedforward Networks](https://jumpingsphinx.github.io/module4-neural-networks/02-feedforward-networks/)
- Experiment with different architectures on real datasets

---

**Need help?** Check the solution notebook or open an issue on [GitHub](https://github.com/jumpingsphinx/jumpingsphinx.github.io/issues).