# Feedforward Neural Networks

In this notebook, we'll build complete feedforward neural networks (also called multilayer perceptrons or MLPs).

## What You'll Learn:
1. **Architecture** - How to stack layers
2. **Implementation from scratch** in NumPy
3. **Implementation in PyTorch**
4. **Training on real datasets** (MNIST digits, XOR)
5. **Best practices** for network design

A feedforward network is called "feedforward" because information flows in one direction: input → hidden layers → output (no loops).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import train_test_split

np.random.seed(42)
torch.manual_seed(42)

## 1. Network Architecture

### Key Components:

1. **Input Layer**: Receives raw data (not really a "layer", just inputs)
2. **Hidden Layers**: Process and transform data (1 or more)
3. **Output Layer**: Produces final predictions

### Example: 784 → 128 → 64 → 10
- Input: 784 features (28x28 image flattened)
- Hidden 1: 128 neurons with ReLU
- Hidden 2: 64 neurons with ReLU  
- Output: 10 neurons (10 classes) with softmax

### Universal Approximation Theorem:
A feedforward network with a single hidden layer can approximate **any continuous function**, given enough neurons!

(In practice, deeper networks work better than very wide shallow ones.)

## 2. Helper Functions (NumPy Implementation)

In [None]:
# Activation functions
def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # clip for numerical stability

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def softmax(z):
    """Softmax activation (for multi-class output)"""
    exp_z = np.exp(z - np.max(z, axis=0, keepdims=True))  # numerical stability
    return exp_z / np.sum(exp_z, axis=0, keepdims=True)

# Loss functions
def mse_loss(y_true, y_pred):
    """Mean Squared Error"""
    return np.mean((y_true - y_pred)**2)

def binary_cross_entropy(y_true, y_pred):
    """Binary Cross-Entropy"""
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Categorical Cross-Entropy"""
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=0))

## 3. Feedforward Network from Scratch (NumPy)

We'll build a flexible class that can create networks with any architecture.

In [None]:
class NeuralNetwork:
    """
    A simple feedforward neural network implemented in NumPy
    
    Example:
        nn = NeuralNetwork([2, 4, 3, 1])  # 2 inputs, 2 hidden layers (4 and 3 neurons), 1 output
    """
    
    def __init__(self, layer_sizes, activation='relu', output_activation='sigmoid'):
        """
        Args:
            layer_sizes: list of layer sizes [input_size, hidden1, hidden2, ..., output_size]
            activation: activation for hidden layers
            output_activation: activation for output layer
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)
        self.activation = activation
        self.output_activation = output_activation
        
        # Initialize parameters
        self.parameters = {}
        for i in range(1, self.num_layers):
            # He initialization for ReLU
            self.parameters[f'W{i}'] = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * np.sqrt(2.0 / layer_sizes[i-1])
            self.parameters[f'b{i}'] = np.zeros((layer_sizes[i], 1))
    
    def forward(self, X):
        """
        Forward propagation
        
        Args:
            X: input data (features x samples)
        
        Returns:
            cache: dictionary with all Z and A values
        """
        cache = {'A0': X}
        A = X
        
        # Forward through hidden layers
        for i in range(1, self.num_layers - 1):
            Z = self.parameters[f'W{i}'] @ A + self.parameters[f'b{i}']
            A = relu(Z) if self.activation == 'relu' else sigmoid(Z)
            cache[f'Z{i}'] = Z
            cache[f'A{i}'] = A
        
        # Output layer
        i = self.num_layers - 1
        Z = self.parameters[f'W{i}'] @ A + self.parameters[f'b{i}']
        
        if self.output_activation == 'sigmoid':
            A = sigmoid(Z)
        elif self.output_activation == 'softmax':
            A = softmax(Z)
        else:
            A = Z  # linear
        
        cache[f'Z{i}'] = Z
        cache[f'A{i}'] = A
        
        return cache
    
    def predict(self, X):
        """Make predictions"""
        cache = self.forward(X)
        return cache[f'A{self.num_layers - 1}']
    
    def compute_loss(self, y_true, y_pred, loss_type='mse'):
        """Compute loss"""
        if loss_type == 'mse':
            return mse_loss(y_true, y_pred)
        elif loss_type == 'binary_crossentropy':
            return binary_cross_entropy(y_true, y_pred)
        elif loss_type == 'categorical_crossentropy':
            return categorical_cross_entropy(y_true, y_pred)

# Test it
nn = NeuralNetwork([2, 4, 3, 1])
X_test = np.random.randn(2, 5)  # 2 features, 5 samples
output = nn.predict(X_test)

print("Network architecture:", nn.layer_sizes)
print(f"Input shape: {X_test.shape}")
print(f"Output shape: {output.shape}")
print(f"Output values:\n{output}")

## 4. Training the Network: Simple Gradient Descent Preview

We'll implement a simple version here. Full backpropagation is in the next notebook!

For now, we'll use **numerical gradients** (slow but educational).

In [None]:
def numerical_gradient(nn, X, y, param_name, loss_type='mse', epsilon=1e-5):
    """
    Compute gradient numerically (slow but useful for checking)
    """
    original = nn.parameters[param_name].copy()
    grad = np.zeros_like(original)
    
    it = np.nditer(original, flags=['multi_index'])
    while not it.finished:
        idx = it.multi_index
        
        # Compute f(x + epsilon)
        nn.parameters[param_name][idx] = original[idx] + epsilon
        y_pred_plus = nn.predict(X)
        loss_plus = nn.compute_loss(y, y_pred_plus, loss_type)
        
        # Compute f(x - epsilon)
        nn.parameters[param_name][idx] = original[idx] - epsilon
        y_pred_minus = nn.predict(X)
        loss_minus = nn.compute_loss(y, y_pred_minus, loss_type)
        
        # Compute gradient
        grad[idx] = (loss_plus - loss_minus) / (2 * epsilon)
        
        # Restore original value
        nn.parameters[param_name][idx] = original[idx]
        it.iternext()
    
    return grad

print("Numerical gradient function ready (we'll use backprop in the next notebook!)")

## 5. Solving XOR with a Feedforward Network

XOR is a classic problem that's not linearly separable. A single neuron can't solve it, but a network with one hidden layer can!

In [None]:
# XOR dataset
X_xor = np.array([[0, 0, 1, 1],
                  [0, 1, 0, 1]])
y_xor = np.array([[0, 1, 1, 0]])

print("XOR Truth Table:")
for i in range(4):
    print(f"  {X_xor[0,i]} XOR {X_xor[1,i]} = {y_xor[0,i]}")

# Visualize
plt.figure(figsize=(6, 6))
colors = ['red' if y == 0 else 'blue' for y in y_xor[0]]
plt.scatter(X_xor[0], X_xor[1], c=colors, s=200, edgecolors='black', linewidth=2)
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('XOR Problem (Red=0, Blue=1)\nNot linearly separable!')
plt.grid(True, alpha=0.3)
plt.xlim(-0.5, 1.5)
plt.ylim(-0.5, 1.5)
plt.show()

print("\nNote: No single straight line can separate red from blue points!")

## 6. PyTorch Implementation

Now let's build and train the same network in PyTorch (much easier!)

In [None]:
class FeedforwardNN(nn.Module):
    """
    Flexible feedforward neural network in PyTorch
    """
    def __init__(self, layer_sizes, activation='relu', output_activation='sigmoid'):
        super().__init__()
        self.layers = nn.ModuleList()
        self.activation = activation
        self.output_activation = output_activation
        
        # Create layers
        for i in range(len(layer_sizes) - 1):
            self.layers.append(nn.Linear(layer_sizes[i], layer_sizes[i+1]))
    
    def forward(self, x):
        # Forward through hidden layers
        for layer in self.layers[:-1]:
            x = layer(x)
            if self.activation == 'relu':
                x = torch.relu(x)
            elif self.activation == 'sigmoid':
                x = torch.sigmoid(x)
            elif self.activation == 'tanh':
                x = torch.tanh(x)
        
        # Output layer
        x = self.layers[-1](x)
        if self.output_activation == 'sigmoid':
            x = torch.sigmoid(x)
        elif self.output_activation == 'softmax':
            x = torch.softmax(x, dim=1)
        
        return x

# Create model for XOR
model = FeedforwardNN([2, 8, 1], activation='relu', output_activation='sigmoid')
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")

### 6.1 Training Loop in PyTorch

In [None]:
# Prepare data
X_train = torch.tensor(X_xor.T, dtype=torch.float32)  # shape: (4, 2)
y_train = torch.tensor(y_xor.T, dtype=torch.float32)  # shape: (4, 1)

# Define loss and optimizer
criterion = nn.BCELoss()  # Binary Cross-Entropy
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
losses = []
num_epochs = 2000

for epoch in range(num_epochs):
    # Forward pass
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)
    
    # Backward pass
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Compute gradients
    optimizer.step()       # Update weights
    
    losses.append(loss.item())
    
    if (epoch + 1) % 400 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}")

# Plot training loss
plt.figure(figsize=(10, 4))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Test the trained model
model.eval()
with torch.no_grad():
    predictions = model(X_train)
    
print("XOR Results After Training:")
print("Input\t\tTarget\tPrediction\tRounded")
print("-" * 50)
for i in range(4):
    x1, x2 = X_train[i]
    target = y_train[i].item()
    pred = predictions[i].item()
    rounded = round(pred)
    print(f"[{x1:.0f}, {x2:.0f}]\t\t{target:.0f}\t{pred:.4f}\t\t{rounded}")

print("\nSuccess! The network learned XOR!")

## 7. More Complex Dataset: Moons

Let's try a more challenging non-linear dataset.

In [None]:
# Generate moons dataset
X_moons, y_moons = make_moons(n_samples=1000, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=42)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis', alpha=0.5)
plt.colorbar(label='Class')
plt.title('Moons Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
# Convert to PyTorch tensors
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

# Create and train model
model_moons = FeedforwardNN([2, 16, 8, 1], activation='relu', output_activation='sigmoid')
criterion = nn.BCELoss()
optimizer = optim.Adam(model_moons.parameters(), lr=0.01)

# Training loop
losses = []
accuracies = []
num_epochs = 1000

for epoch in range(num_epochs):
    # Forward pass
    y_pred = model_moons(X_train_t)
    loss = criterion(y_pred, y_train_t)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Track metrics
    losses.append(loss.item())
    
    # Compute accuracy
    with torch.no_grad():
        y_pred_class = (y_pred > 0.5).float()
        accuracy = (y_pred_class == y_train_t).float().mean().item()
        accuracies.append(accuracy)
    
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}, Accuracy: {accuracy:.4f}")

# Plot metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))

ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss')
ax1.grid(True, alpha=0.3)

ax2.plot(accuracies)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training Accuracy')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate on test set
model_moons.eval()
with torch.no_grad():
    y_test_pred = model_moons(X_test_t)
    y_test_class = (y_test_pred > 0.5).float()
    test_accuracy = (y_test_class == y_test_t).float().mean().item()

print(f"Test Accuracy: {test_accuracy:.4f}")

### 7.1 Visualize Decision Boundary

In [None]:
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """
    Plot decision boundary for a binary classifier
    """
    # Create mesh
    h = 0.01
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    mesh_input = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
    model.eval()
    with torch.no_grad():
        Z = model(mesh_input).numpy()
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.3, levels=20, cmap='viridis')
    plt.colorbar(label='Predicted Probability')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='black', linewidth=0.5)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.show()

plot_decision_boundary(model_moons, X_test, y_test, "Decision Boundary on Test Set")

## 8. Multi-Class Classification

Let's build a network that classifies into 3 or more classes.

In [None]:
# Generate 3-class dataset
from sklearn.datasets import make_blobs

X_blobs, y_blobs = make_blobs(n_samples=600, centers=3, n_features=2, 
                              cluster_std=1.0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X_blobs, y_blobs, test_size=0.2, random_state=42)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.colorbar(label='Class')
plt.title('3-Class Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
# Convert to PyTorch tensors
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.long)  # Long for CrossEntropyLoss
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.long)

# Create model - NOTE: No softmax in output (CrossEntropyLoss includes it)
model_multi = FeedforwardNN([2, 16, 8, 3], activation='relu', output_activation='linear')

# Loss and optimizer
criterion = nn.CrossEntropyLoss()  # Combines softmax + cross-entropy
optimizer = optim.Adam(model_multi.parameters(), lr=0.01)

# Training
losses = []
accuracies = []
num_epochs = 500

for epoch in range(num_epochs):
    # Forward
    y_pred = model_multi(X_train_t)
    loss = criterion(y_pred, y_train_t)
    
    # Backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Metrics
    losses.append(loss.item())
    with torch.no_grad():
        _, predicted = torch.max(y_pred, 1)
        accuracy = (predicted == y_train_t).float().mean().item()
        accuracies.append(accuracy)
    
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item():.4f}, Accuracy: {accuracy:.4f}")

# Evaluate on test
model_multi.eval()
with torch.no_grad():
    y_test_pred = model_multi(X_test_t)
    _, predicted = torch.max(y_test_pred, 1)
    test_accuracy = (predicted == y_test_t).float().mean().item()

print(f"\nTest Accuracy: {test_accuracy:.4f}")

## 9. Best Practices and Tips

### Architecture Design:

1. **Start Simple**: Begin with 1-2 hidden layers
2. **Layer Sizes**: Typically decrease as you go deeper (e.g., 128 → 64 → 32)
3. **Universal Rule**: More data → can support more parameters (larger/deeper networks)

### Activation Functions:

1. **Hidden Layers**: Use ReLU (default choice)
2. **Output Layer**:
   - Binary classification → Sigmoid
   - Multi-class classification → Softmax (or linear with CrossEntropyLoss)
   - Regression → Linear (no activation)

### Initialization:

1. **He Initialization** for ReLU: $W \sim N(0, \sqrt{2/n_{in}})$
2. **Xavier Initialization** for tanh/sigmoid: $W \sim N(0, \sqrt{1/n_{in}})$
3. **Biases**: Usually initialized to zero

### Loss Functions:

1. **Regression**: MSE, MAE
2. **Binary Classification**: BCELoss
3. **Multi-class**: CrossEntropyLoss

### Common Issues:

1. **Vanishing Gradients**: Use ReLU instead of sigmoid/tanh
2. **Exploding Gradients**: Use gradient clipping, proper initialization
3. **Overfitting**: Add dropout, L2 regularization, reduce model size
4. **Underfitting**: Increase model size, train longer, reduce regularization

## Summary

### What We Learned:

1. **Feedforward Networks**:
   - Stack multiple layers: input → hidden → ... → output
   - Each layer: linear transform + activation
   - Information flows forward only (no loops)

2. **Implementation**:
   - Built from scratch in NumPy
   - Much simpler in PyTorch
   - Forward pass computes predictions

3. **Training**:
   - Define loss function
   - Compute gradients (backprop)
   - Update weights with optimizer
   - Repeat!

4. **Applications**:
   - Binary classification (XOR, moons)
   - Multi-class classification (3 classes)
   - Works for non-linear decision boundaries

### Next Steps:
In the next notebook, we'll dive deep into **backpropagation** - how gradients are actually computed!

## Practice Exercises

In [None]:
# Exercise 1: Build a 3-layer network (input→16→8→output) for the circles dataset
X_circles, y_circles = make_circles(n_samples=1000, noise=0.05, factor=0.5, random_state=42)

# Visualize the dataset
plt.figure(figsize=(6, 6))
plt.scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='viridis')
plt.title('Circles Dataset - Your Turn!')
plt.show()

# Your code here:
# 1. Convert to PyTorch tensors
# 2. Create model
# 3. Train for 1000 epochs
# 4. Evaluate accuracy


In [None]:
# Exercise 2: Experiment with different architectures
# Try these and compare performance:
# - Shallow and wide: [2, 64, 1]
# - Deep and narrow: [2, 8, 8, 8, 1]
# - Balanced: [2, 16, 16, 1]
# Your code here


In [None]:
# Exercise 3: Implement a regression network
# Create a dataset with y = x^2 + noise
# Build a network to learn this function
X_reg = np.random.randn(1000, 1) * 2
y_reg = X_reg**2 + np.random.randn(1000, 1) * 0.1

# Your code here
