# 3. Activation Functions

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/maleehahassan/NNBuildingBlocksTeachingPt1/blob/main/content/03_activation_functions.ipynb)

## Learning Objectives

By the end of this section, you will understand:
- Why activation functions are crucial in neural networks
- The problems with linear activation functions
- Common activation functions and their properties
- How to choose the right activation function
- The impact of activation functions on learning

## Why Do We Need Activation Functions?

In the previous section, we used a simple **step function** in our perceptron. But what if we want:
- **Smooth gradients** for better learning?
- **Probabilistic outputs** instead of hard 0/1?
- **Non-linear decision boundaries**?

This is where **activation functions** come to the rescue!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit  # sigmoid function

# Let's start by showing the problem with linear functions
def demonstrate_linear_problem():
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Linear network (just matrix multiplications)
    ax1.set_title('Linear Network (Without Activation Functions)', fontsize=14, fontweight='bold')
    
    # Show that multiple linear layers = single linear layer
    x = np.linspace(-3, 3, 100)
    
    # Layer 1: y1 = 2x
    y1 = 2 * x
    ax1.plot(x, y1, 'b-', linewidth=2, label='Layer 1: 2x')
    
    # Layer 2: y2 = 3 * y1 = 3 * 2x = 6x
    y2 = 3 * y1
    ax1.plot(x, y2, 'r-', linewidth=2, label='Layer 2: 3×(2x) = 6x')
    
    # Equivalent single layer
    y_equivalent = 6 * x
    ax1.plot(x, y_equivalent, 'g--', linewidth=3, label='Equivalent: 6x', alpha=0.7)
    
    ax1.set_xlabel('Input')
    ax1.set_ylabel('Output')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.text(0, 10, 'Multiple linear layers\n= Single linear layer!', 
             ha='center', fontsize=12, 
             bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7))
    
    # Non-linear network (with activation functions)
    ax2.set_title('Non-Linear Network (With Activation Functions)', fontsize=14, fontweight='bold')
    
    # Layer 1 with activation
    y1_nonlinear = np.tanh(2 * x)  # tanh activation
    ax2.plot(x, y1_nonlinear, 'b-', linewidth=2, label='Layer 1: tanh(2x)')
    
    # Layer 2 with activation  
    y2_nonlinear = np.tanh(3 * y1_nonlinear)
    ax2.plot(x, y2_nonlinear, 'r-', linewidth=2, label='Layer 2: tanh(3×Layer1)')
    
    ax2.set_xlabel('Input')
    ax2.set_ylabel('Output')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.text(0, 0.5, 'Each layer adds\ncomplexity!', 
             ha='center', fontsize=12, 
             bbox=dict(boxstyle="round,pad=0.3", facecolor="lightgreen", alpha=0.7))
    
    plt.tight_layout()
    plt.show()

demonstrate_linear_problem()

print("Key Insight: Without activation functions, deep networks are just expensive linear models!")
print("Activation functions introduce NON-LINEARITY, enabling complex pattern recognition.")

## Common Activation Functions

Let's explore the most important activation functions used in neural networks:

### 1. Step Function (Perceptron)
$$f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases}$$

### 2. Sigmoid (Logistic)
$$f(x) = \frac{1}{1 + e^{-x}}$$

### 3. Hyperbolic Tangent (tanh)
$$f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

### 4. Rectified Linear Unit (ReLU)
$$f(x) = \max(0, x)$$

In [None]:
# Define activation functions
def step_function(x):
    return (x >= 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -250, 250)))  # Clip to prevent overflow

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

# Plot all activation functions
x = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

functions = [
    (step_function, 'Step Function', 'Perceptron classic'),
    (sigmoid, 'Sigmoid', 'Smooth, outputs (0,1)'),
    (tanh, 'Hyperbolic Tangent', 'Smooth, outputs (-1,1)'),
    (relu, 'ReLU', 'Modern favorite'),
    (leaky_relu, 'Leaky ReLU', 'ReLU variant'),
    (lambda x: np.where(x > 0, x, 0.1 * (np.exp(x) - 1)), 'ELU', 'Exponential Linear Unit')
]

for i, (func, name, description) in enumerate(functions):
    y = func(x)
    axes[i].plot(x, y, linewidth=3, color=f'C{i}')
    axes[i].set_title(f'{name}\n{description}', fontsize=12, fontweight='bold')
    axes[i].grid(True, alpha=0.3)
    axes[i].set_xlabel('Input (x)')
    axes[i].set_ylabel('Output f(x)')
    axes[i].axhline(y=0, color='black', linestyle='-', alpha=0.3)
    axes[i].axvline(x=0, color='black', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

print("Each activation function has different properties and use cases!")

## Detailed Analysis of Key Activation Functions

### Sigmoid Function: The Smooth Perceptron

The sigmoid function was revolutionary because it:
- **Smooth and differentiable** (enables gradient-based learning)
- **Outputs probabilities** (values between 0 and 1)
- **S-shaped curve** (smooth transition)

In [None]:
# Detailed sigmoid analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

x = np.linspace(-6, 6, 1000)
sig_x = sigmoid(x)

# 1. Sigmoid function
axes[0,0].plot(x, sig_x, 'b-', linewidth=3, label='σ(x) = 1/(1+e^(-x))')
axes[0,0].axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Decision threshold')
axes[0,0].set_title('Sigmoid Function', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Input (x)')
axes[0,0].set_ylabel('Output σ(x)')
axes[0,0].grid(True, alpha=0.3)
axes[0,0].legend()

# Add annotations
axes[0,0].annotate('Saturated\n(gradient ≈ 0)', xy=(-4, sigmoid(-4)), xytext=(-5, 0.3),
                   arrowprops=dict(arrowstyle='->', color='red', alpha=0.7))
axes[0,0].annotate('Saturated\n(gradient ≈ 0)', xy=(4, sigmoid(4)), xytext=(3, 0.7),
                   arrowprops=dict(arrowstyle='->', color='red', alpha=0.7))
axes[0,0].annotate('Steep gradient\n(fast learning)', xy=(0, 0.5), xytext=(1, 0.2),
                   arrowprops=dict(arrowstyle='->', color='green', alpha=0.7))

# 2. Sigmoid derivative (gradient)
sigmoid_derivative = sig_x * (1 - sig_x)
axes[0,1].plot(x, sigmoid_derivative, 'g-', linewidth=3, label="σ'(x) = σ(x)(1-σ(x))")
axes[0,1].set_title('Sigmoid Derivative (Gradient)', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Input (x)')
axes[0,1].set_ylabel('Gradient')
axes[0,1].grid(True, alpha=0.3)
axes[0,1].legend()

# 3. Comparison with step function
step_x = step_function(x)
axes[1,0].plot(x, step_x, 'r-', linewidth=3, label='Step function')
axes[1,0].plot(x, sig_x, 'b-', linewidth=3, label='Sigmoid', alpha=0.7)
axes[1,0].set_title('Step vs Sigmoid', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Input (x)')
axes[1,0].set_ylabel('Output')
axes[1,0].grid(True, alpha=0.3)
axes[1,0].legend()

# 4. Sigmoid with different slopes
for slope in [0.5, 1, 2, 5]:
    y = sigmoid(slope * x)
    axes[1,1].plot(x, y, linewidth=2, label=f'σ({slope}x)')
axes[1,1].set_title('Sigmoid with Different Slopes', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Input (x)')
axes[1,1].set_ylabel('Output')
axes[1,1].grid(True, alpha=0.3)
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("Sigmoid Properties:")
print("✓ Smooth and differentiable")
print("✓ Outputs between 0 and 1 (probabilities)")
print("✓ S-shaped curve")
print("✗ Vanishing gradient problem (saturates at extremes)")
print("✗ Not zero-centered (can slow learning)")

### ReLU: The Modern Champion

ReLU (Rectified Linear Unit) became the default choice because it:
- **Simple**: f(x) = max(0, x)
- **Fast**: Computationally efficient
- **Avoids vanishing gradients**: Gradient is either 0 or 1
- **Sparse**: Many neurons output 0 (computational efficiency)

In [None]:
# Detailed ReLU analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

x = np.linspace(-3, 3, 1000)
relu_x = relu(x)

# 1. ReLU function
axes[0,0].plot(x, relu_x, 'r-', linewidth=3, label='ReLU(x) = max(0, x)')
axes[0,0].axhline(y=0, color='black', linestyle='-', alpha=0.3)
axes[0,0].axvline(x=0, color='black', linestyle='-', alpha=0.3)
axes[0,0].set_title('ReLU Function', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Input (x)')
axes[0,0].set_ylabel('Output')
axes[0,0].grid(True, alpha=0.3)
axes[0,0].legend()

# Add annotations
axes[0,0].annotate('Dead zone\n(gradient = 0)', xy=(-1.5, 0), xytext=(-2, 1),
                   arrowprops=dict(arrowstyle='->', color='red', alpha=0.7))
axes[0,0].annotate('Active zone\n(gradient = 1)', xy=(1.5, 1.5), xytext=(0.5, 2.5),
                   arrowprops=dict(arrowstyle='->', color='green', alpha=0.7))

# 2. ReLU derivative
relu_derivative = (x > 0).astype(float)
axes[0,1].plot(x, relu_derivative, 'g-', linewidth=3, label="ReLU'(x)")
axes[0,1].set_title('ReLU Derivative', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Input (x)')
axes[0,1].set_ylabel('Gradient')
axes[0,1].grid(True, alpha=0.3)
axes[0,1].legend()
axes[0,1].set_ylim(-0.1, 1.1)

# 3. Comparison: Sigmoid vs ReLU
axes[1,0].plot(x, sigmoid(x), 'b-', linewidth=3, label='Sigmoid', alpha=0.7)
axes[1,0].plot(x, relu_x, 'r-', linewidth=3, label='ReLU')
axes[1,0].set_title('Sigmoid vs ReLU', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Input (x)')
axes[1,0].set_ylabel('Output')
axes[1,0].grid(True, alpha=0.3)
axes[1,0].legend()

# 4. ReLU variants
leaky_x = leaky_relu(x, 0.1)
elu_x = np.where(x > 0, x, 0.1 * (np.exp(x) - 1))

axes[1,1].plot(x, relu_x, 'r-', linewidth=3, label='ReLU')
axes[1,1].plot(x, leaky_x, 'g-', linewidth=3, label='Leaky ReLU')
axes[1,1].plot(x, elu_x, 'b-', linewidth=3, label='ELU')
axes[1,1].set_title('ReLU Variants', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Input (x)')
axes[1,1].set_ylabel('Output')
axes[1,1].grid(True, alpha=0.3)
axes[1,1].legend()

plt.tight_layout()
plt.show()

print("ReLU Properties:")
print("✓ Simple and fast computation")
print("✓ No vanishing gradient (for x > 0)")
print("✓ Sparse activation (many zeros)")
print("✓ Biologically inspired")
print("✗ Dying ReLU problem (neurons can become permanently inactive)")
print("✗ Not differentiable at x = 0")

## Practical Demonstration: Impact on Learning

Let's see how different activation functions affect the learning process in a simple neural network.

In [None]:
# Create a simple neural network to compare activation functions
class SimpleNeuralNetwork:
    def __init__(self, activation_func, activation_derivative, learning_rate=0.1):
        self.activation = activation_func
        self.activation_derivative = activation_derivative
        self.learning_rate = learning_rate
        
        # Initialize weights
        self.w1 = np.random.randn(2, 3) * 0.5  # Input to hidden
        self.b1 = np.random.randn(1, 3) * 0.5
        self.w2 = np.random.randn(3, 1) * 0.5  # Hidden to output
        self.b2 = np.random.randn(1, 1) * 0.5
        
        self.losses = []
    
    def forward(self, X):
        self.z1 = np.dot(X, self.w1) + self.b1
        self.a1 = self.activation(self.z1)
        self.z2 = np.dot(self.a1, self.w2) + self.b2
        self.a2 = sigmoid(self.z2)  # Output layer always sigmoid for binary classification
        return self.a2
    
    def backward(self, X, y, output):
        m = X.shape[0]
        
        # Output layer gradients
        dz2 = output - y
        dw2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        
        # Hidden layer gradients
        da1 = np.dot(dz2, self.w2.T)
        dz1 = da1 * self.activation_derivative(self.z1)
        dw1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        
        # Update weights
        self.w2 -= self.learning_rate * dw2
        self.b2 -= self.learning_rate * db2
        self.w1 -= self.learning_rate * dw1
        self.b1 -= self.learning_rate * db1
    
    def train(self, X, y, epochs=1000):
        for epoch in range(epochs):
            output = self.forward(X)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            self.losses.append(loss)
            self.backward(X, y, output)

# Define activation functions and their derivatives
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def relu_derivative(x):
    return (x > 0).astype(float)

# Generate XOR-like dataset (non-linearly separable)
np.random.seed(42)
n_samples = 1000

# Create a more complex pattern
X = np.random.randn(n_samples, 2)
y = ((X[:, 0] * X[:, 1]) > 0).astype(float).reshape(-1, 1)

# Train networks with different activation functions
activations = [
    (sigmoid, sigmoid_derivative, 'Sigmoid'),
    (tanh, tanh_derivative, 'Tanh'),
    (relu, relu_derivative, 'ReLU')
]

networks = []
for activation, derivative, name in activations:
    print(f"Training network with {name} activation...")
    network = SimpleNeuralNetwork(activation, derivative, learning_rate=0.01)
    network.train(X, y, epochs=500)
    networks.append((network, name))

# Plot learning curves
plt.figure(figsize=(12, 8))

# Learning curves
plt.subplot(2, 2, 1)
for network, name in networks:
    plt.plot(network.losses, linewidth=2, label=name)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Learning Curves: Different Activation Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Decision boundaries
for i, (network, name) in enumerate(networks):
    plt.subplot(2, 2, i + 2)
    
    # Create decision boundary
    xx, yy = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = network.forward(mesh_points)
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
    plt.colorbar()
    
    # Plot data points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y.ravel(), cmap='RdYlBu', edgecolors='black', alpha=0.7)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(f'Decision Boundary: {name}')
    plt.xlim(-3, 3)
    plt.ylim(-3, 3)

plt.tight_layout()
plt.show()

# Calculate final accuracies
print("\nFinal Results:")
for network, name in networks:
    predictions = (network.forward(X) > 0.5).astype(int)
    accuracy = np.mean(predictions == y)
    final_loss = network.losses[-1]
    print(f"{name:8s}: Accuracy = {accuracy:.3f}, Final Loss = {final_loss:.4f}")

## Choosing the Right Activation Function

### Guidelines for Selection:

#### Hidden Layers:
- **ReLU**: Default choice for most cases
  - Fast, simple, avoids vanishing gradients
  - Good for deep networks
  
- **Leaky ReLU/ELU**: When ReLU causes "dying" neurons
  - Addresses the dying ReLU problem
  - Slightly more complex but often better performance
  
- **Tanh**: When you need zero-centered outputs
  - Better than sigmoid for hidden layers
  - Still suffers from vanishing gradients

#### Output Layers:
- **Sigmoid**: Binary classification (0/1 probabilities)
- **Softmax**: Multi-class classification (probability distribution)
- **Linear**: Regression (no activation)
- **Tanh**: When outputs should be between -1 and 1

In [None]:
# Summary comparison table
import pandas as pd

comparison_data = {
    'Activation': ['Step', 'Sigmoid', 'Tanh', 'ReLU', 'Leaky ReLU', 'ELU'],
    'Range': ['[0, 1]', '(0, 1)', '(-1, 1)', '[0, ∞)', '(-∞, ∞)', '(-α, ∞)'],
    'Differentiable': ['No', 'Yes', 'Yes', 'Almost', 'Almost', 'Yes'],
    'Vanishing Gradient': ['N/A', 'Yes', 'Yes', 'No', 'No', 'No'],
    'Computational Cost': ['Very Low', 'Medium', 'Medium', 'Very Low', 'Low', 'Medium'],
    'Common Use': ['Historical', 'Output Layer', 'Hidden Layer', 'Hidden Layer', 'Hidden Layer', 'Hidden Layer'],
    'Zero-Centered': ['No', 'No', 'Yes', 'No', 'No', 'Almost']
}

df = pd.DataFrame(comparison_data)
print("Activation Function Comparison:")
print("=" * 80)
print(df.to_string(index=False))

# Visual comparison of key properties
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

x = np.linspace(-2, 2, 1000)

# Plot 1: Saturation comparison
axes[0].plot(x, sigmoid(x), label='Sigmoid (saturates)', linewidth=3)
axes[0].plot(x, tanh(x), label='Tanh (saturates)', linewidth=3)
axes[0].plot(x, relu(x), label='ReLU (no saturation)', linewidth=3)
axes[0].set_title('Saturation Behavior', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Input')
axes[0].set_ylabel('Output')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Zero-centered comparison
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.5, label='Zero line')
axes[1].plot(x, sigmoid(x) - 0.5, label='Sigmoid (shifted)', linewidth=3)
axes[1].plot(x, tanh(x), label='Tanh (zero-centered)', linewidth=3)
axes[1].plot(x, relu(x) - 1, label='ReLU (shifted)', linewidth=3)
axes[1].set_title('Zero-Centered Outputs', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Input')
axes[1].set_ylabel('Output')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Gradient comparison
sig_grad = sigmoid(x) * (1 - sigmoid(x))
tanh_grad = 1 - tanh(x)**2
relu_grad = (x > 0).astype(float)

axes[2].plot(x, sig_grad, label='Sigmoid derivative', linewidth=3)
axes[2].plot(x, tanh_grad, label='Tanh derivative', linewidth=3)
axes[2].plot(x, relu_grad, label='ReLU derivative', linewidth=3)
axes[2].set_title('Gradient Behavior', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Input')
axes[2].set_ylabel('Gradient')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Common Problems and Solutions

### 1. Vanishing Gradient Problem
**Problem**: Gradients become very small in deep networks, making learning slow/impossible.

**Causes**: 
- Sigmoid/tanh saturation
- Many layers multiplying small gradients

**Solutions**:
- Use ReLU or variants
- Proper weight initialization
- Batch normalization
- Residual connections

### 2. Dying ReLU Problem
**Problem**: ReLU neurons become permanently inactive (always output 0).

**Causes**:
- Large negative bias
- Poor weight initialization
- High learning rates

**Solutions**:
- Leaky ReLU (small negative slope)
- ELU (exponential for negative values)
- Proper initialization
- Lower learning rates

### 3. Exploding Gradient Problem
**Problem**: Gradients become very large, causing unstable training.

**Solutions**:
- Gradient clipping
- Proper weight initialization
- Lower learning rates
- Batch normalization

In [None]:
# Demonstrate vanishing gradient problem
def demonstrate_vanishing_gradients():
    # Simulate gradient flow through multiple sigmoid layers
    layers = [5, 10, 15, 20]
    x = np.linspace(-3, 3, 100)
    
    plt.figure(figsize=(15, 10))
    
    for i, n_layers in enumerate(layers):
        plt.subplot(2, 2, i + 1)
        
        # Calculate gradient through n layers
        gradient = np.ones_like(x)
        activation = x.copy()
        
        # Forward pass through layers
        activations = [activation]
        for layer in range(n_layers):
            activation = sigmoid(activation)
            activations.append(activation)
        
        # Backward pass (chain rule)
        for layer in range(n_layers):
            # Sigmoid derivative: σ(x) * (1 - σ(x))
            layer_gradient = activations[n_layers - layer] * (1 - activations[n_layers - layer])
            gradient *= layer_gradient
        
        plt.plot(x, gradient, linewidth=3)
        plt.title(f'Gradient after {n_layers} Sigmoid Layers')
        plt.xlabel('Input')
        plt.ylabel('Gradient Magnitude')
        plt.grid(True, alpha=0.3)
        plt.yscale('log')
        
        # Show vanishing effect
        max_gradient = np.max(gradient)
        plt.text(0, max_gradient/2, f'Max gradient: {max_gradient:.2e}', 
                bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7))
    
    plt.tight_layout()
    plt.show()
    
    print("As we add more sigmoid layers, gradients become exponentially smaller!")
    print("This makes it very difficult to train deep networks with sigmoid activation.")

demonstrate_vanishing_gradients()

## Key Takeaways

### Why Activation Functions Matter:
1. **Enable non-linearity**: Without them, deep networks = linear models
2. **Control information flow**: Determine what gets passed to next layer
3. **Affect learning speed**: Gradient properties impact training
4. **Shape decision boundaries**: Different functions → different capabilities

### Modern Best Practices:
1. **Start with ReLU** for hidden layers
2. **Consider Leaky ReLU/ELU** if ReLU causes problems
3. **Use sigmoid/softmax** for output layers (classification)
4. **Avoid sigmoid/tanh** in deep networks (vanishing gradients)
5. **Match activation to problem type**

### Historical Evolution:
- **1940s-1980s**: Step functions, early sigmoid
- **1980s-2000s**: Sigmoid, tanh dominance
- **2010s-present**: ReLU revolution
- **Future**: Swish, GELU, learned activations

## Discussion Questions

1. Why did ReLU become so popular despite being "just" max(0,x)?
2. When might you still choose sigmoid over ReLU?
3. How do activation functions relate to biological neurons?
4. What problems might arise with very deep ReLU networks?

---

**Next**: We'll explore **Loss Functions** - how neural networks measure and minimize their mistakes!