# Lab 1.3: Activation Function Implementation

## Duration: 45 minutes

## Learning Objectives
By the end of this lab, you will be able to:
- Understand the purpose and importance of activation functions in neural networks
- Implement common activation functions from scratch
- Analyze the mathematical properties of different activation functions
- Visualize activation functions and their derivatives
- Apply activation functions to transform neural network outputs

## Prerequisites
- Completed Lab 1.1 (Environment Setup)
- Completed Lab 1.2 (Mathematical Foundations)
- Understanding of derivatives and function behavior

---

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
%matplotlib inline
plt.style.use('default')

print("Environment ready for activation function implementation!")

## Part 1: Understanding the Need for Activation Functions

Let's first understand why we need activation functions in neural networks.

In [None]:
print("=" * 50)
print("PART 1: WHY DO WE NEED ACTIVATION FUNCTIONS?")
print("=" * 50)

# Example: Linear transformations without activation functions
print("Linear Network Example (WITHOUT activation functions):")
print("-" * 55)

# Input
x = np.array([1, 2])
print(f"Input: {x}")

# First layer weights and computation
W1 = np.array([[0.5, 0.3], [0.2, 0.8]])
z1 = np.dot(x, W1)
print(f"\nFirst layer: z1 = x @ W1 = {z1}")

# Second layer weights and computation
W2 = np.array([[1.0, 0.5], [0.3, 1.2]])
z2 = np.dot(z1, W2)
print(f"Second layer: z2 = z1 @ W2 = {z2}")

# This is equivalent to a single linear transformation!
W_combined = np.dot(W1, W2)
z_direct = np.dot(x, W_combined)
print(f"\nDirect computation: x @ (W1 @ W2) = {z_direct}")
print(f"Are they equal? {np.allclose(z2, z_direct)}")

print("\n💡 Key Insight: Without activation functions, multiple layers collapse")
print("   into a single linear transformation! We need non-linearity.")

## Part 2: Implementing Basic Activation Functions

Let's implement the most common activation functions from scratch.

In [None]:
print("=" * 40)
print("PART 2: ACTIVATION FUNCTION IMPLEMENTATIONS")
print("=" * 40)

# 1. Sigmoid (Logistic) Function
def sigmoid(z):
    """
    Sigmoid activation function: σ(z) = 1 / (1 + e^(-z))
    
    Args:
        z: Input value(s) - can be scalar, vector, or matrix
    
    Returns:
        Output in range (0, 1)
    """
    # Clip z to prevent overflow
    z = np.clip(z, -500, 500)  # Prevent numerical overflow
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """
    Derivative of sigmoid function: σ'(z) = σ(z) * (1 - σ(z))
    """
    s = sigmoid(z)
    return s * (1 - s)

print("✅ Sigmoid function implemented")

# 2. Hyperbolic Tangent (tanh) Function
def tanh(z):
    """
    Hyperbolic tangent activation function: tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
    
    Args:
        z: Input value(s)
    
    Returns:
        Output in range (-1, 1)
    """
    return np.tanh(z)  # NumPy has an optimized version

def tanh_derivative(z):
    """
    Derivative of tanh function: tanh'(z) = 1 - tanh²(z)
    """
    return 1 - np.tanh(z)**2

print("✅ Tanh function implemented")

# 3. Rectified Linear Unit (ReLU) Function
def relu(z):
    """
    ReLU activation function: ReLU(z) = max(0, z)
    
    Args:
        z: Input value(s)
    
    Returns:
        Output in range [0, +∞)
    """
    return np.maximum(0, z)

def relu_derivative(z):
    """
    Derivative of ReLU function: ReLU'(z) = 1 if z > 0, else 0
    """
    return (z > 0).astype(float)

print("✅ ReLU function implemented")

# 4. Leaky ReLU Function
def leaky_relu(z, alpha=0.01):
    """
    Leaky ReLU activation function: LeakyReLU(z) = max(αz, z)
    
    Args:
        z: Input value(s)
        alpha: Slope for negative values (default: 0.01)
    
    Returns:
        Output allowing small negative values
    """
    return np.where(z > 0, z, alpha * z)

def leaky_relu_derivative(z, alpha=0.01):
    """
    Derivative of Leaky ReLU function
    """
    return np.where(z > 0, 1, alpha)

print("✅ Leaky ReLU function implemented")

# 5. Linear (Identity) Function
def linear(z):
    """
    Linear activation function: f(z) = z
    Often used in output layers for regression
    """
    return z

def linear_derivative(z):
    """
    Derivative of linear function: f'(z) = 1
    """
    return np.ones_like(z)

print("✅ Linear function implemented")
print("\nAll activation functions ready for testing!")

## Part 3: Testing Activation Functions

Let's test our implementations with various inputs.

In [None]:
print("=" * 35)
print("PART 3: TESTING ACTIVATION FUNCTIONS")
print("=" * 35)

# Test inputs
test_inputs = np.array([-5, -2, -1, 0, 1, 2, 5])
print(f"Test inputs: {test_inputs}")
print("\n" + "="*80)

# Test each activation function
functions = {
    'Sigmoid': sigmoid,
    'Tanh': tanh,
    'ReLU': relu,
    'Leaky ReLU': leaky_relu,
    'Linear': linear
}

print(f"{'Function':<12} | {'Input':<20} | {'Output':<30}")
print("-" * 80)

for name, func in functions.items():
    outputs = func(test_inputs)
    print(f"{name:<12} | {str(test_inputs):<20} | {str(np.round(outputs, 4)):<30}")

print("\n💡 Observations:")
print("   - Sigmoid: Outputs between 0 and 1")
print("   - Tanh: Outputs between -1 and 1, zero-centered")
print("   - ReLU: Zero for negative inputs, linear for positive")
print("   - Leaky ReLU: Small slope for negative inputs")
print("   - Linear: Outputs equal to inputs")

## Part 4: Visualizing Activation Functions

Visual understanding is crucial for choosing the right activation function.

In [None]:
print("=" * 40)
print("PART 4: VISUALIZING ACTIVATION FUNCTIONS")
print("=" * 40)

# Create input range for plotting
x = np.linspace(-6, 6, 1000)

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Activation Functions and Their Derivatives', fontsize=16, fontweight='bold')

# Function definitions for plotting
activation_functions = [
    ('Sigmoid', sigmoid, sigmoid_derivative, 'blue'),
    ('Tanh', tanh, tanh_derivative, 'red'),
    ('ReLU', relu, relu_derivative, 'green'),
    ('Leaky ReLU', leaky_relu, leaky_relu_derivative, 'orange'),
    ('Linear', linear, linear_derivative, 'purple')
]

# Plot each function
for i, (name, func, deriv_func, color) in enumerate(activation_functions):
    row = i // 3
    col = i % 3
    
    if i < 5:  # We have 5 functions but 6 subplot positions
        ax = axes[row, col]
        
        # Compute function values
        y = func(x)
        dy = deriv_func(x)
        
        # Plot function and derivative
        ax.plot(x, y, color=color, linewidth=2, label=f'{name}')
        ax.plot(x, dy, color=color, linewidth=2, linestyle='--', alpha=0.7, label=f"{name}'")
        
        ax.set_title(f'{name} Function')
        ax.set_xlabel('Input (z)')
        ax.set_ylabel('Output')
        ax.grid(True, alpha=0.3)
        ax.legend()
        ax.axhline(y=0, color='black', linewidth=0.5)
        ax.axvline(x=0, color='black', linewidth=0.5)

# Remove the empty subplot
fig.delaxes(axes[1, 2])

plt.tight_layout()
plt.show()

print("\n📊 Key Visual Observations:")
print("   - Sigmoid: S-shaped curve, saturates at extremes")
print("   - Tanh: Similar to sigmoid but zero-centered")
print("   - ReLU: Sharp corner at zero, zero gradient for negative inputs")
print("   - Leaky ReLU: Small but non-zero gradient for negative inputs")
print("   - Linear: Constant slope, constant gradient")

## Part 5: Analyzing Activation Function Properties

Let's analyze important properties like range, continuity, and gradient behavior.

In [None]:
print("=" * 45)
print("PART 5: ACTIVATION FUNCTION PROPERTIES")
print("=" * 45)

# Analyze properties
print(f"{'Function':<12} | {'Range':<15} | {'Zero-Centered':<15} | {'Monotonic':<12} | {'Differentiable':<15}")
print("-" * 85)

properties = [
    ('Sigmoid', '(0, 1)', 'No', 'Yes', 'Yes'),
    ('Tanh', '(-1, 1)', 'Yes', 'Yes', 'Yes'),
    ('ReLU', '[0, +∞)', 'No', 'Yes', 'No (at 0)'),
    ('Leaky ReLU', '(-∞, +∞)', 'No', 'Yes', 'No (at 0)'),
    ('Linear', '(-∞, +∞)', 'Yes', 'Yes', 'Yes')
]

for name, range_val, zero_cent, monotonic, diff in properties:
    print(f"{name:<12} | {range_val:<15} | {zero_cent:<15} | {monotonic:<12} | {diff:<15}")

print("\n🔍 Detailed Analysis:")

In [None]:
# Gradient analysis - important for training
print("\nGradient Analysis:")
print("-" * 25)

# Test gradient magnitudes at different points
test_points = [-3, -1, 0, 1, 3]

print(f"{'Function':<12} | Input: {test_points}")
print("-" * 60)

grad_functions = {
    'Sigmoid': sigmoid_derivative,
    'Tanh': tanh_derivative,
    'ReLU': relu_derivative,
    'Leaky ReLU': leaky_relu_derivative,
    'Linear': linear_derivative
}

for name, grad_func in grad_functions.items():
    gradients = grad_func(np.array(test_points))
    grad_str = [f"{g:.3f}" for g in gradients]
    print(f"{name:<12} | Gradients: {grad_str}")

print("\n⚠️  Gradient Problems:")
print("   - Sigmoid/Tanh: Vanishing gradients for large |z|")
print("   - ReLU: Dead neurons (zero gradient for z ≤ 0)")
print("   - Leaky ReLU: Helps with dead neuron problem")
print("   - Linear: No gradient issues, but no non-linearity")

In [None]:
# Saturation analysis
print("\nSaturation Analysis:")
print("-" * 25)

# Check how quickly functions saturate
large_inputs = np.array([-10, -5, -2, 2, 5, 10])

print(f"{'Function':<12} | Input: {large_inputs}")
print("-" * 80)

for name, func in functions.items():
    if name != 'Leaky ReLU':  # Skip to avoid parameter issues in loop
        outputs = func(large_inputs)
        out_str = [f"{o:.3f}" for o in outputs]
        print(f"{name:<12} | Outputs: {out_str}")

# Special case for Leaky ReLU
outputs = leaky_relu(large_inputs)
out_str = [f"{o:.3f}" for o in outputs]
print(f"{'Leaky ReLU':<12} | Outputs: {out_str}")

print("\n📈 Saturation Observations:")
print("   - Sigmoid: Saturates quickly (≈0 for z<-5, ≈1 for z>5)")
print("   - Tanh: Saturates at ±1")
print("   - ReLU: No saturation for positive inputs")
print("   - Leaky ReLU: No saturation")
print("   - Linear: No saturation")

## Part 6: Practical Applications and Neural Network Integration

Let's see how activation functions work in a simple neural network context.

In [None]:
print("=" * 50)
print("PART 6: ACTIVATION FUNCTIONS IN NEURAL NETWORKS")
print("=" * 50)

# Simulate a simple neural network layer
print("Simple Neural Network Layer Simulation:")
print("-" * 45)

# Sample input data (batch of 3 samples, 4 features each)
X = np.array([[1.0, 2.0, -1.0, 0.5],
              [-0.5, 1.5, 2.0, -1.0],
              [2.0, -1.0, 0.0, 1.5]])

print(f"Input data X (3 samples, 4 features):\n{X}")

# Weights and bias for a layer with 3 neurons
W = np.random.randn(4, 3) * 0.5  # Small random weights
b = np.array([0.1, -0.2, 0.0])   # Small bias values

print(f"\nWeights W (4 inputs, 3 neurons):\n{W}")
print(f"\nBias b: {b}")

# Compute linear combination (before activation)
Z = np.dot(X, W) + b
print(f"\nLinear output Z = XW + b:\n{Z}")

In [None]:
# Apply different activation functions
print("\nApplying Different Activation Functions:")
print("-" * 45)

activations = {
    'No activation (linear)': Z,
    'Sigmoid': sigmoid(Z),
    'Tanh': tanh(Z),
    'ReLU': relu(Z),
    'Leaky ReLU': leaky_relu(Z)
}

for name, activation in activations.items():
    print(f"\n{name}:")
    print(activation)
    print(f"Range: [{np.min(activation):.3f}, {np.max(activation):.3f}]")

print("\n💡 Notice how each activation function transforms the outputs differently!")

In [None]:
# Demonstrate the effect of different activation functions on network depth
print("\nDeep Network Simulation (5 layers):")
print("-" * 40)

# Start with a simple input
x = np.array([1.0, -0.5])
print(f"Initial input: {x}")

# Define weights for each layer (2->2->2->2->2->1)
weights = [
    np.array([[0.5, -0.3], [0.2, 0.8]]),  # Layer 1
    np.array([[0.4, 0.6], [-0.1, 0.5]]),  # Layer 2  
    np.array([[0.3, -0.4], [0.7, 0.2]]),  # Layer 3
    np.array([[0.1, 0.9], [-0.6, 0.3]]),  # Layer 4
    np.array([[0.8], [0.4]])               # Layer 5 (output)
]

# Test with different activation functions
activation_funcs = {'ReLU': relu, 'Sigmoid': sigmoid, 'Tanh': tanh}

for act_name, act_func in activation_funcs.items():
    print(f"\n--- Using {act_name} activation ---")
    current_input = x.copy()
    
    for layer, W in enumerate(weights):
        # Linear transformation
        z = np.dot(current_input, W)
        
        # Apply activation (except for output layer)
        if layer < len(weights) - 1:  # Hidden layers
            current_input = act_func(z)
            print(f"  Layer {layer+1}: {current_input}")
        else:  # Output layer (no activation for this example)
            output = z
            print(f"  Output: {output}")

print("\n🧠 Observations:")
print("   - Different activation functions lead to different outputs")
print("   - ReLU can lead to some neurons 'dying' (outputting 0)")
print("   - Sigmoid/Tanh keep values in bounded ranges")

## Part 7: Advanced Activation Functions (Bonus)

Let's implement some modern activation functions used in current research.

In [None]:
print("=" * 40)
print("PART 7: ADVANCED ACTIVATION FUNCTIONS")
print("=" * 40)

# Swish (SiLU) activation function
def swish(z):
    """
    Swish activation function: f(z) = z * sigmoid(z)
    Also known as SiLU (Sigmoid Linear Unit)
    """
    return z * sigmoid(z)

def swish_derivative(z):
    """
    Derivative of Swish function
    """
    s = sigmoid(z)
    return s + z * s * (1 - s)

# GELU activation function (approximation)
def gelu(z):
    """
    GELU activation function (Gaussian Error Linear Unit)
    Approximation: GELU(z) ≈ 0.5 * z * (1 + tanh(√(2/π) * (z + 0.044715 * z³)))
    """
    return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * (z + 0.044715 * z**3)))

# ELU activation function
def elu(z, alpha=1.0):
    """
    ELU activation function: f(z) = z if z > 0, else α(e^z - 1)
    """
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

print("✅ Advanced activation functions implemented")

# Test advanced functions
x_test = np.linspace(-3, 3, 100)

plt.figure(figsize=(15, 5))

# Plot advanced activation functions
plt.subplot(1, 3, 1)
plt.plot(x_test, swish(x_test), 'b-', linewidth=2, label='Swish')
plt.plot(x_test, swish_derivative(x_test), 'b--', linewidth=2, alpha=0.7, label="Swish'")
plt.title('Swish Activation')
plt.xlabel('Input (z)')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)
plt.legend()

plt.subplot(1, 3, 2)
plt.plot(x_test, gelu(x_test), 'r-', linewidth=2, label='GELU')
plt.title('GELU Activation')
plt.xlabel('Input (z)')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)
plt.legend()

plt.subplot(1, 3, 3)
plt.plot(x_test, elu(x_test), 'g-', linewidth=2, label='ELU')
plt.title('ELU Activation')
plt.xlabel('Input (z)')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

print("\n🔬 Advanced Function Properties:")
print("   - Swish: Smooth, non-monotonic, performs well in deep networks")
print("   - GELU: Smooth approximation of ReLU, used in transformers")
print("   - ELU: Smooth, has negative values, reduces bias shift")

## Part 8: Choosing the Right Activation Function

Guidelines for selecting activation functions in different scenarios.

In [None]:
print("=" * 50)
print("PART 8: ACTIVATION FUNCTION SELECTION GUIDE")
print("=" * 50)

selection_guide = {
    'Hidden Layers': {
        'Default choice': 'ReLU - Fast, simple, works well in most cases',
        'Deep networks': 'ReLU, Leaky ReLU, or ELU - Help with gradient flow',
        'When ReLU fails': 'Leaky ReLU, ELU, or Swish - Avoid dead neurons',
        'Research/Cutting-edge': 'GELU, Swish - Better performance in some cases'
    },
    'Output Layers': {
        'Binary classification': 'Sigmoid - Outputs probability [0,1]',
        'Multi-class classification': 'Softmax - Outputs probability distribution',
        'Regression': 'Linear - No constraints on output range',
        'Bounded regression': 'Sigmoid or Tanh - For bounded target values'
    }
}

for layer_type, recommendations in selection_guide.items():
    print(f"\n{layer_type}:")
    print("-" * (len(layer_type) + 1))
    for scenario, recommendation in recommendations.items():
        print(f"  {scenario}: {recommendation}")

print("\n⚡ Performance Comparison (General Guidelines):")
print("-" * 50)

performance_data = {
    'Function': ['ReLU', 'Leaky ReLU', 'ELU', 'Sigmoid', 'Tanh', 'Swish', 'GELU'],
    'Speed': ['Fast', 'Fast', 'Medium', 'Slow', 'Medium', 'Slow', 'Slow'],
    'Gradient Flow': ['Good*', 'Good', 'Good', 'Poor', 'Poor', 'Good', 'Good'],
    'Deep Networks': ['Good*', 'Good', 'Good', 'Poor', 'Poor', 'Excellent', 'Excellent']
}

print(f"{'Function':<12} | {'Speed':<8} | {'Gradient Flow':<13} | {'Deep Networks':<13}")
print("-" * 55)
for i in range(len(performance_data['Function'])):
    func = performance_data['Function'][i]
    speed = performance_data['Speed'][i]
    grad = performance_data['Gradient Flow'][i]
    deep = performance_data['Deep Networks'][i]
    print(f"{func:<12} | {speed:<8} | {grad:<13} | {deep:<13}")

print("\n* Can suffer from dead neuron problem")

## Progress Checklist

Mark each concept as understood:

- [ ] Purpose of activation functions in neural networks
- [ ] Sigmoid function implementation and properties
- [ ] Tanh function implementation and properties
- [ ] ReLU function implementation and properties
- [ ] Leaky ReLU function implementation and properties
- [ ] Visualization of activation functions and derivatives
- [ ] Understanding of gradient behavior
- [ ] Saturation effects in different functions
- [ ] Integration of activation functions in neural networks
- [ ] Guidelines for choosing activation functions

## Troubleshooting

### Common Issues:

**1. Numerical overflow in sigmoid:**
- Solution: Clip input values to prevent exp() overflow
- Use: `z = np.clip(z, -500, 500)`

**2. NaN values in calculations:**
- Check for division by zero
- Ensure proper handling of edge cases (e.g., z=0 in derivatives)

**3. Plotting issues:**
- Ensure input ranges are appropriate for each function
- Use sufficient resolution for smooth curves

**4. Shape mismatches:**
- Activation functions should preserve input shape
- Check that derivative functions return same shape as input

**5. Performance issues:**
- For large arrays, consider using NumPy's built-in functions when available
- Profile your code to identify bottlenecks

## Key Concepts Summary

1. **Non-linearity**: Activation functions introduce non-linearity to neural networks
2. **Gradient Flow**: Different functions affect how gradients flow during backpropagation
3. **Saturation**: Some functions saturate (flat gradients) at extreme values
4. **Dead Neurons**: ReLU can cause neurons to permanently output zero
5. **Function Choice**: Selection depends on layer type, network depth, and problem domain
6. **Computational Efficiency**: Simpler functions (ReLU) are faster than complex ones (Sigmoid)
7. **Range and Centering**: Output ranges and zero-centering affect network behavior

## Next Steps

In the next lab, we'll implement a basic neuron that combines the mathematical operations from Lab 1.2 with the activation functions from this lab to create a complete computational unit.

---

**Congratulations! You've successfully implemented and analyzed activation functions for neural networks!**