# üß† MNIST Activation Functions Comparison

## üß© Problem Statement

### What Problem Are We Solving?

We're building **three neural networks** (computer brains) to recognize handwritten digits (0-9) and comparing how different **activation functions** affect their performance.

### Real-Life Analogy üè´

Imagine three students learning to read handwritten numbers:
- **Student A (Sigmoid)**: Wakes up slowly, like on a lazy Sunday
- **Student B (Tanh)**: Wakes up moderately, like for school
- **Student C (ReLU)**: Wakes up instantly, like on Christmas morning!

We want to find which student learns fastest and gets the best grades.

---

## ü™ú Steps to Solve the Problem

```mermaid
flowchart TD
    A[üì• Load MNIST Data] --> B[üîß Preprocess Data]
    B --> C[üèóÔ∏è Build 3 Models]
    C --> D[üìö Train All Models]
    D --> E[üìä Compare Results]
    E --> F[üî¨ Analyze Gradients]
    F --> G[üìù Write Report]
```

1. **Load Data**: Get 60,000 training images of handwritten digits
2. **Preprocess**: Flatten 28x28 images to 784 numbers, normalize to 0-1
3. **Build Models**: Create 3 identical networks with different activations
4. **Train**: Each model learns for 20 epochs
5. **Compare**: Visualize accuracy, loss, and training time
6. **Analyze**: Check gradient flow to understand vanishing gradients

---

## üéØ Expected Output

| Model | Expected Accuracy | Training Speed | Gradient Flow |
|-------|-------------------|----------------|---------------|
| Sigmoid | ~97-98% | Slowest | Weak |
| Tanh | ~97-98% | Medium | Medium |
| ReLU | ~98% | Fastest | Strong |

---

## üìö Section 1: Imports

### üîπ Import TensorFlow

#### 2.1 What the line does
Imports TensorFlow, the main deep learning framework, with alias `tf`.

#### 2.2 Why it is used
TensorFlow provides tools to build, train, and evaluate neural networks. It's industry standard and has Keras built-in.
- **Alternative**: PyTorch (equally popular)
- **Why TensorFlow**: Keras integration makes it beginner-friendly

#### 2.3 When to use it
At the start of any deep learning project.

#### 2.4 Where to use it
AI/ML companies like Google, Netflix, Uber use TensorFlow.

#### 2.5 How to use it
```python
import tensorflow as tf  # Convention is to use 'tf' alias
```

#### 2.6 How it works internally
TensorFlow builds computation graphs and runs them on CPU/GPU.

#### 2.7 Output
Makes `tf.*` functions available (tf.keras, tf.GradientTape, etc.)

In [None]:
import tensorflow as tf
print(f"TensorFlow Version: {tf.__version__}")

### üîπ Import NumPy

#### 2.1 What the line does
Imports NumPy for numerical operations on arrays.

#### 2.2 Why it is used
NumPy is faster than Python lists for mathematical operations. Essential for data manipulation.
- **Alternative**: Pure Python lists
- **Why NumPy**: 10-100x faster for large arrays

#### 2.3 When to use it
Always in data science and ML projects.

#### 2.4 Where to use it
Every data science project uses NumPy.

#### 2.5 How to use it
```python
import numpy as np  # Convention is 'np' alias
```

#### 2.6 How it works internally
Uses C-optimized code for fast array operations.

#### 2.7 Output
Makes `np.*` functions available (np.mean, np.abs, etc.)

In [None]:
import numpy as np
print(f"NumPy Version: {np.__version__}")

### üîπ Import Matplotlib

#### 2.1 What the line does
Imports matplotlib.pyplot for creating visualizations.

#### 2.2 Why it is used
We need to create plots comparing model performance. pyplot is the most common interface.

#### 2.3 When to use it
When creating charts, graphs, or any visual output.

#### 2.4 Where to use it
Reports, research papers, dashboards.

#### 2.5 How to use it
```python
import matplotlib.pyplot as plt  # Convention is 'plt' alias
```

#### 2.6 How it works internally
Creates figure objects and renders them as images.

#### 2.7 Output
Makes `plt.*` functions available (plt.plot, plt.savefig, etc.)

In [None]:
import matplotlib.pyplot as plt
import time
import os

# Enable inline plotting
%matplotlib inline

---

## ‚öôÔ∏è Section 2: Configuration

### üîπ Training Hyperparameters

**Hyperparameters** are like cooking settings - temperature, time, portion size. They control HOW the model trains.

| Parameter | Value | Real-Life Analogy |
|-----------|-------|-------------------|
| EPOCHS | 20 | Number of times to study all flashcards |
| BATCH_SIZE | 128 | Number of flashcards to study before taking a break |
| LEARNING_RATE | 0.001 | How big of a step to take when learning |

In [None]:
# Training hyperparameters
EPOCHS = 20           # How many times to go through all data
BATCH_SIZE = 128      # Samples per weight update
LEARNING_RATE = 0.001 # Step size for optimizer

# Output directory
OUTPUT_DIR = "../outputs"
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Epochs: {EPOCHS}")
print(f"Batch Size: {BATCH_SIZE}")
print(f"Learning Rate: {LEARNING_RATE}")

---

## üì• Section 3: Load and Preprocess Data

### üîπ Loading MNIST Dataset

#### 2.1 What the line does
Loads the MNIST dataset - 70,000 images of handwritten digits.

#### 2.2 Why it is used
MNIST is the "Hello World" of machine learning. It's:
- Small (easy to download)
- Fast to train
- Great for learning

#### 2.3 When to use it
When learning ML basics or benchmarking simple models.

#### 2.4 Where to use it
Computer vision, digit recognition, banking (check reading).

#### 2.5 How to use it
```python
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
```

#### 2.6 How it works internally
Downloads data from internet first time, caches locally after.

#### 2.7 Output
- `X_train`: 60,000 images of shape (28, 28)
- `y_train`: 60,000 labels (0-9)
- `X_test`: 10,000 images for testing

In [None]:
# Load MNIST dataset
print("Loading MNIST dataset...")
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

print(f"Original training data shape: {X_train.shape}")  # (60000, 28, 28)
print(f"Training labels shape: {y_train.shape}")         # (60000,)
print(f"Test data shape: {X_test.shape}")                # (10000, 28, 28)

### üîπ Visualize Sample Images

Let's see what the handwritten digits look like!

In [None]:
# Visualize sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i], cmap='gray')
    ax.set_title(f"Label: {y_train[i]}", fontsize=12)
    ax.axis('off')
plt.suptitle("Sample MNIST Images", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### üîπ Preprocessing: Flatten and Normalize

#### 2.1 What the line does
1. **Reshape**: Changes 28x28 image to flat 784-length vector
2. **Normalize**: Divides by 255 to get values between 0-1

#### 2.2 Why it is used
- **Flatten**: Dense layers expect 1D input, not 2D images
- **Normalize**: Neural networks train better with small values

```mermaid
flowchart LR
    A[28x28 Image<br>Values: 0-255] --> B[Flatten]
    B --> C[784 Vector<br>Values: 0-255]
    C --> D[Normalize]
    D --> E[784 Vector<br>Values: 0-1]
```

#### 2.5 How to use it
```python
X_train = X_train.reshape(-1, 784) / 255.0
# -1 means "figure out this dimension" (will be 60000)
# 784 = 28 * 28 pixels
# / 255.0 normalizes to 0-1 range
```

In [None]:
# Flatten: (60000, 28, 28) -> (60000, 784)
X_train = X_train.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)

# Normalize: 0-255 -> 0-1
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

print(f"After preprocessing:")
print(f"X_train shape: {X_train.shape}")  # (60000, 784)
print(f"X_test shape: {X_test.shape}")    # (10000, 784)
print(f"Value range: [{X_train.min():.2f}, {X_train.max():.2f}]")

---

## üèóÔ∏è Section 4: Build Neural Network Models

### Understanding Activation Functions

An **activation function** decides how a neuron "fires" based on its input. It's like a volume control or filter.

```mermaid
flowchart LR
    A[Input x] --> B[Weighted Sum<br>z = Wx + b]
    B --> C[Activation<br>f(z)]
    C --> D[Output]
```

| Activation | Formula | Range | Pros | Cons |
|------------|---------|-------|------|------|
| **Sigmoid** | 1/(1+e^-x) | (0, 1) | Smooth, probabilistic | Vanishing gradients |
| **Tanh** | (e^x - e^-x)/(e^x + e^-x) | (-1, 1) | Zero-centered | Vanishing gradients |
| **ReLU** | max(0, x) | [0, ‚àû) | Fast, no vanishing | Dead neurons |

### üîπ Visualize Activation Functions

In [None]:
# Visualize activation functions
x = np.linspace(-5, 5, 100)

# Compute activations
sigmoid = 1 / (1 + np.exp(-x))
tanh = np.tanh(x)
relu = np.maximum(0, x)

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(x, sigmoid, 'r-', linewidth=2)
axes[0].set_title('Sigmoid: œÉ(x) = 1/(1+e^-x)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Input (x)')
axes[0].set_ylabel('Output')
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)

axes[1].plot(x, tanh, 'b-', linewidth=2)
axes[1].set_title('Tanh: tanh(x)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Input (x)')
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].grid(True, alpha=0.3)

axes[2].plot(x, relu, 'g-', linewidth=2)
axes[2].set_title('ReLU: max(0, x)', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Input (x)')
axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'activation_functions.png'), dpi=150)
plt.show()

### üîπ Build Model Function

#### 2.1 What the function does
Creates a neural network with 3 layers: 784 ‚Üí 128 ‚Üí 64 ‚Üí 10

#### 2.2 Why this architecture
- **784 inputs**: One for each pixel (28√ó28 = 784)
- **128 hidden**: Learn basic patterns (edges, curves)
- **64 hidden**: Learn complex patterns (digit shapes)
- **10 outputs**: One for each digit (0-9)

```mermaid
graph LR
    A[784 Inputs<br>Pixels] --> B[128 Neurons<br>Hidden 1]
    B --> C[64 Neurons<br>Hidden 2]
    C --> D[10 Outputs<br>Digits 0-9]
    
    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#fff3e0
    style D fill:#e8f5e9
```

### ‚öôÔ∏è Function Arguments

#### 3.1 activation_name (str)
- **What**: String specifying which activation function to use
- **Why**: Different activations have different properties
- **Options**: 'sigmoid', 'tanh', 'relu'
- **Effect**: Changes how neurons compute output

In [None]:
def build_model(activation_name):
    """
    Build a neural network with specified activation function.
    
    Architecture: 784 ‚Üí 128 ‚Üí 64 ‚Üí 10
    
    Parameters:
    -----------
    activation_name : str
        'sigmoid', 'tanh', or 'relu'
        
    Returns:
    --------
    model : tf.keras.Model
        Compiled model ready for training
    """
    model = tf.keras.Sequential([
        # Hidden Layer 1: 784 -> 128 neurons
        tf.keras.layers.Dense(
            units=128,                    # Number of neurons
            activation=activation_name,   # Activation function
            input_shape=(784,)            # Input shape (only for first layer)
        ),
        
        # Hidden Layer 2: 128 -> 64 neurons
        tf.keras.layers.Dense(
            units=64,
            activation=activation_name
        ),
        
        # Output Layer: 64 -> 10 neurons (one per digit)
        tf.keras.layers.Dense(
            units=10,
            activation='softmax'  # Converts to probabilities
        )
    ])
    
    # Compile model
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Test building a model
test_model = build_model('relu')
test_model.summary()

---

## üìö Section 5: Training with Time Tracking

### üîπ Custom Callback for Timing

A **callback** is code that runs at specific points during training. We use it to measure time per epoch.

#### 2.1 What the code does
Creates a custom callback that records start/end time of each epoch.

#### 2.2 Why it is used
To compare training speed of different activation functions.

#### 2.5 How to use it
```python
class TimeCallback(tf.keras.callbacks.Callback):
    def on_epoch_begin(...)  # Called at start of epoch
    def on_epoch_end(...)    # Called at end of epoch
```

In [None]:
class TimeCallback(tf.keras.callbacks.Callback):
    """
    Custom callback to measure training time per epoch.
    
    Like a stopwatch for each study session!
    """
    def __init__(self):
        super().__init__()
        self.times = []       # Store time for each epoch
        self.epoch_start = 0  # Track when epoch started
        
    def on_epoch_begin(self, epoch, logs=None):
        """Start the timer when epoch begins."""
        self.epoch_start = time.time()
        
    def on_epoch_end(self, epoch, logs=None):
        """Record elapsed time when epoch ends."""
        elapsed = time.time() - self.epoch_start
        self.times.append(elapsed)

### üîπ Training Function

#### model.fit() Arguments Explained

| Argument | Value | Why |
|----------|-------|-----|
| `X_train` | Training images | Model learns from these |
| `y_train` | Correct labels | Model checks answers |
| `epochs` | 20 | Number of full passes |
| `batch_size` | 128 | Samples per update |
| `validation_data` | (X_test, y_test) | Monitor overfitting |
| `callbacks` | [TimeCallback()] | Track timing |
| `verbose` | 1 | Show progress bar |

In [None]:
def train_model(model, X_train, y_train, X_test, y_test, model_name):
    """
    Train model and track time per epoch.
    
    Returns:
    --------
    history : Training history with metrics
    epoch_times : List of times for each epoch
    """
    print(f"\n{'='*60}")
    print(f"Training {model_name}")
    print(f"{'='*60}")
    
    time_callback = TimeCallback()
    
    history = model.fit(
        X_train, y_train,
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        validation_data=(X_test, y_test),
        callbacks=[time_callback],
        verbose=1
    )
    
    return history, time_callback.times

---

## üöÄ Section 6: Train All Three Models

Now we train all three models and collect their results!

In [None]:
# Define models to compare
activations = ['sigmoid', 'tanh', 'relu']
model_names = ['Sigmoid Model', 'Tanh Model', 'ReLU Model']

# Storage for results
histories = []
epoch_times_list = []
models = []

# Train each model
for activation, name in zip(activations, model_names):
    # Build fresh model
    model = build_model(activation)
    
    # Train
    history, epoch_times = train_model(model, X_train, y_train, X_test, y_test, name)
    
    # Store results
    histories.append(history)
    epoch_times_list.append(epoch_times)
    models.append(model)
    
    # Evaluate
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    print(f"\n{name} - Test Accuracy: {test_acc:.4f}")

---

## üìä Section 7: Visualize Results

### üîπ Plot Training History

In [None]:
# Define colors for each model
colors = ['#e74c3c', '#3498db', '#2ecc71']  # Red, Blue, Green

# Create figure with 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# ================== Plot 1: Training & Validation Accuracy ==================
ax1 = axes[0, 0]
for i, (history, name) in enumerate(zip(histories, model_names)):
    ax1.plot(history.history['accuracy'], color=colors[i], linestyle='-', 
             linewidth=2, label=f'{name} (Train)')
    ax1.plot(history.history['val_accuracy'], color=colors[i], linestyle='--', 
             linewidth=2, label=f'{name} (Val)')

ax1.set_title('Training vs Validation Accuracy', fontsize=14, fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend(loc='lower right', fontsize=8)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0.8, 1.0])

# ================== Plot 2: Training & Validation Loss ==================
ax2 = axes[0, 1]
for i, (history, name) in enumerate(zip(histories, model_names)):
    ax2.plot(history.history['loss'], color=colors[i], linestyle='-', 
             linewidth=2, label=f'{name} (Train)')
    ax2.plot(history.history['val_loss'], color=colors[i], linestyle='--', 
             linewidth=2, label=f'{name} (Val)')

ax2.set_title('Training vs Validation Loss', fontsize=14, fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend(loc='upper right', fontsize=8)
ax2.grid(True, alpha=0.3)

# ================== Plot 3: Final Test Accuracy Bar Chart ==================
ax3 = axes[1, 0]
final_accuracies = [h.history['val_accuracy'][-1] for h in histories]
bars = ax3.bar(model_names, final_accuracies, color=colors, edgecolor='black', linewidth=1.5)

for bar, acc in zip(bars, final_accuracies):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
             f'{acc:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

ax3.set_title('Final Test Accuracy Comparison', fontsize=14, fontweight='bold')
ax3.set_ylabel('Accuracy')
ax3.set_ylim([0.95, 1.0])
ax3.grid(True, alpha=0.3, axis='y')

# ================== Plot 4: Training Time ==================
ax4 = axes[1, 1]
avg_times = [np.mean(times) for times in epoch_times_list]
bars = ax4.bar(model_names, avg_times, color=colors, edgecolor='black', linewidth=1.5)

for bar, t in zip(bars, avg_times):
    ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{t:.2f}s', ha='center', va='bottom', fontsize=11, fontweight='bold')

ax4.set_title('Average Training Time per Epoch', fontsize=14, fontweight='bold')
ax4.set_ylabel('Time (seconds)')
ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'training_history.png'), dpi=150, bbox_inches='tight')
plt.show()

---

## üî¨ Section 8: Gradient Analysis

### Understanding Vanishing Gradients

**Gradients** are like feedback to the network. Strong gradients = learning well. Weak gradients = not learning!

```mermaid
flowchart LR
    A[Output Error] -->|Backprop| B[Layer 2]
    B -->|Multiply| C[Layer 1]
    C -->|Multiply| D[Input Layer]
    
    style A fill:#ffcdd2
    style B fill:#fff9c4
    style C fill:#c8e6c9
    style D fill:#bbdefb
```

**Problem**: Sigmoid's derivative is max 0.25. When multiplied: 0.25 √ó 0.25 = 0.0625. Gradients shrink!

In [None]:
def compute_gradient_magnitude(model, X_batch, y_batch):
    """
    Compute mean absolute gradient for first layer.
    
    Higher = stronger gradient flow = better learning!
    Lower = vanishing gradients = learning struggles!
    """
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = model(X_batch, training=True)
        
        # Compute loss
        loss = tf.keras.losses.sparse_categorical_crossentropy(y_batch, predictions)
        loss = tf.reduce_mean(loss)
    
    # Get gradients for first layer
    first_layer_weights = model.layers[0].trainable_weights
    gradients = tape.gradient(loss, first_layer_weights)
    
    # Compute mean absolute gradient
    gradient_magnitude = np.mean(np.abs(gradients[0].numpy()))
    
    return gradient_magnitude

# Compute gradients for each model
gradient_mags = []
X_batch = X_train[:32]  # Use 32 samples
y_batch = y_train[:32]

for model, name in zip(models, model_names):
    grad_mag = compute_gradient_magnitude(model, X_batch, y_batch)
    gradient_mags.append(grad_mag)
    print(f"{name}: Gradient Magnitude = {grad_mag:.6f}")

### üîπ Visualize Gradient Magnitudes

In [None]:
# Plot gradient magnitudes
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(model_names, gradient_mags, color=colors, edgecolor='black', linewidth=1.5)

for bar, mag in zip(bars, gradient_mags):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.0001,
            f'{mag:.6f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.set_title('Gradient Magnitude in First Layer\n(Higher = Better Gradient Flow)', 
             fontsize=14, fontweight='bold')
ax.set_ylabel('Mean Absolute Gradient')
ax.grid(True, alpha=0.3, axis='y')

# Add annotation
ax.annotate('‚ö†Ô∏è Lower values = Vanishing Gradients',
            xy=(0, gradient_mags[0]), xytext=(0.5, max(gradient_mags) * 0.7),
            fontsize=10, ha='center',
            arrowprops=dict(arrowstyle='->', color='gray'))

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'gradient_magnitude_comparison.png'), dpi=150)
plt.show()

---

## üìä Section 9: Results Summary

### Final Comparison Table

In [None]:
# Create summary table
print("\n" + "="*70)
print("RESULTS SUMMARY")
print("="*70)
print(f"{'Model':<20} {'Accuracy':>12} {'Avg Time/Epoch':>18} {'Gradient Mag':>15}")
print("-"*70)

for i, name in enumerate(model_names):
    acc = histories[i].history['val_accuracy'][-1]
    avg_time = np.mean(epoch_times_list[i])
    grad = gradient_mags[i]
    print(f"{name:<20} {acc:>12.4f} {avg_time:>15.2f}s {grad:>15.6f}")

print("="*70)

---

## üíº Interview Perspective

### Common Interview Questions

**Q1: Why is ReLU preferred over Sigmoid in hidden layers?**
> ReLU doesn't suffer from vanishing gradients because its derivative is 1 for positive inputs, allowing gradients to flow unchanged. Sigmoid's max derivative is 0.25, causing gradients to shrink exponentially in deep networks.

**Q2: When would you still use Sigmoid?**
> For the output layer in binary classification (outputs probability 0-1). Also in gates of LSTM/GRU networks where we need values between 0-1.

**Q3: What's the "dying ReLU" problem?**
> If a neuron's input is always negative, ReLU outputs 0 and its gradient is 0. The neuron "dies" and stops learning. Use Leaky ReLU to fix this.

**Q4: Why is Tanh better than Sigmoid?**
> Tanh is zero-centered (outputs -1 to 1), making gradients more balanced. Sigmoid outputs only positive values (0-1), causing zigzag updates.

---

## üéì Conclusion

### Key Takeaways

1. **ReLU trains fastest** - Simple computation (just max(0,x))
2. **ReLU has strongest gradients** - No vanishing gradient problem
3. **All models achieve similar accuracy on MNIST** - Dataset is simple enough
4. **Differences become more pronounced in deeper networks**

### Recommendations

| Use Case | Recommended Activation |
|----------|------------------------|
| Hidden layers (default) | ReLU |
| Binary classification output | Sigmoid |
| Multi-class output | Softmax |
| If ReLU neurons are dying | Leaky ReLU |
| RNN gates | Sigmoid/Tanh |