# ReLU Activation Function - From Scratch

## Question 14: Compare Activation Functions Mathematically and Visually

---

### ðŸ§© Problem Statement

**What problem is being solved?**
- Sigmoid/Tanh suffer from vanishing gradient
- Deep networks (10+ layers) couldn't train effectively
- ReLU solves this with gradient = 1 for all positive inputs

**Key Formula:** f(z) = max(0, z)

**Revolutionary Insight:** Gradient never shrinks for positive inputs!

**Trade-off:** Dead neurons (gradient = 0 when z <= 0)

---

## Step 1: Import Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt

---

## Step 2: Implement ReLU Function

### ðŸ”¹ Line Explanation: `return np.maximum(0, z)`

#### 2.1 What the line does
Returns max(0, z) - zero for negatives, z itself for positives.

#### 2.2 Why it is used
Simplest non-linear activation that completely eliminates vanishing gradient for positive inputs. Gradient is EXACTLY 1 for all z > 0.

#### 2.3 When to use it
Default choice for hidden layers in modern deep networks.

#### 2.4 Where to use it
CNNs, Transformers, ResNets, virtually all modern architectures.

#### 2.5-2.7 Usage
```python
relu(-5)  # Returns 0 (blocked)
relu(5)   # Returns 5 (passes through)
```

In [None]:
def relu(z):
    """
    ReLU activation function.
    Formula: f(z) = max(0, z)
    Output: 0 for negative, z for positive
    """
    return np.maximum(0, z)

In [None]:
# Test ReLU
print("relu(-5) =", relu(-5))    # Expected: 0
print("relu(0) =", relu(0))      # Expected: 0
print("relu(5) =", relu(5))      # Expected: 5

---

## Step 3: Implement ReLU Derivative

### ðŸ”¹ Formula: f'(z) = 1 if z > 0, else 0

**Key insight:** Gradient is EXACTLY 1 for ALL positive inputs - never decays!

In [None]:
def relu_derivative(z):
    """
    Derivative of ReLU function.
    Formula: f'(z) = 1 if z > 0, else 0
    Key: Gradient is 1 for ALL positive inputs!
    """
    return np.where(z > 0, 1, 0).astype(float)

In [None]:
# Test derivative
print("relu_derivative(-5) =", relu_derivative(-5))  # Expected: 0 (dead)
print("relu_derivative(0) =", relu_derivative(0))    # Expected: 0
print("relu_derivative(5) =", relu_derivative(5))    # Expected: 1 (perfect!)
print("relu_derivative(100) =", relu_derivative(100)) # Expected: 1 (still 1!)

---

## Step 4: Visualization

In [None]:
z_range = np.linspace(-6, 6, 200)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# ReLU function
ax1.plot(z_range, relu(z_range), 'r-', linewidth=2, label='ReLU')
ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Input (z)')
ax1.set_ylabel('Output')
ax1.set_title('ReLU Function: max(0, z)')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(-1, 7)

# Derivative
ax2.plot(z_range, relu_derivative(z_range), 'b-', linewidth=2, label='Derivative')
ax2.axhline(y=1.0, color='green', linestyle=':', alpha=0.7, label='Gradient=1')
ax2.axhline(y=0, color='red', linestyle=':', alpha=0.7, label='Dead zone')
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Input (z)')
ax2.set_ylabel('Gradient')
ax2.set_title('ReLU Derivative: 1 for positive, 0 for negative')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('outputs/relu_combined.png', dpi=150)
plt.show()

---

## Step 5: Numerical Analysis

In [None]:
test_inputs = np.array([-5, -2, -0.5, 0, 0.5, 2, 5])

print("RELU NUMERICAL ANALYSIS")
print("=" * 50)
print(f"{'Input':<10} {'ReLU':<15} {'Derivative':<15}")
print("-" * 40)

for z in test_inputs:
    print(f"{z:<10.1f} {relu(z):<15.1f} {relu_derivative(z):<15.1f}")

---

## Why ReLU Revolutionized Deep Learning

### Gradient Comparison at z=5

| Function | Gradient at z=5 | Status |
|----------|-----------------|--------|
| Sigmoid | 0.0066 | Vanishing! |
| Tanh | 0.0002 | Vanishing! |
| **ReLU** | **1.0** | **Perfect!** |

ReLU enabled training of 100+ layer networks that were impossible before!

---

## ðŸ’¼ Interview Key Points

1. **Formula**: max(0, z) - simplest activation
2. **Gradient = 1** for ALL positive inputs (no decay!)
3. **Dead neurons**: gradient = 0 for negative inputs
4. **Fix dead neurons**: Use LeakyReLU or He initialization
5. **Use case**: Default for hidden layers in deep networks