# üöÄ Run in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/melhzy/transformer_from_scratch/blob/main/transformer-foundation/04_feed_forward_networks.ipynb)

**Note:** If you're running this in Google Colab, execute the setup cell below to clone the repository and install dependencies.

In [None]:
# Google Colab Setup (run this cell only if you're in Colab)
import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("üîß Running in Google Colab - Setting up environment...")
    if not os.path.exists('transformer_from_scratch'):
        print("üì• Cloning repository...")
        !git clone https://github.com/melhzy/transformer_from_scratch.git
        print("‚úÖ Repository cloned!")
    os.chdir('transformer_from_scratch')
    print("üì¶ Installing dependencies...")
    !pip install -q torch torchvision matplotlib seaborn numpy
    print("‚úÖ Dependencies installed!")
    if '/content/transformer_from_scratch' not in sys.path:
        sys.path.insert(0, '/content/transformer_from_scratch')
    print("‚úÖ Setup complete! Ready to run the tutorial.")
else:
    print("üíª Running locally - no setup needed.")

In [None]:
# Setup
import sys
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sys.path.insert(0, str(Path.cwd().parent))

from src.modules.feed_forward import PositionwiseFeedForward, GLUFeedForward

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
torch.manual_seed(42)
np.random.seed(42)

print("‚úÖ Imports successful!")

---

## 1. Position-wise Feed-Forward Networks <a id="ffn"></a>

### What is a Position-wise FFN?

After attention aggregates information, we need to **transform** it. The feed-forward network (FFN) applies the same transformation to each position **independently**.

### Mathematical Definition

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Or equivalently:

$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

Where:
- $W_1 \in \mathbb{R}^{d_{model} \times d_{ff}}$ (expand)
- $W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}$ (contract)
- Typically: $d_{ff} = 4 \times d_{model}$

### Why "Position-wise"?

The same FFN is applied to **each position independently**:

```python
# NOT like this (sharing across sequence):
output = FFN(entire_sequence)

# But like this (independent per position):
for position in sequence:
    output[position] = FFN(input[position])
```

In practice, we process all positions in parallel using batch operations.

### The Expand-Contract Pattern

```
Input:  512 dimensions
   ‚Üì Linear + ReLU
Hidden: 2048 dimensions (4√ó expansion!)
   ‚Üì Linear
Output: 512 dimensions
```

**Why expand?**
- More capacity for complex transformations
- Non-linear mixing in higher dimensions
- Think of it as a "bottleneck" layer in reverse

### Activation Functions

Original paper used **ReLU**: $\text{ReLU}(x) = \max(0, x)$

Modern variants use:
- **GELU** (Gaussian Error Linear Unit): Smoother, better gradients
- **SwiGLU**: Combines Swish and GLU (Gated Linear Unit)
- **GLU variants**: Gating mechanism for better control

In [None]:
# Create and test a position-wise FFN
d_model = 512
d_ff = 2048
dropout = 0.1

ffn = PositionwiseFeedForward(d_model=d_model, d_ff=d_ff, dropout=dropout)

# Test input
batch_size = 2
seq_len = 10
x = torch.randn(batch_size, seq_len, d_model)

output = ffn(x)

print("üîß Position-wise Feed-Forward Network\n")
print(f"Configuration:")
print(f"  - Input dimension (d_model): {d_model}")
print(f"  - Hidden dimension (d_ff): {d_ff} ({d_ff//d_model}√ó expansion)")
print(f"  - Output dimension: {d_model}")
print(f"  - Parameters: {sum(p.numel() for p in ffn.parameters()):,}")

print(f"\nüìê Shapes:")
print(f"  Input:  {x.shape}")
print(f"  Output: {output.shape}")
print(f"  ‚úì Shape preserved (important for residual connections!)")

print(f"\nüí° Key Points:")
print(f"  - Applied independently to each position")
print(f"  - Same parameters shared across all positions")
print(f"  - Adds non-linear transformation capacity")
print(f"  - The 4√ó expansion allows rich feature mixing")

In [None]:
# Visualize the transformation
# Show input and output distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Input distribution
ax1 = axes[0]
ax1.hist(x.flatten().detach().numpy(), bins=50, alpha=0.7, color='blue')
ax1.set_title('Input Distribution', fontsize=14, weight='bold')
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.axvline(0, color='red', linestyle='--', alpha=0.5)

# Get intermediate (after first linear + ReLU)
with torch.no_grad():
    intermediate = F.relu(ffn.linear1(x))

ax2 = axes[1]
ax2.hist(intermediate.flatten().detach().numpy(), bins=50, alpha=0.7, color='green')
ax2.set_title(f'After Expansion & ReLU\n({d_ff} dimensions)', fontsize=14, weight='bold')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')
ax2.axvline(0, color='red', linestyle='--', alpha=0.5)
print(f"\nüìä Note: ReLU sets all negative values to 0")

# Output distribution
ax3 = axes[2]
ax3.hist(output.flatten().detach().numpy(), bins=50, alpha=0.7, color='purple')
ax3.set_title('Output Distribution', fontsize=14, weight='bold')
ax3.set_xlabel('Value')
ax3.set_ylabel('Frequency')
ax3.axvline(0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"\nüìà Distribution Analysis:")
print(f"  Input - Mean: {x.mean():.3f}, Std: {x.std():.3f}")
print(f"  Intermediate - Mean: {intermediate.mean():.3f}, Std: {intermediate.std():.3f}")
print(f"  Output - Mean: {output.mean():.3f}, Std: {output.std():.3f}")

---

## 2. Layer Normalization <a id="layernorm"></a>

### Why Normalization?

Deep networks suffer from **internal covariate shift**: as parameters update, the distribution of layer inputs changes. This makes training unstable.

**Layer Normalization** solves this by normalizing across features:

$$\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:
- $\mu = \frac{1}{d}\sum_{i=1}^d x_i$ (mean across features)
- $\sigma^2 = \frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2$ (variance)
- $\gamma, \beta$ are learnable scale and shift parameters
- $\epsilon$ is a small constant for numerical stability

### Layer Norm vs Batch Norm

**Batch Normalization**: Normalizes across the batch dimension
```
For each feature:
  Compute mean & variance across all samples in batch
```

**Layer Normalization**: Normalizes across the feature dimension
```
For each sample:
  Compute mean & variance across all features
```

**Why Layer Norm for Transformers?**
1. Works with variable sequence lengths
2. Independent of batch size (important for generation)
3. More stable for NLP tasks
4. No need to track running statistics

### Effect of Layer Normalization

- **Stabilizes training**: Prevents activations from exploding/vanishing
- **Faster convergence**: Can use higher learning rates
- **Better generalization**: Acts as regularization
- **Gradient flow**: Ensures gradients don't get too large/small

In [None]:
# Demonstrate layer normalization
layer_norm = nn.LayerNorm(d_model)

# Create input with varying statistics
x_unnormalized = torch.randn(2, 10, d_model) * 5 + 10  # Mean=10, Std=5

# Apply layer norm
x_normalized = layer_norm(x_unnormalized)

print("üìè Layer Normalization Effect\n")
print("Before normalization:")
print(f"  Shape: {x_unnormalized.shape}")
print(f"  Mean (across features): {x_unnormalized.mean(dim=-1)[0, 0]:.3f}")
print(f"  Std (across features): {x_unnormalized.std(dim=-1)[0, 0]:.3f}")
print(f"  Global range: [{x_unnormalized.min():.3f}, {x_unnormalized.max():.3f}]")

print("\nAfter normalization:")
print(f"  Shape: {x_normalized.shape} (unchanged)")
print(f"  Mean (across features): {x_normalized.mean(dim=-1)[0, 0]:.3f} (‚âà 0)")
print(f"  Std (across features): {x_normalized.std(dim=-1)[0, 0]:.3f} (‚âà 1)")
print(f"  Global range: [{x_normalized.min():.3f}, {x_normalized.max():.3f}]")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

ax1 = axes[0]
ax1.hist(x_unnormalized[0].flatten().detach().numpy(), bins=50, alpha=0.7, color='red')
ax1.set_title('Before Layer Normalization\n(Varying mean & std)', fontsize=14, weight='bold')
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.axvline(x_unnormalized[0].mean(), color='black', linestyle='--', label='Mean')
ax1.legend()

ax2 = axes[1]
ax2.hist(x_normalized[0].flatten().detach().numpy(), bins=50, alpha=0.7, color='green')
ax2.set_title('After Layer Normalization\n(Mean‚âà0, Std‚âà1)', fontsize=14, weight='bold')
ax2.set_xlabel('Value')
ax2.set_ylabel('Frequency')
ax2.axvline(x_normalized[0].mean(), color='black', linestyle='--', label='Mean‚âà0')
ax2.legend()

plt.tight_layout()
plt.show()

print("\n‚úÖ Layer normalization ensures stable, normalized activations!")

---

## 3. Residual Connections <a id="residual"></a>

### The Residual (Skip Connection) Concept

Instead of:
$$\text{output} = F(x)$$

We do:
$$\text{output} = F(x) + x$$

Where $F(x)$ is the sub-layer (attention or FFN).

### Why Residual Connections?

**1. Gradient Flow**
```python
# Without residual:
‚àÇLoss/‚àÇx = ‚àÇLoss/‚àÇF(x) ¬∑ ‚àÇF(x)/‚àÇx  # Can vanish!

# With residual:
‚àÇLoss/‚àÇx = ‚àÇLoss/‚àÇ(F(x)+x) ¬∑ (‚àÇF(x)/‚àÇx + 1)  # Always has gradient from '+1'!
```

**2. Identity Mapping**
- Model can learn to ignore a layer by setting $F(x) \approx 0$
- Makes it easier to train very deep networks
- The network can always "fall back" to the identity

**3. Ensemble Behavior**
- Residual networks behave like ensembles of shallower networks
- Each path from input to output contributes
- More robust and better generalization

### In Transformers

Every sub-layer (attention, FFN) has a residual connection:

```python
# After attention:
x = x + attention(x)

# After FFN:
x = x + ffn(x)
```

This allows information to flow directly through the network!

In [None]:
# Demonstrate residual connections
class SubLayerWithResidual(nn.Module):
    """Demonstrates a sub-layer with residual connection"""
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.ffn = PositionwiseFeedForward(d_model, d_ff, dropout=0.0)
    
    def forward(self, x, use_residual=True):
        ffn_output = self.ffn(x)
        if use_residual:
            return x + ffn_output  # Residual connection
        else:
            return ffn_output  # No residual

# Test
sublayer = SubLayerWithResidual(d_model=512, d_ff=2048)
x = torch.randn(1, 10, 512)

output_with_residual = sublayer(x, use_residual=True)
output_without_residual = sublayer(x, use_residual=False)

print("üîó Residual Connection Effect\n")
print(f"Input norm: {x.norm():.3f}")
print(f"\nWithout residual:")
print(f"  Output norm: {output_without_residual.norm():.3f}")
print(f"  Difference from input: {(output_without_residual - x).norm():.3f}")

print(f"\nWith residual:")
print(f"  Output norm: {output_with_residual.norm():.3f}")
print(f"  Difference from input: {(output_with_residual - x).norm():.3f}")

print(f"\nüí° Notice:")
print(f"  - With residual, output preserves input information")
print(f"  - The FFN learns to add a 'delta' to the input")
print(f"  - Gradients can flow directly through the '+' operation")

# Visualize gradient flow
x_input = torch.randn(1, 1, 512, requires_grad=True)
output = sublayer(x_input, use_residual=True)
loss = output.sum()
loss.backward()

print(f"\nüéØ Gradient Flow:")
print(f"  Input gradient norm: {x_input.grad.norm():.3f}")
print(f"  ‚úì Gradients flow back easily thanks to residual!")

---

## 4. The Complete Sub-Layer Pattern <a id="pattern"></a>

### The Transformer's Universal Pattern

Every sub-layer in a Transformer follows this pattern:

```python
# Pre-LN (Pre-Normalization) - Modern approach:
output = x + SubLayer(LayerNorm(x))

# Post-LN (Post-Normalization) - Original paper:
output = LayerNorm(x + SubLayer(x))
```

Where `SubLayer` can be:
- Multi-head attention
- Feed-forward network

### Pre-LN vs Post-LN

**Post-LN (Original):**
```
x ‚Üí SubLayer ‚Üí Add (residual) ‚Üí LayerNorm ‚Üí output
```

**Pre-LN (Modern):**
```
x ‚Üí LayerNorm ‚Üí SubLayer ‚Üí Add (residual) ‚Üí output
```

**Why Pre-LN is better:**
- More stable training
- Can train deeper models
- Doesn't require learning rate warmup
- Residual path is always normalized

### Complete Encoder Layer

```python
# 1. Self-Attention sub-layer
x = x + MultiHeadAttention(LayerNorm(x))

# 2. Feed-Forward sub-layer  
x = x + FeedForward(LayerNorm(x))
```

Two sub-layers, each with:
- Layer normalization
- The actual computation
- Residual connection
- Dropout (not shown)

In [None]:
# Implement the complete pattern
class SubLayerConnection(nn.Module):
    """
    Complete sub-layer pattern:
    LayerNorm ‚Üí SubLayer ‚Üí Dropout ‚Üí Residual
    """
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, sublayer):
        """Apply: x + dropout(sublayer(norm(x)))"""
        return x + self.dropout(sublayer(self.norm(x)))

# Test with FFN as sublayer
d_model = 512
connection = SubLayerConnection(d_model, dropout=0.1)
ffn = PositionwiseFeedForward(d_model, 2048, dropout=0.1)

x = torch.randn(2, 10, d_model)
output = connection(x, ffn)

print("üéØ Complete Sub-Layer Pattern\n")
print("Flow: Input ‚Üí LayerNorm ‚Üí SubLayer ‚Üí Dropout ‚Üí Add(Residual) ‚Üí Output")
print(f"\nShapes:")
print(f"  Input:  {x.shape}")
print(f"  Output: {output.shape}")

print(f"\nüìä Statistics:")
print(f"  Input - Mean: {x.mean():.3f}, Std: {x.std():.3f}")
print(f"  Output - Mean: {output.mean():.3f}, Std: {output.std():.3f}")

print(f"\n‚úÖ This pattern is used for EVERY sub-layer in Transformers!")
print(f"   - Makes architecture clean and modular")
print(f"   - Ensures stable training")
print(f"   - Enables very deep networks (100+ layers possible)")

In [None]:
# Visualize the complete pattern
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

fig, ax = plt.subplots(figsize=(10, 8))
ax.axis('off')
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)

# Draw the flow
components = [
    (5, 9, "Input x", 'lightblue'),
    (5, 7.5, "LayerNorm(x)", 'lightyellow'),
    (5, 6, "SubLayer\n(Attention or FFN)", 'lightgreen'),
    (5, 4.5, "Dropout", 'lightcoral'),
    (5, 3, "Add Residual\n(+ x)", 'plum'),
    (5, 1.5, "Output", 'lightblue'),
]

for x, y, text, color in components:
    box = FancyBboxPatch((x-1.5, y-0.3), 3, 0.6, boxstyle="round,pad=0.1",
                         edgecolor='black', facecolor=color, linewidth=2)
    ax.add_patch(box)
    ax.text(x, y, text, ha='center', va='center', fontsize=11, weight='bold')

# Draw arrows
for i in range(len(components)-1):
    ax.annotate('', xy=(5, components[i+1][1]+0.3), 
               xytext=(5, components[i][1]-0.3),
               arrowprops=dict(arrowstyle='->', lw=2, color='black'))

# Draw residual connection (skip connection)
ax.annotate('', xy=(7.5, 3), xytext=(7.5, 9),
           arrowprops=dict(arrowstyle='->', lw=3, color='red', 
                         connectionstyle="arc3,rad=.3"))
ax.text(8.5, 6, 'Residual\nConnection', fontsize=10, color='red', 
       weight='bold', ha='center')

plt.title('Complete Sub-Layer Pattern in Transformers', 
         fontsize=16, weight='bold', pad=20)
plt.tight_layout()
plt.show()

print("üìã This pattern appears in:")
print("  - Encoder self-attention")
print("  - Encoder feed-forward")
print("  - Decoder masked self-attention")
print("  - Decoder cross-attention")
print("  - Decoder feed-forward")
print("\n  ‚Üí 5 times per encoder-decoder layer pair!")

---

## 5. DeepSeek Insights <a id="deepseek"></a>

### üî¨ DeepSeek-R1 Perspective on FFN & Normalization

#### 1. **FFN as Memory Storage**

> "While attention routes information, the feed-forward layers **store** knowledge. They're like the 'facts' memory of the network."

**Research findings:**
- FFN weights encode factual knowledge
- Different neurons activate for different concepts
- Editing FFN weights can change model's knowledge
- The 4√ó expansion creates rich representational space

#### 2. **The Expand-Contract is Key**

> "The expansion to 4√ó dimensions isn't arbitrary - it's the sweet spot between capacity and efficiency."

**Why 4√ó?**
- Less than 4√ó: Insufficient capacity, bottleneck
- More than 4√ó: Diminishing returns, wasted compute
- 4√ó empirically optimal across many tasks

#### 3. **Layer Norm Enables Scale**

> "Without layer normalization, training transformers beyond 12 layers is nearly impossible. With it, we scale to 100+ layers."

**Why it works:**
- Prevents activation explosion/vanishing
- Makes optimization landscape smoother
- Reduces sensitivity to initialization
- Enables higher learning rates

#### 4. **Pre-LN vs Post-LN**

> "The shift from post-LN to pre-LN was crucial for scaling. Pre-LN is strictly superior for deep models."

**DeepSeek-R1 uses Pre-LN** because:
- Trains 3-5√ó faster
- More stable with high learning rates
- Scales better to 90+ layers
- No warmup needed

#### 5. **Residual Connections are Essential**

> "Without residuals, deep transformers simply don't train. The gradient flow is crucial for learning multi-hop reasoning."

**Reasoning connection:**
- Each layer adds a "reasoning step"
- Residuals allow building on previous steps
- Gradients flow back through all steps
- Enables learning complex multi-step logic

#### 6. **GLU Variants for Better Performance**

> "Modern transformers use gated variants (SwiGLU, GeGLU) instead of plain ReLU. The gating mechanism provides better control."

**SwiGLU formula:**
$$\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)$$

Where $\odot$ is element-wise multiplication (gating).

In [None]:
# Compare different activation functions
x = torch.linspace(-3, 3, 1000)

# Different activations
relu = F.relu(x)
gelu = F.gelu(x)
swish = x * torch.sigmoid(x)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(x, relu, label='ReLU (Original Transformer)', linewidth=2)
plt.plot(x, gelu, label='GELU (BERT, GPT-2)', linewidth=2)
plt.plot(x, swish, label='Swish (Modern)', linewidth=2)
plt.axhline(0, color='gray', linestyle='--', alpha=0.3)
plt.axvline(0, color='gray', linestyle='--', alpha=0.3)
plt.xlabel('Input', fontsize=12)
plt.ylabel('Output', fontsize=12)
plt.title('Activation Functions in Feed-Forward Networks', fontsize=14, weight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("üìä Activation Function Comparison:")
print("\nReLU: max(0, x)")
print("  ‚úì Simple, fast")
print("  ‚úó Hard cutoff at 0, dead neurons")

print("\nGELU: x¬∑Œ¶(x) where Œ¶ is Gaussian CDF")
print("  ‚úì Smooth, probabilistic")
print("  ‚úì Better gradients")

print("\nSwish: x¬∑œÉ(x)")
print("  ‚úì Smooth, self-gated")
print("  ‚úì Works well in practice")

print("\nüéØ DeepSeek-R1 uses variants of these in gated form (SwiGLU)")

---

## 6. Practical Implementation <a id="implementation"></a>

Let's see how to use our modules:

In [None]:
# Test GLU variant
glu_ffn = GLUFeedForward(d_model=512, d_ff=2048, dropout=0.1)

x = torch.randn(2, 10, 512)
output_glu = glu_ffn(x)

print("üîß GLU Feed-Forward Network\n")
print("Formula: GLU(x) = (xW1 + b1) ‚äô œÉ(xW2 + b2)")
print("  where ‚äô is element-wise multiplication (gating)\n")

print(f"Configuration:")
print(f"  Input dimension: 512")
print(f"  Hidden dimension: 2048")
print(f"  Gating mechanism: Sigmoid")
print(f"  Parameters: {sum(p.numel() for p in glu_ffn.parameters()):,}")

print(f"\nOutput:")
print(f"  Shape: {output_glu.shape}")
print(f"  ‚úì GLU variants often perform better than plain FFN!")

---

## üéØ Summary & Key Takeaways

### What We Learned

1. **Position-wise Feed-Forward**
   - Two linear layers with activation
   - 4√ó expansion (512 ‚Üí 2048 ‚Üí 512)
   - Applied independently to each position
   - Stores factual knowledge

2. **Layer Normalization**
   - Normalizes across features (not batch)
   - Mean ‚âà 0, Std ‚âà 1
   - Critical for training stability
   - Enables deep networks

3. **Residual Connections**
   - output = input + sublayer(input)
   - Ensures gradient flow
   - Enables identity mapping
   - Essential for depth

4. **Complete Pattern**
   - Pre-LN: x + SubLayer(LayerNorm(x))
   - Used for every sub-layer
   - Clean, modular architecture

5. **DeepSeek Insights**
   - FFN stores knowledge, attention routes it
   - 4√ó expansion is empirically optimal
   - Pre-LN superior to Post-LN
   - Gated variants (GLU) improve performance

### The Formula

Complete sub-layer with all components:

$$\text{output} = x + \text{Dropout}(\text{SubLayer}(\text{LayerNorm}(x)))$$

Where SubLayer can be:
- Multi-head attention
- Feed-forward network

### Next Steps

In **Tutorial 5: Encoder & Decoder Architecture**, we'll see:
- How to stack these components into encoder layers
- Decoder layers with cross-attention
- Complete encoder and decoder stacks
- How information flows through the full model

These building blocks are now ready to be assembled! üöÄ