# Dropout Neural Networks – Notes

This section summarizes the **dropout model** and how it is trained.

---

## 1. Standard Feedforward Network

Consider a neural network with $L$ hidden layers:

- $l \in \{1, \dots, L\}$ indexes the hidden layers.  
- $z^{(l)}$ = vector of inputs to layer $l$  
- $y^{(l)}$ = vector of outputs from layer $l$ (with $y^{(0)} = x$, the input)  
- $W^{(l)}, b^{(l)}$ = weights and biases of layer $l$

The standard feedforward operation for hidden unit $i$ is:

$$\begin{aligned}
z_i^{(l+1)} &= W_i^{(l+1)} y^{(l)} + b_i^{(l+1)} \\
y_i^{(l+1)} &= f(z_i^{(l+1)})
\end{aligned}$$

where $f$ is an activation function, e.g., sigmoid:

$$f(x) = \frac{1}{1 + e^{-x}}$$

---

## 2. Feedforward with Dropout

Dropout modifies the forward pass by **randomly dropping units**:

$$r_j^{(l)} \sim \text{Bernoulli}(p)$$

$$\tilde{y}^{(l)} = r^{(l)} \odot y^{(l)}$$

$$z_i^{(l+1)} = W_i^{(l+1)} \tilde{y}^{(l)} + b_i^{(l+1)}$$

$$y_i^{(l+1)} = f(z_i^{(l+1)})$$

- $\odot$ = element-wise product  
- $r^{(l)}$ = vector of independent Bernoulli random variables with probability $p$ of being 1  
- $\tilde{y}^{(l)}$ = **thinned outputs** used as input to the next layer  
- At **test time**, scale weights: $W^{(l)}_{\text{test}} = p W^{(l)}$  

> Intuition: Dropout samples a **sub-network** from the full network. Training over many random sub-networks approximates averaging predictions over an exponential number of models.

---

## 3. Training Dropout Nets

### 3.1 Backpropagation

- Use **stochastic gradient descent (SGD)** similar to standard networks.  
- For each training example:
  1. Sample a **thinned network** by dropping units.  
  2. Perform **forward and backward pass** on this sub-network.  
- Gradients for each parameter are **averaged over the mini-batch**.  
  - If a parameter isn’t used in a training case → gradient = 0.  

**Additional techniques that help:**

- Momentum  
- Annealed learning rates  
- L2 weight decay  

---

### 3.2 Max-norm Regularization

- Constrain incoming weight vector of each hidden unit:

$$\| w \|_2 \leq c$$

- If $w$ goes out of this ball → **project back** onto the ball of radius $c$.  
- Helps prevent weights from **blowing up**, especially with **high learning rates**.  
- Often used **together with dropout**, large decaying learning rates, and high momentum for better performance.  

> Noise from dropout allows the optimizer to explore **different regions of weight space**, improving generalization.

---

### 3.3 Unsupervised Pretraining

- Networks can be pretrained using:
  - RBMs (Restricted Boltzmann Machines)  
  - Autoencoders  
  - Deep Boltzmann Machines  

- Pretraining uses **unlabeled data** to initialize weights.  
- **Dropout finetuning** procedure:
  1. Scale pretrained weights by $1/p$  
  2. Use smaller learning rates than random initialization finetuning  
- Ensures **pretrained information is retained**, while still benefiting from dropout’s regularization.  

---

### ✅ Key Takeaways

1. Dropout approximates **Bayesian model averaging** by sampling sub-networks.  
2. Training requires **randomly masking units** per example, averaging gradients across mini-batch.  
3. Max-norm regularization + dropout + careful learning rates → **best generalization performance**.  
4. Works well even when combined with **pretraining** on unlabeled data.  

---

#### Optional Diagram Idea:



In [None]:
## 4. Practical Implementation Example

```python
import numpy as np
import matplotlib.pyplot as plt

def apply_dropout(layer_output, dropout_prob, training=True):
    """
    Apply dropout to a layer's output
    
    Args:
        layer_output: numpy array of shape (batch_size, num_units)
        dropout_prob: probability of keeping each unit (p in the paper)
        training: whether we're in training mode
    
    Returns:
        Thinned output if training, scaled output if testing
    """
    if training:
        # Generate Bernoulli random variables: r^(l) ~ Bernoulli(p)
        mask = np.random.binomial(1, dropout_prob, size=layer_output.shape)
        # Element-wise product: ỹ^(l) = r^(l) ⊙ y^(l)
        return layer_output * mask
    else:
        # At test time, scale by dropout probability
        return layer_output * dropout_prob

# Demonstrate dropout effect
np.random.seed(42)
layer_size = 10
batch_size = 5
dropout_prob = 0.5

# Simulate layer outputs
y = np.random.randn(batch_size, layer_size)
print("Original layer outputs:")
print(y.round(2))

print(f"\nWith dropout (p={dropout_prob}, training=True):")
y_dropout = apply_dropout(y, dropout_prob, training=True)
print(y_dropout.round(2))
print(f"Units kept: {np.sum(y_dropout != 0)}/{y_dropout.size}")

print(f"\nAt test time (p={dropout_prob}, training=False):")
y_test = apply_dropout(y, dropout_prob, training=False)
print(y_test.round(2))

# Visualize dropout effect
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Original
im1 = axes[0].imshow(y, cmap='RdBu', aspect='auto')
axes[0].set_title('Original Layer Output')
axes[0].set_xlabel('Units')
axes[0].set_ylabel('Batch Examples')
plt.colorbar(im1, ax=axes[0])

# Training (with dropout)
im2 = axes[1].imshow(y_dropout, cmap='RdBu', aspect='auto')
axes[1].set_title(f'Training: With Dropout (p={dropout_prob})')
axes[1].set_xlabel('Units')
axes[1].set_ylabel('Batch Examples')
plt.colorbar(im2, ax=axes[1])

# Test time (scaled)
im3 = axes[2].imshow(y_test, cmap='RdBu', aspect='auto')
axes[2].set_title(f'Test: Scaled by p={dropout_prob}')
axes[2].set_xlabel('Units')
axes[2].set_ylabel('Batch Examples')
plt.colorbar(im3, ax=axes[2])

plt.tight_layout()
plt.show()

# Mathematical relationship demonstration
print("\n" + "="*50)
print("MATHEMATICAL RELATIONSHIPS")
print("="*50)

# Show the mathematical formulation
print("1. Dropout mask generation:")
print("   r_j^(l) ~ Bernoulli(p)")
print(f"   Example mask: {(y_dropout != 0).astype(int)[0]}")

print("\n2. Thinned outputs (training):")
print("   ỹ^(l) = r^(l) ⊙ y^(l)")
print(f"   Original: {y[0].round(2)}")
print(f"   Thinned:  {y_dropout[0].round(2)}")

print("\n3. Test time scaling:")
print("   W_test^(l) = p * W^(l)")
print(f"   Scaled output: {y_test[0].round(2)}")
```
