<a href="https://www.kaggle.com/code/mrafraim/dl-day-35-cnn-weight-initialization?scriptVersionId=296345802" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 35: CNN Weight Initialization
*Xavier vs He · Dead ReLU · Empirical Behavior*

Welocme to Day 35!

Today is not about *choosing an initializer*.

It’s about understanding:
- Why some CNNs never learn
- Why loss curves look “flat but not broken”
- Why optimizers get blamed unfairly

By the end of today:

✔ You’ll diagnose learning failure in minutes  
✔ You’ll pick initialization without guessing  
✔ You’ll recognize Dead ReLU instantly  

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# Why Initialization Matters

Training deep CNNs critically depends on **stable gradient flow**.

Weight initialization directly controls:
- **Activation scale** during the forward pass
- **Gradient scale** during backpropagation

### What goes wrong with poor initialization
- **Weights too small** → activations shrink layer by layer → gradients vanish  
- **Weights too large** → activations blow up → gradients explode  

Neither case usually causes runtime errors.

> Bad initialization doesn’t crash the model.  
> It silently prevents effective learning.

# What Failure Looks Like in Practice

### Common symptoms
- **Training loss decreases extremely slowly or plateaus early**  
- **Accuracy remains near random guessing**  
- **Changing the optimizer has little to no effect**  
- **Lowering the learning rate worsens training**  

Interpretation
> The model is mathematically incapable of propagating signal.

Poor initialization causes activations or gradients to shrink or explode across layers, so learning signals never reach earlier layers. Optimization fails **silently**, even though the training loop appears normal.


# The Core Problem: Variance Propagation

Variance propagation describes how the statistical spread (variance) of activations or gradients changes as signals pass through successive layers of a neural network.

Formally, for a layer:
$$
z^{(l)} = \sum_{i=1}^{n} w_i^{(l)} x_i^{(l)}
$$

the output variance depends on the input variance:
$$
\boxed{\text{Var}\!\left(z^{(l)}\right)
= n \cdot \text{Var}\!\left(w^{(l)}\right)
\cdot \text{Var}\!\left(x^{(l)}\right)}
$$

Consider a single neuron in a neural network:

$$
z = \sum_{i=1}^{n} w_i x_i
$$

- $z$ → pre-activation output of the neuron (input to the non-linearity)  
- $n$ → number of input connections (fan-in)  
- $w_i$ → weight associated with the $i$-th input  
- $x_i$ → $i$-th input activation  

This is a weighted sum of inputs before applying an activation function.


### Variance of the neuron output

Assuming:
- inputs $x_i$ are independent and identically distributed  
- weights $w_i$ are independent of inputs  
- both have zero mean  

the variance of $z$ becomes:

$$
\text{Var}(z) = n \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

- $\text{Var}(z)$ → variance of the neuron’s output  
- $n$ → number of summed terms (fan-in)  
- $\text{Var}(w)$ → variance of the weight distribution  
- $\text{Var}(x)$ → variance of the input activations  

### Why this matters in deep networks

As signals propagate through layers:

- If $\text{Var}(z)$ **increases layer by layer** → activations and gradients explode  
- If $\text{Var}(z)$ **decreases layer by layer** → activations and gradients vanish  

Both cases make learning ineffective.

### Key Insight
> Proper initialization chooses $\text{Var}(w)$ such that  
> $\text{Var}(z)$ remains **approximately constant across layers**.

This is the mathematical foundation behind **Xavier and He initialization**.

---

### <p style="text-align:center; color:orange; font-size:18px;"> Where Does  Var(z)  Come From?</p>

Start with the neuron equation:

$$
z = \sum_{i=1}^{n} w_i x_i
$$

This is a sum of random variables.


**Step 1: Variance of a sum**

A basic probability rule:

If random variables are independent,

$$
\text{Var}\!\left(\sum_{i=1}^{n} y_i\right)
= \sum_{i=1}^{n} \text{Var}(y_i)
$$

So we apply this to:

$$
z = w_i x_i
$$

Then:

$$
\text{Var}(z) = \sum_{i=1}^{n} \text{Var}(w_i x_i)
$$


**Step 2: Variance of a product**

Another key rule (under independence and zero mean):

$$
\text{Var}(w_i x_i)
= \text{Var}(w_i)\,\text{Var}(x_i)
$$

Why this holds:
- $w_i$ and $x_i$ are independent  
- $\mathbb{E}[w_i] = 0$, $\mathbb{E}[x_i] = 0$  

So each term contributes:

$$
\text{Var}(w_i x_i) = \text{Var}(w)\,\text{Var}(x)
$$


**Step 3: Sum all contributions**

Since every term has the same variance:

$$
\text{Var}(z)
= \sum_{i=1}^{n} \text{Var}(w)\,\text{Var}(x)
$$

$$
\boxed{
\text{Var}(z) = n \cdot \text{Var}(w) \cdot \text{Var}(x)
}
$$


### What this means

- Each input contributes **a little variance**
- Adding $n$ such contributions multiplies variance by $n$

> More connections = more variance unless weights are scaled down


### Why this breaks deep networks

Across layers:

$$
\text{Var}(x^{(l+1)}) = n^{(l)} \text{Var}(w^{(l)}) \text{Var}(x^{(l)})
$$

Repeat this 50 times → explosion or collapse.

---


# Xavier (Glorot) Initialization


## Core idea

Xavier initialization is a weight initialization strategy designed to keep the **variance of activations and gradients approximately constant across layers** in deep neural networks.

Its objective is to prevent:
- **Vanishing signals** (variance shrinking with depth)
- **Exploding signals** (variance growing with depth)

during **both the forward and backward pass**.

For a neuron:
$$
z = \sum_{i=1}^{n_{in}} w_i x_i
$$

Assuming:
- inputs $x_i$ are i.i.d. with zero mean and variance $\text{Var}(x)$  
- weights $w_i$ are i.i.d. with zero mean and variance $\text{Var}(w)$  
- weights and inputs are independent  

the output variance becomes:
$$
\text{Var}(z) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

To keep signal magnitude stable across layers, we want:
$$
\text{Var}(z) \approx \text{Var}(x)
$$

This leads to:
$$
n_{in} \cdot \text{Var}(w) \approx 1
$$


### Why $n_{out}$ also appears

Backpropagation imposes a **similar constraint** on gradient variance, which depends on $n_{out}$ (fan-out).

To balance **both forward and backward variance**, Xavier initialization chooses:

$$
\boxed{
\text{Var}(w) = \frac{2}{n_{in} + n_{out}}
}
$$

where:
- $n_{in}$ → number of input connections (fan-in)  
- $n_{out}$ → number of output connections (fan-out)  
- $\text{Var}(w)$ → variance of the weight distribution  

This choice ensures:
- activations neither explode nor vanish in the forward pass  
- gradients remain well-scaled in the backward pass  

### Practical forms

Xavier initialization doesn’t set all weights to the same number.  
Instead, **each weight is chosen randomly** from a carefully controlled distribution to preserve variance.

#### 1️. Uniform Xavier

$$
w \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\;
\sqrt{\frac{6}{n_{in}+n_{out}}}\right)
$$

- Each weight is drawn **independently at random** from the interval  
  $$[-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}]$$
- All values in this range are **equally likely**
- Example:  
  If $n_{in}=128$, $n_{out}=64$, then $\sqrt{6/192} \approx 0.176$  
  → $w \in [-0.176, 0.176]$ randomly


#### 2️. Normal (Gaussian) Xavier

$$
w \sim \mathcal{N}\left(0, \frac{2}{n_{in}+n_{out}}\right)
$$

- Each weight is drawn independently from a **bell-shaped curve**  
  centered at 0 with variance $2/(n_{in}+n_{out})$
- Most weights are near 0, rare large values
- Example:  
  With the same $n_{in}$ and $n_{out}$, $\text{Var}(w) \approx 0.0104$, $\sigma \approx 0.102$  
  → most weights lie roughly in $[-0.3, 0.3]$

#### Why randomness matters

- Prevents neurons from being identical (symmetry problem)  
- Breaks correlation while controlling scale for stable signal propagation  

#### Why these ranges are chosen

- Ensures forward and backward variance is approximately constant:  
$$
\text{Var}(z^{(l)}) \approx \text{Var}(z^{(l-1)})
$$


---

### <p style="text-align:center; color:orange; font-size:18px;">Understanding the Math Behind Xavier Initialization</p>

Xavier is designed for symmetric activations (tanh, sigmoid), assuming no variance is lost after the activation.

### 1. Xavier Uniform

Rule:
$$
w \sim \mathcal{U}\Big(-\sqrt{\frac{6}{n_{in}+n_{out}}},\; \sqrt{\frac{6}{n_{in}+n_{out}}}\Big)
$$

Step-by-step:

1. Variance of uniform distribution $U[a,b]$:


$$
\text{Var}(U[a,b]) = \frac{(b-a)^2}{12}
$$

3. Plug in bounds: $a=-r, b=r$:

$$
\text{Var}(U[-r,r]) = \frac{(r - (-r))^2}{12} = \frac{4 r^2}{12} = \frac{r^2}{3}
$$

3. Set this equal to desired Xavier variance:
$$
\frac{r^2}{3} = \frac{2}{n_{in}+n_{out}} \quad \Rightarrow \quad \boxed{r = \sqrt{\frac{6}{n_{in}+n_{out}}}}
$$

> Each weight is sampled **uniformly in [-r, r]**, producing the correct variance.


### 2. Xavier Normal

Rule:
$$
w \sim \mathcal{N}\left(0,\; \frac{2}{n_{in}+n_{out}}\right)
$$

Step-by-step:

1. **Neuron output**:
$$
z = \sum_{i=1}^{n_{in}} w_i x_i
$$

2. **Variance of output**:
$$
\text{Var}(z) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

3. **Forward + backward balance**:  

We want both forward and backward variance to remain stable.  
- Forward: $n_{in} \cdot \text{Var}(w) \approx 1$  
- Backward: $n_{out} \cdot \text{Var}(w) \approx 1$

4. **Compromise formula**:

We take the average (harmonic mean) of the two constraints:

$$
\text{Var}(w) = \frac{1}{2}\Big(\frac{1}{n_{in}} + \frac{1}{n_{out}}\Big)^{-1} \quad \approx \frac{2}{n_{in}+n_{out}}
$$

$$
\boxed{\text{Var}(w) = \frac{2}{n_{in}+n_{out}}}
$$
- That’s why 2 appears in the numerator  
- It balances forward and backward variance in one formula

> Each weight is sampled from a Gaussian with mean 0 and variance $\frac{2}{n_{in}+n_{out}}$


**Key Intuition**

- Xavier Normal → Gaussian  
- Xavier Uniform → Uniform  
- Both achieve the **same target variance**  
- Suitable for **symmetric activations** (tanh, sigmoid)  
- Not ideal for ReLU (use He instead)

---

## When Xavier Works Well

Xavier initialization is most effective with **symmetric activation functions**:

Suitable activations:
- `sigmoid`  
- `tanh`  

Why it works:
- These activations produce outputs centered around zero  
- Symmetric output ensures variance propagation assumptions hold  
- Forward and backward signals remain stable across layers


## Important Caveat

CNNs almost always use **ReLU activations**, not `tanh`.

Problem:
- ReLU sets all negative activations to zero  
- This halves the effective variance of the signal  
- Xavier initialization assumes symmetric activations, so it underestimates the needed variance

Consequence:
> Learning is slow, unstable, or may fail to converge in deep ReLU networks

For ReLU-based CNNs, use **He initialization** instead of Xavier


## Xavier Initialization in PyTorch

```python
# Uniform Xavier
nn.init.xavier_uniform_(conv.weight)

# Normal (Gaussian) Xavier
nn.init.xavier_normal_(conv.weight)
```

Notes:

* Each weight is randomly initialized using the Xavier formulas
* Works well for symmetric activations like `tanh` or `sigmoid`
* Keeps variance of activations and gradients roughly constant across layers

# He (Kaiming) Initialization


## Core Idea

**He Initialization** (also called Kaiming Initialization) is a weight initialization method designed specifically for ReLU-based neural networks, where weights are initialized so that activation variance remains approximately constant across layers, despite ReLU zeroing out half of the inputs.

Formally, it sets the variance of weights as:
$$
\text{Var}(w) = \frac{2}{n_{in}}
$$

**Why He Initialization Exists**

ReLU behaves as:
$$
\text{ReLU}(x) = \max(0, x)
$$

This causes:
- ~50% of activations → exactly zero
- Output variance → reduced by ~½ after each ReLU

Without correction:
- Activations shrink layer by layer
- Gradients weaken
- Deep CNNs fail to train effectively

He initialization explicitly compensates for this variance loss.


For a neuron:
$$
z = \sum_{i=1}^{n_{in}} w_i x_i
$$

If:
- Inputs $x_i$ have variance 1
- Weights are zero-mean and independent

Then:
$$
\text{Var}(z) = n_{in} \cdot \text{Var}(w)
$$

After ReLU:
$$
\text{Var}(\text{ReLU}(z)) \approx \frac{1}{2}\text{Var}(z)
$$

To keep variance ≈ 1:
$$
\frac{1}{2} \cdot n_{in} \cdot \text{Var}(w) = 1
\Rightarrow \boxed{\text{Var}(w) = \frac{2}{n_{in}}}
$$


Practical Meaning of $n_{in}$:

- **Fully Connected layer**: number of input features  
- **Convolution layer**:

$$
n_{in} = k_h \times k_w \times c_{in}
$$



## Does He Initialization Use Random or Fixed Values?
 
**Random values, not fixed constants.**  
He initialization defines how weights are randomly sampled, not a single value.

### What He Initialization Actually Does

He initialization controls the distribution of weights:

- Mean = 0  
- Variance = $\dfrac{2}{n_{in}}$

Each weight is random, but its scale is carefully chosen so that signal does not vanish after ReLU.


### Common He Initialization Variants

#### 1️. He Normal (Most Common)
Weights are sampled from a normal distribution:
$$
w \sim \mathcal{N}\left(0,\; \frac{2}{n_{in}}\right)
$$

Interpretation:
- Bell-shaped distribution
- Most values are small
- Occasional larger values allowed


#### 2️. He Uniform
Weights are sampled from a uniform distribution:
$$
w \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}}},\; \sqrt{\frac{6}{n_{in}}}\right)
$$

Interpretation:
- Every value in the range is equally likely
- Strict upper and lower bounds

### Critical Clarification

| Question | Answer |
|--------|-------|
Are weights fixed? | No |
Are all weights identical? | No |
Are weights random each run? | Yes |
Is randomness controlled? | Yes |

If weights were fixed:
- Neurons would learn identical features
- Training would completely fail

Randomness is required.  
He initialization makes it mathematically stable.


> **He initialization = random weights scaled correctly so ReLU doesn’t destroy gradient flow.**


---

### <p style="text-align:center; color:orange; font-size:18px;">Understanding the Math Behind He Initialization</p>

#### 1️. He Normal

Rule:
$$
w \sim \mathcal{N}\left(0,\; \frac{2}{n_{in}}\right)
$$

Step-by-step:

1. **Neuron output**:
$$
z = \sum_{i=1}^{n_{in}} w_i x_i
$$

2. **Variance of output** (before ReLU):
$$
\text{Var}(z) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

3. **Effect of ReLU**:
- ReLU zeros out ~50% of inputs  
- Variance after ReLU:
$$
\text{Var}(\text{ReLU}(z)) \approx \frac{1}{2} \text{Var}(z)
$$

4. **To keep signal variance stable**, want:
$$
\text{Var}(\text{ReLU}(z)) \approx \text{Var}(x)
$$

5. **Solve for Var(w)**:
$$
\frac{1}{2} \cdot n_{in} \cdot \text{Var}(w) = 1
\quad \Rightarrow \quad
\boxed{\text{Var}(w) = \frac{2}{n_{in}}}
$$

> Each weight is then sampled from a Gaussian with mean 0 and variance $2/n_{in}$.


### 2️. He Uniform

Rule:
$$
w \sim \mathcal{U}\Big(-\sqrt{\frac{6}{n_{in}}},\; \sqrt{\frac{6}{n_{in}}}\Big)
$$

Step-by-step:

1. Variance of uniform distribution $U[a,b]$:
$$
\text{Var}(U[a,b]) = \frac{(b-a)^2}{12}
$$

2. Plug in bounds: $a=-r, b=r$  
$$
\text{Var}(w) = \frac{(r - (-r))^2}{12} = \frac{(2r)^2}{12} = \frac{4 r^2}{12} = \frac{r^2}{3}
$$

3. Set this equal to desired He variance $2/n_{in}$:
$$
\frac{r^2}{3} = \frac{2}{n_{in}} \quad \Rightarrow \quad r^2 = \frac{6}{n_{in}} \quad \Rightarrow \quad \boxed{r = \sqrt{\frac{6}{n_{in}}}}
$$

> Each weight is then sampled uniformly in $[-r, r]$, which produces the same target variance as He Normal.


**Key Intuition**

- Both Normal and Uniform produce random weights with variance 2/n_in 
- Randomness breaks symmetry  
- Variance scaling compensates for ReLU killing half the signal

---

## Why He Initialization Is Optimal for CNNs

CNN Characteristics
- Very deep stacks of convolutional layers  
- ReLU activations in almost every layer  
- High fan-in (many input connections per neuron/filter)

Why He Works
- **Preserves forward signal** → prevents activations from shrinking  
- **Preserves backward gradients** → avoids vanishing/exploding gradients  
- **Enables learning immediately** → no slow “warm-up” as with Xavier + ReLU

> Bottom line: For deep, ReLU-based CNNs, **He initialization is almost always the default choice**.


## He Initialization in PyTorch

```python
nn.init.kaiming_normal_(
    conv.weight,
    mode="fan_in",
    nonlinearity="relu"
)
```

or

```python
nn.init.kaiming_uniform_(
    conv.weight,
    nonlinearity="relu"
)
```

**Default choice for CNNs**


# CNN-Specific Detail: `fan_in` in Conv Layers

For a `Conv2D` layer:

$$
\text{fan\_in} = C_{in} \times K_h \times K_w
$$

Where:  
- $C_{in}$ → number of input channels  
- $K_h, K_w$ → kernel height and width

Notes:

- PyTorch automatically computes `fan_in` when you use `nn.init.kaiming_*` or `nn.init.xavier_*`  
- If you manually initialize weights, you **must respect fan_in**, otherwise:
  - Forward activations explode or vanish  
  - Backward gradients explode or vanish  
  - Variance assumptions in Xavier/He formulas break  

> Correct fan_in calculation is **critical for deep CNN stability**


# Why fan_in Matters in Convolutional Layers

### What fan_in represents
- fan_in = number of inputs contributing to a single neuron/output unit
- For Conv2D: 
$$
\text{fan\_in} = C_{in} \times K_h \times K_w
$$
- Each output pixel is computed as:
$$
z = \sum_{c=1}^{C_{in}} \sum_{i=1}^{K_h} \sum_{j=1}^{K_w} w_{c,i,j} \cdot x_{c,i,j}
$$
- So `fan_in` counts all weight-input products summed for one output pixel


### Why it affects weight initialization
Variance of neuron output:
$$
\text{Var}(z) = \text{fan\_in} \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

- Larger fan_in → more summed terms → output variance **increases**  
- Smaller fan_in → output variance **decreases**  

**Xavier/He formulas assume you know fan_in** to set $\text{Var}(w)$ correctly:
- Too small variance → activations shrink → slow learning  
- Too large variance → activations explode → unstable gradients


### PyTorch convenience
- Functions like `nn.init.kaiming_normal_` automatically compute fan_in for Conv layers  
- If you manually compute weights, you must use the correct fan_in for the

formula:
$$
\text{Var}(w) = \frac{2}{\text{fan\_in}} \quad \text{(He for ReLU)}
$$

# Why `mode="fan_in"` in `nn.init.kaiming_normal_` Even Though PyTorch Calculates `fan_in`

### What `mode` does
- `mode` tells PyTorch how to scale the variance of the weights:
  - `"fan_in"` → scales weights based on **number of input connections**  
  - `"fan_out"` → scales weights based on **number of output connections**  
  - `"fan_avg"` → uses average of `fan_in` and `fan_out`  

- Scaling formula (He initialization):

$$
\text{Var}(w) = \frac{2}{\text{fan\_in}} \quad \text{(for ReLU)}
$$

### Automatic calculation vs scaling decision
- PyTorch does compute fan_in automatically for Conv/Linear layers  
- But it needs your guidance on which one to use for scaling via `mode`  
- `"fan_in"` is the default and correct for ReLU, because variance of outputs depends on inputs summed 

> If you chose `"fan_out"` instead:
> - Forward activations variance could shrink/explode  
> - Backprop gradients could become unstable

# Dead ReLU: Silent Model Death


To be continue...

---

<p style="text-align:center; color:skyblue; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
