<a href="https://www.kaggle.com/code/mrafraim/dl-day-35-cnn-weight-initialization?scriptVersionId=295980165" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 35: CNN Weight Initialization
*Xavier vs He · Dead ReLU · Empirical Behavior*

Welocme to Day 35!

Today is not about *choosing an initializer*.

It’s about understanding:
- Why some CNNs never learn
- Why loss curves look “flat but not broken”
- Why optimizers get blamed unfairly

By the end of today:

✔ You’ll diagnose learning failure in minutes  
✔ You’ll pick initialization without guessing  
✔ You’ll recognize Dead ReLU instantly  

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# Why Initialization Matters

Training deep CNNs critically depends on **stable gradient flow**.

Weight initialization directly controls:
- **Activation scale** during the forward pass
- **Gradient scale** during backpropagation

### What goes wrong with poor initialization
- **Weights too small** → activations shrink layer by layer → gradients vanish  
- **Weights too large** → activations blow up → gradients explode  

Neither case usually causes runtime errors.

> Bad initialization doesn’t crash the model.  
> It silently prevents effective learning.

# What Failure Looks Like in Practice

### Common symptoms
- **Training loss decreases extremely slowly or plateaus early**  
- **Accuracy remains near random guessing**  
- **Changing the optimizer has little to no effect**  
- **Lowering the learning rate worsens training**  

Interpretation
> The model is mathematically incapable of propagating signal.

Poor initialization causes activations or gradients to shrink or explode across layers, so learning signals never reach earlier layers. Optimization fails **silently**, even though the training loop appears normal.


# The Core Problem: Variance Propagation

Variance propagation describes how the statistical spread (variance) of activations or gradients changes as signals pass through successive layers of a neural network.

Formally, for a layer:
$$
z^{(l)} = \sum_{i=1}^{n} w_i^{(l)} x_i^{(l)}
$$

the output variance depends on the input variance:
$$
\boxed{\text{Var}\!\left(z^{(l)}\right)
= n \cdot \text{Var}\!\left(w^{(l)}\right)
\cdot \text{Var}\!\left(x^{(l)}\right)}
$$

Consider a single neuron in a neural network:

$$
z = \sum_{i=1}^{n} w_i x_i
$$

- $z$ → pre-activation output of the neuron (input to the non-linearity)  
- $n$ → number of input connections (fan-in)  
- $w_i$ → weight associated with the $i$-th input  
- $x_i$ → $i$-th input activation  

This is a weighted sum of inputs before applying an activation function.


### Variance of the neuron output

Assuming:
- inputs $x_i$ are independent and identically distributed  
- weights $w_i$ are independent of inputs  
- both have zero mean  

the variance of $z$ becomes:

$$
\text{Var}(z) = n \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

- $\text{Var}(z)$ → variance of the neuron’s output  
- $n$ → number of summed terms (fan-in)  
- $\text{Var}(w)$ → variance of the weight distribution  
- $\text{Var}(x)$ → variance of the input activations  

### Why this matters in deep networks

As signals propagate through layers:

- If $\text{Var}(z)$ **increases layer by layer** → activations and gradients explode  
- If $\text{Var}(z)$ **decreases layer by layer** → activations and gradients vanish  

Both cases make learning ineffective.

### Key Insight
> Proper initialization chooses $\text{Var}(w)$ such that  
> $\text{Var}(z)$ remains **approximately constant across layers**.

This is the mathematical foundation behind **Xavier and He initialization**.

---

### <p style="text-align:center; color:orange; font-size:18px;"> Where Does  Var(z)  Come From?</p>

Start with the neuron equation:

$$
z = \sum_{i=1}^{n} w_i x_i
$$

This is a sum of random variables.


**Step 1: Variance of a sum**

A basic probability rule:

If random variables are independent,

$$
\text{Var}\!\left(\sum_{i=1}^{n} y_i\right)
= \sum_{i=1}^{n} \text{Var}(y_i)
$$

So we apply this to:

$$
z = w_i x_i
$$

Then:

$$
\text{Var}(z) = \sum_{i=1}^{n} \text{Var}(w_i x_i)
$$


**Step 2: Variance of a product**

Another key rule (under independence and zero mean):

$$
\text{Var}(w_i x_i)
= \text{Var}(w_i)\,\text{Var}(x_i)
$$

Why this holds:
- $w_i$ and $x_i$ are independent  
- $\mathbb{E}[w_i] = 0$, $\mathbb{E}[x_i] = 0$  

So each term contributes:

$$
\text{Var}(w_i x_i) = \text{Var}(w)\,\text{Var}(x)
$$


**Step 3: Sum all contributions**

Since every term has the same variance:

$$
\text{Var}(z)
= \sum_{i=1}^{n} \text{Var}(w)\,\text{Var}(x)
$$

$$
\boxed{
\text{Var}(z) = n \cdot \text{Var}(w) \cdot \text{Var}(x)
}
$$


### What this means

- Each input contributes **a little variance**
- Adding $n$ such contributions multiplies variance by $n$

> More connections = more variance unless weights are scaled down


### Why this breaks deep networks

Across layers:

$$
\text{Var}(x^{(l+1)}) = n^{(l)} \text{Var}(w^{(l)}) \text{Var}(x^{(l)})
$$

Repeat this 50 times → explosion or collapse.

---


# Xavier (Glorot) Initialization


## Core idea

Xavier initialization is a weight initialization strategy designed to keep the **variance of activations and gradients approximately constant across layers** in deep neural networks.

Its objective is to prevent:
- **Vanishing signals** (variance shrinking with depth)
- **Exploding signals** (variance growing with depth)

during **both the forward and backward pass**.

For a neuron:
$$
z = \sum_{i=1}^{n_{in}} w_i x_i
$$

Assuming:
- inputs $x_i$ are i.i.d. with zero mean and variance $\text{Var}(x)$  
- weights $w_i$ are i.i.d. with zero mean and variance $\text{Var}(w)$  
- weights and inputs are independent  

the output variance becomes:
$$
\text{Var}(z) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)
$$

To keep signal magnitude stable across layers, we want:
$$
\text{Var}(z) \approx \text{Var}(x)
$$

This leads to:
$$
n_{in} \cdot \text{Var}(w) \approx 1
$$


### Why $n_{out}$ also appears

Backpropagation imposes a **similar constraint** on gradient variance, which depends on $n_{out}$ (fan-out).

To balance **both forward and backward variance**, Xavier initialization chooses:

$$
\boxed{
\text{Var}(w) = \frac{2}{n_{in} + n_{out}}
}
$$

where:
- $n_{in}$ → number of input connections (fan-in)  
- $n_{out}$ → number of output connections (fan-out)  
- $\text{Var}(w)$ → variance of the weight distribution  

This choice ensures:
- activations neither explode nor vanish in the forward pass  
- gradients remain well-scaled in the backward pass  

### Practical forms

Xavier initialization doesn’t set all weights to the same number.  
Instead, **each weight is chosen randomly** from a carefully controlled distribution to preserve variance.

#### 1️. Uniform Xavier

$$
w \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\;
\sqrt{\frac{6}{n_{in}+n_{out}}}\right)
$$

- Each weight is drawn **independently at random** from the interval  
  $$[-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})}]$$
- All values in this range are **equally likely**
- Example:  
  If $n_{in}=128$, $n_{out}=64$, then $\sqrt{6/192} \approx 0.176$  
  → $w \in [-0.176, 0.176]$ randomly


#### 2️. Normal (Gaussian) Xavier

$$
w \sim \mathcal{N}\left(0, \frac{2}{n_{in}+n_{out}}\right)
$$

- Each weight is drawn independently from a **bell-shaped curve**  
  centered at 0 with variance $2/(n_{in}+n_{out})$
- Most weights are near 0, rare large values
- Example:  
  With the same $n_{in}$ and $n_{out}$, $\text{Var}(w) \approx 0.0104$, $\sigma \approx 0.102$  
  → most weights lie roughly in $[-0.3, 0.3]$

#### Why randomness matters

- Prevents neurons from being identical (symmetry problem)  
- Breaks correlation while controlling scale for stable signal propagation  

#### Why these ranges are chosen

- Ensures forward and backward variance is approximately constant:  
$$
\text{Var}(z^{(l)}) \approx \text{Var}(z^{(l-1)})
$$


## When Xavier Works Well

Xavier initialization is most effective with **symmetric activation functions**:

Suitable activations:
- `sigmoid`  
- `tanh`  

Why it works:
- These activations produce outputs centered around zero  
- Symmetric output ensures variance propagation assumptions hold  
- Forward and backward signals remain stable across layers


## Important Caveat

CNNs almost always use **ReLU activations**, not `tanh`.

Problem:
- ReLU sets all negative activations to zero  
- This halves the effective variance of the signal  
- Xavier initialization assumes symmetric activations, so it underestimates the needed variance

Consequence:
> Learning is slow, unstable, or may fail to converge in deep ReLU networks

For ReLU-based CNNs, use **He initialization** instead of Xavier


## Xavier Initialization in PyTorch

```python
# Uniform Xavier
nn.init.xavier_uniform_(conv.weight)

# Normal (Gaussian) Xavier
nn.init.xavier_normal_(conv.weight)
````

Notes:

* Each weight is randomly initialized using the Xavier formulas
* Works well for symmetric activations like `tanh` or `sigmoid`
* Keeps variance of activations and gradients roughly constant across layers

# He (Kaiming) Initialization


To be continue...

---

<p style="text-align:center; color:skyblue; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
