**Parameter Initialization**

Parameter initialization is a critical topic. While it might seem obvious to initialize weights randomly or set them to zero, doing so carelessly can break the model.

*1. The Problem with Zero Initialization*

It is acceptable to initialize biases ($b$) to zero. However, if we initialize Weights ($W$) to zero, we encounter the **Symmetry Problem**.
* If all weights are zero, every neuron in the layer performs the same calculation and outputs the same value.
* Consequently, during backpropagation, every neuron receives the same gradient update.
* The neurons never "break symmetry" and fail to learn distinct features. The network effectively acts as a single linear neuron, no matter how deep it is.

*2. The Problem with Random Initialization*

If we simply pick random numbers without careful scaling, we run into gradient stability issues:

* **Vanishing Gradient:** If weights are initialized too small, the signal shrinks as it passes through each layer. By the time backpropagation reaches the early layers, the gradients are close to zero. The network learns extremely slowly or stops entirely.
    

* **Exploding Gradient:** If weights are initialized too large, the signal grows uncontrollably with each layer. Gradients become massive, causing the weights to update wildly. This leads to unstable learning, divergence, or `NaN` (Not a Number) errors.

*3. The Solution: Variance Control*

To prevent these issues, we must control the **variance** of the weights. The goal is to keep the scale of the input signal roughly the same as the scale of the output signal across layers.

*4. Implementation*

To correctly initialize $W$, we look at the activation function and the number of input connections ($D_{in}$) from the previous layer.

* **For ReLU (He Initialization):**
    Formula: `np.random.randn(...) *` $\sqrt{\frac{2}{D_{in}}}$

    *Reasoning:* Since ReLU zeros out negative values (killing half the signal), we double the variance to preserve the signal magnitude.

* **For Sigmoid/Softmax (Xavier/Glorot Initialization):**
    Formula: `np.random.randn(...) *` $\sqrt{\frac{1}{D_{in}}}$
    
    *Reasoning:* This keeps the signal variance within the "linear" middle region of the S-curve, preventing the gradients from vanishing at the flat tails of the function.