## Weight Initialization

### Why not identical initialization?

If we initialize every node within layer $l$ with the same parameters:

$ \forall g, h: $

$ \textbf{w}^{[l]}_{g, :} = \textbf{w}^{[l]}_{h, :} $

$ b^{[l]}_g = b^{[l]}_h $

This implies that each node within layer $l$ activates identically:

$ \forall g, h: $

$ z^{[l]}_{s, g} = \textbf{a}^{[l-1]}_{s, :} \cdot \textbf{w}^{[l]T}_{g, :} + b_g^{[l]} $

$ z^{[l]}_{s, h} = \textbf{a}^{[l-1]}_{s, :} \cdot \textbf{w}^{[l]T}_{h, :} + b_h^{[l]} $

$ z^{[l]}_{s, g} = z^{[l]}_{s, h} $

$ a^{[l]}_{s, g} = f(z^{[l]}_{s, g}) = f(z^{[l]}_{s, h}) = a^{[l]}_{s, h} $

This implies that the gradient of all nodes within layer $l$ is identical:

$$ 
\begin{align}
    \forall g, h: & \\
    \frac {\partial C_s} {\partial w^{[l]}_{g, n}} 
    & = \frac {\partial C_s} {\partial a^{[l]}_{s, g}} 
    * f'(z^{[l]}_{s, g})
    * a^{[l-1]}_{s, n} 
    \\
    \frac {\partial C_s} {\partial w^{[l]}_{h, n}} 
    & = \frac {\partial C_s} {\partial a^{[l]}_{s, h}} 
    * f'(z^{[l]}_{s, h})
    * a^{[l-1]}_{s, n} 
    \\
    \frac {\partial C_s} {\partial w^{[l]}_{g, n}} 
    & = \frac {\partial C_s} {\partial w^{[l]}_{h, n}}  
\end{align}
$$

Note that the upstream gradient has to be identical too.
This is true if the upstream layer is also identically initialized.

# Glorot 
$$ W_i \sim U\left( -\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}} \right) $$
$$ n_{in} (n_{out}) = \text{count of input (output) connections} $$

# Xavier

$$ W_i \sim N\left( 0, \sqrt{\frac{2}{n_{in}}} \right) $$
$$ n_{in} = \text{count of input connections} $$