# Shallow Neural Networks

## Network Representation

### Vectorization: Single Input Vector Case

Shallow, 2-layer network with three input features, $\textbf{x} = x_1$, $x_2$, $x_3$, and a single hidden layer with four nodes/neurons, $\textbf{a}_1$, $\textbf{a}_2$, $\textbf{a}_3$, $\textbf{a}_4$

- $n_{h}^{[l]}$ refers to the number of hidden units in the $l^{th}$ layer  
- $\textbf{w}_{i}^{[l]}$ denotes the weight vector of the $i^{th}$ node in the $l^{th}$ layer  
- $W^{[l]} \in \mathbb{R}^{n_{h}^{[l+1]} \; \times \; n_{h}^{[l-1]}} \equiv W^{[l]} \in \mathbb{R}^{\textrm{number of units in next layer} \; \times \; \textrm{number of units in previous layer}}$, which is the weight matrix of the $l^{th}$ layer 
- $\textbf{x} \in \mathbb{R}^{n_x \times 1}$, also be written as $\textbf{a}^{[0]}$, refers to the input vector, and $\textbf{x} \mapsto a^{[2]} = \hat{y}$  

First Node:
$$ z_{1}^{[1]} = w_{1}^{[1]T} \textbf{x} + b_{1}^{[1]} \;\;\; ;\quad a_{1}^{[1]} = \sigma(z_{1}^{[1]}) \tag{1, 2} $$

Second Node:
$$ z_{2}^{[1]} = w_{2}^{[1]T} \textbf{x} + b_{2}^{[1]} \;\;\; ;\quad a_{2}^{[1]} = \sigma(z_{2}^{[1]}) \tag{3, 4} $$
and so on

$$
.\\
.\\
.
$$


Take $W$ and multiply it with the features:

$$
\textbf{z}^{[1]}_{4 \times 1}
=
\begin{bmatrix}
z_{1}^{[1]}\\ 
z_{2}^{[1]}\\
z_{3}^{[1]}\\
z_{4}^{[1]}\\
\end{bmatrix}_{4 \times 1}
=
\begin{bmatrix}
— w_{1}^{[1]T} —  \\ 
— w_{2}^{[1]T} — \\
— w_{3}^{[1]T} — \\
— w_{4}^{[1]T} — \\
\end{bmatrix}_{4 \times 3}

\begin{bmatrix}
x_{1} \\ 
x_{2} \\
x_{3} \\
\end{bmatrix}_{3 \times 1}
+
\begin{bmatrix}
b_{1}^{[1]} \\ 
b_{2}^{[1]} \\
b_{3}^{[1]} \\
b_{4}^{[1]} \\
\end{bmatrix}_{4 \times 1}
=
\begin{bmatrix}
w_{1}^{[1]T} \textbf{x}+ b_{1}^{[1]} \\ 
w_{2}^{[1]T} \textbf{x}+ b_{2}^{[1]} \\
w_{3}^{[1]T} \textbf{x}+ b_{3}^{[1]} \\
w_{4}^{[1]T} \textbf{x}+ b_{4}^{[1]} \\
\end{bmatrix}
=
W^{[1]} \textbf{x} + \textbf{b}^{[1]}
\tag{5}
$$

$$
\textbf{a}^{[1]}_{4 \times 1} = \sigma(\textbf{z}^{[1]}) \tag{6}
$$
and
$$
z^{[2]}_{1 \times 1} = W^{[2]}_{1 \times 4} \; \textbf{a}^{[1]}_{4 \times 1} + b^{[2]}_{1 \times 1} \\
a^{[2]}_{1 \times 1} = \sigma(z^{[2]})
$$

to produce output $\hat{y} = a^{[2]}$

### Computing multiple $z$ as a vector

Given input $X \in \mathbb{R}^{n_x \times m} := \textbf{a}^{[0]}$, where
$$
X =
\begin{bmatrix}
| & | & & | \\
\textbf{x}^{(1)} & \textbf{x}^{(2)} & \ldots & \textbf{x}^{(m)} \\
| & | & & | \\
\end{bmatrix}_{n_x \times m}
$$
map $X \longmapsto \hat{\textbf{y}}$, i.e.
$$\textbf{x}^{(1)} \mapsto \hat{y}^{(1)} := a^{[2](1)}$$
$$\textbf{x}^{(2)} \mapsto \hat{y}^{(2)} := a^{[2](2)}$$
$$
.\\
.\\
.\\
\textbf{x}^{(m)} \mapsto \hat{y}^{(m)} := a^{[2](m)}
$$

### Unvectorized Implementation
`for i=1 to m:`

&emsp; $\textbf{z}^{[1](i)} = W^{[1]} \textbf{x}^{(i)} + b^{[1]}$

&emsp; $\textbf{a}^{[1](i)} = \sigma(\textbf{z}^{[1][i]})$
   
&emsp; $\textbf{z}^{[2](i)} = W^{[2]} \textbf{a}^{[1][i]} + b^{[2]}$
   
&emsp; $\textbf{a}^{[2](i)} = \sigma(\textbf{z}^{[2][i]})$

The primary goal is to remove the `for` loop that runs over all the training examples.

### Vectorized Implementation

Consider:

$$
Z^{[1]} =
\begin{bmatrix}
| & | & & | \\
\textbf{z}^{[1](1)} & \textbf{z}^{[1](2)} & \ldots & \textbf{z}^{[1](m)} \\
| & | & & | \\
\end{bmatrix}_{4 \times m} \; \textrm{where} \; \textbf{z}^{[1](i)} \in \mathbb{R}^{4 \times 1}
$$

$Z^{[1]}$ is thus simply $m$ number of $\textbf{z}^{[1]}$'s stacked horizontally.  
The weight matrix $W^{[1]}$ stays the same shape, $(4, 3)$, and $X$ is of the shape of $(3, m)$.

Therefore,

$$Z^{[1]}_{4 \times m} = W^{[1]} X + b^{[1]} $$
$$A^{[1]}_{4 \times m} = \sigma(Z^{[1]}) $$
$$Z^{[2]}_{1 \times m} = W^{[2]}_{1 \times 4} \; A^{[1]} + b^{[2]} $$
$$A^{[2]}_{1 \times m} = \sigma(Z^{[2]}) $$

$A^{[1]}$ is similar to $Z^{[1]}$: $m$ number of $\textbf{a}^{[1]}$'s stacked horizontally.  
Note that the vertical indices(rows) correspond to different nodes in the network: the first node in the first layer $\rightarrow$ top-left corner of the $Z$ and $A$ matrices.