# Neural Networks
## Terminology
**Activation Function**: sigmoid/logistic function, denoted $a_i^{(j)}$ - activation unit of unit i in layer j

**Weights**: parameters $\theta$

**Input layer**: layer 1, where we input our examples

**Hidden layer**: layers other than input and output layer

**Output layer**: last layer, outputs our hypothesis

**$\Theta^{(j)}$**: matrix of weights controlling function mapping from layer j to layer j + 1


## Overview
<img src='img/4.1.png' />

if network has $s_j$ units in layer j, $s_{j+1}$ units in layer j+ 1, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$

each row is corrosponds to one neurons in the l+1 layer excluding the bias unit. Each column corrosponds to one neuron in the l'th layer including the bias unit.

## Forward Propagation Vectorization
$$
z^{(j+1)} = \Theta^{(j)}a^{(j)} \\
a^{(j+1)} = g(z^{(j+1)}) \\
a_0^{(j+1)} = 1
$$
note: $a^{(1)} = x$

## Cost Function

$$
h_\Theta(x) \in R^K \\
y \in R^K
$$
$$
J(\Theta) = -\frac{1}{m}[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}log(h_\Theta(x^i))_k + (1-y_k^i)log(1-(h_\Theta(x^i))_k] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_i}\sum_{j=1}^{j_i+1}(\Theta_{ji}^l)^2
$$
where m is the number of training examples

L is the number of layers

K is the number of neurons in the output layer

$s_l$ is the number of neurons in layer l

## Backpropagation

$\delta_j^{l}$ is the error of node j in layer l

example:
$$
\delta^{(4)} = a^{(4)} - y \\
\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)} .* g'(z^{(3)})
$$

The algorithm:

set $\Delta_{ij}^{(l)} = 0$ for all l,i,j

for i = 1 to m:

set $a^{(1)} = x^{(i)}$

perform forward propagation to compute $a^{(l)}$ for ll=2,3,...,L

using $y^{(i)}$, compute $\delta^{(L)}=a^{(L)}-y^{(i)}$

compute $\delta^{(L-1)}, \delta^{(L-2)}, ...\delta^{(2)}$

## Training a neural network
number of input layer neurons: dimension of the features
number of output layer neurons: number of classes
hidden layers: more the better, but too much can be very computationally expensive

1. randomly initizlize weights
2. implement forward propagation to compute the hypothesis
3. implement code to compute the loss function
4. implement back propagation to compute the partial derivatives
5. use gradient checking to compare $\frac{\alpha}{\alpha\Theta_{jk}^l}J(\Theta)$ computed using backpropagation vs. using numerical estimate of graident. Then dis
6. use an optimization method to minimize $J(\Theta)$