# Week 5 - Neural Networks: Learning

This time, we'll discuss how to *train* NNs. We will learn the "backpropagation" algorithm for training these models. The topics we'll discuss include:

* Cost Function and Backpropagation
  * Cost Function
  * Backpropagation Algorithm
  * Backpropagation Intuiting
* Backpropagation in Practice
  * Implementation Note: Unrolling Parameters
  * Gradient Checking
  * Random Initialization
  * Putting it Together
* Application of Neural Networks
  * Autonomous Driving
  
## Cost Function and Backpropagation

### Cost Function

NNs are one of the most powerful learning algorithms we have today. We learned a bit about them last week, but now we need to know how to fit the parameters for a training set. We'll start by talking about the cost function for fitting the parameters of the network.

Suppose we have a training dataset

$$ \big\{ (x^{(1)},y^{(1)}) , (x^{(2)},y^{(2)}) , \cdots , (x^{(m)},y^{(m)}) \big\} $$

with 
* $L$ number of layers in the network, and 
* $s_l$ number of units (not including the bias unit) in layer $l$

We'll consider two types of classification; 
1. **binary classification**, where
   $$ y \in \{0,1\} $$
   and there is 1 output unit, or
   $$s_L = 1$$
   We'll say, for simplicity that $K$, the number of classes, is 1.
2. **multiclass classification** of $K$ classes, where
   $$y \in \mathbb{R}^K$$
   and there are $K$ output units, or 
   $$s_L = K, K \ge 3$$ 

Our cost function for the NN will resemble a more broad version of the logistic regression cost function:

$$ 
\begin{align}
J(\Theta) = & - \frac{1}{m} \Big[ \sum_{i=1}^m \sum_{k=1}^K
               y_k^{(i)} \log \big( (h_\Theta (x^{(i)}))_k \big) + 
               (1-y_k^{(i)}) \log \big( 1 - (h_\Theta (x^{(i)}))_k \big)
               \Big] \\
               & + \frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} 
               \big( \Theta_{j,i}^{(l)} \big)^2
\end{align} 
\\ 
\\
h_\Theta (x) \in \mathbb{R}^K, (h_\Theta(x))_i := i\text{th output}
$$

Again, this resembles the logistic regression, where we're basically summing over multiple logistic regression cost functions. Our regularization term is similar again, but we're summing over the parameter matix (rather than array) for all of the layers.

### Backpropagation Algorithm

The backpropagation algorithm is a method for minimizing the cost function on the paramters. To do so, we'll again need to compute, for a given set of parameters

$$ J(\Theta) \text{ and } \frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta) $$

Let's start by talking about a training set of only one sample,

$$(x,y)$$

and using a 4 layer NN. Our forward propagation looks as following:

$$
\begin{align}
a^{(1)} & = x \\
z^{(2)} & = \Theta^{(1)} a^{(1)} \\
a^{(2)} & = g(z^{(2)}), \text{ add } a^{(2)}_0 \\
z^{(3)} & = \Theta^{(2)} a^{(2)} \\
a^{(3)} & = g(z^{(3)}), \text{ add } a^{(3)}_0 \\
z^{(4)} & = \Theta^{(3)} a^{(3)} \\
a^{(4)} & = h_\Theta(x) = g(z^{(4)}) 
\end{align}
$$

Now to compute the gradients, we need to use backpropagation. The intuition is 
- $\delta_j^{(l)}$ : "error" of node $j$ in layer $l$ 

This is the "error" in the activation node in the layer. For L=4,

$$ \delta_j^{(4)} = a_j^{(4)} - y_j  = (h_\Theta (x))_j - y_j .$$

And now we need the earlier layer errors:

$$ \delta^{(3)} = (\Theta^{(3)})^T \delta^{(4)} \cdot g^\prime(z^{(3)}) \\
   \delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} \cdot g^\prime(z^{(2)}) $$
   
The "$\cdot$" here is element wise multiplication. To compute $g^\prime(z^{(3)})$,

$$ g^\prime(z^{(3)}) = a^{(3)} \cdot (1 - a^{(3)}) $$

Note that there is no initial delta function, since the first layer is just our input.

It turns out (it's possible, but rigorous to prove) that 

$$ \frac{\partial}{\partial \Theta_{i,j}^{(l)}} J ( \Theta ) = a_j^{(l)} \delta_i^{(l+1)} $$

Now let's put it all together. Let's go back to our training set

$$ \big\{ (x^{(1)},y^{(1)}) , (x^{(2)},y^{(2)}) , \cdots , (x^{(m)},y^{(m)}) \big\} $$

First, we set

$$ \Delta_{i,j}^{(l)} = 0 $$

which will be used to compute our derivative terms. Next we loop through our training samples and perform the following computations:

> For $i = 1$ to $m$:
>
> &nbsp;&nbsp;&nbsp;&nbsp; Set $a^{(1)} = x^{(i)}$
>
> &nbsp;&nbsp;&nbsp;&nbsp; Perform forward propagation to compute $a^{(l)}$ for $l \in \{2,3,\cdots,L\}$
> 
> &nbsp;&nbsp;&nbsp;&nbsp; Using $y^{(i)}$, compute $\delta^{(L)} = a^{(L)} - y^{(i)}$
> 
> &nbsp;&nbsp;&nbsp;&nbsp; Compute $\delta^{(L-1)}, \delta^{(L-2)}, \cdots , \delta^{(2)}$
> 
> &nbsp;&nbsp;&nbsp;&nbsp; $\Delta_{i,j}^{(l)} := \Delta_{i,j}^{(l)} + a_j^{(l)} \delta_i^{(l+1)}$

The last step can be written in vectorized notation as 

$$\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)} (a^{(l)})^T $$

And now finally, we can compute, with regularization,

$$ \begin{align}
D_{i,j}^{(l)} & := \frac{1}{m} \Delta_{i,j}^{(l)}+ \lambda \Theta_{i,j}^{(l)} \text{ for } j \neq 0 \\
D_{i,j}^{(l)} & := \frac{1}{m} \Delta_{i,j}^{(l)} \text{ for } j = 0 \\
\end{align} $$

and these terms are our derivative terms that we seek to calculate.

### Backpropagation Intuition

Here we'll work a bit more of the mechanical steps of backpropagation. Note that
* $\delta_j^{(l)}$ is the "error" of cost of $a_j^{(l)}$

Formally,

$$ \delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} 
   \Big( y^{(i)} \log \big( h_\Theta (x^{(i)}) \big) + (1 - y^{(i)}) \log \big(1 - h_\theta(x^{(i)} \big) 
   \Big) $$
   
The back propagation algorithm is similar to the forward propagation. For example,

$$ \delta_2^{(2)} = \Theta_{1,2}^{(3)} \delta_1^{(3)} + \Theta_{2,2}^{(3)} \delta_2^{(3)} $$

for backward propagation, and for forward propagation we had

$$ z_1^{(3)} = \Theta_{1,0}^{(2)} 1 + \Theta_{1,1}^{(2)} a_1^{(2)} + \Theta_{1,2}^{(2)} a_2^{(2)} $$

Note that we ignored the bias unit. This all just depends on how you implement the alogrithm.

## Back Bropagation in Practice

### Implementation Note: Unrolling Parameters

Now, let's talk about unrolling the parameters from matrices into vectors. Let's say we're passing the the cost function and gradients to a minimizing function. These minimizing functions assume the parameters and gradients are vectors, not matrices.

Let's take the example of a model with

$$ s_1 = 10, s_2 = 10, s_3 = 1 \\
   \Theta^{(1)} \in \mathbb{R}^{10 \times 11} , \Theta^{(2)} \in \mathbb{R}^{10 \times 11}, \Theta^{(3)} \in \mathbb{R}^{1 \times 11} \\
   S^{(1)} \in \mathbb{R}^{10 \times 11} , S{(2)} \in \mathbb{R}^{10 \times 11}, S{(3)} \in \mathbb{R}^{1 \times 11} $$
   
So we'll have to make a vector containing *all* of the elements in *each* of the matrices so that we'd have

$$ \theta \in \mathbb{R}^{10 \cdot 11 + 10 \cdot 11 + 1 \cdot 11} = \mathbb{R}^{231} $$

And then we'd have to reshape the vector such that the first matrix is the first 110 elements in the shape of 10 by 11, the second is the following 111 through 220 elements in the shape of 10 by 11, and the last 221 through 231 elements in the shape of 1 by 11. In Python, this whole process will involve using the `ravel` and `reshape` functions in NumPy. This will need to be done for the gradient matrices as well.

### Gradient Checking

Simplistically, we can think of gradient checking for a real number by estimating

$$ \frac{\mathrm{d}}{\mathrm{d}\Theta} J ( \Theta )\approx \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2 \epsilon} , \epsilon \ll 1$$

Typically, we use a small value like $\epsilon = 10^{-4}$.

Now let's see what our approximation would look like for an unrolled parameter vector. What we'd have is 

$$ \theta = [\theta_1,\theta_2,\cdots,\theta_n] \\
\frac{\partial}{\partial \theta_1} J(\theta) \approx 
\frac{J(\theta_1+\epsilon,\theta_2,\cdots,\theta_n)-J(\theta_1-\epsilon,\theta_2,\cdots,\theta_n)}{2 \epsilon} \\
\frac{\partial}{\partial \theta_2} J(\theta) \approx 
\frac{J(\theta_1,\theta_2+\epsilon,\cdots,\theta_n)-J(\theta_1,\theta_2-\epsilon,\cdots,\theta_n)}{2 \epsilon} \\
\vdots \\
\frac{\partial}{\partial \theta_n} J(\theta) \approx 
\frac{J(\theta_1,\theta_2,\cdots,\theta_n+\epsilon)-J(\theta_1,\theta_2,\cdots,\theta_n-\epsilon)}{2 \epsilon}$$

We can then use this estimation to check the results of backwards propagation. This is really just a step for debugging, though. You don't want this running when you run the program because it can become very slow with gradient checking still on.

### Random Initialization

We have been initiating our parameters to zero. But now if all of our weights are the same, set to zero, then each activation layer element, and therefore each of our gradient elements, will all be equivalent. If the gradients are always equivalent, the gradient descent will continue to update the parameters, but they will still be completely identical, leaving again idetical hidden layer units. Since the hidden layer units in the second-to-last layer are being mapped to the single output, and each of our hidden layer units would be identical, this is the same as effectively mapping all of the input elements directly to one singular element, which is not very interesting. This is the problem of providing symmetric weights.

In order to get around this problem, we initialize randomly. Programmatically, we will use the `numpy.random` module to get random arrays of a set size. We want to initialize each parameter as a random number between some small range

$$ [ -\tau , \tau ] $$

and the way we do that programmatically is something like the following, since the random array function returns values between the range of 0 and 1:

    Theta1 = rand(10,11)*(2*tau)-tau
    Theta2 = rand(1,11)*(2*tau)-tau


### Putting It Together

Now let's put all of the pieces together to see how to implement a NN learning algorithm.

#### Pick an architecture

The first thing you need to do is select a network achitecture. You may want to select a 3 element input layer to one hidden layer of 5 activation elements to a 4 class output layer. Or perhaps you want 2, or even 3 hidden layers.

It is reasonable to start with just one hidden layer, and this is most common. It's also common, if you wish to use multiple hidden layers, to maintain the same number of activation units in each hidden layer. In general, having more hidden units is better, but it can be a bit more computationally expensive.

Remember that if you have a multiclass output layer, you need to have outputs that are simply columns of the identity matrix. For example, if the training data output resembles

$$ y \in \{ 1, 2, \cdots, 10 \} $$

then $y=5$ needs to be mapped to 

$$ y = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} $$ 

#### Training a NN

The steps for training a NN are as follows:
1. Randomaly initialize weights $\Theta$
2. Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
3. Implement code to compute cost function $J(\Theta)$
4. Implement backpropagation to compute partial derivatives $\frac{\partial}{\partial \Theta_{j,k}^{(l)}} J (\Theta)$
5. Use gradient checking to compare the gradients found with backpropagation against a numerical estimate of the cost function gradient
   Then after we've checked, we need to make sure to disable the gradient checking code.
6. Use gradient descent or advanced optimization methods with backpropagation to minimize the cost function on the parameters.

We usually do the forward and backward propagation with a `for` loop iterated over our taining samples. Also, the cost function here is non-convex for NNs, so minimization functions can sometimes get stuck in a local minimum. But usually this is not a problem. Even if it does get stuck, as long as the local minimum is somewhat low, the error should still also be low.