# Week 5 - Neural Networks: Learning

This time, we'll discuss how to *train* NNs. We will learn the "backpropagation" algorithm for training these models. The topics we'll discuss include:

* Cost Function and Backpropagation
  * Cost Function
  * Backpropagation Algorithm
  * Backpropagation Intuiting
* Backpropagation in Practice
  * Implementation Note: Unrolling Parameters
  * Gradient Checking
  * Random Initialization
  * Putting it Together
* Application of Neural Networks
  * Autonomous Driving
  
## Cost Function and Backpropagation

### Cost Function

NNs are one of the most powerful learning algorithms we have today. We learned a bit about them last week, but now we need to know how to fit the parameters for a training set. We'll start by talking about the cost function for fitting the parameters of the network.

Suppose we have a training dataset

$$ \big\{ (x^{(1)},y^{(1)}) , (x^{(2)},y^{(2)}) , \cdots , (x^{(m)},y^{(m)}) \big\} $$

with 
* $L$ number of layers in the network, and 
* $s_l$ number of units (not including the bias unit) in layer $l$

We'll consider two types of classification; 
1. **binary classification**, where
   $$ y \in \{0,1\} $$
   and there is 1 output unit, or
   $$s_L = 1$$
   We'll say, for simplicity that $K$, the number of classes, is 1.
2. **multiclass classification** of $K$ classes, where
   $$y \in \mathbb{R}^K$$
   and there are $K$ output units, or 
   $$s_L = K, K \ge 3$$ 

Our cost function for the NN will resemble a more broad version of the logistic regression cost function:

$$ 
\begin{align}
J(\Theta) = & - \frac{1}{m} \Big[ \sum_{i=1}^m \sum_{k=1}^K
               y_k^{(i)} \log \big( (h_\Theta (x^{(i)}))_k \big) + 
               (1-y_k^{(i)}) \log \big( 1 - (h_\Theta (x^{(i)}))_k \big)
               \Big] \\
               & + \frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} 
               \big( \Theta_{j,i}^{(l)} \big)^2
\end{align} 
\\ 
\\
h_\Theta (x) \in \mathbb{R}^K, (h_\Theta(x))_i := i\text{th output}
$$

Again, this resembles the logistic regression, where we're basically summing over multiple logistic regression cost functions. Our regularization term is similar again, but we're summing over the parameter matix (rather than array) for all of the layers.

### Backpropagation Algorithm

The backpropagation algorithm is a method for minimizing the cost function on the paramters. To do so, we'll again need to compute, for a given set of parameters

$$ J(\Theta) \text{ and } \frac{\partial}{\partial \Theta_{i,j}^{(l)}} J(\Theta) $$

Let's start by talking about a training set of only one sample,

$$(x,y)$$

and using a 4 layer NN. Our forward propagation looks as following:

$$
\begin{align}
a^{(1)} & = x \\
z^{(2)} & = \Theta^{(1)} a^{(1)} \\
a^{(2)} & = g(z^{(2)}), \text{ add } a^{(2)}_0 \\
z^{(3)} & = \Theta^{(2)} a^{(2)} \\
a^{(3)} & = g(z^{(3)}), \text{ add } a^{(3)}_0 \\
z^{(4)} & = \Theta^{(3)} a^{(3)} \\
a^{(4)} & = h_\Theta(x) = g(z^{(4)}) 
\end{align}
$$