1. [Linear Models are Limited](#Linear Models are Limited)
2. [2-Layer Neural Network](#2-Layer Neural Network)
    1. [Network of ReLUs](#Network of ReLUs)
    2. [TensorFlow ReLUs](#TensorFlow ReLUs)
3. [Chain Rule and Backpropagation](#Chain Rule and Backpropagation)

# 1. Linear Models are Limited <a name='Linear Models are Limited'></a>

We've trained so far is simple model, and it's also relatively limited. 

__Quiz:__

How many train parameters did it actually have, which input was a 28 by 28 image, and the output was 10 classes. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.00.25.png' width=400>

__Answer:__

$$
\begin{align}
\text{Total number of parameters} &= \text{size of } \mathbf{W} + \text{size of } \mathbf{b} \\
&= 28 \times 28 \times 10 + 10 \\
&= 7850
\end{align}
$$

If we have $N$ inputs, and $K$ outputs, we have $(N+1)K$ parameters to use. We might want to use many, many more parameters in practice, but it's linear. This means that the kind of interactions that we're capable of representing with that model is somewhat limited. For example, if two inputs interacti in an additive way, $y = x_1 + x_2$, the model can represent them well as a matrix multiply. But if two inputs interact in the way that the outcome depends on the product of the two, $y=x_1 \times x_2$, we won't be able to model that efficiently with a linear model. 

However, linear operations are really nice. Big matrix multiplies are exactly what GPUs wer designed for, and numerically linear operations are very stable. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.26.49.png' width=300>

We can show mathematically that small changed in the input can never yield big changes in the output. In addition, the derivates are very nice too. The derivative of a linear function is constant. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.27.18.png' width=300>

This means that we can't get more stable numerically than a constant. So, we would like to keep the parameters inside big linear functions, but we would also want the entire model to be nonlinear.

We can't just keep multiplying our inputs by linear functions, because that's just equivalent to one big linear function as below.

<img src='Figures4/Screen Shot 2017-03-26 at 18.52.19.png' width=300>

So, we're going to have to introduce non-linearities.

# 2. 2-Layer Neural Network <a name='2-Layer Neural Network'></a>

### 2.1. Network of ReLUs <a name='Network of ReLUs'></a>

Rectified Linear Units (ReLUs) are literally the simplest non-linear functions. They're linear if $x$ is greater than $0$, and they're the $0$ everywhere else.

<img src='Figures4/Screen Shot 2017-03-26 at 18.35.17.png' width = 400>

ReLUs have nice derivatives, as well. 

<img src='Figures4/Screen Shot 2017-03-26 at 18.37.28.png' width = 200>

When $x$ is less than zero, the value is 0. So, the derivative is 0 as well. When $x$ is greater than 0, the value is equal to x. So, the derivative is equal to 1. 

A logistic classifier can be non-linear. Instead of having a single matrix
multiplier as our classifier, we're going to insert a ReLUs right in the middle.

<img src='Figures4/Screen Shot 2017-03-26 at 18.41.05.png' width=500>

We now have two matrices. One going from the inputs to the ReLUs, and another one connecting the ReLUs to the classifier. 

We've solved two of our problems. Our function in now nonlinear thanks
to the ReLUs in the middle, and we now have a new knob that we can tune,
the number $H$ which corresponds to the number of ReLUs units that we have in the classifier. We can make it as big as we want. 

__Note__: Depicted above is a "2-layer" neural network:

<img src='Figures4/RELU.png' width=500>

    1. The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
    
    2. The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.

# 2.2. TensorFlow ReLUs <a name='TensorFlow ReLUs'></a>

<img src='Figures4/Screen Shot 2017-03-26 at 18.35.17.png' width = 400>

A Rectified linear unit (ReLU) is type of [activation function](https://en.wikipedia.org/wiki/Activation_function) that is defined as $f(x) = max(0, x)$. The function returns 0 if $x$ is negative, otherwise it returns $x$. TensorFlow provides the ReLU function as __tf.nn.relu()__, as shown below.

```Python
# Hidden Layer with ReLU activation function
hidden_layer = tf.add(tf.matmul(features, hidden_weights), hidden_biases)
hidden_layer = tf.nn.relu(hidden_layer)

output = tf.add(tf.matmul(hidden_layer, output_weights), output_biases)
```

<img src='Figures4/insert-relu.png' width=500>

The above code applies the __tf.nn.relu()__ function to the __hidden_layer__, effectively turning off any negative weights and acting like an on/off switch. Adding additional layers, like the output layer, after an activation function turns the model into a nonlinear function. This nonlinearity allows the network to solve more complex problems.

__Quiz:__

Use TensorFlow's ReLU function to turn the linear model below into a nonlinear model.

In [21]:
# Solution is available in the other "solution.py" tab
import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model
linear1 = tf.add(tf.matmul(features, weights[0]), biases[0])
ReLU = tf.nn.relu(linear1)
linear2 = tf.add(tf.matmul(ReLU, weights[1]), biases[1])

# TODO: Print session results
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output = sess.run(linear2)
print(output)

[[  5.11000013   8.44000053]
 [  0.           0.        ]
 [ 24.01000214  38.23999786]]


__Answer:__

```Python
# Quiz Solution
# Note: You can't run code in this tab
import tensorflow as tf

output = None
hidden_layer_weights = [
    [0.1, 0.2, 0.4],
    [0.4, 0.6, 0.6],
    [0.5, 0.9, 0.1],
    [0.8, 0.2, 0.8]]
out_weights = [
    [0.1, 0.6],
    [0.2, 0.1],
    [0.7, 0.9]]

# Weights and biases
weights = [
    tf.Variable(hidden_layer_weights),
    tf.Variable(out_weights)]
biases = [
    tf.Variable(tf.zeros(3)),
    tf.Variable(tf.zeros(2))]

# Input
features = tf.Variable([[1.0, 2.0, 3.0, 4.0], [-1.0, -2.0, -3.0, -4.0], [11.0, 12.0, 13.0, 14.0]])

# TODO: Create Model
hidden_layer = tf.add(tf.matmul(features, weights[0]), biases[0])
hidden_layer = tf.nn.relu(hidden_layer)
logits = tf.add(tf.matmul(hidden_layer, weights[1]), biases[1])

# TODO: Print session results
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(logits))
```

# 3. Chain Rule and Backpropagation <a name='Chain Rule and Backpropagation'><a/>

<img src='Figures4/Screen Shot 2017-03-26 at 19.50.36.png' width = 400>

One reason to build this network by stacking simple operations, like multiplications, and sums, and ReLUs, on top of each other is
that it makes the math very simple.

The key mathematical insight is the chain rule. 

<img src='Figures4/Screen Shot 2017-03-26 at 19.52.42.png' width = 300>

If we have two functions that get composed, then the chain rule tells the derivatives of that function simply by taking the product of the derivatives of the components. That's very powerful.

As long as you know how to write the derivatives of your individual functions, there is a simple graphical way to combine them together and compute the derivative for the whole function.

<img src='Figures4/Screen Shot 2017-03-26 at 19.54.26.png' width = 500>

There is a way to write this chain rule that is very efficient computationally, with lots of data reuse, and that looks like a very simple data pipeline.

Imagine the network is a stack of simple operations. Some have parameters like the matrix transforms, some don't like the ReLUs in those blocks.

<img src='Figures4/Screen Shot 2017-03-26 at 19.59.04.png' width = 400>

When we apply data to some input x, we have data flowing through the stack up to your predictions $y$.

To compute the derivatives, we create another graph.

<img src='Figures4/Screen Shot 2017-03-26 at 20.01.01.png' width = 400>

The data in the new graph flows backwards through the network, and get's combined using the chain rule that we saw before and produces gradients. That graph can be derived completely automatically from the individual operations in the network. This is called back-propagation, and it's a very powerful concept.

It makes computing derivatives of complex function very efficient as long as the function is made up of simple blocks with simple derivatives.

16
00:00:56,250 --> 00:01:00,910
Running the model up to the predictions
is often call the forward prop, and

17
00:01:00,910 --> 00:01:04,190
the model that goes backwards
is called the back prop.

18
00:01:04,190 --> 00:01:07,290
So, to recap,
to run stochastic gradient descent,

19
00:01:07,290 --> 00:01:11,220
for every single little batch of
your data in your training set,

20
00:01:11,220 --> 00:01:14,900
you're going to run the forward prop,
and then the back prop.

21
00:01:14,900 --> 00:01:19,270
And that will give you gradients for
each of your weights in your model.

22
00:01:19,270 --> 00:01:21,950
Then you're going to apply those
gradients with the learning weights

23
00:01:21,950 --> 00:01:24,710
to your original weights,
and update them.

24
00:01:24,710 --> 00:01:28,200
And you're going to repeat that
all over again, many, many times.

25
00:01:28,200 --> 00:01:31,610
This is how your entire
model gets optimized.

26
00:01:31,610 --> 00:01:34,610
I am not going to go through more
of the maths of what's going on in

27
00:01:34,610 --> 00:01:35,910
each of those blocks.

28
00:01:35,910 --> 00:01:38,780
Because, again, you don't typically
have to worry about that, and

29
00:01:38,780 --> 00:01:42,270
it's essentially the chain rule,
but keep in mind, this diagram.

30
00:01:42,270 --> 00:01:45,730
In particular each block of
the back prop often takes about

31
00:01:45,730 --> 00:01:50,395
twice the memory that's needed for
prop and twice the compute.

32
00:01:50,395 --> 00:01:52,520
That's important when you
want to size your model and

33
00:01:52,520 --> 00:01:53,940
fit it in memory for example.

