In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import neural_net_helper
%aimport neural_net_helper

nnh = neural_net_helper.NN_Helper()

# How does the NN "learn" the transformations ?

The matrix $\W$ contains the "patterns" that serve to recognize the synthetic features created by each layer


<div align="middle">
    <center>Layer</center>
    <br>
    <!-- edX: Original: <img src="images/NN_Layer_multi_unit.png"> replace by EdX created image -->
    <img src=images/Layers_W8_L3_Sl5.png width=40%>
</div>

- $\W_{\llp, j}$ are the weights /pattern for feature $\y_{\llp,j}$


- How are these patterns discovered ?


The answer is: exactly as we did in Classical Machine Learning
- Define a loss function that is parameterized by $\W$: 
$$\loss = L(\hat{\y},\y; \W)$$
    - Per example loss $\loss^\ip$
    - Average loss $\loss = \frac{1}{m} \sum_{i=1}^m { \loss^\ip }$
- Our goal is to find $\W^*$ the "best" set of weights
$$
\W^* = \argmin{W} L(\hat{\y},\y; \W)
$$
- Find $\W^*$ using Gradient Descent !

Very much in spirit of the multi-layer architecture
- We add a new layer (L+1) to compute the loss $\loss$ !

<div>
    <center><strong>Additional Loss Layer (L+1)</strong></center>
<br>
     <!-- edX: Original: <img src="images/NN_Layers_plus_Loss.png"> replace by EdX created image -->
    <img src="images/Addtl_Loss_Layer_W8_L5_Sl4.png">
</div>

# Gradient Descent review

Gradient Descent is an iterative method for finding the minimum of a function.
<!--EdX:
Omit this from EdX: can't refer to prior course
- See the [Gradient Descent lecture](Gradient_Descent.ipynb) in the Classical ML part of the course for more details
-->

Let's review Gradient Descent using our current notation


- We start with an initial guess for $\W$ and iteratively improve it.
- Compute the loss $\loss$ given the current $\W$
    - Average loss of the $m$ examples in the training examples
- Compute the gradient
$$
\frac{\partial \loss}{\partial W}
$$
- Update $\W$ in the direction of the *negative* of the gradient
- Scaled by a learning rate $\alpha$
$$
\W = \W - \alpha * \frac{\partial \loss}{\partial W}
$$


A unit change in $\W$ *increases* $\loss$ by $
\frac{\partial \loss}{\partial W}$
- That's why there is a negative sign: we proceed in the direction *opposite* the one that increases $\loss$
- We move only a fraction $\alpha \le 1$ of the (negative) of the gradient
- To avoid the possibility of over-shooting the minimum

$\W$ is a multi-dimensional vector, not a scalar
- So the gradient is multi-dimensional
- We will formally discuss Matrix Gradients in a later lecture 
    - For now: we compute the derivative with respect to each element of $\W$ and arrange in a matrix


We can approximate $\loss$ by *sampling* from the $m$ training examples
- Choose a *random subset (of size $m' \le m$)* of  examples: $I = \{ i_1, \ldots, i_{m'} \}$
- Approximate $\loss$ on $I$
$$
\loss \approx \frac{1}{|I|}\sum_{i \in I} \loss^\ip
$$

**Minibatch gradient descent**

The average loss $\loss$ is defined over all $m$ training examples.

- This can be expensive to compute when $m$ is large.
- The gradient can be *approximated* by sampling from the $m$ training examples
    - Choose a *random subset (of size $m' \le m$)* of  examples: $I = \{ i_1, \ldots, i_{m'} \}$
    - Approximate $\loss$ on $I$
$$
\loss \approx \frac{1}{|I|}\sum_{i \in I} \loss^\ip
$$


*Minibatch gradient descents* divides the $m$ examples into chunks (mini-batches) and *approximates* the gradient

- Divides the $m$ training examples
- Into $b = m/m'$ disjoint batches of size $m' \le m$ examples each

We can approximate the gradient and update our guess of $\W$ on each mini-batch

$$
\W = \W - \alpha * \frac{\partial \loss}{\partial W}
$$

An **epoch** is defined as the processing of all $m$ examples (using $b$ batches of size $m'$)


<div>
    <center><strong>Minibatch: Forward Pass<br>From minibatch to Loss</strong></center>
    <br>
    <!-- edX: Original: <img src=images/NN_Layers_Forward.png > replace by EdX created image -->
    <img src=images/Minibatch_fwrdpass_W8_L5_Sl12.png>
</div>

<div>
    <center><strong>Minibatch: Backwards Pass<br>From minibatch Loss to Gradient</strong></center>
    <br>
    <!-- edX: Original: <img src=images/NN_Layers_Backward.png > replace by EdX created image -->
    <img src="images/Minibatch_bkwrdpass_W8_L5_Sl13.png">
</div>

During one epoch, the gradient gets updated $b$ times
- Contrast to the single update when there is a single batch ($m' = m$)
- May be faster as updates occur more frequently

# The Training loop

Gradient Descent is an iterative process
- Iterate over multiple epochs
- Within an epoch
    - Iterate over mini-batches

This iterative process is called the *training loop*.

Here is some pseudo-code:


It used to be the case that this fairly standard training loop was coded for each problem.

Just as `sklearn` wrapped common code into a high-level API
- We will use a toolkit that hides the training loop behind a high level API

# Scaling the inputs

Many times in this course we have pointed out that some models are *scale sensitive*.

Neural Networks are not *mathematically* sensitive but tend to be so *in practice*.

It is *highly recommended* to scale your data so their absolute values are around 1.0 or at least somewhat small.

Gradient Descent is the root of the problem:

- Two features on different scales can cause the optimizer to favor one over the other
- Activations can *saturate*
    - Output of dot product (Dense layer) is in the "flat" area of the activation
    - Zero derivative: no learning
- The Loss may be large in initial epochs when the target values are too different from the dot products
    - *Large* gradients: unstable learning
    - Weights are typically initialized to values less than 1.0, leading to small dot products
    

Remember: if you re-scale the inputs, you will need to invert the transformation when
communicating the results.

In [4]:
print("Done")

Done
