In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

# Training neural networks: the nitty-gritty

We know that 
the weights $\W^*$ that minimize the average loss

$$
\W^* = \argmin{W} L(\hat{\y},\y; \W)
$$

can be found by using Back Propagation
([see the earlier module](Training_Neural_Network_Backprop.ipynb))
to compute the derivatives needed by the Gradient Descent algorithm.


So training a Neural Network sounds simple enough.

But we also recalled several "AI Winters" in which progress in Deep Learning stalled.

History indicates that Training is more complex than it appears.

We will illustrate the complexities via a running example
- Training a binary classifier
- On the dataset shown on the left
    - Two features: $x_0, x_1$
    - Targets: 0, 1 
- Using the Neural Network architecture shown on the right



<table>
    <tr>
        <td><center><strong>Data</strong></center></td>
        <td><center><strong>Neural Network</strong></center></td>
     </tr>
    <tr>
        <td>
            <img src="images/tnn_data.png">
        </td>
        <td>
            <img src="images/tnn_arch.png">
         </td>
    </tr>
</table>

## Effect of different Activation functions

Let's see the effect of choosing among ReLU, Sigmoid and Tanh activation functios
- Keeping the architecture identical
- but changing the activation functions of the `Dense` layers
    - the Classifier `Dense` layer always uses the sigmoid (as a classifier must)

We see a difference in Loss and Accuracy
- note the difference in the scale of the vertical axes between plots

<img src="images/tnn_loss_and_acc.png">

## Effect of weight initialization

Training is the process of discovering optimal weights/parameters for the components of the Neural Network.
- updating an initial choice via Gradient Descent

Yet, we have not specified how to initialize the weights.

Does it matter ?

In the diagram above, we saw the Loss and Accuracy when initializing weights
- according to a Random Normal distribution with mean 0 and unit variance

Here are the identical plots when initializing weights
- to all zero

<img src="images/tnn_loss_and_acc_zero_init.png">

The loss and accuracy of **all** the networks is notably worse than the Random Normal initialization
- The Loss hardly changes with training: the network is not "learning"
- The Accuracy is not better than a coin flip


## Understanding what happens during training

As we saw from our experiments
- choices that seem minor
    - activation function
    - weight initialization
- can have a major impact on the success of training a Neural Network.

We now spend some time investigating the causes, and solutions, to the difficulty of training networks.

Broadly speaking the issues are
- Gradients becoming zero or infinity, inhibiting learning (weight updates in Gradient Descent)
- Proper scaling of the inputs
- Initialization of learnable weights
- Making sure that the proper scaling of inputs continues to each layer, not just the input

# Vanishing and Exploding Gradients

Although Backpropagation is mechanically simple, there are some mathematical subtleties.

Let's explore the problem of
[Vanishing and Exploding Gradients](Vanishing_and_Exploding_Gradients.ipynb)

# Initializing and maintaining layer inputs

Neural Networks are sensitive to the scale of the layer inputs.

Creating the correct situation to learn is the subject of [Scaling and Initialization](Training_Neural_Networks_Scaling_and_Initialization.ipynb).


# Improving trainability

Apart from the mathematical issues of preventing activations and gradients from exploding/vanishing, there are many ways to make training successful.

Let's explore techniques for [Improving trainability](Training_Neural_Networks_Tweaks.ipynb).

# How big should my NN be ?

There is a paradox in building Neural Networks:
- Start off training an overly large NN (many units)
- Many units turn out to be "dead": near zero weights
- Reduce the number of units
- Can't train !

Given a fixed number of layers: it is easier to train a big NN than a small one.

"Somewhere in this big mess must be something valuable"


<table>
    <tr>
        <th><center>"Big" NN -- some nodes are dead</center></th>
    </tr>
    <tr>
        <td><img src="images/Dropout_NN_wo_dropout.png"</td>
    </tr>
</table>

<table>
    <tr>
        <th><center>"Big" NN after dead nodes have been pruned</center></th>
    </tr>
    <tr>
        <td><img src="images/Dropout_NN_w_dropout.png"</td>
    </tr>


[The Lottery Ticket Hypothesis](https://arxiv.org/abs/1803.03635)
is an interesting paper that addresses this issue.

For now:
- Use bigger than necessary NN's
- With regularization to "prune"


# Conclusion

We sometimes take training Neural Networks for granted.

After all, Gradient Descent seems like a simple procedure.

It turns out that there are *many* subtleties.

Uncovering and solving the subtle problems were the key contributions in the recent rapid advance
of Deep Learning.

Without them, we'd still be living in "AI Winter".

In [3]:
print("Done")

Done
