<h1>Improving the way neural networks learn</h1>

In this chapter I explain a suite of techniques which can be used to improve on our vanilla implementation of backpropagation, and so improve the way our networks learn.

When using the quadratic cost function our neural networks have difficulty learning when they're badly wrong--which can make learning very slow. We want a way to avoid such slowdowns.

<h2>Introducing the cross-entropy cost function</h2>
We define the cross-entropy cost function for this neuron by

$$
\begin{eqnarray} 
  C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right],
\tag{57}\end{eqnarray}
$$

The cross-entropy is positive, and tends toward zero as the neuron gets better at computing the desired output, $y$, for all training inputs, $x$. And the larger the error, the faster the neuron will learn.

In [11]:
import sys
sys.path.insert(1, 'E:\Github\neural_networks_and_deep_learning\src')

import network2
import mnist_loader

In [15]:
"""
mnist_loader
~~~~~~~~~~~~
"""
#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    f = gzip.open('data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

In [17]:
training_data, validation_data, test_data = load_data_wrapper()

net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data, monitor_evaluation_accuracy=True)

Epoch 0 training complete
Accuracy on evaluation data: 9046 / 10000

Epoch 1 training complete
Accuracy on evaluation data: 9044 / 10000

Epoch 2 training complete
Accuracy on evaluation data: 9355 / 10000

Epoch 3 training complete
Accuracy on evaluation data: 9333 / 10000

Epoch 4 training complete
Accuracy on evaluation data: 9333 / 10000

Epoch 5 training complete
Accuracy on evaluation data: 9371 / 10000

Epoch 6 training complete
Accuracy on evaluation data: 9422 / 10000

Epoch 7 training complete
Accuracy on evaluation data: 9439 / 10000

Epoch 8 training complete
Accuracy on evaluation data: 9469 / 10000

Epoch 9 training complete
Accuracy on evaluation data: 9478 / 10000

Epoch 10 training complete
Accuracy on evaluation data: 9474 / 10000

Epoch 11 training complete
Accuracy on evaluation data: 9496 / 10000

Epoch 12 training complete
Accuracy on evaluation data: 9502 / 10000

Epoch 13 training complete
Accuracy on evaluation data: 9487 / 10000

Epoch 14 training complete
Acc

([],
 [9046,
  9044,
  9355,
  9333,
  9333,
  9371,
  9422,
  9439,
  9469,
  9478,
  9474,
  9496,
  9502,
  9487,
  9513,
  9496,
  9497,
  9474,
  9490,
  9499,
  9517,
  9500,
  9506,
  9493,
  9533,
  9430,
  9514,
  9520,
  9508,
  9505],
 [],
 [])

<h2>Softmax</h2>
The idea of softmax is to define a new type of output layer for our neural networks. We don't apply the sigmoid function to get the output. Instead, we apply the so-called softmax function to the output layer.

According to this function, the activation $a^L_j$ of the $j^th$ output neuron is

$$
\begin{eqnarray} 
  a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}},
\tag{78}\end{eqnarray}
$$

where in the denominator we sum over all the output neurons.

The softmax function makes the output layer akin to a probability distribution. All the outputs sum up to one and when we increase the inputs to one neuron, all the other neurons descrease. It creates a convenient way to interpret the output activations of the network.

https://machinelearningmastery.com/softmax-activation-function-with-python/

<h2>Overfitting and regularization</h2>

If we train too much, our network no longer generalizes to the test day. And so it's not useful learning. In this case, we say the network is **overfitting** or **overtraining**.

Looking at this graph:

![Overfitting](imgs/overfit_2.PNG)

We might believe that our model is getting continuously better. But when we look at this figure:

![Overfitting](imgs/overfit_1.PNG)

The accuracy stops improving around epoch 280. This shows that our learning was merely an illusion.

Usually, when the accuracy stops improving, we should stop training. This will help us avoid overfitting. We should use the validation data to measure accuracy between each epoch and stop when we don't improve accuracy to some degree. This method is called **early stopping**.

The **validation data set** are examples that are held back to tune hyper parameters. We don't want to tune hyper parameters with the test data or training data because the parameters may just fit to the data. Using a validation data set reduces this bias.

<h3>Regularization</h3>

We can reduce overfitting with **regularization**, specifically with **L2 regularization**. The idea is that we add an extra term to the cost function, a term called the regularization term. The regularized cross-entropy function would be:

$$
\begin{eqnarray} C = -\frac{1}{n} \sum_{xj} \left[ y_j \ln a^L_j+(1-y_j) \ln
(1-a^L_j)\right] + \frac{\lambda}{2n} \sum_w w^2.
\tag{85}\end{eqnarray}
$$

We've added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor $\lambda/2n$, where $\lambda>0$ is known as the **regularization parameter**, and n is, as usual, the size of our training set.

Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function.

Looking at the same example (but with regularization) as last time, we see the cost changes the same:

![Overfitting](imgs/overfit_3.PNG)

But the accuracy increases over the entire 400 epochs. Clearly, we've avoided overfitting with regularization.

![Overfitting](imgs/overfit_4.PNG)

<h2>Other techniques for regularization</h2>

<h3>Dropout</h3>
We randomly (and temporarily) delete half the hidden nuerons in a network, while leaving the input and output neurons untouched.

We train the selected hidden nuerons then restore the dropped neurons. We select a new subset of neurons for the next mini-batch and repeat the process.

This is equivalent to training multiple networks then averaging their outputs. It avoids overfitting and turns out to be pretty accurate.

<h3>Artificially expanding the training data</h3>
Finding more data is a good way to improve a network, but it can be expensive to get more data. We can instead modify the data we have to make more data.

For example, we can rotate the handwritten digits a couple degrees and add the modified digits to the data.

<h3>An aside on big data and what it means to compare classification accuracies</h3>

Suppose we have two algorithms A and B. Sometimes algorithm A will outperform algorithm B with one set of training data. But does that make A a better algorithm then B?

Sometimes, when we change to data, algorithm B will outperform algorithm A. Usually, we look for both better algorithms and training data.



<h2>Weight Initialization</h2>

As described in chapter 1, we have initialized weights and biases using independent Gaussian random variables, normalized to have mean 0 and standard deviation 1. The distribution of the weighted sum $z = \sum_j w_j x_j + b$ will look like:

![Distribution](imgs/distribution_2.PNG)

But with this method, the sum is often very small or very large. This causes the output of the hidden neurons to be very close to 0 or 1, so the hidden neuron will have saturated. Thus, learning will be very slow.

Choosing a different cost function helps descrease saturation in the output neurons, but it does nothing for saturated hidden neurons.

Instead, we will initialized weights as Gaussian random variables with mean 0 and standard deviation $1/\sqrt{n_{\rm in}}$. The weighted sum $z = \sum_j w_j x_j + b$ will be a Gaussian distribution with mean 0, but it'll be much more sharply peaked than before.

![Distribution](imgs/distribution_1.PNG)

The initialization procedure for biases doesn't really matter, wo we'll continue to use the old method.

With the new method, we get a classification accuracy of,

![Classification Accuracy](imgs/accuracy.PNG)

Which ends up being the same accuracy but the network learns better. Sometimes, the weight initilization will result in higher classification accuracies.

<h2>How to choose a neural network's hyper parameters</h2>

<h3>Broad Strategy</h3>

Our first goal should be to create a network that can achieve better results than chance. We can start by doing the following:

- stripping a problem down to its base form--like distinguishing 1s and 0s instead of all ten digits. 
- stripping our network down to its simpliest form that can achieve meaningful results
- increase the frequency of monitoring (say every 1000 images instead of 50,000

We're going to get a lot of noise but at least we're getting feedback a lot faster. Now we can experiment with hyper parameters and when we get a singnal that something is working, we can implement it in the bigger problem

<h3>Learning Rate</h3>

1. Start by finding the order of magnitude that decreases that cost during the first few epochs.
2. Find the largest value where the cost decreases during the first few epochs.
3. Experiment with values lower than the max learning rate and of the same magnitude

Learning rate won't affect the final classification accuracy so we do not have to use a validation data set.

<h3>Use early stopping to determine the number of training epochs</h3>

Early stopping is terminating training after classifcation accuracy stops improving by a certain amount after a certain number of epochs.

<h3>Learning rate schedule</h3>

We can start with a large learning rate and decrease it when validation accuracy starts to get worse. A variable learning schedule can improve performance but it creates many more hyper-parameters to tweak.

<h2>Other models of artificial neuron</h2>

We can use

$$
\begin{eqnarray}
 \tanh(w \cdot x+b), 
\tag{109}\end{eqnarray}
$$

instead of a sigmoid neuron. The values range from -1 to 1.

We can also use

$$
\begin{eqnarray}
  \max(0, w \cdot x+b).
\tag{112}\end{eqnarray}
$$

<h2>On stories of neural networks</h2>

We don't always have rigourous mathematical proofs on whether certain techniques in neural networks work. Neural networks challenge the scope of human understanding and it will take decades before we can fully understand them. 