# Introduction to Machine Learning and Deep Learning
### Neural network training

#### Optimisers

There are two main steps to training neural networks:

1. Computation of the (stochastic) gradient of the loss function with respect to the model parameters
2. Use of the computed gradient to update the parameters

In step 1., the gradients of the loss with respect to the parameters can be efficiently computed using the backpropagation algorithm. In step 2., there are several popular gradient-based optimisation algorithms used in deep learning. Many are available as built-in optimisers in Keras.

#### Weight regularisation and early stopping

Deep learning models are typically very over-parameterised, often with millions of parameters over many layers in the model. They are universal approximators (see e.g. [Cybenko](#Cybenko89) for the large width case, or [Lu et al](#Lu17) for the large depth case), and so overfitting can be a problem. When training neural networks, it is important to regularise them to prevent overfitting. 

**$\mathcal{l}^2$ and $\mathcal{l}^1$ regularisation.** Recall that for a linear model of the form

$$
f(\mathbf{x}) = \sum_j w_j \phi_j(\mathbf{x}),
$$

a typical regularisation is to add a sum of squares penalty term to the loss term to discourage the weights $w_j$ from growing too large. In this case, the regularised loss takes the form


$$
L(\mathbf{w}, \alpha) = L_0(\mathbf{w}) + \alpha_2 \sum_i w_i^2,
$$

where $L_0$ is the unconstrained loss function, and $\alpha_2$ is a regularisation hyperparameter. This is $\mathcal{l}^2$ regularisation.

This form of regularisation is often referred to as **weight decay**, although the two terms are technically not the same. Weight decay ([Hanson & Pratt](#Hanson88)) is defined as a modification to the update rule, rather than to the loss function itself:

$$
\mathbf{\theta}_{t+1} \leftarrow (1 - \lambda)\theta_t - \eta g_t,
$$

where $\theta\in\mathbb{R}^p$ is the model parameters, $\lambda$, $\eta$ are hyperparameters, and $g_t$ is the $t$-th batch update. In the case of stochastic gradient descent, the update $g_t = \nabla_\theta L(\theta_t; \mathcal{D}_m)$ and the two formulations are equivalent. However, this is not the case for all gradient-based optimisers commonly used in deep learning.

An alternative weight regularisation is $\mathcal{l}^1$ regularisation, in which the sum of absolute values of the weights are added to the loss term:

$$
L(\mathbf{w}, \alpha) = L_0(\mathbf{w}) + \alpha_1 \sum_i |w_i|.
$$

This form of regularisation encourages sparsity in the weights. Both $\mathcal{l}^1$ and $\mathcal{l}^2$ regularisation discourage the weights from growing too large, which restricts the capacity of the network.

It is also possible to add a weighted combination of both $\mathcal{l}^2$ and $\mathcal{l}^1$ regularisation to the loss function.

**Early stopping**. You might have found in the last week that it is difficult to set a good number of epochs to train for ahead of time. In toy examples the training is usually quick so it is not a problem to experiment, but in many cases training could take hours or days (or even longer!) and so this is not an option. 

Recall that deep learning models are usually vastly overparameterised, and have the capacity to drastically overfit. A simple but effective method is to simply stop the training before the model starts to overfit. 

With early stopping, the aim is to stop the training when the validation error is at a minimum. This means that the model needs to be regularly evaluated on a held-out validation set (that is not used for training), and the optimisation routine is terminated when the validation error starts to rise. Validation is normally performed once per epoch in the training run.

In practice, the validation error measurements will be noisy, and so it is not a reliable measure to simply detect when the validation error increases and immediately stop the training. What is usually done is to periodically save model checkpoints (say once per epoch), and set a **patience** threshold, to specify a maximum number of validation runs that are allowed where the validation error does not improve upon the best score so far. If this patience threshold is reached, the training is terminated.

The early stopping algorithm is outlined in pseudocode below.

Early stopping inputs: `val_metric`, `max_patience`

-------
>```
>best_valid_loss = np.inf
>patience = 0
>
>for epoch in range(max_epochs):
>    epoch_train_loss = train_model(train_data, train_loss)
>    epoch_valid_loss = validate_model(valid_data, val_metric)
>    if epoch_valid_loss < best_valid_loss:
>        best_valid_loss = epoch_valid_loss
>        save_model(epoch)
>        patience = 0
>    else:
>        patience += 1
>    
>    if patience >= max_patience:
>        break  # terminate training
>```

-------

It is also possible to validate the model using a measure that is different to the loss function used for training the model. Therefore `val_metric` is also an input to the early stopping algorithm above.

### Keras regularisers, Dropout layers and callbacks

In this notebook we will build on what we have covered already with the `Sequential` API, and include weight regularisers, `Dropout` layers, and introduce callback objects - these are very useful objects for dynamically performing operations during the training run. An example is the `EarlyStopping` callback.

In [None]:
import numpy as np

def target_f(x):
    return np.sin(2 * np.pi * x)

np.random.seed(16)
X = np.random.rand(100, 1) * 2 - 1
y = target_f(X) + np.random.normal(0, 0.1, size=X.shape)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.9)

In [None]:
from keras.models import Sequential
from keras.layers import Input, Dense

model = Sequential([
    Input(shape=[x_train.shape[1]]),
    Dense(100, activation='relu'),
    Dense(100, activation='relu'),
    Dense(100, activation='relu'),
    Dense(y_train.shape[1])
])
model.summary()

In [None]:
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history = model.fit(x_train, y_train, epochs=100, validation_data=(x_test, y_test), verbose=0)

In [None]:
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 3))

ax1.plot(history.history['loss'], label='train')
ax1.plot(history.history['val_loss'], label='val')
ax1.set_xlabel("Epoch")
ax1.set_ylabel("MSE Loss")
ax1.legend()

ax2.plot(history.history['mae'], label='train')
ax2.plot(history.history['val_mae'], label='val')
ax2.set_xlabel("Epoch")
ax2.set_ylabel("MAE")
ax2.legend()

plt.show()

### Dropout & weight regularization

In [None]:
from keras.layers import Dropout
from keras.regularizers import l2

model = Sequential([
    Input(shape=[x_train.shape[1]]),
    Dense(100, activation='relu', kernel_regularizer=l2(1e-3)),
    Dropout(0.1),
    Dense(100, activation='relu', kernel_regularizer=l2(1e-3)),
    Dropout(0.1),
    Dense(100, activation='relu'),
    Dense(y_train.shape[1])
])

In [None]:
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history = model.fit(x_train, y_train, epochs=100, validation_data=(x_test, y_test), verbose=0)

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 3))

ax1.plot(history.history['loss'], label='train')
ax1.plot(history.history['val_loss'], label='val')
ax1.set_xlabel("Epoch")
ax1.set_ylabel("MSE Loss")
ax1.legend()

ax2.plot(history.history['mae'], label='train')
ax2.plot(history.history['val_mae'], label='val')
ax2.set_xlabel("Epoch")
ax2.set_ylabel("MAE")
ax2.legend()

plt.show()

## Early stopping

In [None]:
from keras.callbacks import EarlyStopping

earlystopping = EarlyStopping(patience=20, monitor='val_loss', restore_best_weights=True)

In [None]:
model = Sequential([
    Input(shape=[x_train.shape[1]]),
    Dense(100, activation='relu', kernel_regularizer=l2(1e-3)),
    Dropout(0.1),
    Dense(100, activation='relu', kernel_regularizer=l2(1e-3)),
    Dropout(0.1),
    Dense(100, activation='relu'),
    Dense(y_train.shape[1])
])

In [None]:
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history = model.fit(x_train, y_train, epochs=200, validation_data=(x_test, y_test), verbose=0, 
                    callbacks=[earlystopping])

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 3))

ax1.plot(history.history['loss'], label='train')
ax1.plot(history.history['val_loss'], label='val')
ax1.set_xlabel("Epoch")
ax1.set_ylabel("MSE Loss")
ymax, ymin = ax1.get_ylim()
ax1.vlines(earlystopping.best_epoch, ymax=ymax, ymin=ymin, linestyle='--', color='r', label='Best epoch')
ax1.legend()

ax2.plot(history.history['mae'], label='train')
ax2.plot(history.history['val_mae'], label='val')
ax2.set_xlabel("Epoch")
ax2.set_ylabel("MAE")
ax2.legend()

plt.show()

### References

* Chen, J. & Kyrillidis, A., (2019), "Decaying Momentum Helps Neural Network Training", arXiv preprint arXiv:1910.04952.
* Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, **2** (4), 303–314.
* Hanson, S. J. & Pratt, L. Y. (1988) "Comparing biases for minimal network construction with back-propagation", in *Proceedings of the 1st International Conference on Neural Information Processing Systems*,  177–185.
* Kingma, D. P. & Ba, J. L. (2015), "Adam: a Method for Stochastic Optimization", International Conference on Learning Representations, 1–13.
* Lu, Z., Pu, H., Wang, F. Hu, Z., & Wang, L. (2017) "The Expressive Power of Neural Networks: A View from the Width", Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 6231–6239.
* Nesterov, Y. (1983), "A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)", Doklady ANSSSR (translated as Soviet. Math. Docl.), **269**, 543–547.
* Qian, N. (1999), "On the momentum term in gradient descent learning algorithms", Neural Networks: The Official Journal of the International Neural Network Society, **12** (1), 145–151.
* Robbins, H. and Monro, S. (1951), "A stochastic approximation method", *The annals of mathematical statistics*, 400–407.