## Regularization techniques in Keras - a basic overview 
Overfitting is an big problem when training a model and it occurs when the model achieves a very high accuracy on the training data, i.e. has learned the training data's features very well (and sometimes even remembers every single datapoint), but fails to achieve a similar accuracy on new (test) data. 

This is a problem, since what we actually care about is how well the model predicts *new* data, for which we don't have an answer in regards to which class it belongs to, for example in the case of a classification problem. 
For the training data we already *have* the answer, therefore it's accuracy isn't as important. 

There's a lot of discussion about whether over- or underfitting is worse (make sure to check out the [question](https://stats.stackexchange.com/questions/521835/is-overfitting-better-than-underfitting) I asked out on Cross-Validated), the key is to find the sweetspot between both. 

![Graph](/Users/luca/Desktop/Screenshot.png)

In the following we're going to focus on some regularization techniques to prevent overfitting in neural networks.
**Note**: Here I will mainly focus on the code and how to implement it using Keras, and not as much on the theoretical background. 

### Overview of regularization
#### 1) Create a simpler model
Often times a model that is too complex (in this case we will refer to complexity as a high number of parameters), for example to deep or too wide, will overfit. On the flipside, a model that isn't complex *enough* won't learn form the data and underfit. So the first thing we should ask ourselves is of we can reduce model complexity by reducing the number of parameters. 

#### 2) Reducing batch_size
Lowering the batchsize when we fit the training data to the model, i.e. train the model, creates more uncertainty and can help to reduce overfitting. For example a model with batch_size = 500 will be more likely to overfit than a mode with batch_size = 100. 

#### 3) Dropout
This is likely to be the most effective way of reducing overfitting. 
Dropout means randomly ignoring or “dropping out” some number of layer outputs.
Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.

#### 4) Early Stopping
Early stopping means exactly what it sounds like: We stop the model before it overfits. 
Usually with increasing training time the model will aproximate 100% training accuracy while test test accuracy drops. Therefore sometimes by stopping the model earlier than the actual epochs, we can prevent overfitting. 

#### 4) Batch Normalization
Batch normalization applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1. Batch normalization is mostly effective for very deep neural networks. 

#### 5) Add penalty to cost function - L1, L2 and elastic net regularizer
Neural networks learn a set of weights that best map inputs to outputs.

A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be a sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data.

A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.

#### 6) Get more training data
This can be tricky at times, because getting data may be expensive, unaccessible or simply not available. 
But generally, the more data we have, the better the model can estimate the "real" value of the parameters. 



*For a simple example:* <br>
Imagine you wanted to estimate the population mean of the height of students at a university.
If you would pick a sample of 5 students your sample mean will probably vary greatly from the "real" (the population) mean. Therefore estimating the population mean from only a sample of 5 students results in high variance. <br>
On the otherhand if you had a sample of 500 students and wanted to estimate the population mean from this sample mean, you could be more much confident about not having as much variance. 

***What are alternative ways to get more data and make the model more robust?***

#### 7) Data Augmentation 
This means creating a duplicate with small modifications for each datapoint in the set. 
For example, for a training set with dog pictures, we augment the data by cropping, brightening, zooming in, darkening or mirroring the picture. 

#### 8) Noise injection
*Why are small datasets a problem?*
- The first problem is that the model might memorize the training set entirely. This means it will learn each speficic input and its associated output, when it should learn a general mapping of how inputs and outputs. 
- The second problem is that a small dataset provides less opportunity to describe the structure of the input space and its relationship to the output. More training data provides a richer description of the problem from which the model may learn. Fewer data points means that rather than a smooth input space, the points may represent a jarring and disjointed structure that may result in a difficult, if not unlearnable, mapping function.

Adding random noise then will help to smoothen out those datapoints on create a larger input space.


### Implementation in Keras
We will work with the mnist dataset, which is a well-known dataset consisting of a collection of 28x28 pixel images corresponding in digits from 0 to 9 manuscripts. The purpose of this data set is to train models that recognize handwritten numbers.

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()


# As we've said, the images consist of a 28x28 matrix (that is the pixels).
# Each pixel is encoded to its corresponding color channel (in this case only gray), 
# with values between 0 (black) and 255 (white). 

# It is usual to normalize the values to work with a range between 0.0 and 1.0.
x_train, x_test = x_train / 255.0, x_test / 255.0


# split into train test sets
# reduce the dataset to induce more overfitting
_, x, _, y = train_test_split(
    x_train, y_train, test_size=0.02, random_state=1, stratify=y_train)

#### Baseline model (without regularization) 
We'll start out building our baseline model without any regularization consisting of our inputs flattened to a vector (since we cannot feed a matrix into a [feed-forward neural network](https://en.wikipedia.org/wiki/Feedforward_neural_network)).

To build the model we will use the functional api, as opposed to the [sequential model](https://hanifi.medium.com/sequential-api-vs-functional-api-model-in-keras-266823d7cd5e).

In [None]:
# Baseline model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
base_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
base_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = base_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = base_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

### Regularize Baseline Model
#### 1) Simpler Model
First and easiest thing to do is always to ask yourself if we can create a simpler model without compromising its ability to learn. Practically this means **reducing the number of parameters** by either removing neurons or layers.

In [None]:
# Simpler model -- remove third layer and reduce number of neurons in the remaining

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(80, activation='relu')(flat)
## hidden layer 2
l_2 = layers.Dense(60, activation='relu')(l_1)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_2)

## Model definition
simple_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
simple_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_2 = simple_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results_2 = simple_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results_2[0]))
print('Test Accuracy: {}'.format(results_2[1]))

#### 2) Reduce `batch_size`
`batch_size` is a parameter when we fit the model and is the number of training samples that goes through one forward (or backward) pass. After each pass the internal model parameters (the weights and biases) are updated. 
Larger `batch_sizes` require more memory, while smaller ones fluctuate more (since the parameters get updated more frequently), which is meant by they introduce "uncertainty".

*The important thing is*: It can help to prevent overfitting, but keep in mind that it will also result in longer training times. <br>
In practice it's literally as easy as reducing the number in the `batch_size` parameter when fitting the model. 

In [None]:
# Reduced batch_size model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
redbatch_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
redbatch_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = redbatch_model.fit(
    x,
    y,
    batch_size=40, # In practice it's as simple as reducing the batch size from let's say 80 to a lower number, e.g. 40
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = redbatch_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### 3) Dropout
This is a technique that's exceptionally effective.<br>
In practice, we "switch off" some neurons in the model with a probability of p.
This is done by adding *dropout layers* to the model.<br>
In this case we'll go with a probability of p=0.5. 

In [None]:
# Dropout model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)
#Adding dropout layer
flat = layers.Dropout(0.5, name='dropout_flat')(flat)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
l_1 = layers.Dropout(0.5, name='dropout_l1')(l_1)
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
l_2 = layers.Dropout(0.5, name='dropout_l2')(l_2)
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)
l_3 = layers.Dropout(0.5, name='dropout_l3')(l_3)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
dropout_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
dropout_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = dropout_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = dropout_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### 4) Early Stopping
Oftentimes we see that with increasing training time the model will start to approximate 1 on the training data and get lower accuracy on the test data. <br>
We can avoid this by stopping the training at the point where the difference between train and test data is the lowest (the "sweetspot" mentioned before). Early stopping also results in shorter training times. 

To perform early stopping we will use the Keras Callbacks :

`tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=0,
    verbose=0,
)`

The model architecture is basically the same, we just have to specify the callback and pass it as an argument when we fit the model. 

In [None]:
# Early stopping model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
es_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Specifying callback
es_callback = keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    patience=5,  # if during 5 epochs there is no improvement in `val_loss`, the execution will stop
    verbose=1)

In [None]:
# Compile
es_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = es_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True,
    callback = [es_callback]
)

In [None]:
# Evaluation
results = es_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### 5) Batch Normalization 
Batch normalization applies a transformation that keeps the mean close to 0 and the standard deviation close to 1. 
In Keras we simply add an extra layer for BatchNorm to each existing layer, similar to what we do with the Dropout layers. 

In [None]:
# BatchNorm model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
l_1 = layers.BatchNormalization()(l_1) # adding the BatchNorm layer 
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
l_2 = layers.BatchNormalization()(l_2) # adding the BatchNorm layer
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)
l_3 = layers.BatchNormalization()(l_3) # adding the BatchNorm layer

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
batchnorm_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
batchnorm_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit = batchnorm_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = batchnorm_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### 5) L1, L2 and Elastic Net
In Keras, we just **create the kernels and pass them as an argument in the layer**.

Furthermore, it is possible to choose whether to include the penalty in the cost function on the weights, the biases or on the activation, with the following arguments:

`kernel_regularizer: only on weights.
bias_regularizer: only on biases.
activity_regularizer: on full output.`

In [None]:
# Creating kernels
L1_regularizer = regularizers.l1(1e-5)
L2_regularizer = regularizers.l2(5e-4)
elastic_net = regularizers.l1_l2(l1=1e-5, l2=5e-4)

In [None]:
# Regularizer model - here we will just only use the L2 regularizer which is usually the one working best

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu', kernel_regularizer=L2_regularizer)(flat)
## hidden layer 2
l_2 = layers.Dense(64, activation='relu', kernel_regularizer=L2_regularizer)(l_1)
## hidden layer 3
l_3 = layers.Dense(64, activation='relu', kernel_regularizer=L2_regularizer)(l_2)

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
regularizers_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
regularizers_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = regularizers_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = regularizers_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### 6) Noise Injection
Adding random noise then will help to smoothen out those datapoints on create a larger input space.

In Keras we **just add a noise layer**, similar to what we did with the Dropout layers. 

In [None]:
# Regularizer model - here we will just only use the L2 regularizer which is usually the one working best

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)

stddev = 2 # specific SD for noise

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat)
l_1 = layers.GaussianNoise(stddev, name='noise_l1')(l_1) # add noise layer
## hidden layer 2
l_2 = layers.Dense(64, activation='relu', kernel_regularizer=L2_regularizer)(l_1)
l_2 = layers.GaussianNoise(stddev, name='noise_l2')(l_2) # add noise layer
## hidden layer 3
l_3 = layers.Dense(64, activation='relu', kernel_regularizer=L2_regularizer)(l_2)
l_3 = layers.GaussianNoise(stddev, name='noise_l3')(l_3) # add noise layer

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
noise_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
noise_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = noise_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = noise_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))

#### Combining different methods
Of course we can also combine different regularizers in one and the same model. 
<br>
For example combining **dropout with batchnorm** like in the following: <br><br>
NOTE: There's some discussion about what the order of the layers should be, i.e. of BatchNorm should be applied before or after dropout. 

In [None]:
# Dropout model

## Input
inputs = tf.keras.Input(shape=(28, 28))
## Convert the 2D image to a vector
flat = layers.Flatten()(inputs)
#Adding dropout layer
flat = layers.Dropout(0.5, name='dropout_flat')(flat)

## hidden layer 1
l_1 = layers.Dense(128, activation='relu')(flat) # dense layer 
l_1 = layers.BatchNormalization()(l_1)           # BatchNorm layer 
l_1 = layers.Dropout(0.5, name='dropout_l1')(l_1)# Dropout layer 
## hidden layer 2
l_2 = layers.Dense(64, activation='relu')(l_1)
l_2 = layers.BatchNormalization()(l_2)           # BatchNorm layer 
l_2 = layers.Dropout(0.5, name='dropout_l2')(l_2)# Dropout layer 
## hidden layer 3
l_3 = layers.Dense(64, activation='relu')(l_2)
l_3 = layers.BatchNormalization()(l_3)           # BatchNorm layer 
l_3 = layers.Dropout(0.5, name='dropout_l3')(l_3)# Dropout layer 

## Outputs
outputs = layers.Dense(10, activation='softmax')(l_3)

## Model definition
combined_model = keras.Model(inputs=inputs, outputs=outputs)

In [None]:
# Compile
combined_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
# Training
fit_1 = combined_model.fit(
    x,
    y,
    batch_size=80,
    epochs=100,
    shuffle=True
)

In [None]:
# Evaluation
results = combined_model.evaluate(x_test, y_test, verbose=0)
print('Test Loss: {}'.format(results[0]))
print('Test Accuracy: {}'.format(results[1]))