It is possible to achive high accuracy on the training set, what we really wnat is to develop models that generalize well to a testing set (or data they haven't seen before).

The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvment on the test data. This can happen for a number of reasons: If the model is not powerful enough, is over-fegularized, or has simply not been trained long enough. This means thae network has not learned the relevant patterns in the training data.

If you train for too long though, the model will start to overfit and learn patterns from the training data that do not generalize to the test data. We need to strike a balance.

To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quality and type of information your model can store. If a network cna only affort to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better change of generalizing well.

In this notebook, we'll explore two common regularization techniques - weight regularization and dropout- and use them to improve our IMDB movie review classification notebook.

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
    %tensorflow_version 2.x
except Exception:
    pass

import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)

In [0]:
# Rather than using an embedding as in the previous notebook, here we will
# multi-hot encode the sentences. This model will quickly overfit to the training set.
# It will be used to demonstrate when overfitting occurs, and how to fight it.

# Multi-hot-encoding our lists means turing them into vectors of 0s and 1s.
# Concretely, this would mean for instance turning the sequence [3, 5] into a
# 10,000-dimensional vector that would be all-zeros except for indices 3 and 5,
# which would be ones.

NUM_WORDS = 10000

(train_data, train_labels), (test_data, test_labels) = keras.datasets.imdb.load_data(num_words=NUM_WORDS)

def multi_hot_sequences(sequences, dimension):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, word_indices in enumerate(sequences):
        results[i, word_indices] = 1.0
    return results

train_data = multi_hot_sequences(train_data, dimension=NUM_WORDS)
test_data = multi_hot_sequences(test_data, dimension=NUM_WORDS)

In [0]:
# Let's look at one of the resulting multi-hot vectors.
# The word indices are sorted by frequency, so it is expected that there are more
# 1-values near index zero.
plt.plot(train_data[0])
train_data[0]

# Demonstrate overfitting
The simplestway to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is determined by the number of layers and the numbrer of units per layer). In deep learning, the number of learnable parameters in a model is often reffered to as the model's "capacity". Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their target, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.

Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, it you make your model too small, it will have difficulty fitting to the training data. There is a balance between "too much capacity" and "not enough capacity".


In [0]:
# Create a baseline model
baseline_model = keras.Sequential([
    # `input_shape` is only required here so that `.summary` works.
    keras.layers.Dense(16, activation='relu', input_shape=(NUM_WORDS, )),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

In [0]:
baseline_model.compile(optimizer='adam',
                       loss='binary_crossentropy',
                       metrics=['accuracy', 'binary_crossentropy'])

In [0]:
baseline_model.summary()

In [0]:
baseline_history = baseline_model.fit(train_data,
                             train_labels,
                             epochs=20,
                             batch_size=512,
                             validation_split=0.2,
                             verbose=2)

In [0]:
# Create a smaller model
# Let's create a model with less hidden units to compare against the baseline model
smaller_model = keras.Sequential([
    keras.layers.Dense(4, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(4, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

smaller_model.summary()

In [0]:
smaller_model.compile(optimizer='adam',
                      loss='binary_crossentropy',
                      metrics=['accuracy', 'binary_crossentropy'])

In [0]:
smaller_history = smaller_model.fit(train_data,
                            train_labels,
                            epochs=20,
                            batch_size=512,
                            validation_split=0.2,
                            verbose=2)

In [0]:
# Create a bigger model
# As an exercise, we can create an even larger model, and see
# how quickly it begins overfitting.
# Let's add to this benchmark a network that has much more capacity,
# far more than the problem would warrent.

bigger_model = keras.Sequential([
    keras.layers.Dense(512, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

bigger_model.compile(optimizer='adam',
                     loss='binary_crossentropy',
                     metrics=['accuracy', 'binary_crossentropy'])

bigger_model.summary()

In [0]:
bigger_history = bigger_model.fit(train_data,
                                  train_labels,
                                  epochs=20,
                                  batch_size=512,
                                  validation_split=0.2,
                                  verbose=2)

In [0]:
# Plot the training and validation loss
# Note that smaller network begins overfitting later than the baseline model (after 6 epochs rather than 4)
# and its performance degrades much more slowly once it starts overfitting.

def plot_history(histories, key='binary_crossentropy'):
    plt.figure(figsize=(16, 10))
    
    for name, history in histories:
        val = plt.plot(history.epoch, history.history['val_' + key],
                       '--', label=name.title()+'Val')
        plt.plot(history.epoch, history.history[key], color=val[0].get_color(),
                label=name.title()+'Train')
    plt.xlabel('Epochs')
    plt.ylabel(key.replace('_', ' ').title())
    plt.legend()
    
    plt.xlim([0, max(history.epoch)])

In [0]:
plot_history([('baseline', baseline_history),
              ('smaller', smaller_history),
              ('bigger', bigger_history)])

The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).

# Strategies to prevent overfitting
## Add weight regularization
You may be familiar with Occam's Razor principle: given two explanation for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explina the data, and simpler models are less likely to overfit than complex ones.

A "simple model" in this context is a model where the distribution of parameter values has less entropy ( or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss functin of the network a cost associated with having large weights

In [0]:
l2_model = keras.models.Sequential([
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                      activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

l2_model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy', 'binary_crossentropy'])

l2_model_history = l2_model.fit(train_data, train_labels, epochs=20,
                                batch_size=512, validation_split=0.2, verbose=2)


In [0]:
# l2(0.001) means that every coefficient in the weight matrix of the layer
# will add 0.001 * weight_coefficient_value **2 to the total loss of the network.

plot_history([('baseline', baseline_history),
              ('l2', l2_model_history)])

As we can see, the L2 regularized model has become much more resistant to overfitting than the baseline model, even though both models have the same number of parameters.

## Add dropout
Dropout is one of the most effective and most commonly used regularization techniques for neural networks. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1]. The "dropout rate" is the fraction of the features that are begin zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fac that more units are active than at training time.

In [0]:
dpt_model = keras.models.Sequential([
    keras.layers.Dense(16, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1, activation='sigmoid')
])

dpt_model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy', 'binary_crossentropy'])

dpt_model_history = dpt_model.fit(train_data,
                                  train_labels,
                                  epochs=20,
                                  batch_size=512,
                                  validation_split=0.2,
                                  verbose=2)

In [0]:
plot_history([('baseline', baseline_history),
              ('dropout', dpt_model_history)])

In [0]:
# Recap: here are the most common ways to prevent overfitting in NN.
# - Get more training data
# - Reduce the capacity of the network
# - Add weight regularization
# - Add dropout
# - batch normalization
# - data-augmentation