# <center> Avoiding Overfitting Through Regularization <center>
---

## Early Stopping

One way to implement this with TensorFlow is to evaluate the model on a validation
set at regular intervals (e.g., every 50 steps), and save a “winner” snapshot if it outperforms previous “winner” snapshots. Count the number of steps since the last “win‐
ner” snapshot was saved, and interrupt training when this number reaches some limit
(e.g., 2,000 steps). Then restore the last “winner” snapshot.
Although early stopping works very well in practice, you can usually get much higher
performance out of your network by combining it with other regularization techniques.

## ℓ1 and ℓ2 Regularization

One way to do this using TensorFlow is to simply add the appropriate regularization
terms to your cost function. For example, assuming you have just one hidden layer
with weights W1 and one output layer with weights W2, then you can apply ℓ1 regularization like this:

In [None]:
[...] # construct the neural network
W1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")
scale = 0.001 # l1 regularization hyperparameter
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                            logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
    reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
    loss = tf.add(base_loss, scale * reg_losses, name="loss")

However, if there are many layers, this approach is not very convenient. Fortunately,
TensorFlow provides a better option. Many functions that create variables (such as
get_variable() or tf.layers.dense()) accept a *_regularizer argument for each
created variable (e.g., kernel_regularizer). You can pass any function that takes
weights as an argument and returns the corresponding regularization loss. The
l1_regularizer(), l2_regularizer(), and l1_l2_regularizer() functions return
such functions. The following code puts all this together:

In [None]:
my_dense_layer = partial(tf.layers.dense, activation=tf.nn.relu,
                        kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))
with tf.name_scope("dnn"):
    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    hidden2 = my_dense_layer(hidden1, n_hidden2, name="hidden2")
    logits = my_dense_layer(hidden2, n_outputs, activation=None,
                            name="outputs")

This code creates a neural network with two hidden layers and one output layer, and
it also creates nodes in the graph to compute the ℓ1 regularization loss corresponding
to each layer’s weights. TensorFlow automatically adds these nodes to a special collection containing all the regularization losses. You just need to add these regularization
losses to your overall loss, like this:

In [None]:
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name="loss")

Another way:

In [None]:
vars = tf.trainable_variables() 
lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars
                    if 'bias' not in v.name ]) * 0.001

with tf.name_scope("loss"):
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y, logits=logits)
    loss = tf.reduce_mean(cross_entropy + lossL2, name="loss")

## Dropout

The most popular regularization technique for deep neural networks is arguably
dropout. It was proposed21 by G. E. Hinton in 2012 and further detailed in a paper by
Nitish Srivastava et al., and it has proven to be highly successful: even the state-ofthe-art neural networks got a 1–2% accuracy boost simply by adding dropout. This
may not sound like a lot, but when a model already has 95% accuracy, getting a 2%
accuracy boost means dropping the error rate by almost 40% (going from 5% error to
roughly 3%).
It is a fairly simple algorithm: at every training step, every neuron (including the
input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step,
but it may be active during the next step. The hyperparameter p is
called the dropout rate, and it is typically set to 50%. After training, neurons don’t get
dropped anymore. And that’s all (except for a technical detail we will discuss momentarily). It is quite surprising at first that this rather brutal technique works at all. Would a
company perform better if its employees were told to toss a coin every morning to
decide whether or not to go to work? Well, who knows; perhaps it would! The com‐
pany would obviously be forced to adapt its organization; it could not rely on any sin‐
gle person to fill in the coffee machine or perform any other critical tasks, so this
expertise would have to be spread across several people. Employees would have to
learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person quit, it wouldn’t make
much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. Neurons trained with dropout cannot
co-adapt with their neighboring neurons; they have to be as useful as possible on
their own. They also cannot rely excessively on just a few input neurons; they must
pay attention to each of their input neurons. They end up being less sensitive to slight
changes in the inputs. In the end you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural
network is generated at each training step. Since each neuron can be either present or
absent, there is a total of 2N possible networks (where N is the total number of drop‐
pable neurons). This is such a huge number that it is virtually impossible for the same
neural network to be sampled twice. Once you have run a 10,000 training steps, you
have essentially trained 10,000 different neural networks (each with just one training
instance). These neural networks are obviously not independent since they share
many of their weights, but they are nevertheless all different. The resulting neural
network can be seen as an averaging ensemble of all these smaller neural networks.
There is one small but important technical detail. **Suppose p = 50%, in which case
during testing a neuron will be connected to twice as many input neurons as it was
(on average) during training. To compensate for this fact, we need to multiply each
neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will
get a total input signal roughly twice as large as what the network was trained on, and
it is unlikely to perform well. More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide
each neuron’s output by the keep probability during training (these alternatives are
not perfectly equivalent, but they work equally well).
To implement dropout using TensorFlow, you can simply apply the tf.layers.dropout() function to the input layer and/or to the output of any hidden layer you want.
During training, this function randomly drops some items (setting them to 0) and
divides the remaining items by the keep probability. After training, this function does
nothing at all. The following code applies dropout regularization to our three-layer
neural network:**

In [None]:
[...]
training = tf.placeholder_with_default(False, shape=(), name='training')
dropout_rate = 0.5 # == 1 - keep_prob
X_drop = tf.layers.dropout(X, dropout_rate, training=training)
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu,
                                name="hidden1")
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu,
                                name="hidden2")
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")

P.s. You want to use the tf.layers.dropout() function, not
tf.nn.dropout(). The first one turns off (no-op) when not training, which is what you want, while the second one does not.

**Of course, just like you did earlier for Batch Normalization, you need to set training
to True when training, and leave the default False value when testing.
If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training
set. It can also help to increase the dropout rate for large layers, and reduce it for
small ones.
Dropout does tend to significantly slow down convergence, but it usually results in a
much better model when tuned properly. So, it is generally well worth the extra time
and effort.**

## Max-Norm Regularization

Another regularization technique that is quite popular for neural networks is called
max-norm regularization: for each neuron, it constrains the weights w of the incoming connections such that $∥ w ∥_2$ ≤ r, where r is the max-norm hyperparameter and
$∥ · ∥_2$ is the ℓ2 norm.
We typically implement this constraint by computing $∥w∥_2$ after each training step
and clipping w if needed (w <- w*r/ $∥ w ∥_2$).
Reducing r increases the amount of regularization and helps reduce overfitting. Maxnorm regularization can also help alleviate the vanishing/exploding gradients problems (if you are not using Batch Normalization).
TensorFlow does not provide an off-the-shelf max-norm regularizer, but it is not too
hard to implement. The following code gets a handle on the weights of the first hidden layer, then it uses the clip_by_norm() function to create an operation that will clip the weights along the second axis so that each row vector ends up with a maximum norm of 1.0. The last line creates an assignment operation that will assign the
clipped weights to the weights variable:

In [None]:
threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
clip_weights = tf.assign(weights, clipped_weights)

Then you just apply this operation after each training step, like so:

In [None]:
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
clip_weights.eval()

In general, you would do this for every hidden layer. Although this solution should
work fine, it is a bit messy. A cleaner solution is to create a max_norm_regularizer()
function and use it just like the earlier l1_regularizer() function:

In [None]:
def max_norm_regularizer(threshold, axes=1, name="max_norm",
                        collection="max_norm"):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
        clip_weights = tf.assign(weights, clipped, name=name)
        tf.add_to_collection(collection, clip_weights)
        return None # there is no regularization loss term
    return max_norm

This function returns a parametrized max_norm() function that you can use like any
other regularizer:

In [None]:
max_norm_reg = max_norm_regularizer(threshold=1.0)
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
    kernel_regularizer=max_norm_reg, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
    kernel_regularizer=max_norm_reg, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

Note that max-norm regularization does not require adding a regularization loss term
to your overall loss function, which is why the max_norm() function returns None. But
you still need to be able to run the clip_weights operations after each training step,
so you need to be able to get a handle on them. This is why the max_norm() function
adds the clip_weights operation to a collection of max-norm clipping operations.
You need to fetch these clipping operations and run them after each training step:

In [None]:
clip_all_weights = tf.get_collection("max_norm")
    with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            sess.run(clip_all_weights)

## Data Augmentation

One last regularization technique, data augmentation, consists of generating new
training instances from existing ones, artificially boosting the size of the training set.
This will reduce overfitting, making this a regularization technique. The trick is to
generate realistic training instances; ideally, a human should not be able to tell which
instances were generated and which ones were not. Moreover, simply adding white
noise will not help; the modifications you apply should be learnable (white noise is
not).
For example, if your model is meant to classify pictures of mushrooms, you can
slightly shift, rotate, and resize every picture in the training set by various amounts
and add the resulting pictures to the training set (see Figure 11-10). This forces the
model to be more tolerant to the position, orientation, and size of the mushrooms in
the picture. If you want the model to be more tolerant to lighting conditions, you can
similarly generate many images with various contrasts. Assuming the mushrooms are
symmetrical, you can also flip the pictures horizontally. By combining these transformations you can greatly increase the size of your training set. It is often preferable to generate training instances on the fly during training rather
than wasting storage space and network bandwidth. TensorFlow offers several image
manipulation operations such as transposing (shifting), rotating, resizing, flipping,
and cropping, as well as adjusting the brightness, contrast, saturation, and hue (see
the API documentation for more details). This makes it easy to implement data augmentation for image datasets.