# Improving generalization with regularizers and constraints

Neural networks usually have a very large number of parameters, which may lead to overfitting in many cases (especially when you do not have a large dataset). There's a large number of methods for regularization, and here we cover the most usual ones which are already implemented in Keras.

For more details and theoretical grounds for the regularization methods described here, a good reference is [Chapter 7 of the Deep Learning Book](http://www.deeplearningbook.org/contents/regularization.html).

## Regularizers (`keras.regularizers`)

- `l1(l=0.01)`: L1 weight regularization penalty, also known as LASSO
- `l2(l=0.01)`: L2 weight regularization penalty, also known as weight decay, or Ridge
- `l1l2(l1=0.01, l2=0.01)`: L1-L2 weight regularization penalty, also known as ElasticNet
- `activity_l1(l=0.01)`: L1 activity regularization
- `activity_l2(l=0.01)`: L2 activity regularization
- `activity_l1l2(l1=0.01, l2=0.01)`: L1+L2 activity regularization


In [None]:
# Example: Defining a Dense layer with l2 regularization on the weights and activations
from keras.regularizers import l2, activity_l2
model.add(Dense(256, W_regularizer=l2(0.01), activity_regularizer=activity_l2(0.05)))

## Constraints (`keras.constraints`)

- `maxnorm(m=2)`: maximum-norm constraint
- `nonneg()`: non-negativity constraint
- `unitnorm()`: unit-norm constraint, enforces the matrix to have unit norm along the last axis



In [None]:
# Example: enforce non-negativity on a convolutional layer weights
from keras.constraints import nonneg
model.add(Convolution1D(64, 3, border_mode='same', W_constraint=nonneg()))

## Dropout

Dropout is a different regularization technique, based on dropping out random internal features and/or inputs during training. In its usual formulation (which is the one implemented in Keras), dropout will set an input or feature to zero with probability P only during training (or, equivalently, setting a fraction P of the inputs/features to zero).

This is how you use Dropout in Keras:

In [None]:
from keras.layers import Dropout
model.add(Dense(128, input_dim=64))
model.add(Dropout(0.5)) # Dropout 50% of the features from the dense layer

Note that whenever Dropout is the first layer of a network, you have to specify the `input_shape` as usual. The parameter passed to Dropout should be between zero and one, and 0.5 is the usual value chosen for internal features. For inputs, you usually want to drop out a smaller amount of input features (0.1 or 0.2 are good values to start with).

As an alternative to this sort of "binary" dropout, one can also apply a multiplicative one-centered Gaussian noise to the inputs/features. This is implemented in Keras as the `GaussianDropout` layer:


In [None]:
from keras.layers.noise import GaussianDropout
model.add(GaussianDropout(0.1)) # Dropout 50% of the features

where the parameter is the $\sigma$ for the Gaussian distribution to be sampled.

## Adding noise to the inputs and/or internal features

Instead of multiplicative Gaussian noise, you can also use good old additive Gaussian noise, too. Usage is similar to the dropout layers described above:


In [None]:
from keras.layers.noise import GaussianNoise
model.add(GaussianNoise(0.2))

Again, the parameter is the $\sigma$ for the Gaussian distribution, but this time the noise is zero-centered as usual for additive Gaussian noise.

## Early stopping

Early stopping avoids overfitting the training data by monitoring the performance on a validation set and stopping when it stops improving. In Keras, it is implemented as a callback (`keras.callbacks.EarlyStopping`). In order to avoid noise in the performance metric used for the validation set, early stopping is implemented in Keras with a "patience" term: training stops when no improvement is seen for `patience` epochs. 

In [None]:
early_stop = EarlyStopping(patience=5)

Note that the model parameters after training with early stopping will correspond to those from the last epoch, not those for the "best" epoch. So, most of the time, `EarlyStopping` is used in combination with the `ModelCheckpoint` callback with `save_best_only=True` , so you can load the best model after `EarlyStopping` interrupts your model training.