# Improve the performance I.
### Reduce Overﬁtting With Dropout Regularization
- dropout on your input layers
- dropout on your hidden layers

### What is the dropout Regularization?
Dropout is a regularization technique for neural network model. Dropout is a technique where randomly selected neurons are ignored during training. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass. As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for speciﬁc features providing some specialization. Neighboring neurons come to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data (This reliance on context for a neuron during training is referred to as complex co-adaptations). You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network. 

The eﬀect is that the network becomes less sensitive to the speciﬁc weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overﬁt the training data.

Dropout is easily implemented by randomly selecting nodes to be dropped-out with a given probability (e.g. 20%) each weight update cycle. This is how Dropout is implemented in Keras. Dropout is only used during the training of a model and is not used when evaluating the skill of the model. 

In this example, we will evaluate the developed models using scikit-learn with **10-fold cross-validation**, in order to better tease out diﬀerences in the results. There are **60 input** values and a **single output** value and the **input values are standardized** before being used in the network. 

The baseline neural network model has two hidden layers, the ﬁrst with 60 units and the second with 30. Stochastic gradient descent is used to train the model with a relatively low learning rate and momentum. The full baseline model is listed below.

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

from keras.optimizers import SGD # Stochastic Gradient Descent

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(47)
df = pd.read_csv('sonar.csv', header=None)
data = df.values

X = data[:,0:60].astype(float) 
y = data[:,60]
encoder = LabelEncoder()
encoder.fit(y)

encoded_y = encoder.transform(y)

Using TensorFlow backend.


In [2]:
# the baseline model without drop-out 
#  Generate an estimated classiﬁcation accuracy....

def create_baseline():
    model = Sequential()
    model.add(Dense(60, input_dim=60, activation='relu', kernel_initializer='normal'))
    model.add(Dense(30, activation='relu', kernel_initializer='normal'))
    model.add(Dense(1, activation='sigmoid', kernel_initializer='normal'))
    
    sgd = SGD(lr=0.01, momentum=0.8, decay=0.0, nesterov=False)
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return(model)

estimators =[]
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=47)

results = cross_val_score(pipeline, X, encoded_y, cv=kfold)

print('Baseline_mean: %.2f (+/-%.2f)' %(results.mean(), results.std()))

Baseline_mean: 0.86 (+/-0.09)


### Using Dropout on the Visible Layer
Dropout can be applied to **input neurons** called the visible layer. In the example below we add a **new Dropout layer** between the input (or visible layer) and the ﬁrst hidden layer. The dropout rate is set to **20%**, meaning one in ﬁve inputs will be randomly excluded from each update cycle. Additionally, as recommended in the original paper on dropout, a constraint is imposed on the weights for each hidden layer, ensuring that the **maximum norm of the weights does not exceed a value of 3**. This is done by setting the **kernel_constraint** argument on the **Dense** class when constructing the layers. 

The **learning rate** was lifted by one order of magnitude and the **momentum** was increased to 0.9. These increases in the learning rate were also recommended in the original dropout paper. Continuing on from the baseline example above, the code below exercises the same network with input dropout.

In [5]:
# Example of Dropout on the Sonar Dataset: Visible input Layer
from keras.layers import Dropout

def create_model():  
    model = Sequential() 
    
    model.add(Dropout(0.2, input_shape=(60,))) 
    
    model.add(Dense(60, kernel_initializer='normal', activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dense(30, kernel_initializer='normal', activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid')) 
    # Compile model 
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False) 
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) 
    return(model)

estimators =[]
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=47)

results = cross_val_score(pipeline, X, encoded_y, cv=kfold)

print('Add_dropout_inputLayer_mean: %.2f (+/-%.2f)' %(results.mean(), results.std()))

Add_dropout_inputLayer_mean: 0.87 (+/-0.06)


### Using Dropout on Hidden Layers
Dropout can be applied to hidden neurons in the body of your network model. In the example below dropout is applied between the two hidden layers and between the last hidden layer and the output layer. Again a dropout rate of **20%** is used as is a **weight constraint** on those layers. 

In [7]:
# Example of Dropout with weight constraint on the Sonar Dataset: Hidden Layer
from keras.layers import Dropout
from keras.constraints import maxnorm

def create_model():
    model = Sequential() 
    model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu', kernel_constraint=maxnorm(3))) 
    
    model.add(Dropout(0.2)) 
    model.add(Dense(30, kernel_initializer='normal', activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2)) 
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    
    # Compile model 
    sgd = SGD(lr=0.1, momentum=0.9, decay=0.0, nesterov=False) 
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) 
    return(model)

estimators =[]
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, epochs=300, batch_size=16, verbose=0)))

pipeline = Pipeline(estimators)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=47)

results = cross_val_score(pipeline, X, encoded_y, cv=kfold)

print('Add_dropout_HiddenLayer_mean: %.2f (+/-%.2f)' %(results.mean(), results.std()))

Add_dropout_HiddenLayer_mean: 0.87 (+/-0.09)


What if the performance with tuning was worse than the baseline? It is possible that additional training **epochs** are required or that further tuning is required to the **learning rate**.

### Useful heuristics to consider when using dropout in practice:
Generally use a small dropout value of 20%-50% of neurons with 20% providing a good starting point. A probability too low has minimal eﬀect and a value too high results in under-learning by the network.
- Use a larger network. You are likely to get better performance when dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
- Use dropout on input (visible) as well as hidden layers. Application of dropout at each layer of the network has shown good results.
- Use a large learning rate with decay and a large momentum. Increase your learning rate by a factor of 10 to 100 and use a high momentum value of 0.9 or 0.99.
- Constrain the size of network weights. A large learning rate can result in very large network weights. Imposing a constraint on the size of network weights such as max-norm regularization with a size of 4 or 5 has been shown to improve results.
