# Lift your model performance with Learning Rate Schedules !!
Training a large deep learning model is a diﬃcult optimization task. The classical algorithm to train neural networks is called **stochastic gradient descent**. It has been well established that you can achieve increased performance and faster training on some problems by using a **learning rate** that changes during training.
- a time-based learning rate schedule
- a drop-based learning rate schedule

Adapting the **learning rate** for your **stochastic gradient descent** optimization procedure can increase performance and reduce training time. Sometimes this is called **learning rate annealing** or **adaptive learning rates**. Here we will call this approach a learning rate schedule, where the default schedule is to use a **constant learning rate** to update network weights for each training epoch. 

The simplest and perhaps most used adaptation of learning rates during training are techniques that **reduce the learning rate over time**. These have the beneﬁt of making large changes at the beginning of the training procedure when larger learning rate values are used, and decreasing the learning rate such that a smaller rate and therefore smaller training updates are made to weights later in the training procedure. **This has the eﬀect of quickly learning good weights early and ﬁne tuning them later**. Two popular and easy to use learning rate schedules are as follows:
- Decrease the learning rate gradually based on the **epoch**
- Decrease the learning rate using punctuated **large drops** at speciﬁc epochs

### dataset
The **Ionosphere binary classiﬁcation problem** is used as a demonstration in this lesson. The dataset describes **radar returns** where the target was free electrons in the ionosphere. It is a binary classiﬁcation problem where positive cases (g for good) show **evidence of some type of structure** in the ionosphere and negative cases (b for bad) do not. It is a good dataset for practicing with neural networks because all of the inputs are small numerical values of the same scale. 

There are **34 attributes and 351 observations**. State-of-the-art results on this dataset achieve an accuracy of approximately 94% to 98% accuracy using 10-fold cross-validation. You can learn more about the ionosphere dataset on the UCI Machine Learning Repository website.

In [2]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelEncoder

from keras.optimizers import SGD # Stochastic Gradient Descent

np.random.seed(47)
df = pd.read_csv('ionosphere.csv', header=None)
data = df.values
df.head(2)

Using TensorFlow backend.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1.0,0.0376,...,-0.51171,0.41078,-0.46168,0.21266,-0.3409,0.42267,-0.54487,0.18641,-0.453,g
1,1,0,1.0,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1.0,-0.04549,...,-0.26569,-0.20468,-0.18401,-0.1904,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b


In [3]:
X = data[:, 0:34].astype(float)

y = data[:, -1]
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)

### Time-Based Learning Rate Schedule
The stochastic gradient descent optimization algorithm implementation in the **SGD** class has an argument called **decay**. This argument is used in the **time-based learning rate decay schedule** equation as follows: 

$LearningRate = LearningRate \times \frac{1}{1 + (decay \times epoch)}$

When the **decay** argument becomes zero (the default), this has no eﬀect on the learning rate..
- LearningRate = $0.1 \times \frac{1}{1 + (0.0 \times 1)}$ 
- LearningRate = $0.1$

When the decay argument is speciﬁed, it will decrease the learning rate from the previous epoch by the given ﬁxed amount. For example, if we use the initial learning rate value of 0.1 and the decay of 0.001...
<img src="46.jpg">

You can create a nice default schedule by **setting the decay value** as follows:
- Decay = LearningRate / Epochs 
- Decay = 0.1 / 100 
- Decay = 0.001

The example below demonstrates using the time-based learning rate adaptation schedule in Keras. A small neural network model is constructed with a single hidden layer with **34 neurons** and using the **rectiﬁer activation** function. The output layer has a single neuron and uses the sigmoid activation function in order to output probability-like values. The learning rate for stochastic gradient descent has been set to a higher value of **0.1**. The model is trained for 50 epochs and the decay argument has been set to 0.002. Additionally, it can be a good idea to use **momentum** when using an adaptive learning rate. In this case we use a **momentum value of 0.8**.

In [4]:
# Create the model
model = Sequential() 
model.add(Dense(34, input_dim=34, kernel_initializer='normal', activation='relu')) 
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

In [10]:
# Compile and fit the model
epochs = 50 
learning_rate = 0.1 
decay_rate = learning_rate/epochs 
momentum = 0.8 
sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False) 

model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

model.fit(X, encoded_y, validation_split=0.33, epochs=epochs, batch_size=28, verbose=2)
# The model is trained on 67% of the dataset and evaluated using a 33% validation dataset.

Train on 235 samples, validate on 116 samples
Epoch 1/50
 - 0s - loss: 0.0174 - acc: 0.9957 - val_loss: 0.0725 - val_acc: 0.9741
Epoch 2/50
 - 0s - loss: 0.0172 - acc: 0.9957 - val_loss: 0.0725 - val_acc: 0.9741
Epoch 3/50
 - 0s - loss: 0.0186 - acc: 0.9957 - val_loss: 0.0826 - val_acc: 0.9741
Epoch 4/50
 - 0s - loss: 0.0182 - acc: 0.9957 - val_loss: 0.0745 - val_acc: 0.9741
Epoch 5/50
 - 0s - loss: 0.0180 - acc: 0.9957 - val_loss: 0.0722 - val_acc: 0.9828
Epoch 6/50
 - 0s - loss: 0.0232 - acc: 0.9957 - val_loss: 0.0820 - val_acc: 0.9741
Epoch 7/50
 - 0s - loss: 0.0169 - acc: 0.9957 - val_loss: 0.0756 - val_acc: 0.9741
Epoch 8/50
 - 0s - loss: 0.0212 - acc: 0.9957 - val_loss: 0.0753 - val_acc: 0.9655
Epoch 9/50
 - 0s - loss: 0.0147 - acc: 0.9957 - val_loss: 0.0903 - val_acc: 0.9655
Epoch 10/50
 - 0s - loss: 0.0170 - acc: 0.9957 - val_loss: 0.0899 - val_acc: 0.9655
Epoch 11/50
 - 0s - loss: 0.0155 - acc: 0.9957 - val_loss: 0.0798 - val_acc: 0.9741
Epoch 12/50
 - 0s - loss: 0.0162 - acc:

<keras.callbacks.History at 0x224138370f0>

Running the example shows a classiﬁcation accuracy of 99.14%. This is higher than the baseline of 95.69% without the learning rate decay or momentum.

### Drop-Based Learning Rate Schedule
Another popular learning rate schedule used with deep learning models is to systematically drop the learning rate at speciﬁc times during training. Often this method is implemented by dropping the learning rate by half every ﬁxed number of epochs. For example, we may have an initial learning rate of 0.1 and drop it by a factor of 0.5 every 10 epochs. The ﬁrst 10 epochs of training would use a value of 0.1, in the next 10 epochs a learning rate of 0.05 would be used, and so on. If we plot out the learning rates for this example out to 100 epochs...
<img src="45.jpg">

We can implement this in Keras using the **LearningRateScheduler** callback when ﬁtting the model. It allows us to deﬁne a function to call that takes the **epoch number** as an argument and returns the **learning rate** to use in stochastic gradient descent. When used, the learning rate speciﬁed by stochastic gradient descent is ignored. In the code below, we use the same example as before of a single hidden layer network on the Ionosphere dataset. A new **`step_decay()`** function is deﬁned that implements the equation:

$LearningRate = Initial LearningRate \times DropRate ^ {floor(\frac{1 + Epoch}{EpochDrop})}$ Where **InitialLearningRate** is the learning rate at the beginning of the run, **EpochDrop** is how often the learning rate is dropped in epochs and **DropRate** is how much to drop the learning rate each time it is dropped.

In [12]:
from keras.callbacks import LearningRateScheduler
import math

def step_decay(epoch): 
    initial_lrate = 0.1 
    drop = 0.5 
    epochs_drop = 10.0 
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop)) 
    return(lrate)

In [14]:
# Compile model 
sgd = SGD(lr=0.0, momentum=0.9, decay=0.0, nesterov=False) 

model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) 
# learning schedule callback 
lrate = LearningRateScheduler(step_decay) 
callbacks_list = [lrate] 
# Fit the model 
model.fit(X, encoded_y, validation_split=0.33, epochs=50, batch_size=28, callbacks=callbacks_list, verbose=2)

Train on 235 samples, validate on 116 samples
Epoch 1/50
 - 0s - loss: 0.0132 - acc: 0.9957 - val_loss: 0.0886 - val_acc: 0.9655
Epoch 2/50
 - 0s - loss: 0.0174 - acc: 0.9957 - val_loss: 0.0808 - val_acc: 0.9655
Epoch 3/50
 - 0s - loss: 0.0170 - acc: 0.9957 - val_loss: 0.0687 - val_acc: 0.9828
Epoch 4/50
 - 0s - loss: 0.0249 - acc: 0.9957 - val_loss: 0.1020 - val_acc: 0.9655
Epoch 5/50
 - 0s - loss: 0.0311 - acc: 0.9915 - val_loss: 0.0777 - val_acc: 0.9828
Epoch 6/50
 - 0s - loss: 0.1003 - acc: 0.9660 - val_loss: 0.1816 - val_acc: 0.9310
Epoch 7/50
 - 0s - loss: 0.0934 - acc: 0.9745 - val_loss: 0.1035 - val_acc: 0.9569
Epoch 8/50
 - 0s - loss: 0.0567 - acc: 0.9745 - val_loss: 0.0728 - val_acc: 0.9828
Epoch 9/50
 - 0s - loss: 0.0450 - acc: 0.9830 - val_loss: 0.1503 - val_acc: 0.9483
Epoch 10/50
 - 0s - loss: 0.0377 - acc: 0.9830 - val_loss: 0.0860 - val_acc: 0.9741
Epoch 11/50
 - 0s - loss: 0.0255 - acc: 0.9957 - val_loss: 0.0738 - val_acc: 0.9741
Epoch 12/50
 - 0s - loss: 0.0267 - acc:

<keras.callbacks.History at 0x22414a9a390>

Running the example results in a classiﬁcation accuracy of 99.14% on the validation dataset, again an improvement over the baseline for the model on this dataset.

### Tips for Using Learning Rate Schedules
- **Increase the initial learning rate.** Because the learning rate will decrease, start with a larger value to decrease from. A larger learning rate will result in much larger changes to the weights, at least in the beginning, allowing you to beneﬁt from ﬁne tuning later.
- **Use a large momentum.** Using a larger momentum value will help the optimization algorithm continue to make updates in the right direction when your learning rate shrinks to small values.
- **Experiment with diﬀerent schedules.** It will not be clear which learning rate schedule to use so try a few with diﬀerent conﬁguration options and see what works best on your problem. Also try schedules that change exponentially and even schedules that respond to the accuracy of your model on the training or test datasets.
