keras ReduceLROnPlateau, is this a bug ? #10924

naomifridman · 2018-08-17T06:19:33Z

I am training a keras sequential model. I want the learning rate to be reduced when training is not progressing.

I use ReduceLROnPlateau callback.

After first 2 epoch with out progress, the learning rate is reduced as expected. But then its reduced every 2 epoch's, causing the training to stop progressing.

Is that a keras bug ? or I use the function the wrong way ?

The code:

earlystopper = EarlyStopping(patience=8, verbose=1)
checkpointer = ModelCheckpoint(filepath = 'model_zero7.{epoch:02d}-{val_loss:.6f}.hdf5',
                               verbose=1,
                               save_best_only=True, save_weights_only = True)

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
                              patience=2, min_lr=0.000001, verbose=1)

history_zero7 = model_zero.fit_generator(bach_gen_only1,
                                        validation_data = (v_im, v_lb),
                                        steps_per_epoch=25,epochs=100,
                    callbacks=[earlystopper, checkpointer, reduce_lr])

The output:

Epoch 00006: val_loss did not improve from 0.68605
Epoch 7/100
25/25 [==============================] - 213s 9s/step - loss: 0.6873 - binary_crossentropy: 0.0797 - dice_coef_loss: -0.8224 - jaccard_distance_loss_flat: 0.2998 - val_loss: 0.6865 - val_binary_crossentropy: 0.0668 - val_dice_coef_loss: -0.8513 - val_jaccard_distance_loss_flat: 0.2578

Epoch 00007: val_loss did not improve from 0.68605

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.000200000009499.
Epoch 8/100
25/25 [==============================] - 214s 9s/step - loss: 0.6865 - binary_crossentropy: 0.0648 - dice_coef_loss: -0.8547 - jaccard_distance_loss_flat: 0.2528 - val_loss: 0.6860 - val_binary_crossentropy: 0.0694 - val_dice_coef_loss: -0.8575 - val_jaccard_distance_loss_flat: 0.2485

Epoch 00008: val_loss improved from 0.68605 to 0.68598, saving model to model_zero7.08-0.685983.hdf5
Epoch 9/100
25/25 [==============================] - 208s 8s/step - loss: 0.6868 - binary_crossentropy: 0.0624 - dice_coef_loss: -0.8554 - jaccard_distance_loss_flat: 0.2518 - val_loss: 0.6860 - val_binary_crossentropy: 0.0746 - val_dice_coef_loss: -0.8527 - val_jaccard_distance_loss_flat: 0.2557

Epoch 00009: val_loss improved from 0.68598 to 0.68598, saving model to model_zero7.09-0.685982.hdf5

Epoch 00009: ReduceLROnPlateau reducing learning rate to 4.00000018999e-05.
Epoch 10/100
25/25 [==============================] - 211s 8s/step - loss: 0.6865 - binary_crossentropy: 0.0640 - dice_coef_loss: -0.8570 - jaccard_distance_loss_flat: 0.2493 - val_loss: 0.6859 - val_binary_crossentropy: 0.0630 - val_dice_coef_loss: -0.8688 - val_jaccard_distance_loss_flat: 0.2311

Epoch 00010: val_loss improved from 0.68598 to 0.68589, saving model to model_zero7.10-0.685890.hdf5
Epoch 11/100
25/25 [==============================] - 211s 8s/step - loss: 0.6869 - binary_crossentropy: 0.0610 - dice_coef_loss: -0.8580 - jaccard_distance_loss_flat: 0.2480 - val_loss: 0.6859 - val_binary_crossentropy: 0.0681 - val_dice_coef_loss: -0.8616 - val_jaccard_distance_loss_flat: 0.2422

Epoch 00011: val_loss improved from 0.68589 to 0.68589, saving model to model_zero7.11-0.685885.hdf5
Epoch 12/100
25/25 [==============================] - 210s 8s/step - loss: 0.6866 - binary_crossentropy: 0.0575 - dice_coef_loss: -0.8612 - jaccard_distance_loss_flat: 0.2426 - val_loss: 0.6858 - val_binary_crossentropy: 0.0636 - val_dice_coef_loss: -0.8679 - val_jaccard_distance_loss_flat: 0.2325

Epoch 00012: val_loss improved from 0.68589 to 0.68585, saving model to model_zero7.12-0.685847.hdf5

Epoch 00012: ReduceLROnPlateau reducing learning rate to 8.0000005255e-06.

The text was updated successfully, but these errors were encountered:

AndreGuerra123 · 2018-08-18T17:32:48Z

Look at this mock example:

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import numpy

# fix random seed for reproducibility
numpy.random.seed(7)

# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
earlystopper = EarlyStopping(monitor='loss',
                              min_delta=0.1,
                              patience=8,
                              verbose=1, mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=2, min_delta=0.04, verbose=1)

model.fit(X, Y, epochs=100, batch_size=5,callbacks=[earlystopper,reduce_lr])

From https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

The way I see this working is:
earlystopper callback: If in 8 epochs the train loss does NOT shift at least 0.1 then stop the training.
reduce_lr callback: Every 2 epoch check if train loss does NOT shift at least 0.04 then reduce the learning rate by 80%

The common approach I've seen people taken is to make the stop callback using patience = 0 , min_delta = small value. PS: I don't know if it was in purpose but you are not monitoring the same values in both callbacks. Hope I helped.
Here is my output:

Epoch 5/100
768/768 [==============================] - 0s 172us/step - loss: 0.6750 - acc: 0.6758
Epoch 6/100
768/768 [==============================] - 0s 190us/step - loss: 0.6553 - acc: 0.6745
Epoch 7/100
768/768 [==============================] - 0s 233us/step - loss: 0.6389 - acc: 0.7096

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 8/100
768/768 [==============================] - 0s 292us/step - loss: 0.6112 - acc: 0.6953
Epoch 9/100
768/768 [==============================] - 0s 230us/step - loss: 0.5997 - acc: 0.7044
Epoch 10/100
768/768 [==============================] - 0s 186us/step - loss: 0.6004 - acc: 0.6953

Epoch 00010: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.
Epoch 11/100
768/768 [==============================] - 0s 172us/step - loss: 0.5929 - acc: 0.7122
Epoch 12/100
768/768 [==============================] - 0s 181us/step - loss: 0.5920 - acc: 0.7018

Epoch 00012: ReduceLROnPlateau reducing learning rate to 8.000000525498762e-06.
Epoch 13/100
768/768 [==============================] - 0s 173us/step - loss: 0.5909 - acc: 0.7227
Epoch 14/100
768/768 [==============================] - 0s 193us/step - loss: 0.5903 - acc: 0.7279

Epoch 00014: ReduceLROnPlateau reducing learning rate to 1.6000001778593287e-06.
Epoch 15/100
768/768 [==============================] - 0s 211us/step - loss: 0.5893 - acc: 0.7266
Epoch 16/100
768/768 [==============================] - 0s 168us/step - loss: 0.5893 - acc: 0.7266

Epoch 00016: ReduceLROnPlateau reducing learning rate to 3.200000264769187e-07.
Epoch 17/100
768/768 [==============================] - 0s 184us/step - loss: 0.5892 - acc: 0.7240
Epoch 18/100
768/768 [==============================] - 0s 171us/step - loss: 0.5892 - acc: 0.7240

Epoch 00018: ReduceLROnPlateau reducing learning rate to 6.400000529538374e-08.
Epoch 19/100
768/768 [==============================] - 0s 186us/step - loss: 0.5892 - acc: 0.7240
Epoch 00019: early stopping
768/768 [==============================] - 0s 53us/step

acc: 72.40%

naomifridman · 2018-08-24T12:17:22Z

In your example, learning rate is reduced when loss is NOT improving, In my example, it reduced even when val_loss is improving.
Why do you think I monitor different values ? I thought I monitor val_loss in all callbacks.

picagrad · 2018-09-05T23:37:37Z

I've looked into this as well, in my case it seems that after the initial patience period has passed,
ReduceRLOnPlateau will be called every "patience" epochs regardless of loss.

I've seen a similar issue on this but it was marked closed sometime around 2016. seems the problem is still here...

These are some of the outputs of my code (not attaching everything because I'm using a patience of 25 (yes, I know that really high) so it'll be a lot of text, but this shows the gist of it...

Epoch 00173: ReduceLROnPlateau reducing learning rate to 0.0006249999860301614.

Epoch 00190: val_loss improved from 0.00341 to 0.00341, saving model to ../data/comparison/2018_09_05_1532/Meta_inds_[1]_weights.hdf5

Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.0003124999930150807.

Epoch 00213: val_loss improved from 0.00340 to 0.00340, saving model to ../data/comparison/2018_09_05_1532/Meta_inds_[1]_weights.hdf5

Epoch 00223: ReduceLROnPlateau reducing learning rate to 0.00015624999650754035.

notice that ReduceLROnPlateau reduces learning rate every 25 epochs which the patience parameter, regardless of improvement in loss

soumendra · 2019-02-20T20:39:19Z

Any progress on this?

melaanya · 2019-03-21T10:49:50Z

Try to set explicitly mode='min'

naomifridman · 2019-10-31T13:53:13Z

Having similar issue again. Probably the number of epoch without improvement are summed together, even if they are not consecutive.

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keras ReduceLROnPlateau, is this a bug ? #10924

keras ReduceLROnPlateau, is this a bug ? #10924

naomifridman commented Aug 17, 2018

AndreGuerra123 commented Aug 18, 2018 •

edited

naomifridman commented Aug 24, 2018

picagrad commented Sep 5, 2018

soumendra commented Feb 20, 2019

melaanya commented Mar 21, 2019

naomifridman commented Oct 31, 2019

keras ReduceLROnPlateau, is this a bug ? #10924

keras ReduceLROnPlateau, is this a bug ? #10924

Comments

naomifridman commented Aug 17, 2018

AndreGuerra123 commented Aug 18, 2018 • edited

naomifridman commented Aug 24, 2018

picagrad commented Sep 5, 2018

soumendra commented Feb 20, 2019

melaanya commented Mar 21, 2019

naomifridman commented Oct 31, 2019

AndreGuerra123 commented Aug 18, 2018 •

edited