Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keras ReduceLROnPlateau, is this a bug ? #10924

Closed
naomifridman opened this issue Aug 17, 2018 · 6 comments
Closed

keras ReduceLROnPlateau, is this a bug ? #10924

naomifridman opened this issue Aug 17, 2018 · 6 comments

Comments

@naomifridman
Copy link

I am training a keras sequential model. I want the learning rate to be reduced when training is not progressing.

I use ReduceLROnPlateau callback.

After first 2 epoch with out progress, the learning rate is reduced as expected. But then its reduced every 2 epoch's, causing the training to stop progressing.

Is that a keras bug ? or I use the function the wrong way ?

The code:

earlystopper = EarlyStopping(patience=8, verbose=1)
checkpointer = ModelCheckpoint(filepath = 'model_zero7.{epoch:02d}-{val_loss:.6f}.hdf5',
                               verbose=1,
                               save_best_only=True, save_weights_only = True)

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2,
                              patience=2, min_lr=0.000001, verbose=1)

history_zero7 = model_zero.fit_generator(bach_gen_only1,
                                        validation_data = (v_im, v_lb),
                                        steps_per_epoch=25,epochs=100,
                    callbacks=[earlystopper, checkpointer, reduce_lr])

The output:

Epoch 00006: val_loss did not improve from 0.68605
Epoch 7/100
25/25 [==============================] - 213s 9s/step - loss: 0.6873 - binary_crossentropy: 0.0797 - dice_coef_loss: -0.8224 - jaccard_distance_loss_flat: 0.2998 - val_loss: 0.6865 - val_binary_crossentropy: 0.0668 - val_dice_coef_loss: -0.8513 - val_jaccard_distance_loss_flat: 0.2578

Epoch 00007: val_loss did not improve from 0.68605

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.000200000009499.
Epoch 8/100
25/25 [==============================] - 214s 9s/step - loss: 0.6865 - binary_crossentropy: 0.0648 - dice_coef_loss: -0.8547 - jaccard_distance_loss_flat: 0.2528 - val_loss: 0.6860 - val_binary_crossentropy: 0.0694 - val_dice_coef_loss: -0.8575 - val_jaccard_distance_loss_flat: 0.2485

Epoch 00008: val_loss improved from 0.68605 to 0.68598, saving model to model_zero7.08-0.685983.hdf5
Epoch 9/100
25/25 [==============================] - 208s 8s/step - loss: 0.6868 - binary_crossentropy: 0.0624 - dice_coef_loss: -0.8554 - jaccard_distance_loss_flat: 0.2518 - val_loss: 0.6860 - val_binary_crossentropy: 0.0746 - val_dice_coef_loss: -0.8527 - val_jaccard_distance_loss_flat: 0.2557

Epoch 00009: val_loss improved from 0.68598 to 0.68598, saving model to model_zero7.09-0.685982.hdf5

Epoch 00009: ReduceLROnPlateau reducing learning rate to 4.00000018999e-05.
Epoch 10/100
25/25 [==============================] - 211s 8s/step - loss: 0.6865 - binary_crossentropy: 0.0640 - dice_coef_loss: -0.8570 - jaccard_distance_loss_flat: 0.2493 - val_loss: 0.6859 - val_binary_crossentropy: 0.0630 - val_dice_coef_loss: -0.8688 - val_jaccard_distance_loss_flat: 0.2311

Epoch 00010: val_loss improved from 0.68598 to 0.68589, saving model to model_zero7.10-0.685890.hdf5
Epoch 11/100
25/25 [==============================] - 211s 8s/step - loss: 0.6869 - binary_crossentropy: 0.0610 - dice_coef_loss: -0.8580 - jaccard_distance_loss_flat: 0.2480 - val_loss: 0.6859 - val_binary_crossentropy: 0.0681 - val_dice_coef_loss: -0.8616 - val_jaccard_distance_loss_flat: 0.2422

Epoch 00011: val_loss improved from 0.68589 to 0.68589, saving model to model_zero7.11-0.685885.hdf5
Epoch 12/100
25/25 [==============================] - 210s 8s/step - loss: 0.6866 - binary_crossentropy: 0.0575 - dice_coef_loss: -0.8612 - jaccard_distance_loss_flat: 0.2426 - val_loss: 0.6858 - val_binary_crossentropy: 0.0636 - val_dice_coef_loss: -0.8679 - val_jaccard_distance_loss_flat: 0.2325

Epoch 00012: val_loss improved from 0.68589 to 0.68585, saving model to model_zero7.12-0.685847.hdf5

Epoch 00012: ReduceLROnPlateau reducing learning rate to 8.0000005255e-06.
@AndreGuerra123
Copy link

AndreGuerra123 commented Aug 18, 2018

Look at this mock example:

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import numpy

# fix random seed for reproducibility
numpy.random.seed(7)

# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
earlystopper = EarlyStopping(monitor='loss',
                              min_delta=0.1,
                              patience=8,
                              verbose=1, mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=2, min_delta=0.04, verbose=1)

model.fit(X, Y, epochs=100, batch_size=5,callbacks=[earlystopper,reduce_lr])

From https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

The way I see this working is:
earlystopper callback: If in 8 epochs the train loss does NOT shift at least 0.1 then stop the training.
reduce_lr callback: Every 2 epoch check if train loss does NOT shift at least 0.04 then reduce the learning rate by 80%

The common approach I've seen people taken is to make the stop callback using patience = 0 , min_delta = small value. PS: I don't know if it was in purpose but you are not monitoring the same values in both callbacks. Hope I helped.
Here is my output:

Epoch 5/100
768/768 [==============================] - 0s 172us/step - loss: 0.6750 - acc: 0.6758
Epoch 6/100
768/768 [==============================] - 0s 190us/step - loss: 0.6553 - acc: 0.6745
Epoch 7/100
768/768 [==============================] - 0s 233us/step - loss: 0.6389 - acc: 0.7096

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 8/100
768/768 [==============================] - 0s 292us/step - loss: 0.6112 - acc: 0.6953
Epoch 9/100
768/768 [==============================] - 0s 230us/step - loss: 0.5997 - acc: 0.7044
Epoch 10/100
768/768 [==============================] - 0s 186us/step - loss: 0.6004 - acc: 0.6953

Epoch 00010: ReduceLROnPlateau reducing learning rate to 4.0000001899898055e-05.
Epoch 11/100
768/768 [==============================] - 0s 172us/step - loss: 0.5929 - acc: 0.7122
Epoch 12/100
768/768 [==============================] - 0s 181us/step - loss: 0.5920 - acc: 0.7018

Epoch 00012: ReduceLROnPlateau reducing learning rate to 8.000000525498762e-06.
Epoch 13/100
768/768 [==============================] - 0s 173us/step - loss: 0.5909 - acc: 0.7227
Epoch 14/100
768/768 [==============================] - 0s 193us/step - loss: 0.5903 - acc: 0.7279

Epoch 00014: ReduceLROnPlateau reducing learning rate to 1.6000001778593287e-06.
Epoch 15/100
768/768 [==============================] - 0s 211us/step - loss: 0.5893 - acc: 0.7266
Epoch 16/100
768/768 [==============================] - 0s 168us/step - loss: 0.5893 - acc: 0.7266

Epoch 00016: ReduceLROnPlateau reducing learning rate to 3.200000264769187e-07.
Epoch 17/100
768/768 [==============================] - 0s 184us/step - loss: 0.5892 - acc: 0.7240
Epoch 18/100
768/768 [==============================] - 0s 171us/step - loss: 0.5892 - acc: 0.7240

Epoch 00018: ReduceLROnPlateau reducing learning rate to 6.400000529538374e-08.
Epoch 19/100
768/768 [==============================] - 0s 186us/step - loss: 0.5892 - acc: 0.7240
Epoch 00019: early stopping
768/768 [==============================] - 0s 53us/step

acc: 72.40%

@naomifridman
Copy link
Author

In your example, learning rate is reduced when loss is NOT improving, In my example, it reduced even when val_loss is improving.
Why do you think I monitor different values ? I thought I monitor val_loss in all callbacks.

@picagrad
Copy link

picagrad commented Sep 5, 2018

I've looked into this as well, in my case it seems that after the initial patience period has passed,
ReduceRLOnPlateau will be called every "patience" epochs regardless of loss.

I've seen a similar issue on this but it was marked closed sometime around 2016. seems the problem is still here...

These are some of the outputs of my code (not attaching everything because I'm using a patience of 25 (yes, I know that really high) so it'll be a lot of text, but this shows the gist of it...

Epoch 00173: ReduceLROnPlateau reducing learning rate to 0.0006249999860301614.

Epoch 00190: val_loss improved from 0.00341 to 0.00341, saving model to ../data/comparison/2018_09_05_1532/Meta_inds_[1]_weights.hdf5

Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.0003124999930150807.

Epoch 00213: val_loss improved from 0.00340 to 0.00340, saving model to ../data/comparison/2018_09_05_1532/Meta_inds_[1]_weights.hdf5

Epoch 00223: ReduceLROnPlateau reducing learning rate to 0.00015624999650754035.

notice that ReduceLROnPlateau reduces learning rate every 25 epochs which the patience parameter, regardless of improvement in loss

@soumendra
Copy link

Any progress on this?

@melaanya
Copy link

Try to set explicitly mode='min'

@naomifridman
Copy link
Author

Having similar issue again. Probably the number of epoch without improvement are summed together, even if they are not consecutive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants