Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.evaluate() gives a different loss on training data from the one in training process #6977

Closed
alanwang93 opened this issue Jun 13, 2017 · 67 comments

Comments

@alanwang93
Copy link

I'm implementing a CNN model, when I just have few layers, it works well. When I tried a deeper network, I can achieve a high performance (a small loss given during the training process) on training data, but when I use model.evaluate() on training data, I get a poor performance (much greater loss). I wonder why this will happen since the evaluation are all on training data.

Here is what I got:

input_shape = (X.shape[1], X.shape[2], 1)
model = Sequential()

y = [label2id[l] for l in labels.reshape(-1)]
y =  keras.utils.to_categorical(y)

model.add(Conv2D(32, (5, 5), strides=(2,2), input_shape=input_shape))
model.add(Activation('relu'))
model.add(BatchNormalization())


model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))

model.add(Conv2D(512, (1, 1)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))

model.add(Conv2D(15, (1, 1)))
model.add(Activation('relu'))
model.add(BatchNormalization())


model.add(GlobalAveragePooling2D())

model.add(Dense(500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(15, activation='softmax'))

opt = Adam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

model.fit(np.expand_dims(X, axis=3), y, batch_size=200, epochs=15, validation_data=(np.expand_dims(X_val,3), y_val))

The log during training:

Train on 582 samples, validate on 290 samples
Epoch 1/15
582/582 [==============================] - 14s - loss: 2.6431 - acc: 0.1821 - val_loss: 2.6653 - val_acc: 0.0759
Epoch 2/15
582/582 [==============================] - 12s - loss: 2.3759 - acc: 0.3832 - val_loss: 3.9411 - val_acc: 0.0655
Epoch 3/15
582/582 [==============================] - 13s - loss: 2.0834 - acc: 0.4141 - val_loss: 7.2338 - val_acc: 0.0655
Epoch 4/15
582/582 [==============================] - 13s - loss: 1.8380 - acc: 0.5120 - val_loss: 9.4135 - val_acc: 0.0655
Epoch 5/15
582/582 [==============================] - 13s - loss: 1.6002 - acc: 0.5550 - val_loss: 10.0389 - val_acc: 0.0655
Epoch 6/15
582/582 [==============================] - 13s - loss: 1.3725 - acc: 0.6117 - val_loss: 11.0042 - val_acc: 0.0759
Epoch 7/15
582/582 [==============================] - 13s - loss: 1.1924 - acc: 0.6443 - val_loss: 10.2766 - val_acc: 0.0862
Epoch 8/15
582/582 [==============================] - 13s - loss: 1.0529 - acc: 0.6993 - val_loss: 9.2593 - val_acc: 0.0862
Epoch 9/15
582/582 [==============================] - 13s - loss: 0.9137 - acc: 0.7491 - val_loss: 9.9668 - val_acc: 0.0897
Epoch 10/15
582/582 [==============================] - 13s - loss: 0.7928 - acc: 0.7784 - val_loss: 9.4821 - val_acc: 0.0966
Epoch 11/15
582/582 [==============================] - 13s - loss: 0.6885 - acc: 0.8179 - val_loss: 8.7342 - val_acc: 0.1000
Epoch 12/15
582/582 [==============================] - 12s - loss: 0.6094 - acc: 0.8213 - val_loss: 8.5325 - val_acc: 0.1207
Epoch 13/15
582/582 [==============================] - 12s - loss: 0.5345 - acc: 0.8488 - val_loss: 7.9924 - val_acc: 0.1207
Epoch 14/15
582/582 [==============================] - 12s - loss: 0.4800 - acc: 0.8643 - val_loss: 7.8522 - val_acc: 0.1000
Epoch 15/15
582/582 [==============================] - 12s - loss: 0.4357 - acc: 0.8660 - val_loss: 7.1004 - val_acc: 0.1172

When I evaluate on training data:

score = model.evaluate(np.expand_dims(X, axis=3), y, batch_size=32)
print score
576/582 [============================>.] - ETA: 0s[7.6189327469396426, 0.10309278350515463]

On validation data

score = model.evaluate(np.expand_dims(X_val, axis=3), y_val, batch_size=32)
print score
288/290 [============================>.] - ETA: 0s[7.1004119609964302, 0.11724137931034483]

Could someone help me? Thanks a lot.

@ouzan19
Copy link

ouzan19 commented Jun 17, 2017

Same problem happens for me...

@danielS91
Copy link
Contributor

danielS91 commented Jun 17, 2017

It's due to the dropout layers. During the training phase neurons are dropped. In contrast during the prediction all neurons remain in the network structure. So it's quite likely that the results will be different.
You can see it directly from the results for the validation data.
They are equal, because both results are generated in the same way.

Edit: The batch normalization layers also influence the results.

@danielS91
Copy link
Contributor

Regarding the problem that both losses are quite different, it looks like that your model structure does not fit the problem well.

@ouzan19
Copy link

ouzan19 commented Jun 17, 2017

Even without dropout layers and batch normalization, same issue continues for me. I don't agree that the problem is caused by the model structure because the training and the test data is the same.

@danielS91
Copy link
Contributor

How large is the difference in your case?
Both loss values will not match exactly because during training the network parameters change from batch to batch and Keras will report the mean loss over all batches...

@ouzan19
Copy link

ouzan19 commented Jun 17, 2017

I use only one batch. In training, final loss (mse) is 0.045. Evaluating with training data gives 1.14

@danielS91
Copy link
Contributor

That's strange. Did you try to use a different dataset?
Can you provide some code to reproduce the problem? (small public dataset would be great)

@bzhong2
Copy link

bzhong2 commented Jun 24, 2017

#6895 I have a similar problem and even tried with the public data set. I was doing fine tuning.

@fraztto
Copy link

fraztto commented Jul 17, 2017

I had an issue like that one, the solution for me was very simple. I was evaluating using the train data and the accuracy was quite different than the one while training. When evaluating I had swapped the dims of the input images, height was width and width was height (silly me)

@ouzan19
Copy link

ouzan19 commented Dec 1, 2017

Hi guys,

Other than dropout, batch norm also causes the same problem. I suspect that this is caused by the fact that the number of samples used in bacth norm after activation is 200 (bacth size) in tarining time, however it is only 1 in test time. This causes different normalization and different loss.

What's your thoughts?

@renato145
Copy link

#6895 Yes, I just encountered that problem with Resnet50.

@BrianHuf
Copy link

BrianHuf commented Feb 12, 2018

I'm running into the same problem. When I create learning curves from fit metrics, train and test look unrealistically different.

As an experiment, I tried calculating my own metrics.

class SecondOpinion(Callback):
    def __init__(self, model, x_train, y_train, x_test, y_test):
        self.model = model
        self.x_train = x_train
        self.y_train = y_train
        self.x_test = x_test
        self.y_test = y_test

    def on_epoch_end(self, epoch, logs={}):
        y_train_pred = self.model.predict(self.x_train)
        y_test_pred = self.model.predict(self.x_test)

        mse_train = ((y_train_pred - self.y_train) ** 2).mean()
        mse_test = ((y_test_pred - self.y_test) ** 2).mean()

        print("\n                                             Second Opinion loss: %5.4f - val_loss: %5.4f" % (mse_train, mse_test))

...

model.compile(
	loss='mean_squared_error',
	optimizer=adam
)

second_opinion = SecondOpinion(model, data.x_train, data.y_train, data.x_test, data.y_test)

model.fit(
	x=data.x_train,
	y=data.y_train,
	validation_data=(data.x_test, data.y_test),
	batch_size=200,
	epochs=200
	callbacks=[second_opinion]
)

With batch normalization and drop out included, train loss is very different (~3x). Validation losses are different, but not substantial.

Epoch 1/200
7200/7255 [============================>.] - ETA: 0s - loss: 208810.7629
                                             Second Opinion loss: 147483.0938 - val_loss: 164947.0781
7255/7255 [==============================] - 59s 8ms/step - loss: 207874.9320 - val_loss: 140131.2018

Epoch 2/200
7200/7255 [============================>.] - ETA: 0s - loss: 57029.7061
                                             Second Opinion loss: 128558.4609 - val_loss: 142726.4375
7255/7255 [==============================] - 55s 8ms/step - loss: 57108.7740 - val_loss: 135797.0371

Epoch 3/200
7200/7255 [============================>.] - ETA: 0s - loss: 49392.7298
                                             Second Opinion loss: 154096.3281 - val_loss: 173001.8438
7255/7255 [==============================] - 55s 8ms/step - loss: 49370.2950 - val_loss: 151737.2370

With batch normalization and drop out removed, loss is somewhat different and val_loss matches

Epoch 1/200
7200/7255 [============================>.] - ETA: 0s - loss: 1691567.5816
                                             Second Opinion loss: 592996.7500 - val_loss: 631589.8125
7255/7255 [==============================] - 35s 5ms/step - loss: 1682561.1545 - val_loss: 631589.8356

Epoch 2/200
7200/7255 [============================>.] - ETA: 0s - loss: 557553.0530
                                             Second Opinion loss: 503776.0000 - val_loss: 539686.3750
7255/7255 [==============================] - 32s 4ms/step - loss: 557585.9540 - val_loss: 539686.4883

Epoch 3/200
7200/7255 [============================>.] - ETA: 0s - loss: 434417.9800
                                             Second Opinion loss: 353186.8750 - val_loss: 383728.2500
7255/7255 [==============================] - 32s 4ms/step - loss: 434553.5198 - val_loss: 383728.2623

I'm not schooled enough to know if these differences are intentional by Keras or not. Anyone?

@mikowals
Copy link

I am new to Keras so maybe this is expected behaviour but I can't find it documented in .fit() or .evaluate() that .fit() must be run first.

model.evaluate consistently gets a wrong result if run on after loading saved weights. Running model.fit even training for 1 step with a learning rate of 0 fixes the validation results though the weights should not be changed.

Loading weights from the file again after model.fit() will cause the problem with model.evaluate() to reoccur.

Result of initial evaluate:

10000/10000 [==============================] - 27s 3ms/step
Loss: 14.154, Accuracy: 0.114

Now train one step:

Train on 256 samples, validate on 256 samples
Epoch 1/1
256/256 [==============================] - 46s 178ms/step - loss: 0.7133 - acc: 0.7930 - val_loss: 0.7081 - val_acc: 0.7930

Now run same evaluate call again:

10000/10000 [==============================] - 24s 2ms/step
Loss: 0.659, Accuracy: 0.798

The code to produce this is:

from __future__ import print_function
import keras

batch_size = 256
img_rows, img_cols, img_channels = 32, 32, 3

(_, _), (x_test, y_test) = keras.datasets.cifar10.load_data()

x_test = x_test.astype('float32') / 255.
y_test = keras.utils.to_categorical(y_test, num_classes=10)

model = keras.applications.nasnet.NASNetMobile(
    input_shape=(img_rows, img_cols, img_channels),
    weights=None,
    classes=10
)
model.load_weights('weights.81-0.35-0.873.txt', by_name=True)
optimizer = keras.optimizers.SGD(lr=0.000, momentum=0.0, clipnorm=5)
model.compile(loss=['categorical_crossentropy'],
          optimizer=optimizer, metrics=['accuracy'])

def eval():
    metrics = model.evaluate(
        x=x_test,
        y=y_test,
        batch_size=batch_size,
        verbose=1,
        sample_weight=None
    )
    print ('Loss: {:.3f}, Accuracy: {:.3f}'.format(metrics[0], metrics[1]))

eval()
model.fit(x=x_test[:batch_size,...], y=y_test[:batch_size,...], batch_size=batch_size, epochs=1,     validation_data=(x_test[:batch_size,...], y_test[:batch_size,...]))
eval()

I am using Keras (2.1.4) installed by pip on MacOS 10.13.4. This version of Keras is printing a ton of deprecation warnings (from tensorflow I think) which I have omitted in the output for clarity but if you see them it is not a problem with the code.

weights.81-0.35-0.873.txt

@BrianHuf
Copy link

I'm still in over my head here, but here's how things appear to me. Can anyone confirm I'm on the right track?

This is all tied to learning_phase (see https://keras.io/backend/) and loss/metric estimation based on batches.

Dropout is only active when the learning_phase is set to test. Otherwise, it should be ignored. It's unclear to me if BatchNormalization is active when learning_phase is test

Batching presumes each batch can represent the entire data set. If the data is heavily skewed or batches aren't well randomized, I can imagine this will magnify the differences between losses from fit vs. predict.

It seems to me that Learning Curves are more correct when evaluating losses and metrics when learning_phase is set to test and applied across all batches. I can imagine this is not done during fit, because it is computationally expensive.

@lorenzoriano
Copy link

I'm seeing the same problem

@Deepu14
Copy link

Deepu14 commented Apr 4, 2018

I have same problem.
Epoch 28/30 5760/5760 [==============================] - 4s 641us/step - loss: 0.0166 - acc: 0.9934 - val_loss: 0.0299 - val_acc: 0.9891

Epoch 29/30 5760/5760 [==============================] - 4s 644us/step - loss: 0.0163 - acc: 0.9932 - val_loss: 0.0296 - val_acc: 0.9875

Epoch 30/30 5760/5760 [==============================] - 4s 641us/step - loss: 0.0165 - acc: 0.9925 - val_loss: 0.0318 - val_acc: 0.9875

Evaluating on test data: `1712/1712 [==============================] - 0s 236us/step $loss [1] 0.329597

$acc [1] 0.9281542`

There is a huge difference between train-validation loss and test loss.

@emerygoossens
Copy link

emerygoossens commented Apr 4, 2018

I am having the same issue. I train a model, save the weights, load the model. The resulting evaluation call is giving results that change each time.

@azmathmoosa
Copy link

azmathmoosa commented May 14, 2018

I, too have the same issue. I was training a DenseNet121 with all layers frozen except the last 1 or 2.

Epoch 00032: val_acc did not improve from 0.29563
Epoch 33/90
154/154 [==============================] - 148s 963ms/step - loss: 0.1546 - acc: 0.9538 - val_loss: 6.4297 - val_acc: 0.2246

Epoch 00033: val_acc did not improve from 0.29563
Epoch 34/90
154/154 [==============================] - 148s 963ms/step - loss: 0.1416 - acc: 0.9573 - val_loss: 6.1487 - val_acc: 0.2423

Epoch 00034: val_acc did not improve from 0.29563
Epoch 35/90
154/154 [==============================] - 147s 955ms/step - loss: 0.1415 - acc: 0.9556 - val_loss: 6.6624 - val_acc: 0.2016

Epoch 00035: val_acc did not improve from 0.29563
Epoch 36/90
154/154 [==============================] - 147s 957ms/step - loss: 0.1457 - acc: 0.9545 - val_loss: 5.9998 - val_acc: 0.2548

Epoch 00036: val_acc did not improve from 0.29563
Epoch 00036: early stopping
154/154 [==============================] - 191s 1s/step
Final Training loss: 6.1547
Training accuracy:  0.2037

I ran evaluate() on the training data itself, and the validation data between each epoch is also the training data! Yet the difference is huge.

I'm planning to drop Keras and move to TF.

@raghavab1992
Copy link

I am facing the same issue...trying to finetune inception_v3. Added two Dense layers and set all other inception layers trainable=False. So without any dropout layers, getting completely different metrics for training data during training and evaluation!!
Epoch 1/25
35/35 [==============================] - 24s 693ms/step - loss: 2.1526 - categorical_accuracy: 0.2010 - val_loss: 12.1775 - val_categorical_accuracy: 0.0993
Epoch 2/25
35/35 [==============================] - 19s 557ms/step - loss: 1.8757 - categorical_accuracy: 0.3301 - val_loss: 12.5643 - val_categorical_accuracy: 0.1066
Epoch 3/25
35/35 [==============================] - 19s 533ms/step - loss: 1.6845 - categorical_accuracy: 0.4497 - val_loss: 12.5669 - val_categorical_accuracy: 0.1176

print(model.metrics_names, model.evaluate_generator(train_gen), model.evaluate_generator(val_gen))
['loss', 'categorical_accuracy'] [12.482194125054637, 0.0966271650022339] [12.378837978138643, 0.10294117647058823]

As none of the inception layers are being trained, the batch norm layers should use default mean and std dev and hence shouldn't give different results in training and evaluation phase!
Any idea why this is happening?

@ub216
Copy link

ub216 commented Jun 22, 2018

Has anyone solved this? Having the same issue. model.eval gives completely different results compared to model.fit (learning rate set to zero). Don't use dropout layer. Tried playing with the batch_norm layers "trainable" parameter but got similar performance.

@shunjiangxu
Copy link

I am having the same problem as well. In my case, I am trying to reuse the pre-trained keras ResNet50 model and add my own last few layers. I got very large differences between .fit and .evaluate using the same training data. When I look at the prediction result using the training data, it's clear the .evaluate gives the right loss/accuracy. Anyone has any ideas? I don't believe the batchnorm/dropout layer is the reason here. Below is my differences:
From .fit:
Epoch 1/1
657/657 [==============================] - 327s 498ms/step - loss: 0.1465 - acc: 0.9691

From evaluate with the same training data:
657/657 [==============================] - 356s 542ms/step
[2.496475939699867, 0.4247050990252734]

@j0bby
Copy link

j0bby commented Jun 26, 2018

Hello everyone,

Here is the official Kera's answer to this question.
https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss

Even without dropout or batch normalization, the problem will persist. The reason for this is that when you use fit, at each batch of the training data the weights are updated. The loss value returned by the fit method is not the mean of the loss of the final model, but the mean of the loss of all slightly different models used on each batch.
On the other hand, when you use evaluate, the same model is used on the whole dataset. And this model actually doesn't even appear in the loss of the fit method since even at the last batch of training, the loss computed is used to update the model's weights.

To sum everything up, fit and evaluate have two completely different behavior, and comparing their output doesn't make any sense !

@ub216
Copy link

ub216 commented Jun 26, 2018

Hey j0bby, Thanks for your reply.
The link that you referred to is an expected behavior when taken over the whole epoch. However, I see this discrepancy even when testing on a single batch of the data! Moreover, I also tried setting "loss_weights" to zero so as to have zero gradients and still model.fit() gives different (better) performance compared to model.evaluate().
Furthermore, if you notice from shunjiangxu's post model.fit() is doing better than model.evaluate() and not worse as explained in your link.

@j0bby
Copy link

j0bby commented Jun 26, 2018

Hello @ub216,
May I ask what is your model? If you have some sort of regularizer, your gradient is not 0.
My model does include dropout and no regularizer. It has 3 outputs on which I compute the loss as well. When I try to set loss_weights to 0.0, after one epoch on 1 batch, the overall loss returned is 0.0 (as expected), the loss computed for each output is greater than 0.0. However, validation and training loss are different as expected because of the dropout. Finally, some training loss is greater than validation loss and some are lower.

Here is how to have the same output from fit and evaluate :

  • model.fit(x_train,y_train,validation_data=(x_train,y_train)
  • model.evaluate(x_train,y_train)
    Then the metrics on the validation set from the fit method are equals to the one from the evaluate method. The loss on the training and validation are different (testing is better), even if the dataset is the same, as explained in the link.

About @shunjiangxu 's results: The two methods return different results as expected. However, in that case, the evaluate method expected to have better results is performing worth. This can have different explanations depending on the hyperparameters of the model, and the training.

@mikowals
Copy link

Hi @j0bby,

Thanks for trying to get to the bottom of this. If you run your code with the order of commands reversed do you still get matching results? Like this:

  • model.evaluate(x_train,y_train)
  • model.fit(x_train,y_train,validation_data=(x_train,y_train)

In my example above from the 19th of Feb running fit() first works as you say but why does fit() need to be run first? Is this behaviour documented somewhere?

@ub216
Copy link

ub216 commented Jun 27, 2018

Hey @j0bby
Thanks again for your prompt reply. I don't have dropout or regularizes. My model has three outputs and the total loss is a weighted loss of the three. For debugging purpose I have set their weights to zero. When I run:

model.fit(x_train,[y_train1,y_train2,y_train3],validation_data=(x_train,[y_train1,y_train2,y_train3])

I get:

Epoch 1/1 12/12 [==============================] - 1s 77ms/step - loss: 0.0000e+00 - out1_loss: 2.8200 - out2_loss: 0.3365 - out3_loss: 1.8442 - out1_categorical_cross_entropy4d_split: 0.0660 - out2_mean_squared_error: 0.1878 - out3_categorical_accuracy: 0.2500 - val_loss: 0.0000e+00 - val_out1_loss: 2.3867 - val_out2_loss: 0.3041 - val_out3_loss: 0.7214 - val_out1_categorical_cross_entropy4d_split: 0.0337 - val_out2_mean_squared_error: 0.1578 - val_out3_categorical_accuracy: 0.0000e+00

I checked the difference in weights before and after executing the command but the weights haven't changed! Any idea/pointers on why this discrepancy?

@mikowals I had the same issue with this model as well. I'm trying to figure this out first but maybe they are related.

@shunjiangxu
Copy link

Thanks for following this up. I am trying to dig out the reason for this. For my case, the evaluate gave a much much worse [loss, accuracy] result than the .fit. I am trying to use a VGG16 model instead of the Resnet50.
@ub216 if initial all the weights to be 0, then they will always stay at 0 during training due to the symmetry breaking issue, is that right?

@ub216
Copy link

ub216 commented Jun 27, 2018

@shunjiangxu
By "weights" I mean the weights for the weighted loss not the NN parameters, sorry for the confusion.

@shunjiangxu
Copy link

@ub216 All right, sorry I did not understand correctly. I was searching for a Keras callback function/parameters to print out the .fit output but can't seem to find it. The only way seems to run the .evaluate with on_batch_end/on_epoch_end. But that is not really what the .fit has calculated. Does anyone know if callback can get the .fit 'prediction' output?

@SimonZhao777
Copy link

Has anyone solved the problem yet, I'm facing the same problem here...
I'm using keras version 2.2.4 and tensorflow version 1.5.0,
I tried to print the result for every 10 training epochs, and the results of model.evaluate, model.predict, model.test_on_batch are all consistent but none of them are same from training phase results even when I used the same training data for all of them.

here are the results:

epoch=281, loss=16.09882 max_margin_loss=15.543743 ortho_loss=0.5550766
epoch=282, loss=15.947226 max_margin_loss=15.379948 ortho_loss=0.5672779
epoch=283, loss=15.848539 max_margin_loss=15.284585 ortho_loss=0.56395435
epoch=284, loss=15.519976 max_margin_loss=14.971162 ortho_loss=0.5488138
epoch=285, loss=14.816533 max_margin_loss=14.289791 ortho_loss=0.526742
epoch=286, loss=14.412685 max_margin_loss=13.907438 ortho_loss=0.5052471
epoch=287, loss=14.295979 max_margin_loss=13.805334 ortho_loss=0.49064445
epoch=288, loss=14.7037945 max_margin_loss=14.220262 ortho_loss=0.4835329
epoch=289, loss=14.691599 max_margin_loss=14.213996 ortho_loss=0.47760296
epoch=290, loss=14.596203 max_margin_loss=14.125141 ortho_loss=0.4710617
model.evaluate========
train_loss=[20.45014190673828, 19.984760284423828]
val_loss=[20.450117111206055, 19.984760284423828]
test_loss=[20.450183868408203, 19.984760284423828]
model.predict========
prediction_result len=2708
[[19.999733]
[19.963854]
[20.013517]
...
[20.03875 ]
[20.024363]
[20.024124]]
model.test_on_batch========
test_train_batch_result=[20.450142, 19.98476]
test_val_batch_result=[20.450117, 19.98476]
test_test_batch_result=[20.450184, 19.98476]

@andrey999333
Copy link

it might be the problem i have described in the stackoverflow post:
https://stackoverflow.com/questions/51123198/strange-behaviour-of-the-loss-function-in-keras-model-with-pretrained-convoluti

@SimonZhao777
Copy link

SimonZhao777 commented Mar 14, 2020

Hey guys, I found an easy solution which works at least in my case (My model has Dropout layer but with no BatchNormalization layer), thanks to OverLordGoldDragon in the link here and here

The easy fix for me is to set keras learning phase to 0 before building and initializing my model:
here is a demo code:

import keras.backend as K

    K.set_learning_phase(0)

    Then the model building and compiling code...

Now the four results(model.train_on_batch, model.evaluate, model.predict, model.test_on_batch) are all as expected.

below are the experiment output:
epoch=882, loss=8.4112625 max_margin_loss=7.6551723 ortho_loss=0.75609016
epoch=883, loss=8.406249 max_margin_loss=7.6501327 ortho_loss=0.7561164
epoch=884, loss=8.400357 max_margin_loss=7.644247 ortho_loss=0.7561102
epoch=885, loss=8.395483 max_margin_loss=7.639352 ortho_loss=0.7561312
epoch=886, loss=8.398947 max_margin_loss=7.642764 ortho_loss=0.7561827
epoch=887, loss=8.394142 max_margin_loss=7.6379457 ortho_loss=0.7561965
epoch=888, loss=8.387917 max_margin_loss=7.63174 ortho_loss=0.7561765
epoch=889, loss=8.383256 max_margin_loss=7.6270676 ortho_loss=0.7561884
epoch=890, loss=8.386976 max_margin_loss=7.6307592 ortho_loss=0.756217
model.evaluate========
train_loss=[8.382329940795898, 7.62611198425293]
val_loss=[8.38233470916748, 7.62611198425293]
test_loss=[8.382333755493164, 7.62611198425293]
model.predict========
prediction_result len=2708
[[11.143183 ]
[ 2.2248592]
[ 4.534893 ]
...
[ 7.269316 ]
[ 9.213724 ]
[ 5.3815193]]
model.test_on_batch========
test_train_batch_result=[8.38233, 7.626112]
test_val_batch_result=[8.382335, 7.626112]
test_test_batch_result=[8.382334, 7.626112]

@liangsun-ponyai
Copy link

@Osdel Why your change will fix your problem?

@Echosanmao
Copy link

Same problem happens for me...
hello~
Have you resolve this question?
Could you tell me some thing?Thank you!!

@fire717
Copy link

fire717 commented Apr 22, 2020

I just remove the dropout and have no problem any more.

@Nikeshbajaj
Copy link

Nikeshbajaj commented Apr 26, 2020

Looking into callback, I think the model.evaluate returns the final loss and accuracy, while the verbosity mode prints the average loss and accuracy of the epoch as it in stored in logs.
at link https://www.tensorflow.org/guide/keras/custom_callback#usage_of_logs_dict

def on_epoch_end(self, epoch, logs=None):
    print('The average loss for epoch {} is {:7.2f} and mean absolute error is {:7.2f}.'.format(epoch, logs['loss'], logs['mae']))

If we need final performance on training, we could use callback to store it, which is slow but don't know if there in any option to force log to record final performance rather than average.

@Yoskutik
Copy link

Same problem. I do not have dropout layers, but there are a few batch normalizations. The difference between validation accuracy while training and after evaluating around 15%.

target_size = (160, 160)
batch_size = 50

def conv2d(filters, kernel_size, prev_layer, activation='relu'):
    layer = Conv2D(filters, kernel_size, padding='same')(prev_layer)
    layer = BatchNormalization()(layer)
    layer = Activation(activation)(layer)
    return layer

def inception_block(n_filters, prev_layer, compress_to=None):
    layer_conv11 = conv2d(n_filters, (1, 1), prev_layer)
    if compress_to is None:
        layer_conv33 = conv2d(n_filters, (3, 3), prev_layer)
        layer_conv55 = conv2d(n_filters, (5, 5), prev_layer)
    else:
        layer_conv33 = conv2d(n_filters, (3, 3), conv2d(compress_to, (1, 1), prev_layer))
        layer_conv55 = conv2d(n_filters, (5, 5), conv2d(compress_to, (1, 1), prev_layer))
    return concatenate([
        layer_conv11, layer_conv33, layer_conv55
    ])

def top_2_acc(x, y):
    return top_k_categorical_accuracy(x, y, 2)

base_model = MobileNetV2(
    include_top=False, 
    input_shape=(*target_size, 3)
)

layer = inception_block(256, base_model.output, compress_to=128)
layer = inception_block(512, layer, compress_to=128)
layer = concatenate([
    GlobalAvgPool2D()(layer),
    GlobalAvgPool2D()(base_model.output),
]) 
layer = Dense(256, activation='relu')(layer)
layer = BatchNormalization()(layer)
out = Dense(120, activation='softmax')(layer)

model = Model(inputs=base_model.input, outputs=out)

I was saving the best model while training. So, the best epoch was:

Epoch 43/150
340/340 [==============================] - 158s 466ms/step - loss: 0.0034 - acc: 0.9998 - top_2_acc: 1.0000 - val_loss: 0.9116 - val_acc: 0.7999 - val_top_2_acc: 0.8962

But the evaluation on validation subset is:

37/37 [==============================] - 3s 83ms/step - loss: 0.2102 - acc: 0.9486 - top_2_acc: 0.9773

Even if there is an influence of batch normalisation, is it okay to have that much improvement?

P.S. The test accuracy is also around 95%, so I think this number is pretty representive. I just confused because of difference.

@kechan
Copy link

kechan commented Oct 17, 2020

I just hit this problem with sparse_categorical_accuracy. I believed during training, whatever it is reported is totally wrong. I ran model.evaluate on the same train_ds and obtain an answer that agree with y_pred = model.predict(...), and then explicitly compute the metrics with y_pred and y.

For my case, the sparse_categorical_accuracy during training is way better than it should be.

Looks like this issue has been opened for a long time.... not sure if anyone knows the ans.

@jan0410
Copy link

jan0410 commented Oct 20, 2020

Hi everyone - if you use different batch_size for .fit() and .evaluate(), you will get a different loss.
For example, model.fit(X,y,batch_size=32,...) will approximately have half the error of model.evaluate(X,y,batch_size=64,...)

@kechan
Copy link

kechan commented Oct 27, 2020

I came to understand my specific case, it has to do with presence of batch norm layer, which can lead to a diff “prediction” during training vs. evaluate. For a simple model and if i remove any batch norm and use full batch GD, the train set metrics will be exactly the same during training or evaluation.

@marziehoghbaie
Copy link

@emerygoossens
I am having the same issue. I trained my model and validation acc is 70% during training , I save the model and reload it from checkpoint but evaluation of the model on the same validation dataset gives me 10% acc. It is kind of weird because it sound that the prediction is random.
I do not use any batch normalization

@kasrahabib
Copy link

kasrahabib commented Dec 2, 2020

It is due to dropout.
Make sure to evaluate training loss (after training) without dropout.

@mgroth0
Copy link

mgroth0 commented Jan 12, 2021

Hi, I'm experiencing an issue with batch normalization layers wondering if anyone has a bit of helpful insight.

My issue is described in full on stack overflow here. In that post there's also demo code you can run and a complete dataset you can download. There have been no direct answers yet, but one commenter was able to confirm that the issue is caused by batch normalization layers.

I'm not completely sure that this issue is connected to my issue, but I have been suspecting it.

@janpfeifer
Copy link

I had a similar problem with BatchNormalization, and after lots of investigation, and help of a friend, he pointed out that the issue can be caused by a very small moving momentum in BatchNormalization -- typically set in Keras Applications to 0.999. So the evaluation would use garbage means/variance (while training use the in-batch mean/variance).

The fix was to introspect into the model and change the momentum of all BratchNormalization layers -- see example code on the stack overflow post mentioned on the comment above.

A nice fix would be for the BatchNormalization library to use a normal mean up to a certain number of examples, and only afterwards start using the moving average.

@alexander-soare
Copy link

alexander-soare commented Jun 25, 2021

EDIT - If you've got this far on this thread, don't be stubborn/lazy like me (pre-edit below). Just do the batch norm thing. It cleaned everything up for me.

The code is here. Thanks @mgroth0


What a journey this ticket has been. Regular PyTorch user here trying to retrain a model in TensorFlow. I find this totally bizarre as I've trained this model 100 times on PyTorch. I use the same datagen for train and val and this is what the progbar printout looks like

61/61 [==============================] - 23s 295ms/step - loss: 6.3648 - categorical_accuracy: 0.0994 - lr: 3.1000e-09 - val_loss: 5.7572 - val_categorical_accuracy: 0.0954 - val_lr: 6.1000e-09
Epoch 2/10
61/61 [==============================] - 19s 294ms/step - loss: 4.8986 - categorical_accuracy: 0.2354 - lr: 1.1250e-04 - val_loss: 2.6929 - val_categorical_accuracy: 0.5301 - val_lr: 2.9672e-04
Epoch 3/10
61/61 [==============================] - 19s 298ms/step - loss: 0.5306 - categorical_accuracy: 0.8373 - lr: 2.9216e-04 - val_loss: 1.7681 - val_categorical_accuracy: 0.6370 - val_lr: 2.8780e-04
Epoch 4/10
61/61 [==============================] - 19s 293ms/step - loss: 0.2582 - categorical_accuracy: 0.9253 - lr: 2.8338e-04 - val_loss: 1.1328 - val_categorical_accuracy: 0.7145 - val_lr: 2.7915e-04
Epoch 5/10
61/61 [==============================] - 19s 292ms/step - loss: 0.1742 - categorical_accuracy: 0.9502 - lr: 2.7487e-04 - val_loss: 0.8952 - val_categorical_accuracy: 0.7417 - val_lr: 2.7077e-04
Epoch 6/10
61/61 [==============================] - 19s 297ms/step - loss: 0.1396 - categorical_accuracy: 0.9568 - lr: 2.6661e-04 - val_loss: 0.7403 - val_categorical_accuracy: 0.7638 - val_lr: 2.6263e-04
Epoch 7/10
61/61 [==============================] - 21s 304ms/step - loss: 0.1107 - categorical_accuracy: 0.9664 - lr: 2.5860e-04 - val_loss: 0.8310 - val_categorical_accuracy: 0.7219 - val_lr: 2.5474e-04
Epoch 8/10
61/61 [==============================] - 20s 319ms/step - loss: 0.0981 - categorical_accuracy: 0.9701 - lr: 2.5083e-04 - val_loss: 1.0080 - val_categorical_accuracy: 0.6621 - val_lr: 2.4708e-04
Epoch 9/10
61/61 [==============================] - 19s 296ms/step - loss: 0.0930 - categorical_accuracy: 0.9712 - lr: 2.4329e-04 - val_loss: 1.3283 - val_categorical_accuracy: 0.5703 - val_lr: 2.3966e-04
Epoch 10/10
61/61 [==============================] - 20s 292ms/step - loss: 0.0815 - categorical_accuracy: 0.9751 - lr: 2.3598e-04 - val_loss: 1.6118 - val_categorical_accuracy: 0.4788 - val_lr: 2.3246e-04

Somehow my val accuracy is way off train accuracy and takes a nose dive after a few epochs...

I would look into the batch norm thing, but it seems so wrong that it can't be right.

@Raverss
Copy link

Raverss commented Oct 25, 2021

Issue: model.evaluate(val_data) gives abysmal result compared to val_accuracy during model.fit()

Description: I have custom image data divided into train/val/test sets. For loading the data, I'm using tensorflow datasets library (with albumentation transformations). I'm using EfficientNet model from tensorflow.keras.applications. During the training, I'm able to reach around 82% accuracy on the validation split. After training, when I load the model with load_model(), I get extremely poor performance fromevaluate(). 4% accuracy vs 82% accuracy.

Solution: Load model from a checkpoint with load_model() AND THEN load the weights from the same checkpoint, i.e.

model = load_model(model_chckpt)
model.load_weights(model_chckpt)

After this, evaluate() returns sensible (didn't check if the same) values.

@gitpeblo
Copy link

gitpeblo commented Jan 4, 2022

Issue: The output layer for the last epoch of model.fit(...) is different from the one generated by model.predict(..) when applied to the training data, but the exact same when applied to the validation data.

Testing: I observed that model.fit(..) produces a different output even when passing the same data both for the actual training and the validation (I will call these data "X_trainval"). Basically, I call model.fit(..) as:
model.fit(X_trainval, X_trainval ..., validation_data=(X_trainval,X_trainval))
NOTE: I am working on an autoencoder, hence I have (X_trainval,X_trainval) twice instead of (X_trainval,y_trainval)

What I obtain is that the output for what model.fit(..) thinks are the training data is different from that of what model.fit(..) thinks are the validation data (while they are in fact both the same data, i.e. X_trainval).

However, the separate output for the validation is identical to a post-fit application of model.predict(..) to X_trainval. Same goes for the loss. That suggests that what is used for the validation in model.fit(...) stays frozen for model.predict(..).

Solution: This is most probably due to a last back propagation between the last training and the last validation of model.fit(...). The output layer (and loss) is calculated for the validation data using a model with weights which went through 1 more update, with respect to the training data.

It has nothing to do with BackPropagation: I tried both using it or not.

@whitetechnologies
Copy link

whitetechnologies commented Aug 22, 2022

I'm not sure why this is closed when people are still having issues. Here is a good example:

https://datascience.stackexchange.com/questions/113627/lstm-sequential-val-loss-train-loss-on-same-dataset

We all understand that model.fit() works slightly differently than model.evaluate() because of how it updates weights between batches. But there still should be no reason why model.evaluate would produce a training loss (for example MAE) that is an order of magntude higher on the same X_train dataset

@Raverss that solution unfortunately did not work. I'm not sure why it did for you, but loading the model from checkpoint or from a file created by model.save always seems to work fine for model.fit. In other words, continuing the training from a loaded model resumes the training from very close to what the last training loss, regardless of how the model is loaded. This suggests that your case was a bit unique, and the issue is not with loading weights.

@alexander-soare this problem occurs even when there are NO batch normalization layers. So your proposed solution makes sense, but how do you change the behavior of a layer that's not even in the model???

@gitpeblo Same observations as you. The result of the model.fit's validation function = same model.evaluate (this is expected since that is how it's designed to work in TF/Keras). Further, model.predict() results are always equal to model.evaluate(). Whatever is going on is the same in those two function, but different in model.fit training function.

@BrianHuf It does seem that setting keras.backend.set_learning_phase(1) should force model.evaluate() to behave like model.fit, but that doesn't work either...

Has anyone successfully resolved this?

@yuraSomatic
Copy link

yuraSomatic commented Dec 13, 2022

Try to change data scaling, based on the your specific activation functions.
For example [-1; 1] -> [0; 1] if you're using relu. (the orange one is after rescaling, no dropout, batch norm moment 0.0001)
image

@sonukiller
Copy link

Hi, if you are using tf.keras.preprocessing.image.ImageDataGenerator for loading the data, try shuffle=False for test data! By default shuffle=True

@Raverss
Copy link

Raverss commented Apr 9, 2023

Hi, if you are using tf.keras.preprocessing.image.ImageDataGenerator for loading the data, try shuffle=False for test data! By default shuffle=True

Turning on/off shuffling shouldn't change the loss or accuracy. I agree that turning shuffling off for the testing data is sensible thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests