Evaluate_generator produces wrong accuracy scores? #6499

skoch9 · 2017-05-04T10:01:08Z

Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:

img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])

With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255 parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks

The text was updated successfully, but these errors were encountered:

joeyearsley · 2017-05-07T20:41:47Z

Having looked into the backend I believe this is due to the number of workers if you have more than one worker there is nothing to ensure consistency of file loading across them. Therefore it is possible that multiple files are shown numerous times in these methods, as the 12 generators are randomly initialised but don't actually share a file list state.

Maybe @fchollet or @farizrahman4u can confirm?

skoch9 · 2017-05-08T17:46:01Z

Thanks for your answer. I already considered that the image generators are not threadsafe. However, I was able to reproduce the problem with setting all three worker counts to 1. Does the order of the filenames maybe not correspond to the order of the scores?

Recently added similar issues are #6540 and #6544.

skoch9 · 2017-05-08T18:03:21Z

Plus, @fchollet mentioned here that the ImageDataGenerator supports multiprocessing.

avn3r · 2017-05-15T18:29:45Z

@skoch9 try to set pickle_safe=True. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.

@fchollet Please make evaluate_generator, and predict_generator workers=1 always, or eliminate parameter until it is fixed.

Make sure:

shuffle = false
pickle_safe = True
workers = 1

Let me know if that gives you consistent results.

skoch9 · 2017-05-16T11:57:47Z

Thanks @abnerA for your answer. I investigated a little further and found that running evaluate_generator before predict_generator without setting pickle_safe=True messes up the predictions of the latter, even without multiprocessing.

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, pickle_safe=False)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size)

As for the parameters, when setting workers >1, shouldn't pickle_safe automatically be set to True also for the fit_generator?

avn3r · 2017-05-16T14:37:42Z

@skoch9 no. Pickle safe is false by default in all keras generator methods. The only difference between this two modes is False uses multithread and True uses multiprocess.

skoch9 · 2017-05-16T15:04:29Z

Sure, but it doesn't make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator before predict_generator changes the prediction results.

stale · 2017-08-14T15:37:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

romainVala · 2017-08-29T10:28:42Z

Hello
I made a test based on the mnist_cnn.py example.
I just change model.fit by

gen=matrix_generator(x_train,y_train,batch_size=batch_size) val_steps = len(y_test)/batch_size

model.fit_generator(gen,epochs=epochs, steps_per_epoch= steps_per_epoch, validation_data=(x_test, y_test))

with the generator define to take subpart of the input matrix

def matrix_generator(x,y,batch_size=1,validation=False):
import numpy as np
item=0;
tot_item,dimx,dimy,dimz=x.shape

X = np.zeros((batch_size,dimx,dimy,dimz), dtype=np.float32)

if len(y.shape)==1:
    Y = np.zeros((batch_size))
else:
    Y = np.zeros((batch_size, y.shape[1]))

while True:
    for bs in range(batch_size):
        ximg = x[item,:,:,:]
        
        yout = y[item,:]                
        yout = yout.reshape(1,yout.shape[0])
        Y[bs,:] = yout
        
        X[bs,:,:,:] = ximg

        item +=1
        if item>tot_item-1:
            item=0

    if validation:
        yield X
    else:
        yield X,Y

`
Now if I test the model

model.evaluate(x_test, y_test, verbose=0)
Out[18]: [0.16519621040821075, 0.96389999999999998]

but
gen_test = matrix_generator(x_test,y_test,batch_size=batch_size)
model.evaluate_generator(gen_test,len(y_test)/batch_size,workers=1)
Out[17]: [0.31920816290837067, 0.93760016025641024]

same learn parameter, same test data, but different results

What to do ???
Many thanks for your help

romainVala · 2017-08-30T10:43:42Z

To follow up
I get the correct result only if I set
max_q_size=1

if I only set worker=1 it does not woks either (and give different results each time)

ghost · 2017-10-10T18:49:08Z

Hi,
I am having a similar problem with a binary classifier that uses 2 outputs. model.evaluate_generator() and model.predict() both suggest that I have 50% accuracy (chance) and a loss of over 1.0, while model.fit_generator() always gives me a loss of under 0.5 and accuracy of over 80%.

for x in range(10000):
    print("starting epoch {0}".format(x))
    mg_acc = mg_model.evaluate_generator(mg_gen, 2, 
        max_queue_size=1, workers=1, use_multiprocessing=False)
    mg_model.fit_generator(mg_gen, 2, epochs=1, callbacks=mg_callbacks, 
        max_queue_size=1, workers=1, use_multiprocessing=False, verbose=1)
    print(mg_acc)
    print(mg_model.metrics_names)

It does not matter if use_multiprocessing is True or False, or the order of the calls (fit_generator before evaluate_generator or vice-versa), or if one call is commented out.

I know that this code will not result in identical generator output (each calls the next iteration of the generator) but the output of the generator is similar between iterations, and the evaluate_generator is always the one that does poorly, regardless of order.

EDIT: SOLVED: My accuracy difference was eliminated when I removed all batch normalization layers (also removed dropout layers, but I don't think that was the cause)

GalAvineri · 2017-12-04T09:52:30Z

I too have a big difference between the reported fit_generator results and the later evaluate_generator results.

I've looked into this a bit and found the following results:

When i use evaluate_generator with a generator that does not shuffle the suite, i get results that are
very different than those reported by fig_generator.
But when i use evaluate_generator with a generator that does shuffle the suite, i get results that are
similiar to those reported by fit_generator.
When i use evaluate (without any generators) the output is exactly the same as evaluate_generator
without shuffling
When i use model.predict and infer the measurements manually, i get the same measurements
reported by fit_generator (and the same results as evaluate_generator with shuffling)

Can anyone verify that any of the above happen to them as well?

A note:
I use a very simple model - just one dense layer.
no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2

jnygaard · 2017-12-10T17:14:37Z

@GalAvineri : Very interesting... I just came across your post, after posting my own frustrations: #5818. I will try to add shuffling to my evaluate_generator, to see if this makes any difference, and if I get the same results as you do, although I cannot see any sense in shuffling during model evaluation...

csandmann · 2017-12-12T14:33:07Z

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

xing89qs · 2017-12-28T02:26:11Z

validation_generator = test_datagen.flow_from_directory(shuffle=True) , the shuffle=True is set by default.
I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results... So the accuracy calculated after using predict_generator may cause wrong answer.

keshavunni · 2018-02-19T09:55:42Z

I also ran into a similar issue where my fit_generator gives an accuracy and loss of 98% and 0.08. evaluate_generator gives the same accuracy if I use rescale=1. / 255 but if I don't, I get an accuracy and loss of 50% and 7.9 respectively. predict_generator always gives only 50% accuracy even if I use rescale=1. / 255 or not. What should I do?

@jeremydr2 When you removed the batch normalization and dropout what was your accuracy and loss? Was it still 80% and 0.5? Because for me it fell to 50%

@GalAvineri I have a similar issue, but I do not think that the issue is about using shuffle but it is because of the rescale=1. / 255 in ImageDataGenerator

hokmund · 2018-04-20T13:25:36Z

This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.

ubershmekel · 2018-08-30T06:42:44Z

My problem was using loss='binary_crossentropy' instead of loss='categorical_crossentropy'. This caused my accuracy to be 96% before I even started training on my 15 different classes.

Just noting that for future googlers. Might not be relevant for this thread topic specifically. It might be worth to warn against this during evaluation in keras when a model has 2+ outputs and the accuracy is binary.

https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance

klday · 2018-09-13T14:53:23Z

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

I'm not sure I understand your solution. When you say, "Resolution was to seed np.random in getitem(self, idx)." Could you explain how you did this a little more thoroughly? How do you seed np.random in getitem(self,idx)?

ghost · 2018-12-25T11:20:09Z

Hey guys i'm having similar issues with predict_generator.

When i'm training i get around 92% train acc and 80% val_acc but when i make predictions and put them through a confusion matrix the acc drops to 50%, any updates on this

maglkp · 2019-05-13T14:00:16Z

hi, same here - manually computed accuracy on this kaggle competition https://www.kaggle.com/c/aerial-cactus-identification gives me 50-65% accuracy (binary classification problem) while predict_generator
gives roughly 95%. I tried drop_duplicates=True, seed=2019, pickle_safe = True, workers=1 args and I'm using the correct loss='binary_crossentropy' for the model. I'm happy to give full details but the overall picture seems to be the same as above.

truth_generator = datagen.flow_from_dataframe(dataframe=df_truth, directory="test", x_col="id", y_col="has_cactus", class_mode="binary", target_size=(32,32), batch_size=200, drop_duplicates=True, seed=2019, pickle_safe = True, workers=1)
predictions = model.predict_generator(truth_generator, steps=20)

lipinski · 2019-06-03T20:49:57Z

You need to add shuffle=False to flow_from_dataframe.

adilincepto · 2019-06-14T09:13:55Z

Hi,

I encountered the same issue recently, and actually the solution is quite simple.
You use validation_generator two times in a row, and I imagine your number of samples isn't exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won't yield the sampels in the order you expect.

So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :

validation_generator2 = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)

pcko1 · 2019-06-17T14:56:17Z

I am experiencing a similar problem, my model is trained using fit_generator, saved using model.save, loaded using load_model and evaluated using evaluate_generator but its accuracy is similar to the untrained one. However, model.predict (without generator) works adequately well.

Using keras.__version__ = 2.2.4.

JaeDukSeo · 2019-07-25T01:51:39Z

this seems to be still a problem. One simple fix was not to use multi-process

BruceDai003 · 2019-08-06T09:53:32Z

I have encountered this problem today.
And I found the solution to this:

You must set shuffle=False in your generator.
you need to reset your generator before calling predict_generator() function.
For example:

valid_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory="../images/train/",
    x_col="id",
    y_col="label_2",
    subset="validation",
    batch_size=batch_size,
    seed=42,
    shuffle=False,
    class_mode="categorical",
    classes=classes,
    target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)

Dovermore · 2019-12-09T00:44:15Z

The predict_generator on a custom data generator for me also produces a different result compared to eval_generator and the verbose output from fit_generator. I didn't shuffle the index in predict_generator

ghasemikasra39 · 2020-01-05T14:17:43Z

Same issue. Totally confused from the answers above. Any simple solution?

achbogga · 2020-01-10T14:52:19Z

same issue. any simple solution from the official authors yet?

EXJUSTICE · 2020-03-18T20:32:14Z

Similiar observations to @GalAvineri , when setting shuffle = False, my evaluate_generator accuracy goes up to 92%, which is great, but unrealistic. Setting it to true yields much more realistic results. I would appreciate some official guidelines on this - there are others who argue otherwise (@lipinski) which makes it incredibly confusing

kkviks · 2020-07-06T17:45:50Z

Same issue. Totally confused from the answers above. Any simple solution?

jtxtina · 2020-07-10T09:43:01Z

Same issue/

jtxtina · 2020-07-10T10:37:55Z

Ok, I solved the issue.
My code previously:

scores_evaluation = model.evaluate_generator(test_generator.flow(X_test,Y_test, batch_size=32, shuffle=False),len(Y_test)/32)
scores_prediction = model.predict_generator(test_generator.flow(X_test, batch_size=32, shuffle=False),len(Y_test)/32)

However, in flow() shuffle is default to True, so for each flow function here, I added shuffle=False. Then problem solved.

ahmedhassen7 · 2021-05-31T17:04:32Z

setting the shuffle=False just saved my life :D

stale bot added the stale label Aug 14, 2017

romainVala mentioned this issue Aug 29, 2017

predict_generator cannot maintain data order #5048

Closed

stale bot removed the stale label Aug 29, 2017

csandmann mentioned this issue Dec 16, 2017

Memory consumption when using fit generator #3675

Closed

tbagnoli mentioned this issue Nov 29, 2018

Correct use of fit_generator, predict_generator, and evaluate_generator #11754

Closed

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate_generator produces wrong accuracy scores? #6499

Evaluate_generator produces wrong accuracy scores? #6499

skoch9 commented May 4, 2017

joeyearsley commented May 7, 2017

skoch9 commented May 8, 2017 •

edited

Loading

skoch9 commented May 8, 2017

avn3r commented May 15, 2017 •

edited

Loading

skoch9 commented May 16, 2017 •

edited

Loading

avn3r commented May 16, 2017

skoch9 commented May 16, 2017

stale bot commented Aug 14, 2017

romainVala commented Aug 29, 2017

romainVala commented Aug 30, 2017

ghost commented Oct 10, 2017 •

edited by ghost

Loading

GalAvineri commented Dec 4, 2017 •

edited

Loading

jnygaard commented Dec 10, 2017

csandmann commented Dec 12, 2017

xing89qs commented Dec 28, 2017

keshavunni commented Feb 19, 2018 •

edited

Loading

hokmund commented Apr 20, 2018

ubershmekel commented Aug 30, 2018

klday commented Sep 13, 2018

ghost commented Dec 25, 2018

maglkp commented May 13, 2019 •

edited

Loading

lipinski commented Jun 3, 2019

adilincepto commented Jun 14, 2019

pcko1 commented Jun 17, 2019 •

edited

Loading

JaeDukSeo commented Jul 25, 2019

BruceDai003 commented Aug 6, 2019

Dovermore commented Dec 9, 2019

ghasemikasra39 commented Jan 5, 2020

achbogga commented Jan 10, 2020

EXJUSTICE commented Mar 18, 2020

kkviks commented Jul 6, 2020

jtxtina commented Jul 10, 2020

jtxtina commented Jul 10, 2020

ahmedhassen7 commented May 31, 2021

Evaluate_generator produces wrong accuracy scores? #6499

Evaluate_generator produces wrong accuracy scores? #6499

Comments

skoch9 commented May 4, 2017

joeyearsley commented May 7, 2017

skoch9 commented May 8, 2017 • edited Loading

skoch9 commented May 8, 2017

avn3r commented May 15, 2017 • edited Loading

skoch9 commented May 16, 2017 • edited Loading

avn3r commented May 16, 2017

skoch9 commented May 16, 2017

stale bot commented Aug 14, 2017

romainVala commented Aug 29, 2017

romainVala commented Aug 30, 2017

ghost commented Oct 10, 2017 • edited by ghost Loading

GalAvineri commented Dec 4, 2017 • edited Loading

jnygaard commented Dec 10, 2017

csandmann commented Dec 12, 2017

xing89qs commented Dec 28, 2017

keshavunni commented Feb 19, 2018 • edited Loading

hokmund commented Apr 20, 2018

ubershmekel commented Aug 30, 2018

klday commented Sep 13, 2018

ghost commented Dec 25, 2018

maglkp commented May 13, 2019 • edited Loading

lipinski commented Jun 3, 2019

adilincepto commented Jun 14, 2019

pcko1 commented Jun 17, 2019 • edited Loading

JaeDukSeo commented Jul 25, 2019

BruceDai003 commented Aug 6, 2019

Dovermore commented Dec 9, 2019

ghasemikasra39 commented Jan 5, 2020

achbogga commented Jan 10, 2020

EXJUSTICE commented Mar 18, 2020

kkviks commented Jul 6, 2020

jtxtina commented Jul 10, 2020

jtxtina commented Jul 10, 2020

ahmedhassen7 commented May 31, 2021

skoch9 commented May 8, 2017 •

edited

Loading

avn3r commented May 15, 2017 •

edited

Loading

skoch9 commented May 16, 2017 •

edited

Loading

ghost commented Oct 10, 2017 •

edited by ghost

Loading

GalAvineri commented Dec 4, 2017 •

edited

Loading

keshavunni commented Feb 19, 2018 •

edited

Loading

maglkp commented May 13, 2019 •

edited

Loading

pcko1 commented Jun 17, 2019 •

edited

Loading