Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate_generator produces wrong accuracy scores? #6499

Closed
skoch9 opened this issue May 4, 2017 · 34 comments
Closed

Evaluate_generator produces wrong accuracy scores? #6499

skoch9 opened this issue May 4, 2017 · 34 comments

Comments

@skoch9
Copy link

skoch9 commented May 4, 2017

Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:

img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])

With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255 parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks

@joeyearsley
Copy link
Contributor

Having looked into the backend I believe this is due to the number of workers if you have more than one worker there is nothing to ensure consistency of file loading across them. Therefore it is possible that multiple files are shown numerous times in these methods, as the 12 generators are randomly initialised but don't actually share a file list state.

Maybe @fchollet or @farizrahman4u can confirm?

@skoch9
Copy link
Author

skoch9 commented May 8, 2017

Thanks for your answer. I already considered that the image generators are not threadsafe. However, I was able to reproduce the problem with setting all three worker counts to 1. Does the order of the filenames maybe not correspond to the order of the scores?

Recently added similar issues are #6540 and #6544.

@skoch9
Copy link
Author

skoch9 commented May 8, 2017

Plus, @fchollet mentioned here that the ImageDataGenerator supports multiprocessing.

@avn3r
Copy link
Contributor

avn3r commented May 15, 2017

@skoch9 try to set pickle_safe=True. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.

@fchollet Please make evaluate_generator, and predict_generator workers=1 always, or eliminate parameter until it is fixed.

Make sure:

  1. shuffle = false
  2. pickle_safe = True
  3. workers = 1

Let me know if that gives you consistent results.

@skoch9
Copy link
Author

skoch9 commented May 16, 2017

Thanks @abnerA for your answer. I investigated a little further and found that running evaluate_generator before predict_generator without setting pickle_safe=True messes up the predictions of the latter, even without multiprocessing.

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, pickle_safe=False)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size)

As for the parameters, when setting workers >1, shouldn't pickle_safe automatically be set to True also for the fit_generator?

@avn3r
Copy link
Contributor

avn3r commented May 16, 2017

@skoch9 no. Pickle safe is false by default in all keras generator methods. The only difference between this two modes is False uses multithread and True uses multiprocess.

@skoch9
Copy link
Author

skoch9 commented May 16, 2017

Sure, but it doesn't make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator before predict_generator changes the prediction results.

@stale stale bot added the stale label Aug 14, 2017
@stale
Copy link

stale bot commented Aug 14, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@romainVala
Copy link

Hello
I made a test based on the mnist_cnn.py example.
I just change model.fit by

gen=matrix_generator(x_train,y_train,batch_size=batch_size) val_steps = len(y_test)/batch_size

model.fit_generator(gen,epochs=epochs, steps_per_epoch= steps_per_epoch, validation_data=(x_test, y_test))

with the generator define to take subpart of the input matrix

def matrix_generator(x,y,batch_size=1,validation=False):
import numpy as np
item=0;
tot_item,dimx,dimy,dimz=x.shape

X = np.zeros((batch_size,dimx,dimy,dimz), dtype=np.float32)

if len(y.shape)==1:
    Y = np.zeros((batch_size))
else:
    Y = np.zeros((batch_size, y.shape[1]))

while True:
    for bs in range(batch_size):
        ximg = x[item,:,:,:]
        
        yout = y[item,:]                
        yout = yout.reshape(1,yout.shape[0])
        Y[bs,:] = yout
        
        X[bs,:,:,:] = ximg

        item +=1
        if item>tot_item-1:
            item=0

    if validation:
        yield X
    else:
        yield X,Y

`
Now if I test the model

model.evaluate(x_test, y_test, verbose=0)
Out[18]: [0.16519621040821075, 0.96389999999999998]

but
gen_test = matrix_generator(x_test,y_test,batch_size=batch_size)
model.evaluate_generator(gen_test,len(y_test)/batch_size,workers=1)
Out[17]: [0.31920816290837067, 0.93760016025641024]

same learn parameter, same test data, but different results

What to do ???
Many thanks for your help

@stale stale bot removed the stale label Aug 29, 2017
@romainVala
Copy link

To follow up
I get the correct result only if I set
max_q_size=1

if I only set worker=1 it does not woks either (and give different results each time)

@ghost
Copy link

ghost commented Oct 10, 2017

Hi,
I am having a similar problem with a binary classifier that uses 2 outputs. model.evaluate_generator() and model.predict() both suggest that I have 50% accuracy (chance) and a loss of over 1.0, while model.fit_generator() always gives me a loss of under 0.5 and accuracy of over 80%.

for x in range(10000):
    print("starting epoch {0}".format(x))
    mg_acc = mg_model.evaluate_generator(mg_gen, 2, 
        max_queue_size=1, workers=1, use_multiprocessing=False)
    mg_model.fit_generator(mg_gen, 2, epochs=1, callbacks=mg_callbacks, 
        max_queue_size=1, workers=1, use_multiprocessing=False, verbose=1)
    print(mg_acc)
    print(mg_model.metrics_names)

It does not matter if use_multiprocessing is True or False, or the order of the calls (fit_generator before evaluate_generator or vice-versa), or if one call is commented out.

I know that this code will not result in identical generator output (each calls the next iteration of the generator) but the output of the generator is similar between iterations, and the evaluate_generator is always the one that does poorly, regardless of order.

EDIT: SOLVED: My accuracy difference was eliminated when I removed all batch normalization layers (also removed dropout layers, but I don't think that was the cause)

@GalAvineri
Copy link

GalAvineri commented Dec 4, 2017

I too have a big difference between the reported fit_generator results and the later evaluate_generator results.

I've looked into this a bit and found the following results:

  1. When i use evaluate_generator with a generator that does not shuffle the suite, i get results that are
    very different than those reported by fig_generator.
    But when i use evaluate_generator with a generator that does shuffle the suite, i get results that are
    similiar to those reported by fit_generator.

  2. When i use evaluate (without any generators) the output is exactly the same as evaluate_generator
    without shuffling

  3. When i use model.predict and infer the measurements manually, i get the same measurements
    reported by fit_generator (and the same results as evaluate_generator with shuffling)

Can anyone verify that any of the above happen to them as well?

A note:
I use a very simple model - just one dense layer.
no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2

@jnygaard
Copy link

@GalAvineri : Very interesting... I just came across your post, after posting my own frustrations: #5818. I will try to add shuffling to my evaluate_generator, to see if this makes any difference, and if I get the same results as you do, although I cannot see any sense in shuffling during model evaluation...

@csandmann
Copy link

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

@xing89qs
Copy link

validation_generator = test_datagen.flow_from_directory(shuffle=True) , the shuffle=True is set by default.
I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results... So the accuracy calculated after using predict_generator may cause wrong answer.

@keshavunni
Copy link

keshavunni commented Feb 19, 2018

I also ran into a similar issue where my fit_generator gives an accuracy and loss of 98% and 0.08. evaluate_generator gives the same accuracy if I use rescale=1. / 255 but if I don't, I get an accuracy and loss of 50% and 7.9 respectively. predict_generator always gives only 50% accuracy even if I use rescale=1. / 255 or not. What should I do?

@jeremydr2 When you removed the batch normalization and dropout what was your accuracy and loss? Was it still 80% and 0.5? Because for me it fell to 50%

@GalAvineri I have a similar issue, but I do not think that the issue is about using shuffle but it is because of the rescale=1. / 255 in ImageDataGenerator

@hokmund
Copy link

hokmund commented Apr 20, 2018

This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.

@ubershmekel
Copy link

My problem was using loss='binary_crossentropy' instead of loss='categorical_crossentropy'. This caused my accuracy to be 96% before I even started training on my 15 different classes.

Just noting that for future googlers. Might not be relevant for this thread topic specifically. It might be worth to warn against this during evaluation in keras when a model has 2+ outputs and the accuracy is binary.

https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance

@klday
Copy link

klday commented Sep 13, 2018

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

I'm not sure I understand your solution. When you say, "Resolution was to seed np.random in getitem(self, idx)." Could you explain how you did this a little more thoroughly? How do you seed np.random in getitem(self,idx)?

@ghost
Copy link

ghost commented Dec 25, 2018

Hey guys i'm having similar issues with predict_generator.

When i'm training i get around 92% train acc and 80% val_acc but when i make predictions and put them through a confusion matrix the acc drops to 50%, any updates on this

@maglkp
Copy link

maglkp commented May 13, 2019

hi, same here - manually computed accuracy on this kaggle competition https://www.kaggle.com/c/aerial-cactus-identification gives me 50-65% accuracy (binary classification problem) while predict_generator
gives roughly 95%. I tried drop_duplicates=True, seed=2019, pickle_safe = True, workers=1 args and I'm using the correct loss='binary_crossentropy' for the model. I'm happy to give full details but the overall picture seems to be the same as above.

truth_generator = datagen.flow_from_dataframe(dataframe=df_truth, directory="test", x_col="id", y_col="has_cactus", class_mode="binary", target_size=(32,32), batch_size=200, drop_duplicates=True, seed=2019, pickle_safe = True, workers=1)
predictions = model.predict_generator(truth_generator, steps=20)

@lipinski
Copy link

lipinski commented Jun 3, 2019

You need to add shuffle=False to flow_from_dataframe.

@adilincepto
Copy link

Hi,

I encountered the same issue recently, and actually the solution is quite simple.
You use validation_generator two times in a row, and I imagine your number of samples isn't exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won't yield the sampels in the order you expect.

So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :

validation_generator2 = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)

@pcko1
Copy link

pcko1 commented Jun 17, 2019

I am experiencing a similar problem, my model is trained using fit_generator, saved using model.save, loaded using load_model and evaluated using evaluate_generator but its accuracy is similar to the untrained one. However, model.predict (without generator) works adequately well.

Using keras.__version__ = 2.2.4.

@JaeDukSeo
Copy link

this seems to be still a problem. One simple fix was not to use multi-process

@BruceDai003
Copy link

I have encountered this problem today.
And I found the solution to this:

  1. You must set shuffle=False in your generator.
  2. you need to reset your generator before calling predict_generator() function.
    For example:
valid_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory="../images/train/",
    x_col="id",
    y_col="label_2",
    subset="validation",
    batch_size=batch_size,
    seed=42,
    shuffle=False,
    class_mode="categorical",
    classes=classes,
    target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)

@Dovermore
Copy link

The predict_generator on a custom data generator for me also produces a different result compared to eval_generator and the verbose output from fit_generator. I didn't shuffle the index in predict_generator

@ghasemikasra39
Copy link

Same issue. Totally confused from the answers above. Any simple solution?

@achbogga
Copy link

same issue. any simple solution from the official authors yet?

@EXJUSTICE
Copy link

Similiar observations to @GalAvineri , when setting shuffle = False, my evaluate_generator accuracy goes up to 92%, which is great, but unrealistic. Setting it to true yields much more realistic results. I would appreciate some official guidelines on this - there are others who argue otherwise (@lipinski) which makes it incredibly confusing

@kkviks
Copy link

kkviks commented Jul 6, 2020

Same issue. Totally confused from the answers above. Any simple solution?

@jtxtina
Copy link

jtxtina commented Jul 10, 2020

Same issue/

@jtxtina
Copy link

jtxtina commented Jul 10, 2020

Ok, I solved the issue.
My code previously:

scores_evaluation = model.evaluate_generator(test_generator.flow(X_test,Y_test, batch_size=32, shuffle=False),len(Y_test)/32)
scores_prediction = model.predict_generator(test_generator.flow(X_test, batch_size=32, shuffle=False),len(Y_test)/32)

However, in flow() shuffle is default to True, so for each flow function here, I added shuffle=False. Then problem solved.

@ahmedhassen7
Copy link

setting the shuffle=False just saved my life :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests