-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate_generator produces wrong accuracy scores? #6499
Comments
Having looked into the backend I believe this is due to the number of workers if you have more than one worker there is nothing to ensure consistency of file loading across them. Therefore it is possible that multiple files are shown numerous times in these methods, as the 12 generators are randomly initialised but don't actually share a file list state. Maybe @fchollet or @farizrahman4u can confirm? |
Thanks for your answer. I already considered that the image generators are not threadsafe. However, I was able to reproduce the problem with setting all three worker counts to 1. Does the order of the filenames maybe not correspond to the order of the scores? |
@skoch9 try to set @fchollet Please make evaluate_generator, and predict_generator Make sure:
Let me know if that gives you consistent results. |
Thanks @abnerA for your answer. I investigated a little further and found that running score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, pickle_safe=False)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size) As for the parameters, when setting |
@skoch9 no. Pickle safe is false by default in all keras generator methods. The only difference between this two modes is False uses multithread and True uses multiprocess. |
Sure, but it doesn't make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
Hello
with the generator define to take subpart of the input matrix
` model.evaluate(x_test, y_test, verbose=0) but same learn parameter, same test data, but different results What to do ??? |
To follow up if I only set worker=1 it does not woks either (and give different results each time) |
Hi,
It does not matter if use_multiprocessing is True or False, or the order of the calls ( I know that this code will not result in identical generator output (each calls the next iteration of the generator) but the output of the generator is similar between iterations, and the EDIT: SOLVED: My accuracy difference was eliminated when I removed all batch normalization layers (also removed dropout layers, but I don't think that was the cause) |
I too have a big difference between the reported fit_generator results and the later evaluate_generator results. I've looked into this a bit and found the following results:
Can anyone verify that any of the above happen to them as well? A note: |
@GalAvineri : Very interesting... I just came across your post, after posting my own frustrations: #5818. I will try to add shuffling to my evaluate_generator, to see if this makes any difference, and if I get the same results as you do, although I cannot see any sense in shuffling during model evaluation... |
I had a similar problem using fit_generator with multiprocessing under Linux: Maybe this saves time for some of you. [1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/ |
|
I also ran into a similar issue where my @jeremydr2 When you removed the batch normalization and dropout what was your accuracy and loss? Was it still 80% and 0.5? Because for me it fell to 50% @GalAvineri I have a similar issue, but I do not think that the issue is about using shuffle but it is because of the |
This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless. |
My problem was using Just noting that for future googlers. Might not be relevant for this thread topic specifically. It might be worth to warn against this during evaluation in keras when a model has 2+ outputs and the accuracy is binary. |
I'm not sure I understand your solution. When you say, "Resolution was to seed np.random in getitem(self, idx)." Could you explain how you did this a little more thoroughly? How do you seed np.random in getitem(self,idx)? |
Hey guys i'm having similar issues with predict_generator. When i'm training i get around 92% train acc and 80% val_acc but when i make predictions and put them through a confusion matrix the acc drops to 50%, any updates on this |
hi, same here - manually computed accuracy on this kaggle competition https://www.kaggle.com/c/aerial-cactus-identification gives me 50-65% accuracy (binary classification problem) while predict_generator truth_generator = datagen.flow_from_dataframe(dataframe=df_truth, directory="test", x_col="id", y_col="has_cactus", class_mode="binary", target_size=(32,32), batch_size=200, drop_duplicates=True, seed=2019, pickle_safe = True, workers=1) |
You need to add shuffle=False to flow_from_dataframe. |
Hi, I encountered the same issue recently, and actually the solution is quite simple. So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :
|
I am experiencing a similar problem, my model is trained using Using |
this seems to be still a problem. One simple fix was not to use multi-process |
I have encountered this problem today.
|
The |
Same issue. Totally confused from the answers above. Any simple solution? |
same issue. any simple solution from the official authors yet? |
Similiar observations to @GalAvineri , when setting shuffle = False, my evaluate_generator accuracy goes up to 92%, which is great, but unrealistic. Setting it to true yields much more realistic results. I would appreciate some official guidelines on this - there are others who argue otherwise (@lipinski) which makes it incredibly confusing |
Same issue. Totally confused from the answers above. Any simple solution? |
Same issue/ |
Ok, I solved the issue.
However, in flow() shuffle is default to True, so for each flow function here, I added shuffle=False. Then problem solved. |
setting the shuffle=False just saved my life :D |
Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:
With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the
rescale=1. / 255
parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks
The text was updated successfully, but these errors were encountered: