-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras freezing on last batch of first epoch (can't move to second epoch) #8595
Comments
I have the same issue. I've been trying to change batch sizes, but that doesn't seem to change anything. |
I think there is bug with Imagedatagenerator. If I load my images from h5py using |
Same issue here. |
This is likely due to changes in @Dref360 please take a look, this seems like a serious issue. |
@moustaki Are you also using flow_from_directory? |
Could you all update to master / 2.1.2 please? |
@Dref360 Thanks - just tried both master and 2.1.2 and it indeed fixes the issue. Should have tried that before -- sorry about that! For your earlier question, I am using a custom Sequence sub-class. |
I still have this problem with Keras 2.1.2 using tensorflow-gpu 1.4.1. Some advise how to solve it? |
NikeNano - make sure that your validation_steps is reasonable. I had a similar problem, but turns out I forgot to divide by batch_size. |
same with @NikeNano , using keras 2.1.2 and tensorflow-gpu 1.4.1 and keras freezes on epoch 11 |
I have the same problem it is stuck on last batch of first epoch. Epoch 1/30 1/6428 [..............................] - ETA: 9:25:55 - loss: 0.0580 |
It's solved, It just took so much time in the last batch but then it got to epoch 2 |
I also have the same issue, where first epoch hangs on the last step. Using the latest Keras, gpu, python 3.5, windows 10 |
If you are still having this problem, try rebooting, I don't know why but that fixed my issue as I was running keras on the cloud |
Hello! I am running into this issue still on Ubuntu running Python 3.5.2 and Keras 2.1.4. I've been waiting a few hours at the end of the first epoch on a very similar issue (Training a transfer binary classifier on VGG19). At first I thought that it must have been just running through my validation data which was taking an exorbitant amount of time until I found this thread. Is it still a possibility that it is just a very slow iteration over my validation set (it's about 12,000 images, running on a GTX 950)? Or is my mental model of how fit_generator works mistaken? Also, thanks to all who are maintaining this project! It's been great to work with as I'm beginning to dive deeper into ML. 😄 Update: Found I was using Keras 1 api for fit_generator method, switched to using the Keras 2 api and its working now. |
@minaMagedNaeem:same with @oliran, i have the same issue and resolve it after setting validation_steps=validation_size//batch_size history_ft = model.fit_generator( |
same here. i have this problem with the code from Deep Learning with Python Listing 6.37 |
I have same issue when I was running Inception V3 to do transfer learning. Windows 10, python 3.5, keras 2.1.6, tensorflow 1.4-gpu |
Same here with python3, keras v2.1.6, tensorflow v1.8, ubuntu 18.04 |
I had a similar issue with python3, keras v2.1.6, tensorflow v1.8.0, ubuntu 16.04. I interrupted the processing and was able to see that was busy running |
Changing validation_steps=validation_size//batch_size worked for me |
Experiencing the same with Keras 2.2.0 Tensorflow 1.8 on Ubuntu 16.04 . |
Experiencing the same with Keras 2.2.0 Tensorflow 1.10 on Ubuntu 16.04 . |
Experiencing the same - stuck on the final batch for my CNN! |
same. For what it's worth, I think this is a CPU thing, because when I run my code on 1080, it works fine. |
I had this issue with both CPU and GPU, keras 2.2.0. What solved it for me was to set workers=0. |
This worked for me:
|
I encountered this problem using the |
I confirm the valid_generator was the problem. The problem was gone after I had turned it off. But if the validation set is big, I still need the method. I would appreciate if the Keras team can help with this! |
Any progress with this issue? |
I am meeting the same issue with keras 2.2.4, tf 1.8. |
@BioScince I think that's just a problem with the website. It looks like you can't scroll down within the output. Try committing and see if it clips output still. Or write your standardout to a text file |
I also solved this by removing the validation process entirely. I use Ubuntu18.4 LTS, cuda10.0 cudnn7.6, keras2.3.1, and tesoflow 1.14. |
The same issue happened during the fine-tuning of a VGG16 model. Keras 2.2.0 and python: 2.7.15+ Any update from the development teams on this problem? |
I've got the same issue as @BioScince.
|
In my case, the problem was here:
On the epoch end, I've shuffled the data with shuffle from the random package, instead of sklearn.shuffle, it returns None as it an in-place operation. |
I had the same problem and I solved it by downgrading the graphic card's driver. |
This problem can also occur when the path to valiation data is invalid, which is actually my case. I have two seperated directories for training and validation. However, the path to my validation set is incorrect. So at the end of the epoch, Keras could not load the validation data and it got freezed. I think it would be better if Keras can promp an Error like File not found or something like that. |
That worked for me, thanks! But I just did the second step, forgot the validation_steps. |
Hi, I am very new to deep learning and i am using cnn for image classification. I am having the same problem that the epoch is not moving beyond 1/15. i have left it to train overnight but no reposnse and the kernel shows as busy. I am using windows 10, tensorflow 2.0.0, keras 2.3.1 and python 3.6.1 def cnn(x_train, y_train, x_test, y_test):
the output is stuck at: hello Please help me out. I tried changing the batch_size from 128 to 16 and set verbose = 1, still no change |
Hi, @rajpaldish ! |
I had similiar issue, i just set the initial_epoch =1 and it was removed. epochs will start from number 2 though, so add more one more epoch to the existing number. |
200k validation data costs me 20min in my 6 GPU machine after every end of epoch, and this scene costs me 1 week to accept the truth!!! Unbelievable right? our view of STUCK/FREEZEhowever the machine is still computing!!! and where there is no log under computing val loss which make us think something getting wrong... |
I faced the same issue. |
matterport/Mask_RCNN#2243 SOLUTION: Downgrade to scickit-image==0.16.2 matterport/Mask_RCNN#749 You can safely ignore this warning. It's a preemptive warning from TensorFlow when it cannot be certain of the size of the generated tensor. SOLUTION: import warnings warnings.filterwarnings('ignore') matterport/Mask_RCNN#127 keras-team/keras#8595 (comment) SOLUTION: Set workers=1 and ensure use_multiprocessing=False in model.py matterport/Mask_RCNN#2111 Delete the use_mini_mask from the argument list. It's enabled by default in config.py in this version.
I also experienced similar issue, when running a training job with Tensorflow 1.15 using keras sequential model. I also got a warning, Method (on_train_batch_end) is slow compared to the batch update. Check your callbacks (refer this issue). I was able to overcome this issue by following these steps,
|
for those who were not able to solve this issue using the methods above, if you are using |
I've been stuck at this issue for like a day, but I found my elegant fix with this.
The key for me is to defined validation_steps & steps_per_epoch by the samples & batch-size variables within the generator, so there won't be any discrepancies or mistake. |
@leon-kwy Did you ever figure it out? I'm having the exact same problem with Mask RCNN. |
I'm using Keras 2.1.1 and Tensorflow 1.4, Python 3.6, Windows 7.
I'm attempting transfer learning using the Inception model.
The code is straight from the Keras Application API, just a few tweaks (using my data).
Here is the code
The text was updated successfully, but these errors were encountered: