Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras freezing on last batch of first epoch (can't move to second epoch) #8595

Closed
Moondra opened this issue Nov 26, 2017 · 71 comments
Closed

Comments

@Moondra
Copy link

Moondra commented Nov 26, 2017

I'm using Keras 2.1.1 and Tensorflow 1.4, Python 3.6, Windows 7.

I'm attempting transfer learning using the Inception model.
The code is straight from the Keras Application API, just a few tweaks (using my data).

Here is the code

from keras.preprocessing import image```
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
from keras import optimizers


img_width, img_height = 299, 299
train_data_dir = r'C:\Users\Moondra\Desktop\Keras Applications\data\train'
total_samples = 13581
batch_size = 3
epochs = 5


train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
zoom_range = 0.1,
rotation_range=15)



train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = 'categorical')  #class_mode = 'categorical'


# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(12, activation='softmax')(x)

# this is the model we will train
model = Model(input=base_model.input, output=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer=optimizers.SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# train the model on the new data for a few epochs
model.fit_generator(
train_generator,
steps_per_epoch = 20,
epochs = epochs)


# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.

# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 172 layers and unfreeze the rest:
for layer in model.layers[:249]:
   layer.trainable = False
for layer in model.layers[249:]:
   layer.trainable = True

# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(
train_generator,
steps_per_epoch = 25,
epochs = epochs)`


Output is

Found 13581 images belonging to 12 classes.

Warning (from warnings module):
  File "C:\Users\Moondra\Desktop\Keras Applications\keras_transfer_learning_inception_problem_one_epoch.py", line 44
    model = Model(input=base_model.input, output=predictions)
UserWarning: Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("in..., outputs=Tensor("de...)`
Epoch 1/5

 1/20 [>.............................] - ETA: 38s - loss: 2.8652 - acc: 0.0000e+00����������������������������������������������������������������������������������
 3/20 [===>..........................] - ETA: 12s - loss: 2.6107 - acc: 0.1111    ������������������������������������������������������������������������������
 4/20 [=====>........................] - ETA: 8s - loss: 2.6454 - acc: 0.0833 �����������������������������������������������������������������������������
 5/20 [======>.......................] - ETA: 6s - loss: 2.6483 - acc: 0.0667�����������������������������������������������������������������������������
 6/20 [========>.....................] - ETA: 5s - loss: 2.6863 - acc: 0.0556�����������������������������������������������������������������������������
 7/20 [=========>....................] - ETA: 4s - loss: 2.6230 - acc: 0.0952�����������������������������������������������������������������������������
 8/20 [===========>..................] - ETA: 3s - loss: 2.6212 - acc: 0.0833�����������������������������������������������������������������������������
 9/20 [============>.................] - ETA: 3s - loss: 2.6192 - acc: 0.1111�����������������������������������������������������������������������������
10/20 [==============>...............] - ETA: 2s - loss: 2.6223 - acc: 0.1000�����������������������������������������������������������������������������
11/20 [===============>..............] - ETA: 2s - loss: 2.6626 - acc: 0.0909�����������������������������������������������������������������������������
12/20 [=================>............] - ETA: 2s - loss: 2.6562 - acc: 0.1111�����������������������������������������������������������������������������
13/20 [==================>...........] - ETA: 1s - loss: 2.6436 - acc: 0.1282�����������������������������������������������������������������������������
14/20 [====================>.........] - ETA: 1s - loss: 2.6319 - acc: 0.1190�����������������������������������������������������������������������������
15/20 [=====================>........] - ETA: 1s - loss: 2.6343 - acc: 0.1111
Warning (from warnings module):
  File "C:\Users\Moondra\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\callbacks.py", line 116
    % delta_t_median)
UserWarning: Method on_batch_end() is slow compared to the batch update (0.102000). Check your callbacks.
�����������������������������������������������������������������������������
16/20 [=======================>......] - ETA: 0s - loss: 2.6310 - acc: 0.1042�����������������������������������������������������������������������������
17/20 [========================>.....] - ETA: 0s - loss: 2.6207 - acc: 0.1176�����������������������������������������������������������������������������
18/20 [==========================>...] - ETA: 0s - loss: 2.6063 - acc: 0.1296�����������������������������������������������������������������������������
19/20 [===========================>..] - ETA: 0s - loss: 2.6056 - acc: 0.1228




It just hangs at the 19/20.

I already asked on stack overflow but no help.

https://stackoverflow.com/questions/47382952/cant-get-past-first-epoch-just-hangs-keras-transfer-learning-inception


@whatisAI
Copy link

I have the same issue. I've been trying to change batch sizes, but that doesn't seem to change anything.

@moondra2017
Copy link

I think there is bug with Imagedatagenerator. If I load my images from h5py using
model.train_on_batch I have no problems.

@moustaki
Copy link
Contributor

moustaki commented Dec 1, 2017

Same issue here. fit_generator works fine in 2.0.9, but hangs indefinitely at the end of the first epoch from 2.1.0 onwards.

@fchollet
Copy link
Member

fchollet commented Dec 1, 2017

This is likely due to changes in keras/utils/data_utils.py between 2.0.9 and 2.1.0. Specifically this: 612f530#diff-ba9d38600a2df565e5ae8757eb2b1b35

@Dref360 please take a look, this seems like a serious issue.

@Dref360
Copy link
Contributor

Dref360 commented Dec 2, 2017

@moustaki Are you also using flow_from_directory?

@Dref360
Copy link
Contributor

Dref360 commented Dec 2, 2017

Could you all update to master / 2.1.2 please?
Pretty sure this has been fixed with : 2f3edf9#diff-299cfd5886683a4b012f286403769fc1

@moustaki
Copy link
Contributor

moustaki commented Dec 2, 2017

@Dref360 Thanks - just tried both master and 2.1.2 and it indeed fixes the issue. Should have tried that before -- sorry about that! For your earlier question, I am using a custom Sequence sub-class.

@NikeNano
Copy link

I still have this problem with Keras 2.1.2 using tensorflow-gpu 1.4.1. Some advise how to solve it?

@oliran
Copy link

oliran commented Jan 4, 2018

NikeNano - make sure that your validation_steps is reasonable. I had a similar problem, but turns out I forgot to divide by batch_size.

@LivingProgram
Copy link

same with @NikeNano , using keras 2.1.2 and tensorflow-gpu 1.4.1 and keras freezes on epoch 11

@minaMagedNaeem
Copy link

I have the same problem it is stuck on last batch of first epoch.
Keras version 2.1.3
Tensorflow version 1.4.0

Epoch 1/30
C:\Users\Minal\AppData\Local\Programs\Python\Python36\lib\site-packages\skimage\transform_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
warn("The default mode, 'constant', will be changed to 'reflect' in "

1/6428 [..............................] - ETA: 9:25:55 - loss: 0.0580
2/6428 [..............................] - ETA: 7:46:11 - loss: 0.0560
3/6428 [..............................] - ETA: 7:14:06 - loss: 0.0569
4/6428 [..............................] - ETA: 6:52:54 - loss: 0.0536
5/6428 [..............................] - ETA: 6:49:36 - loss: 0.0541
6/6428 [..............................] - ETA: 6:51:51 - loss: 0.0556
7/6428 [..............................] - ETA: 6:45:15 - loss: 0.0580
8/6428 [..............................] - ETA: 6:33:50 - loss: 0.0595
9/6428 [..............................] - ETA: 6:20:48 - loss: 0.0594
10/6428 [..............................] - ETA: 6:12:55 - loss: 0.0604
11/6428 [..............................] - ETA: 6:07:12 - loss: 0.0596
12/6428 [..............................] - ETA: 6:00:31 - loss: 0.0588
13/6428 [..............................] - ETA: 6:00:06 - loss: 0.0589
14/6428 [..............................] - ETA: 5:59:53 - loss: 0.0591
15/6428 [..............................] - ETA: 5:57:44 - loss: 0.0590
16/6428 [..............................] - ETA: 5:55:21 - loss: 0.0601
.
.
.
6420/6428 [============================>.] - ETA: 14s - loss: 0.0213
6421/6428 [============================>.] - ETA: 12s - loss: 0.0213
6422/6428 [============================>.] - ETA: 10s - loss: 0.0213
6423/6428 [============================>.] - ETA: 8s - loss: 0.0213
6424/6428 [============================>.] - ETA: 7s - loss: 0.0213
6425/6428 [============================>.] - ETA: 5s - loss: 0.0213
6426/6428 [============================>.] - ETA: 3s - loss: 0.0213
6427/6428 [============================>.] - ETA: 1s - loss: 0.0212

@minaMagedNaeem
Copy link

It's solved, It just took so much time in the last batch but then it got to epoch 2

@KenHollandWHY
Copy link

I also have the same issue, where first epoch hangs on the last step. Using the latest Keras, gpu, python 3.5, windows 10

@LivingProgram
Copy link

If you are still having this problem, try rebooting, I don't know why but that fixed my issue as I was running keras on the cloud

@JackCurrie
Copy link

JackCurrie commented Apr 13, 2018

Hello! I am running into this issue still on Ubuntu running Python 3.5.2 and Keras 2.1.4. I've been waiting a few hours at the end of the first epoch on a very similar issue (Training a transfer binary classifier on VGG19).

At first I thought that it must have been just running through my validation data which was taking an exorbitant amount of time until I found this thread. Is it still a possibility that it is just a very slow iteration over my validation set (it's about 12,000 images, running on a GTX 950)? Or is my mental model of how fit_generator works mistaken?

Also, thanks to all who are maintaining this project! It's been great to work with as I'm beginning to dive deeper into ML. 😄

Update: Found I was using Keras 1 api for fit_generator method, switched to using the Keras 2 api and its working now.

@kaka7
Copy link

kaka7 commented Apr 26, 2018

@minaMagedNaeem:same with @oliran, i have the same issue and resolve it after setting validation_steps=validation_size//batch_size

history_ft = model.fit_generator(
generator_train,#可自定义
samples_per_epoch=4170, # nb_train_samples
# steps_per_epoch=10, # nb_train_samples#每轮epoch遍历的samples
validation_data=generator_test,#可自定义
nb_epoch=10,
# verbose=0,
validation_steps=530//64,
# epochs=100
# nb_val_samples=530
)

@ptah23
Copy link

ptah23 commented May 2, 2018

same here. i have this problem with the code from Deep Learning with Python Listing 6.37
I am on Ubuntu 18.04 with keras 2.1.6, tensorflow-gpu 1.8.0

@Tensorfengsheng1926
Copy link

I have same issue when I was running Inception V3 to do transfer learning. Windows 10, python 3.5, keras 2.1.6, tensorflow 1.4-gpu

@hashJoe
Copy link

hashJoe commented May 10, 2018

Same here with python3, keras v2.1.6, tensorflow v1.8, ubuntu 18.04
After multiple reinstallations and tries
the solution was to wait for several minutes for it to jump to epoch 2/25, after it was stuck on epoch 1 (7999/8000) xD

@ldelphinpoulat
Copy link

I had a similar issue with python3, keras v2.1.6, tensorflow v1.8.0, ubuntu 16.04. I interrupted the processing and was able to see that was busy running self.sess.run([self.merged], feed_dict=feed_dict) in keras/callbacks.py.
I guessed that it was related to histogram computations in TensorBoard. So, I set histogram_freq=0 on TensorBoard object creation. And, for me it solved the issue, at the cost of loosing TensorBoard histograms.
I had previous versions of keras and tensorflow for which the histogram computation for tensorboard did not take such a huge time (unfortunately I do not recall for which versions it was ok).

@shaktisd
Copy link

Changing validation_steps=validation_size//batch_size worked for me

@whatdhack
Copy link

Experiencing the same with Keras 2.2.0 Tensorflow 1.8 on Ubuntu 16.04 .

@bmitrauncc
Copy link

screen shot 2018-06-20 at 16 39 46

Getting stuck here

@yangjh39
Copy link

Experiencing the same with Keras 2.2.0 Tensorflow 1.10 on Ubuntu 16.04 .

@kjaisingh
Copy link

Experiencing the same - stuck on the final batch for my CNN!

@ejcer
Copy link

ejcer commented Aug 4, 2018

same. For what it's worth, I think this is a CPU thing, because when I run my code on 1080, it works fine.

@dantheman3333
Copy link

dantheman3333 commented Aug 10, 2018

Have the same issue. Stuck on first epoch step 1999/2000. Using windows, tensorflow-gpu 1.10.0, Keras 2.2.2, CUDA V9.0.176. Using the ImageDataGenerator flow_from_directory for training and validation

I have way too much data - I have 50 million images and I split it 70 train and 30 val, so I thought it would have way too much validation data to run through every epoch. But if I set validation_steps in fit_generator to 1 it should only do one step of validation (one batch?) before moving on to the next epoch?

I'm new to this so I'm having a hard time debugging, but this is the profile after I few hours:
call_count
time

when sorted by time taken the top two methods are get and wait in pool.py, and the other get is from keras' data_utils.py

Edit: I downgraded Keras to 2.0.9 and now it works
Edit: I actually still sometimes have this issue on 2.0.9. Can't seem to find out why it's happening occasionally.

@MinnML
Copy link

MinnML commented Aug 22, 2018

I had this issue with both CPU and GPU, keras 2.2.0. What solved it for me was to set workers=0.

@ashuta03
Copy link

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that:
    steps_per_epoch = number of train samples//batch_size
    and
    validation_steps = number of validation samples//batch_size

@Quetzalcohuatl
Copy link

I encountered this problem using the fit function. I believe I fixed it by setting batch_size=2 and using Adam instead of SGD as my optimizer. I think it may be a memory issue, and the machine was coping using swap memory which is notoriously slow.

@mnguyenmti
Copy link

I confirm the valid_generator was the problem. The problem was gone after I had turned it off. But if the validation set is big, I still need the method. I would appreciate if the Keras team can help with this!

@srv902
Copy link

srv902 commented Aug 30, 2019

Any progress with this issue?

@SWHL
Copy link

SWHL commented Sep 16, 2019

I am meeting the same issue with keras 2.2.4, tf 1.8.
I think the reason is the IMAGES_PER_GPU = 4. When I change the IMAGES_PER_GPU = 1 from IMAGES_PER_GPU = 4, the problem is gone.

@BioScince
Copy link

I am meeting the same issue with keras
image
the code stop in 18 epoch from 60 epoch ????
any one help me

@Quetzalcohuatl
Copy link

@BioScince I think that's just a problem with the website. It looks like you can't scroll down within the output. Try committing and see if it clips output still. Or write your standardout to a text file

@Akiqqqqqqq
Copy link

I also solved this by removing the validation process entirely. I use Ubuntu18.4 LTS, cuda10.0 cudnn7.6, keras2.3.1, and tesoflow 1.14.

@ghost
Copy link

ghost commented Nov 7, 2019

The same issue happened during the fine-tuning of a VGG16 model. Keras 2.2.0 and python: 2.7.15+
Removing use_multiprocessing=True solves the freezing problem.

Any update from the development teams on this problem?

@BEEugene
Copy link

I've got the same issue as @BioScince.
That wasn't an issue when I used a small number of images.
The training is run under Linux cluster.
Python 3.6.8 Keras 2.3.1
I waited for a day till it starts a new epoch after validation, nothing happens.
Training the network with such generator.

class DataGeneratorFilesCrop(Sequence):
    def __init__(self, image_mask_prepoc, image_filenames=None, mask_names=None, image_folder=None, mask_folder=None, root_dir=None,
                 batch_size=1, image_size=256, nb_y_features=1,
                 augmentation=None, mask_transform=lambda x: np.expand_dims((x > 0).astype(np.int8), -1),
                 suffle=True):
        self.image_filenames = image_filenames if image_filenames else self.listdir_fullpath(os.path.join(root_dir, image_folder))
        self.mask_names = mask_names if mask_names else self.listdir_fullpath(os.path.join(root_dir, mask_folder))
        self.batch_size = batch_size
        self.currentIndex = 0
        self.augmentation = augmentation
        self.image_size = image_size
        self.nb_y_features = nb_y_features
        self.indexes = None
        self.mask_transform = mask_transform
        self.suffle = suffle

    def listdir_fullpath(self, d):
        return np.sort([os.path.join(d, f) for f in os.listdir(d)])

    def __len__(self):
        """
        Calculates size of batch
        """
        return int(len(self.image_filenames) / (self.batch_size))

    def on_epoch_end(self):
        """Updates indexes after each epoch"""
        if self.suffle == True:
            self.image_filenames, self.mask_names = shuffle(self.image_filenames, self.mask_names)

    def read_image_mask(self, image_name, mask_name):
        return cv2.resize(cv2.imread(image_name), (self.image_size, self.image_size)) / 255, cv2.resize(cv2.imread(mask_name, 0), (self.image_size,self.image_size))

    def __getitem__(self, index):
        """
        Generate one batch of data

        """
        # Generate indexes of the batch
        data_index_min = int(index * self.batch_size)
        data_index_max = int(min((index + 1) * self.batch_size, len(self.image_filenames)))

        indexes = self.image_filenames[data_index_min:data_index_max]

        this_batch_size = len(indexes)  # The last batch can be smaller than the others

        # Defining dataset
        X = np.empty((this_batch_size, self.image_size, self.image_size, 3), dtype=np.float32)
        y = np.empty((this_batch_size, self.image_size, self.image_size, self.nb_y_features), dtype=np.uint8)

        for i, sample_index in enumerate(indexes):

            X_sample, y_sample = self.read_image_mask(self.image_filenames[index * self.batch_size + i],
                                                      self.mask_names[index * self.batch_size + i])

            if self.mask_transform:
                y_sample = self.mask_transform(y_sample)

            # if augmentation is defined, we assume its a train set

            X[i, ...] = np.clip(X_sample, a_min=0, a_max=1)
            y[i, ...] = y_sample

        return X, y

@BEEugene
Copy link

In my case, the problem was here:

def on_epoch_end(self):
        """Updates indexes after each epoch"""
        if self.suffle == True:
            self.image_filenames, self.mask_names = shuffle(self.image_filenames, self.mask_names)

On the epoch end, I've shuffled the data with shuffle from the random package, instead of sklearn.shuffle, it returns None as it an in-place operation.
In my case, it throws an exception which wasn't shown in the terminal and the process stopped.
That is why it doesn't start a new epoch and seemed to be frozen.
It is a clue to almost all of what was discussed here.
So, note to the developers - let exception appear:)

@jalilasadi
Copy link

I had the same problem and I solved it by downgrading the graphic card's driver.

@doantientai
Copy link

This problem can also occur when the path to valiation data is invalid, which is actually my case. I have two seperated directories for training and validation. However, the path to my validation set is incorrect. So at the end of the epoch, Keras could not load the validation data and it got freezed.

I think it would be better if Keras can promp an Error like File not found or something like that.

@erasmo-aln
Copy link

erasmo-aln commented Apr 11, 2020

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that:
    steps_per_epoch = number of train samples//batch_size
    and
    validation_steps = number of validation samples//batch_size

That worked for me, thanks! But I just did the second step, forgot the validation_steps.

@rajpaldish
Copy link

Hi, I am very new to deep learning and i am using cnn for image classification. I am having the same problem that the epoch is not moving beyond 1/15. i have left it to train overnight but no reposnse and the kernel shows as busy. I am using windows 10, tensorflow 2.0.0, keras 2.3.1 and python 3.6.1

def cnn(x_train, y_train, x_test, y_test):
model = Sequential()
model.add(Conv2D(32,(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))

model.add(Conv2D(64, (3,3), activation = 'relu'))
model.add(MaxPool2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3,3), activation= 'relu'))
model.add(MaxPool2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=15, batch_size=16, verbose = 1)
loss, accuracy = model.evaluate(x_train, y_train)
probabilities = model.predict(x_test)
predictions = [float(np.round(x)) for x in probabilities]
accuracy = np.mean(predictions == y_test)
print("Prediction Accuracy: %.2f%%" % (accuracy*100))
model.save('result.h5')

the output is stuck at:

hello
i am in second block
i am in third block
(1, 64, 64)
(0, 64, 64)
Epoch 1/15

Please help me out. I tried changing the batch_size from 128 to 16 and set verbose = 1, still no change

@BEEugene
Copy link

BEEugene commented Apr 28, 2020

Hi, @rajpaldish !
It seems the shape of your second batch is zero (0, 64, 64). It means that it is an empty dataset. That is why Keras freezes. Try to test the generator separately. You will probably find some mistakes in it. Check this issue

@dushyant-007
Copy link

I had similiar issue, i just set the initial_epoch =1 and it was removed. epochs will start from number 2 though, so add more one more epoch to the existing number.

@yyhhlancelot
Copy link

200k validation data costs me 20min in my 6 GPU machine after every end of epoch, and this scene costs me 1 week to accept the truth!!! Unbelievable right? our view of STUCK/FREEZEhowever the machine is still computing!!! and where there is no log under computing val loss which make us think something getting wrong...

@lalitbhagat7
Copy link

I faced the same issue.
This is because the model is running on the validation dataset, and this usually takes a lot of time. Try reducing the validation dataset, or wait for some time it worked for me. It seems like it's stuck, but it is running on the validation dataset.

bogdanlalu added a commit to bogdanlalu/maskrcnn_TF2 that referenced this issue Apr 25, 2021
matterport/Mask_RCNN#2243
SOLUTION: Downgrade to scickit-image==0.16.2

matterport/Mask_RCNN#749
You can safely ignore this warning. It's a preemptive warning from TensorFlow when it cannot be certain of the size of the generated tensor.
SOLUTION:
import warnings
warnings.filterwarnings('ignore')

matterport/Mask_RCNN#127
keras-team/keras#8595 (comment)
SOLUTION: Set workers=1 and ensure use_multiprocessing=False in model.py

matterport/Mask_RCNN#2111
Delete the use_mini_mask from the argument list. It's enabled by default in config.py in this version.
@NeerajanS
Copy link

I also experienced similar issue, when running a training job with Tensorflow 1.15 using keras sequential model. I also got a warning, Method (on_train_batch_end) is slow compared to the batch update. Check your callbacks (refer this issue).

I was able to overcome this issue by following these steps,

  1. Increase the batch size (earlier I was using 128, I increased the batch size to 512)
  2. Setting validation_steps = int(number of validation samples/batchsize) in model.fit
    I found that changing the verbose value has no effect on the issue.

@ronithsaju
Copy link

for those who were not able to solve this issue using the methods above, if you are using from tensorflow.keras.preprocessing.image import ImageDataGenerator, try changing it to from keras.preprocessing.image import ImageDataGenerator, or vice versa. worked for me. its said that you should never mix keras and tensorflow.

@supermomo668
Copy link

supermomo668 commented Jul 9, 2021

I've been stuck at this issue for like a day, but I found my elegant fix with this.

#train_generator = ...
#val_generator = ...
history = model.fit(
        train_generator,
        epochs=200,
        validation_data=val_generator,
        use_multiprocessing=True,
        workers=16,
        steps_per_epoch= train_generator.samples//train_generator.batch_size,  ######  Here
        validation_steps= val_generator.samples//val_generator.batch_size,   ##### Here
        callbacks=callbacks
        )

The key for me is to defined validation_steps & steps_per_epoch by the samples & batch-size variables within the generator, so there won't be any discrepancies or mistake.

@leon-kwy
Copy link

leon-kwy commented Mar 4, 2022

image

The same issue, I stuck at the first beginning when training the model using colab. And it shows that I am stuck at the get function of _get_next_batch, and colab doesn't show anything to tell me what was wrong. Is there anyone could tell me what was going on?

svbeuningen added a commit to Living-Technologies/Mask_RCNN that referenced this issue Mar 7, 2022
@emmanuel-nwogu
Copy link

@leon-kwy Did you ever figure it out? I'm having the exact same problem with Mask RCNN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests