Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keras 2 - fit_generator broken? #5818

Closed
daavoo opened this issue Mar 16, 2017 · 25 comments
Closed

keras 2 - fit_generator broken? #5818

daavoo opened this issue Mar 16, 2017 · 25 comments

Comments

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2017

I've updated to keras v2 yesterday.

I adapted all my code from version 1 to the new API, following all the warnings I encountered.

However I'm having some very strange problems with fit_generator method of Model.

Using this toy example, wich worked totally fine in version 1:

from keras.models import Model
from keras.layers import Input, Dense, Flatten
from keras.optimizers import SGD
from keras.losses import categorical_crossentropy
from keras.preprocessing.image import ImageDataGenerator

gen = ImageDataGenerator()
train_batches = gen.flow_from_directory("D:/GitHub/Kaggle/redux/train/")

inp = Input(shape=(256,256,3))
l1 = Flatten()(inp)
out = Dense(2, activation="softmax")(l1)

model = Model(inp, out)

model.compile(loss=categorical_crossentropy, optimizer=SGD(lr=0.01))

model.fit_generator(train_batches, train_batches.samples / train_batches.batch_size)

The output in jupyter notebook is quite strange, printing a unknown symbol until the notebook crashes:

Epoch 1/1
   23/718 [..............................] - ETA: 522s - loss: 8.4146 �������������������������������������������������������

Running the code from the terminal don't print those strange symbols.

The code works perfect when manually getting the batches from the generator to use with model.fit:

n = 0
for imgs, labels in train_batches:
    if n > 3:
        break
    X_train = np.array(imgs)
    y_train = np.array(labels)
    model.fit(X_train, y_train)
     n += 1
Epoch 1/1
   32/32 [==============================] - 0s - loss: 7.5555
   Epoch 1/1
   32/32 [==============================] - 0s - loss: 8.5627
   Epoch 1/1
   32/32 [==============================] - 0s - loss: 6.5480
   Epoch 1/1
   32/32 [==============================] - 0s - loss: 10.0738

Anyone is facing similar problems with fit_generator and/or know something about it?

@daavoo daavoo closed this as completed Mar 16, 2017
@georgenizharadze
Copy link

I'm also having a problem with fit_generator after upgrading to Keras2. The model training time has gone up about 1000 times! Have not figured out yet why. I read that in Keras2 fit_generator number of samples has been replaced by the number of batches. I suspect this is the cause of the issue but don't know for sure.

@mxbi
Copy link

mxbi commented Mar 17, 2017

Hi @daavoo, can you explain why you closed this issue? Did you manage to solve your issue? I have seen several people have this issue today, so if you have found a workaround it would be much appreciated.

@daavoo daavoo reopened this Mar 17, 2017
@daavoo
Copy link
Contributor Author

daavoo commented Mar 17, 2017

Ok so:

  • About loading images 1 by 1. The problem was that I was used to keras v1 where the number printed were the number of images. Now is the number of steps. the slowniness is just caused by the overhed of printing that strange symbol.

  • Setting verbose=0 on fit_generator avoids printing this strange thing at cost of not printing shit.

Lastly, this is a little embarrasing, but I closed the issue by mistake when closing issues from my repos. xD

@daavoo
Copy link
Contributor Author

daavoo commented Mar 17, 2017

Tracked down the problem and I think that the issue is related to this class:
class Progbar(object):
In : https://github.com/fchollet/keras/blob/master/keras/utils/generic_utils.py#L211
Too lazy to continue investigation today.

@debarundhar
Copy link

Hi, I'm having this same problem after upgrading to Keras 2 on Ubuntu 16.04. Any progress on a fix?

@jerpint
Copy link

jerpint commented Mar 22, 2017

Hello, Im not sure if this is the right place to ask, but hopefully someone can help. Im having some issues with fit_generator() in keras V1.

It seems to work well on the first epoch , but not on the epochs afterwards. I say this because on the first epoch, the model takes a significant amount of train time, and returns accuracy metrics that seem plausible. However, epoch 2 onwards, the train time decreases significantly, and accuracy shoots up to 1 ( obviously suspicious). It seems as though the generator doesn't reset appropriately. Does anyone know what could be causing this?

My code :

def batch_generator_train():
    
    from keras.utils import np_utils

    global f_train
    dset_train = f_train['urbansound']
    global batch_size
    global count_train
    global meta_info_train
    global nb_classes
    idx = range(0,count_train)
    np.random.shuffle(idx)
    count=0
    while 1:
        idx_tmp = idx[count*batch_size:(count+1)*batch_size]
        X_train = np.zeros((batch_size,128,128,1))
        y_train = np.zeros(batch_size)
        #y_meta_train_all = []
        for ii,jj in enumerate( idx_tmp ):
            X_train[ii,:,:,0] = dset_train[jj]
            y_train[ii] = meta_info_train[jj][6]
            #y_meta_train_all.append( meta_info_train[jj])
        Y_train = np_utils.to_categorical(y_train, nb_classes)
        yield X_train,Y_train
        count=count+1

    
def batch_generator_val():
    
    from keras.utils import np_utils

    global f_val
    dset_val = f_val['urbansound']
    global batch_size
    global count_val
    global meta_info_valid
    global nb_classes
    idx = range(0,count_val)
    np.random.shuffle(idx)
    count=0
    while 1:
        idx_tmp = idx[count*batch_size:(count+1)*batch_size]
        X_val = np.zeros((batch_size,128,128,1))
        y_val = np.zeros(batch_size)
        #y_meta_train_all = []
        for ii,jj in enumerate( idx_tmp ):
            X_val[ii,:,:,0] = dset_val[jj]
            y_val[ii] = meta_info_valid[jj][6]
            #y_meta_train_all.append( meta_info_train[jj])
        Y_val = np_utils.to_categorical(y_val, nb_classes)
        yield X_val,Y_val
        count=count+1

and my network definitions


f_train = h5py.File("/home/jerpint/Desktop/Audiostuff/aug/Xtrain.h5", "r")
f_val = h5py.File("/home/jerpint/Desktop/Audiostuff/aug/Xvalid.h5", "r")

generator_train = batch_generator_train()
generator_val = batch_generator_val()

# callbacks
filepath = 'test2_callback_audio.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

#count_val = number of validation samples, count_train = number of train samples
history = model.fit_generator(generator= generator_train,samples_per_epoch= int(np.floor((count_train)/batch_size)*batch_size),nb_epoch=5,verbose=2,validation_data=generator_val,
                    nb_val_samples = int(np.floor((count_val)/batch_size)*batch_size))#,callbacks = callbacks_list
                   
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
f_train.close()
f_val.close()

this, in turn, returns

75584 test samples
Epoch 1/5
7412s - loss: 0.8559 - acc: 0.7073 - val_loss: 1.3755 - val_acc: 0.6275
Epoch 2/5
435s - loss: 4.5010e-04 - acc: 0.9999 - val_loss: 1.1921e-07 - val_acc: 1.0000
Epoch 3/5
437s - loss: 3.4126e-06 - acc: 1.0000 - val_loss: 1.1921e-07 - val_acc: 1.0000
Epoch 4/5
437s - loss: 1.8840e-06 - acc: 1.0000 - val_loss: 1.1921e-07 - val_acc: 1.0000
Epoch 5/5
437s - loss: 1.6184e-06 - acc: 1.0000 - val_loss: 1.1921e-07 - val_acc: 1.0000
Test score: 7.35714074846
Test accuracy: 0.40946496613

@fchollet
Copy link
Member

fchollet commented Mar 22, 2017 via email

@daavoo
Copy link
Contributor Author

daavoo commented Mar 22, 2017

I'm not having this issue anymore.

@daavoo daavoo closed this as completed Mar 22, 2017
@harpone
Copy link

harpone commented Mar 25, 2017

FYI I'm getting the exact same weird character printing issue in a Jupyter notebook without data generators.

Steps to reproduce: copy paste the mnist_cnn example below to a notebook and evaluate cell.

@daavoo
Copy link
Contributor Author

daavoo commented Mar 26, 2017

Are you using Windows?

@jerpint
Copy link

jerpint commented Mar 26, 2017

Jupyter doesn't handle Keras verbosity very well for me either (python 2.7, Ubuntu 16.04), it crashes it most of the time for me as well, I recommend setting verbosity to a minimum or to 0 completely. Or using it through a terminal by running python directly , that seems to work fine

@harpone
Copy link

harpone commented Mar 26, 2017

@daavoo Ubuntu 16.04 and Jupyter on latest Chrome

@daavoo
Copy link
Contributor Author

daavoo commented Mar 26, 2017

@harpone This is really strange. I was facing this issue last week on Windows but right now I'm using same configuration as yours and it works fine for me. Do you have the lastest jupyter and keras versions??.

Pd. Solutions for those interested, apart from setting verbose to 0:

  • Convert notebook to python script with:
    jupyter nbconvert --to script [YOUR_NOTEBOOK].ipynb
  • Run the code as python script. The issue seems to be related with Jupyter,

@daavoo
Copy link
Contributor Author

daavoo commented Mar 26, 2017

@harpone as I mentioned above, the issue is related to the class Progbar(object) in : https://github.com/fchollet/keras/blob/master/keras/utils/generic_utils.py#L211

I noticed the problem using fit_generator but any method that uses the Progbar will also print those strange symbols, wich in the mnist example isfit

@harpone
Copy link

harpone commented Mar 26, 2017 via email

@Abhilash-Chandran
Copy link

Abhilash-Chandran commented Sep 8, 2017

For those who are looking at this issue now.

My solution:
switch both steps_per_epoch and validation_steps to be the no Of samples/batch size.

I believe this is what @fchollet mentioned in his response.
For me verbose=1 worked fine.

so my batch size is 32 , no of samples was 8000 and validation set size was 2000

As per the keras documentation the parameter steps_per_epoch = no of train samples/batch_size and validation_steps = no of validation samples/batch_size

  1. steps_per_epoch = 8000/32 = 250

  2. validation_steps = 2000/32 = 62.5

Following code tested on the day of this comment. 08 Sep 2017.

Old Code
classifier.fit_generator(training_set, samples_per_epoch=8000, epochs=25, verbose=1, validation_data=test_set, validation_steps=2000)

Changed Code
classifier.fit_generator(training_set, steps_per_epoch=250, epochs=25, verbose=1, validation_data=test_set, validation_steps=62.5)

@deltaz0
Copy link

deltaz0 commented Nov 7, 2017

In case it helps anyone coming across this issue of fit() calls flooding jupyter with characters and crashing the tab:

you can probably fix with

conda upgrade notebook

if that doesn't work you can try

conda upgrade ipython ipywidgets jupyter jupyter_console notebook qtconsole spyder widgetsnbextension

the old jupyter kernel doesn't correctly interpret \b and \r, and the progbar output will flood the browser tab with these until it crashes. specifically the 'dynamic display' case in which these lines will overwrite the bar between batches. the \b character will print a � and the \r character will clear a line but seems to keep the text in memory, or fails to flush old lines, or something.

the new jupyter kernel properly interprets these characters and should not have this issue.

there is a similar issue in vscode's output, where \b is interpreted as � and \r is simply ignored. not a big deal as it flushes itself so it won't crash the debugger, but it makes the output ugly. The solution is to wait for VSCode to properly interpret these characters. But it could be temporarily worked around by changing these lines from:

sys.stdout.write('\b' * prev_total_width)
sys.stdout.write('\r')

to:

#sys.stdout.write('\b' * prev_total_width)
sys.stdout.write('\n')

or just using verbose=2, depending on how bad you want a progress bar.

There is also another solution in #4880, which just replaces the progbar entirely.

@jnygaard
Copy link

Were these issues (as far as not being user errors) with Keras (2) and fit_generator resolved? I arrived at this thread googling for explanation/solution to similar problems that I have. I have a very simple (toy) model set up. Training etc. works very well when using fit:

Epoch 1/100
1507/1507 [==============================] - 2s 1ms/step - loss: 1.0992 - acc: 0.3384 - val_loss: 1.1007 - val_acc: 0.3260
...
Epoch 100/100
1507/1507 [==============================] - 1s 719us/step - loss: 0.0303 - acc: 1.0000 - val_loss: 0.2245 - val_acc: 0.9304

Now I'm trying to get fit_generator to work, but without success. Note that all my data fits well in memory at the same time, and the generator is just a dummy one, only taking care of shuffling and batch-division, no real data processing:

12/12 [==============================] - 1s 114ms/step - loss: 1.0815 - acc: 0.5072 - val_loss: 1.0649 - val_acc: 0.5977
...
Epoch 100/100
12/12 [==============================] - 1s 103ms/step - loss: 0.0241 - acc: 0.9889 - val_loss: 0.0226 - val_acc: 0.9902

First, a comment regarding the "verbose=1"-outputs above: Why is the "fit version" showing "sample/samples" for each epoch, while the "fit_generator version" shows "epoch/epochs"? Shouldn't his have been a change from ver. 1 to 2 of Keras, rather than a difference between "fit/fit_generator"?

Second: Loss and accuracy values seem fine, and converging nicely, in both cases. Running time is also about the same. However, the test on additional test data shows that only the "fit version" produces a good model! (96.57% vs. 40.20% accuracy on test data!)

Note that my number of training samples is 1507, batch size 128, so number of batches should be 12, with the last one not completely filled.

Any comments appreciated!

@jnygaard
Copy link

A short followup to my own post: Even stranger, is that for a given model, evaluation of the model on a test data set gives different results, depending on whether evaluate or evaluate_generator is used:

(loss, accuracy) = best_model.evaluate( testData, testLabels, batch_size=batch_size, verbose=1 )
print( "\n[INFO] accuracy: {:.2f}%".format(accuracy * 100) )
990/990 [==============================] - 0s 428us/step
[INFO] accuracy: 40.81%

(loss, accuracy) = best_model.evaluate_generator( g2.generator(), steps=g2.steps() )
print( "\n[INFO] accuracy: {:.2f}%".format(accuracy * 100) )
[INFO] accuracy: 42.09%

In this case, it's hard to make any mistakes (maybe I still managed?!) since the generator should be extremely simple; mine is this:

class DataGenForEval:

    def __init__( self, width, height, depth, b_size, testData_len ):
        self.width, self.height, self.depth, self.b_size = width, height, depth, b_size
        self.test_size    = testData_len
        self.test_indices = np.arange( testData_len ) # Since we are not splitting an index set into two (train+validate), we don't really need this

    def __generate_a_batch( self, sample_list_for_a_batch ):
        X = np.empty( (self.b_size, self.width, self.height, self.depth, 1) )
        Y = np.empty( (self.b_size, testLabels.shape[1]), dtype=int )
        for i, sample in enumerate(sample_list_for_a_batch):
            X[i, :, :, :, :] = testData[i]   # testData and testLabels are global variables, this is a "hello world" test case...
            Y[i]             = testLabels[i]
        return X, Y
    
    def generator( self ):
        while 1:
            indexes = self.test_indices
            for i in range( int( (len(indexes)-1)/self.b_size ) + 1 ):
                indexes_subset = indexes[i*self.b_size:(i+1)*self.b_size]
                yield self.__generate_a_batch( indexes_subset )
                
    def steps( self ):
        return int((self.test_size-1)/self.b_size)+1

I really need to get this resolved before committing more effort to using this system...

@jnygaard
Copy link

jnygaard commented Dec 10, 2017

Ugh, yet another comment: I just noticed that subsequent calls to the test data evaluation with the evaluate_generator method, and parameter workers=0, yield different results, with no changes to the model whatsoever, nor to anything else in the Python "state". This has to be bad. I use no randomness in the evaluator/generator, and workers=0 should ensure no threading issues, right?!
Still,

(loss, accuracy) = best_model.evaluate_generator( g2.generator(), workers=0, steps=g2.steps() )
print( "\n[INFO] accuracy: {:.2f}%".format(accuracy * 100) )
[INFO] accuracy: 42.29%
[...about 5 calls returning the same result, then...]
[INFO] accuracy: 42.68%

@jnygaard
Copy link

Uh... I must hurriedly admit to having made an embarrassingly stupid bug involving a missed indirection in training data dereferencing, which appears to be the cause of my odd observations. Hope I didn't waste anybody's time too much... "Rubber duck debugging" to a real and observant colleague solved it!

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Oct 18, 2018

Hi,

I have the same issue with my notebook: https://gist.github.com/GuillaumeDesforges/da20d65b825a8e13da9cc1489eeee543/7be860e635b03291e0657f1f4896212d9ccf3f4c

I can clearly see that before the fit_generator, the RAM is stable at 6Go. When I start fit_generator I see the RAM usage growing until it reaches max of my PC (16Go).

However getting an item form my Sequence class is not increasing RAM.

Does the fit_generator behave other than just computing the batches?

Thanks,
Guillaume

@pravendrakhichi
Copy link

I guess that's because you are passing a dataframe both as input and target.
fit_generator uses an input as dataframe/ndarray and target variable as data like you can pass dataframe.values/ndarray. i guess this might be helpful.

@basuCool
Copy link

For those who are looking at this issue now.

My solution:
switch both steps_per_epoch and validation_steps to be the no Of samples/batch size.

I believe this is what @fchollet mentioned in his response.
For me verbose=1 worked fine.

so my batch size is 32 , no of samples was 8000 and validation set size was 2000

As per the keras documentation the parameter steps_per_epoch = no of train samples/batch_size and validation_steps = no of validation samples/batch_size

  1. steps_per_epoch = 8000/32 = 250
  2. validation_steps = 2000/32 = 62.5

Following code tested on the day of this comment. 08 Sep 2017.

Old Code
classifier.fit_generator(training_set, samples_per_epoch=8000, epochs=25, verbose=1, validation_data=test_set, validation_steps=2000)

Changed Code
classifier.fit_generator(training_set, steps_per_epoch=250, epochs=25, verbose=1, validation_data=test_set, validation_steps=62.5)

Hi,
Even after changing the fit method, i am still getting file not found error.
Code:

Convolutional Neural Netwrok

Importing the Keras libraries and packages

import keras
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

#Initializing CNN
classifier = Sequential()

#Step 1: Convolution
classifier.add(Convolution2D(32, (3, 3), input_shape = (64, 64, 3 ), activation='relu'))

Step2 : Max pooling

classifier.add(MaxPooling2D(pool_size = (2, 2)))

Step 3: Flattening

classifier.add(Flatten())

Step 4: Full Connection

classifier.add(Dense(units = 128, activation='relu'))

Adding the output layer

classifier.add(Dense(units = 1, activation='sigmoid'))

Compiling the CNN

classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics=['accuracy'])

Fitting the CNN to the images

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

training_set = train_datagen.flow_from_directory('dataset/training_set',
target_size=(64, 64),
batch_size=32,
class_mode='binary')

test_set = test_datagen.flow_from_directory('dataset/test_set',
target_size=(64, 64),
batch_size=32,
class_mode='binary')

Error in console:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/training_set\dogs\dog.3373.jpg'

@karthik2215
Copy link

karthik2215 commented Nov 29, 2021

Hi, I am facing training the dataset the .model file was got less than the number of epoch

(i.e). The epoch count was 10 but .model file save on my system only 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests