Could Keras handle a large dataset, for instance more than 50GB? #107

shawnLeeZX · 2015-05-10T04:19:49Z

It is really great to see such an elegant design. I am wondering is it possible Keras use a nosql database such lmdb as its data source, then load data and do computation in parallel?

fchollet · 2015-05-10T05:19:30Z

Keras can work with datasets that don't fit in memory, through the use of batch training.

There are two ways to make this work:

# let's say you have a BatchGenerator that yields a large batch of samples at a time
# (but still small enough for the GPU memory)
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in BatchGenerator(): 
        model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)


# Alternatively, let's say you have a MiniBatchGenerator that yields 32-64 samples at a time:
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in MiniBatchGenerator(): # these are chunks of ~10k pictures
        model.train(X_batch, Y_batch)

BoltzmannBrain · 2016-08-14T22:57:26Z

For people finding this for reference, the above calls to train and fit the model should be model.fit(X_train, Y_train, batch_size=32, nb_epoch=1) and model.train(X_train, Y_train), respectively.

fchollet · 2016-08-14T23:17:24Z

Note that nowadays you can use the model.fit_generator method with Python data generators.

BoltzmannBrain · 2016-08-14T23:19:45Z

Thanks @fchollet, but this has not been working as expected for me (#3461), so I'm using this method temporarily.

dudeperf3ct · 2017-03-24T22:16:37Z

Thanks @fchollet . Is it necessary to load all the training data in memory and then using custom generator, generate indefinite batches? I am training for 27k images(without loading in memory), which I did by above method but how can this be achieved using fit_generator() as it supports callbacks and also be used for validation dataset in one function as this can not be done using either train_on_batch() or fit()?
Here is the code I used for loading batches of size 32 :

#Training large data by breaking it into batches
for e in range(num_epoch) :
	print ("epoch %d" %e)
	for step in range(num_iters) :
		train_x, train_y = LoadTrainBatch(32)  #batch_size = 32
		train_x = np.asarray(train_x)
		train_y = np.asarray(train_y) 
		for train_X , train_Y in zip(train_x ,train_y) :
			train_X = train_X.reshape(1, row, col, 3)
			model.fit(train_X, train_Y, nb_epoch=1)

wjgan7 · 2017-06-29T21:08:46Z

Does training for nb_epochs on a batch then moving onto the next batch significantly decrease the quality of the model, and can this method be done in parallel? It seems like redundantly reading from disk can be a bottleneck.

Also updated example in `Reshape` layer.

fchollet closed this as completed May 10, 2015

parag2489 mentioned this issue Feb 3, 2016

A concrete example for using data generator for large datasets such as ImageNet #1627

Closed

BoltzmannBrain mentioned this issue Aug 13, 2016

Model fit_generator not pulling data samples as expected #3461

Closed

beomjunshin-ben mentioned this issue Aug 26, 2016

Large scale data training in keras, by parallel image preprocessing #3585

Closed

zhichao-li mentioned this issue Oct 19, 2017

Fine-grain optimizer intel-analytics/ipex-llm#1684

Open

pavithrasv mentioned this issue Feb 2, 2018

How to use model.summary() when using placeholder instead of Input(keras) tensorflow/tensorflow#16620

Closed

spate141 mentioned this issue Mar 6, 2018

How to train on multi-GPUs when using fit_generator? #9502

Closed

geoffwoollard mentioned this issue Feb 24, 2019

train on more data geoffwoollard/ece1512_project#2

Closed

hubingallin pushed a commit to hubingallin/keras that referenced this issue Sep 22, 2023

Add keras_core.layers.Flatten. (keras-team#107)

b8b65eb

Also updated example in `Reshape` layer.

kernel-loophole pushed a commit to kernel-loophole/keras that referenced this issue Sep 25, 2023

Add keras_core.layers.Flatten. (keras-team#107)

18e575c

Also updated example in `Reshape` layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could Keras handle a large dataset, for instance more than 50GB? #107

Could Keras handle a large dataset, for instance more than 50GB? #107

shawnLeeZX commented May 10, 2015

fchollet commented May 10, 2015

BoltzmannBrain commented Aug 14, 2016

fchollet commented Aug 14, 2016

BoltzmannBrain commented Aug 14, 2016

dudeperf3ct commented Mar 24, 2017 •

edited

Loading

wjgan7 commented Jun 29, 2017

Could Keras handle a large dataset, for instance more than 50GB? #107

Could Keras handle a large dataset, for instance more than 50GB? #107

Comments

shawnLeeZX commented May 10, 2015

fchollet commented May 10, 2015

BoltzmannBrain commented Aug 14, 2016

fchollet commented Aug 14, 2016

BoltzmannBrain commented Aug 14, 2016

dudeperf3ct commented Mar 24, 2017 • edited Loading

wjgan7 commented Jun 29, 2017

dudeperf3ct commented Mar 24, 2017 •

edited

Loading