Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could Keras handle a large dataset, for instance more than 50GB? #107

Closed
shawnLeeZX opened this issue May 10, 2015 · 6 comments
Closed

Could Keras handle a large dataset, for instance more than 50GB? #107

shawnLeeZX opened this issue May 10, 2015 · 6 comments

Comments

@shawnLeeZX
Copy link

It is really great to see such an elegant design. I am wondering is it possible Keras use a nosql database such lmdb as its data source, then load data and do computation in parallel?

@fchollet
Copy link
Member

Keras can work with datasets that don't fit in memory, through the use of batch training.

There are two ways to make this work:

# let's say you have a BatchGenerator that yields a large batch of samples at a time
# (but still small enough for the GPU memory)
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in BatchGenerator(): 
        model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)


# Alternatively, let's say you have a MiniBatchGenerator that yields 32-64 samples at a time:
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in MiniBatchGenerator(): # these are chunks of ~10k pictures
        model.train(X_batch, Y_batch)

@BoltzmannBrain
Copy link

For people finding this for reference, the above calls to train and fit the model should be model.fit(X_train, Y_train, batch_size=32, nb_epoch=1) and model.train(X_train, Y_train), respectively.

@fchollet
Copy link
Member

Note that nowadays you can use the model.fit_generator method with Python data generators.

@BoltzmannBrain
Copy link

Thanks @fchollet, but this has not been working as expected for me (#3461), so I'm using this method temporarily.

@dudeperf3ct
Copy link

dudeperf3ct commented Mar 24, 2017

Thanks @fchollet . Is it necessary to load all the training data in memory and then using custom generator, generate indefinite batches? I am training for 27k images(without loading in memory), which I did by above method but how can this be achieved using fit_generator() as it supports callbacks and also be used for validation dataset in one function as this can not be done using either train_on_batch() or fit()?
Here is the code I used for loading batches of size 32 :

#Training large data by breaking it into batches
for e in range(num_epoch) :
	print ("epoch %d" %e)
	for step in range(num_iters) :
		train_x, train_y = LoadTrainBatch(32)  #batch_size = 32
		train_x = np.asarray(train_x)
		train_y = np.asarray(train_y) 
		for train_X , train_Y in zip(train_x ,train_y) :
			train_X = train_X.reshape(1, row, col, 3)
			model.fit(train_X, train_Y, nb_epoch=1)

@wjgan7
Copy link

wjgan7 commented Jun 29, 2017

Does training for nb_epochs on a batch then moving onto the next batch significantly decrease the quality of the model, and can this method be done in parallel? It seems like redundantly reading from disk can be a bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants