New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with large datasets like Imagenet #68

Closed
superhans opened this Issue Apr 20, 2015 · 16 comments

Comments

Projects
None yet
7 participants
@superhans

superhans commented Apr 20, 2015

Hi Guys,

First and foremost, I think Keras is quite amazing !!

So far, I see that the largest dataset has about 50000 images. I was wondering if it is possible to work on Imagenet scale datasets (around 1,000,000 images, which are too big to fit in memory), by pre-processing the data (i.e., splitting it into say : 1000 containers of 1000 images each), and feeding one container at a time to the model.fit() function. Or, do I have to save_weights() and load_weights() after each container ?

Thanks for reading.

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Apr 20, 2015

Collaborator

Keras models absolutely do support batch training. The CIFAR10 example offers an example of this.

What's more, you can use the image preprocessing module (data augmentation and normalization) on batches as well. Here's a quick example:

datagen = ImageDataGenerator(
        featurewise_center=True, # set input mean to 0 over the dataset
        samplewise_center=False, # set each sample mean to 0
        featurewise_std_normalization=True, # divide inputs by std of the dataset
        samplewise_std_normalization=False, # divide each input by its std
        zca_whitening=False, # apply ZCA whitening
        rotation_range=20, # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.2, # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.2, # randomly shift images vertically (fraction of total height)
        horizontal_flip=True, # randomly flip images
        vertical_flip=False) # randomly flip images

datagen.fit(X_sample) # let's say X_sample is a small-ish but statistically representative sample of your data

# let's say you have an ImageNet generator that yields ~10k samples at a time.
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
        for X_batch, Y_batch in datagen.flow(X_train, Y_train, batch_size=32): # these are chunks of 32 samples
            loss = model.train(X_batch, Y_batch)

# Alternatively, without data augmentation / normalization:
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
        model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)
Collaborator

fchollet commented Apr 20, 2015

Keras models absolutely do support batch training. The CIFAR10 example offers an example of this.

What's more, you can use the image preprocessing module (data augmentation and normalization) on batches as well. Here's a quick example:

datagen = ImageDataGenerator(
        featurewise_center=True, # set input mean to 0 over the dataset
        samplewise_center=False, # set each sample mean to 0
        featurewise_std_normalization=True, # divide inputs by std of the dataset
        samplewise_std_normalization=False, # divide each input by its std
        zca_whitening=False, # apply ZCA whitening
        rotation_range=20, # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.2, # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.2, # randomly shift images vertically (fraction of total height)
        horizontal_flip=True, # randomly flip images
        vertical_flip=False) # randomly flip images

datagen.fit(X_sample) # let's say X_sample is a small-ish but statistically representative sample of your data

# let's say you have an ImageNet generator that yields ~10k samples at a time.
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
        for X_batch, Y_batch in datagen.flow(X_train, Y_train, batch_size=32): # these are chunks of 32 samples
            loss = model.train(X_batch, Y_batch)

# Alternatively, without data augmentation / normalization:
for e in range(nb_epoch):
    print("epoch %d" % e)
    for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
        model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)
@superhans

This comment has been minimized.

Show comment
Hide comment
@superhans

superhans Apr 21, 2015

Thanks for your reply. Actually, I'm having a hard time even getting a "toy program" to work. Maybe I've done something wrong. Both data and labels are nxd and nxc numpy arrays (where d is dimension of data, and c is number of classes), right ?

I wasn't able to get this code working correctly on the 'iris data set'. get_data() is a function that reads from file.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Dropout, Activation                   
from keras.optimizers import SGD                                           
from keras.utils import np_utils, generic_utils                            
import numpy, scipy, scipy.io                                              
import sys                                                                 

model = Sequential()                                                       
model.add(Dense(4, 4, init='uniform'))                                     
model.add(Activation('tanh'))                                              
model.add(Dense(4, 3, init='uniform'))                                     
model.add(Activation('softmax'))                                           
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.1, nesterov=True)                
model.compile(loss='mean_squared_error', optimizer=sgd)                    

train_data, train_labels = get_data('iris_training.dat');          
test_data, test_labels = get_data('iris_test.dat');                
valid_data, valid_labels = get_data('iris_validation.dat');        

nb_classes = 3;                                                    
t = test_labels;                                                   
train_labels = np_utils.to_categorical(train_labels, nb_classes)   
test_labels = np_utils.to_categorical(test_labels, nb_classes)     
valid_labels = np_utils.to_categorical(valid_labels, nb_classes)   

model.fit(train_data, train_labels, nb_epoch=5, batch_size = 10, show_accuracy = True)      
score = model.evaluate(valid_data, valid_labels)                      
print model.predict_classes(valid_data)  # records output of program                             

This code outputs the following (which shows it isn't learning anything at all)
Epoch 0
75/75 [==============================] - 0s - loss: 0.2223 - acc.: 0.2636
Epoch 1
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4439
Epoch 2
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4167
Epoch 3
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4030
Epoch 4
75/75 [==============================] - 0s - loss: 0.2221 - acc.: 0.4030
37/37 [==============================] - 0s - loss: 0.2228
37/37 [==============================] - 0s
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

superhans commented Apr 21, 2015

Thanks for your reply. Actually, I'm having a hard time even getting a "toy program" to work. Maybe I've done something wrong. Both data and labels are nxd and nxc numpy arrays (where d is dimension of data, and c is number of classes), right ?

I wasn't able to get this code working correctly on the 'iris data set'. get_data() is a function that reads from file.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Dropout, Activation                   
from keras.optimizers import SGD                                           
from keras.utils import np_utils, generic_utils                            
import numpy, scipy, scipy.io                                              
import sys                                                                 

model = Sequential()                                                       
model.add(Dense(4, 4, init='uniform'))                                     
model.add(Activation('tanh'))                                              
model.add(Dense(4, 3, init='uniform'))                                     
model.add(Activation('softmax'))                                           
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.1, nesterov=True)                
model.compile(loss='mean_squared_error', optimizer=sgd)                    

train_data, train_labels = get_data('iris_training.dat');          
test_data, test_labels = get_data('iris_test.dat');                
valid_data, valid_labels = get_data('iris_validation.dat');        

nb_classes = 3;                                                    
t = test_labels;                                                   
train_labels = np_utils.to_categorical(train_labels, nb_classes)   
test_labels = np_utils.to_categorical(test_labels, nb_classes)     
valid_labels = np_utils.to_categorical(valid_labels, nb_classes)   

model.fit(train_data, train_labels, nb_epoch=5, batch_size = 10, show_accuracy = True)      
score = model.evaluate(valid_data, valid_labels)                      
print model.predict_classes(valid_data)  # records output of program                             

This code outputs the following (which shows it isn't learning anything at all)
Epoch 0
75/75 [==============================] - 0s - loss: 0.2223 - acc.: 0.2636
Epoch 1
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4439
Epoch 2
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4167
Epoch 3
75/75 [==============================] - 0s - loss: 0.2222 - acc.: 0.4030
Epoch 4
75/75 [==============================] - 0s - loss: 0.2221 - acc.: 0.4030
37/37 [==============================] - 0s - loss: 0.2228
37/37 [==============================] - 0s
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Apr 21, 2015

Collaborator

A toy example should be at least a properly formulated ML problem, otherwise the point is lost. You should look at MNIST, it's a good toy example. https://www.kaggle.com/users/123235/fchollet/digit-recognizer/simple-deep-mlp-with-keras

Here's a simpler version of your code, and its output. You can see it is in fact learning the training data, but with only a hundred or so samples, it starts overfitting from the start.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Activation                                                              
from keras.utils import np_utils
from sklearn import datasets

iris = datasets.load_iris()    
print iris.data.shape
print iris.target.shape                                                       

model = Sequential()                                                       
model.add(Dense(4, 3, init='uniform'))                                   
model.add(Activation('softmax'))                                           

model.compile(loss='mean_squared_error', optimizer='rmsprop')                    

labels = np_utils.to_categorical(iris.target)                                              
model.fit(iris.data, labels, nb_epoch=5, batch_size=1, show_accuracy=True, validation_split=0.3)
Train on 105 samples, validate on 45 samples
Epoch 0
105/105 [==============================] - 0s - loss: 0.2116 - acc.: 0.3714 - val. loss: 0.3828 - val. acc.: 0.0000
Epoch 1
105/105 [==============================] - 0s - loss: 0.1659 - acc.: 0.5048 - val. loss: 0.4688 - val. acc.: 0.0000
Epoch 2
105/105 [==============================] - 0s - loss: 0.1428 - acc.: 0.7905 - val. loss: 0.5031 - val. acc.: 0.0000
Epoch 3
105/105 [==============================] - 0s - loss: 0.1258 - acc.: 0.9524 - val. loss: 0.5391 - val. acc.: 0.0000
Epoch 4
105/105 [==============================] - 0s - loss: 0.1113 - acc.: 0.9524 - val. loss: 0.5564 - val. acc.: 0.0000
Collaborator

fchollet commented Apr 21, 2015

A toy example should be at least a properly formulated ML problem, otherwise the point is lost. You should look at MNIST, it's a good toy example. https://www.kaggle.com/users/123235/fchollet/digit-recognizer/simple-deep-mlp-with-keras

Here's a simpler version of your code, and its output. You can see it is in fact learning the training data, but with only a hundred or so samples, it starts overfitting from the start.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Activation                                                              
from keras.utils import np_utils
from sklearn import datasets

iris = datasets.load_iris()    
print iris.data.shape
print iris.target.shape                                                       

model = Sequential()                                                       
model.add(Dense(4, 3, init='uniform'))                                   
model.add(Activation('softmax'))                                           

model.compile(loss='mean_squared_error', optimizer='rmsprop')                    

labels = np_utils.to_categorical(iris.target)                                              
model.fit(iris.data, labels, nb_epoch=5, batch_size=1, show_accuracy=True, validation_split=0.3)
Train on 105 samples, validate on 45 samples
Epoch 0
105/105 [==============================] - 0s - loss: 0.2116 - acc.: 0.3714 - val. loss: 0.3828 - val. acc.: 0.0000
Epoch 1
105/105 [==============================] - 0s - loss: 0.1659 - acc.: 0.5048 - val. loss: 0.4688 - val. acc.: 0.0000
Epoch 2
105/105 [==============================] - 0s - loss: 0.1428 - acc.: 0.7905 - val. loss: 0.5031 - val. acc.: 0.0000
Epoch 3
105/105 [==============================] - 0s - loss: 0.1258 - acc.: 0.9524 - val. loss: 0.5391 - val. acc.: 0.0000
Epoch 4
105/105 [==============================] - 0s - loss: 0.1113 - acc.: 0.9524 - val. loss: 0.5564 - val. acc.: 0.0000
@nagadomi

This comment has been minimized.

Show comment
Hide comment
@nagadomi

nagadomi Apr 21, 2015

Contributor

The iris data sorted by the label. In this case, validation_split in model.fit does not work correctly.
We should shuffle data before training.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Activation                                                              
from keras.utils import np_utils
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
shuffle = np.arange(len(iris.data))
np.random.shuffle(shuffle)
iris.data = iris.data[shuffle]
iris.target = iris.target[shuffle]

print iris.data.shape
print iris.target.shape                                                       

model = Sequential()                                                       
model.add(Dense(4, 3, init='uniform'))                                   
model.add(Activation('softmax'))                                           

model.compile(loss='mean_squared_error', optimizer='rmsprop')                    

labels = np_utils.to_categorical(iris.target)                                              
model.fit(iris.data, labels, nb_epoch=5, batch_size=1, show_accuracy=True, validation_split=0.3)

(150, 4)
(150,)
Train on 105 samples, validate on 45 samples
Epoch 0
105/105 [==============================] - 0s - loss: 0.2135 - acc.: 0.3524 - val. loss: 0.2137 - val. acc.: 0.2667
Epoch 1
105/105 [==============================] - 0s - loss: 0.2004 - acc.: 0.4190 - val. loss: 0.2051 - val. acc.: 0.5778
Epoch 2
105/105 [==============================] - 0s - loss: 0.1891 - acc.: 0.6952 - val. loss: 0.1956 - val. acc.: 0.6000
Epoch 3
105/105 [==============================] - 0s - loss: 0.1787 - acc.: 0.6952 - val. loss: 0.1842 - val. acc.: 0.6000
Epoch 4
105/105 [==============================] - 0s - loss: 0.1686 - acc.: 0.6952 - val. loss: 0.1757 - val. acc.: 0.6000

Contributor

nagadomi commented Apr 21, 2015

The iris data sorted by the label. In this case, validation_split in model.fit does not work correctly.
We should shuffle data before training.

from keras.models import Sequential                                        
from keras.layers.core import Dense, Activation                                                              
from keras.utils import np_utils
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
shuffle = np.arange(len(iris.data))
np.random.shuffle(shuffle)
iris.data = iris.data[shuffle]
iris.target = iris.target[shuffle]

print iris.data.shape
print iris.target.shape                                                       

model = Sequential()                                                       
model.add(Dense(4, 3, init='uniform'))                                   
model.add(Activation('softmax'))                                           

model.compile(loss='mean_squared_error', optimizer='rmsprop')                    

labels = np_utils.to_categorical(iris.target)                                              
model.fit(iris.data, labels, nb_epoch=5, batch_size=1, show_accuracy=True, validation_split=0.3)

(150, 4)
(150,)
Train on 105 samples, validate on 45 samples
Epoch 0
105/105 [==============================] - 0s - loss: 0.2135 - acc.: 0.3524 - val. loss: 0.2137 - val. acc.: 0.2667
Epoch 1
105/105 [==============================] - 0s - loss: 0.2004 - acc.: 0.4190 - val. loss: 0.2051 - val. acc.: 0.5778
Epoch 2
105/105 [==============================] - 0s - loss: 0.1891 - acc.: 0.6952 - val. loss: 0.1956 - val. acc.: 0.6000
Epoch 3
105/105 [==============================] - 0s - loss: 0.1787 - acc.: 0.6952 - val. loss: 0.1842 - val. acc.: 0.6000
Epoch 4
105/105 [==============================] - 0s - loss: 0.1686 - acc.: 0.6952 - val. loss: 0.1757 - val. acc.: 0.6000

@jfsantos

This comment has been minimized.

Show comment
Hide comment
@jfsantos

jfsantos Apr 21, 2015

Contributor

If you have a huge dataset as an HDF5 file, you can use keras.utils.io_utils.HDF5Matrix to load the dataset. It will only read one batch at a time from memory, but there's some limitations (e.g., you cannot read shuffled data from the file, only sequentially). A workaround would be to shuffle the data before you store it to disk (but you would still read the same batches after a full epoch).

Here is a short example of how to do this. This considers you have all of your samples in the same HDF5 file, and features and targets are in HDF5 datasets named 'features' and 'targets':

def load_data(datapath, train_start, train_end, n_training_examples, n_test_examples)
    X_train = HDF5Matrix(datapath, 'features', train_start, train_start+n_training_examples, normalizer=normalize_data)
    y_train = HDF5Matrix(datapath, 'targets', train_start, train_start+n_training_examples)
    X_test = HDF5Matrix(datapath, 'features', test_start, test_start+n_test_examples, normalizer=normalize_data)
    y_test = HDF5Matrix(datapath, 'targets', test_start, test_start+n_test_examples)
    return X_train, y_train, X_test, y_test

The returned variables here are not real Numpy arrays, but they implement the same interface so everything works transparently in keras (as long as you don't try to read shuffled indices).

Contributor

jfsantos commented Apr 21, 2015

If you have a huge dataset as an HDF5 file, you can use keras.utils.io_utils.HDF5Matrix to load the dataset. It will only read one batch at a time from memory, but there's some limitations (e.g., you cannot read shuffled data from the file, only sequentially). A workaround would be to shuffle the data before you store it to disk (but you would still read the same batches after a full epoch).

Here is a short example of how to do this. This considers you have all of your samples in the same HDF5 file, and features and targets are in HDF5 datasets named 'features' and 'targets':

def load_data(datapath, train_start, train_end, n_training_examples, n_test_examples)
    X_train = HDF5Matrix(datapath, 'features', train_start, train_start+n_training_examples, normalizer=normalize_data)
    y_train = HDF5Matrix(datapath, 'targets', train_start, train_start+n_training_examples)
    X_test = HDF5Matrix(datapath, 'features', test_start, test_start+n_test_examples, normalizer=normalize_data)
    y_test = HDF5Matrix(datapath, 'targets', test_start, test_start+n_test_examples)
    return X_train, y_train, X_test, y_test

The returned variables here are not real Numpy arrays, but they implement the same interface so everything works transparently in keras (as long as you don't try to read shuffled indices).

@superhans

This comment has been minimized.

Show comment
Hide comment
@superhans

superhans Apr 21, 2015

I have a couple of other questions :

  1. The model.fit() can be used to update models, right ? That's what I've understood it to do. It doesn't train from scratch each time it is called.
  2. The batch_size seems to be a really crucial parameter. Is there some rule of thumb for selecting the batch size ?

superhans commented Apr 21, 2015

I have a couple of other questions :

  1. The model.fit() can be used to update models, right ? That's what I've understood it to do. It doesn't train from scratch each time it is called.
  2. The batch_size seems to be a really crucial parameter. Is there some rule of thumb for selecting the batch size ?
@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Apr 23, 2015

Collaborator

The model.fit() can be used to update models, right ? That's what I've understood it to do. It doesn't train from scratch each time it is called.

Yes. You're starting from the previous model state (ie. the weights are not re-initialized).

The batch_size seems to be a really crucial parameter. Is there some rule of thumb for selecting the batch size ?

In general smaller batches will give better results, however larger batches will make training faster. There is a compromise to strike, somewhere in between. I generally use 16 or 32, unless there is very little data in which case I go full stochastic (batch_size = 1).

Collaborator

fchollet commented Apr 23, 2015

The model.fit() can be used to update models, right ? That's what I've understood it to do. It doesn't train from scratch each time it is called.

Yes. You're starting from the previous model state (ie. the weights are not re-initialized).

The batch_size seems to be a really crucial parameter. Is there some rule of thumb for selecting the batch size ?

In general smaller batches will give better results, however larger batches will make training faster. There is a compromise to strike, somewhere in between. I generally use 16 or 32, unless there is very little data in which case I go full stochastic (batch_size = 1).

@patyork

This comment has been minimized.

Show comment
Hide comment
@patyork

patyork Apr 23, 2015

Contributor

In general smaller batches will give better results, however larger batches will make training faster. There is a compromise to strike, somewhere in between. I generally use 16 or 32, unless there is very little data in which case I go full stochastic (batch_size = 1).

There's much debate there. I learned the opposite way: small batch sizes will approach a minimum faster, but using large batches will better approximate the distribution of the training data, thus giving a better result.

I go full stochastic (batch_size = 1).

This is online. Full stochastic would be 1 random sample, such that after training for nb_samples, there is no guarantee that all of the training data would appear once; some wouldn't have been trained on yet, some would have been trained on twice. Yeah?

Contributor

patyork commented Apr 23, 2015

In general smaller batches will give better results, however larger batches will make training faster. There is a compromise to strike, somewhere in between. I generally use 16 or 32, unless there is very little data in which case I go full stochastic (batch_size = 1).

There's much debate there. I learned the opposite way: small batch sizes will approach a minimum faster, but using large batches will better approximate the distribution of the training data, thus giving a better result.

I go full stochastic (batch_size = 1).

This is online. Full stochastic would be 1 random sample, such that after training for nb_samples, there is no guarantee that all of the training data would appear once; some wouldn't have been trained on yet, some would have been trained on twice. Yeah?

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet Apr 23, 2015

Collaborator

I learned the opposite way: small batch sizes will approach a minimum faster, but using large batches will better approximate the distribution of the training data, thus giving a better result.

Depends on what you mean by fast: stochastic learning gets to a minimum in less epochs/samples seen, but you're doing many more gradient updates, and the average time per sample increases dramatically. So the computing time will be longer (unless you are using very large batch sizes, in which case you're doing redundant computations). Just try it: batch_size = 1 vs. batch_size = 32, on, say, the CIFAR10 example. 32 will be much faster (computationally).

LeCun has long argued that the result obtained with stochastic learning is almost always better, thanks to the random noise it introduces. See: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Stochastic learning also often results in better solutions because of the noise in the updates. Nonlinear networks usually have multiple local minima of differing depths. The goal of training is to locate one of these minima. Batch learning will discover the minimum of whatever basin the weights are initially placed. In stochastic learning, the noise present in the updates can result in the weights jumping into the basin of another, possibly deeper, local minimum. This has been demonstrated in certain simplified cases.

Collaborator

fchollet commented Apr 23, 2015

I learned the opposite way: small batch sizes will approach a minimum faster, but using large batches will better approximate the distribution of the training data, thus giving a better result.

Depends on what you mean by fast: stochastic learning gets to a minimum in less epochs/samples seen, but you're doing many more gradient updates, and the average time per sample increases dramatically. So the computing time will be longer (unless you are using very large batch sizes, in which case you're doing redundant computations). Just try it: batch_size = 1 vs. batch_size = 32, on, say, the CIFAR10 example. 32 will be much faster (computationally).

LeCun has long argued that the result obtained with stochastic learning is almost always better, thanks to the random noise it introduces. See: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Stochastic learning also often results in better solutions because of the noise in the updates. Nonlinear networks usually have multiple local minima of differing depths. The goal of training is to locate one of these minima. Batch learning will discover the minimum of whatever basin the weights are initially placed. In stochastic learning, the noise present in the updates can result in the weights jumping into the basin of another, possibly deeper, local minimum. This has been demonstrated in certain simplified cases.

@patyork

This comment has been minimized.

Show comment
Hide comment
@patyork

patyork Apr 23, 2015

Contributor

Depends on what you mean by fast

Fair enough - I was referring to computational time.

Batch learning will discover the minimum of whatever basin the weights are initially placed.

With pre-training (although, that just raises the point again: batches or not during pre-training) this may not be a bad thing. Random noise is random, so there's always the possibility of jumping out of a global optimum, but such is life of those in the machine learning field.

I think we're arguing different points here though: LeCun is referring to Batch training (batch_size==nb_examples), not to mini-batch learning (batch_size < nb_examples). I think mini-batch is the best of both worlds, and I usually train with stochastic mini-batches (a random subset of the data comprises a pass, and an epoch is just an arbitrary number of these passes such that I can save the model frequently).

Contributor

patyork commented Apr 23, 2015

Depends on what you mean by fast

Fair enough - I was referring to computational time.

Batch learning will discover the minimum of whatever basin the weights are initially placed.

With pre-training (although, that just raises the point again: batches or not during pre-training) this may not be a bad thing. Random noise is random, so there's always the possibility of jumping out of a global optimum, but such is life of those in the machine learning field.

I think we're arguing different points here though: LeCun is referring to Batch training (batch_size==nb_examples), not to mini-batch learning (batch_size < nb_examples). I think mini-batch is the best of both worlds, and I usually train with stochastic mini-batches (a random subset of the data comprises a pass, and an epoch is just an arbitrary number of these passes such that I can save the model frequently).

@superhans

This comment has been minimized.

Show comment
Hide comment
@superhans

superhans Apr 23, 2015

Got it. Thanks, everyone !!

superhans commented Apr 23, 2015

Got it. Thanks, everyone !!

@hadi-ds

This comment has been minimized.

Show comment
Hide comment
@hadi-ds

hadi-ds May 9, 2016

Hi there, I have a related issue regarding consecutive call of .fit method on batches of the data. In the following, e.g.:

for e in range(nb_epoch): print("epoch %d" % e) for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)

It seems like each time fit in invoked, the model is fit to the given batch of data, but the next fit will reset the model and starts fitting to the new batch (instead of starting at what weights were at the end of previous round).

The reason I am saying so is that, is my case, I have a large data set I an reading it 10K lines at a time, and use that to fit the model. I see that the loss function is constantly decreased at fit is done, but then when the next fit starts on the next batch, the loss jumps back high and starts reducing as before.
Therefore, what I do instead is to use as much as data I can and run several epoch through that.

Can someone comment if this is expected behaviour every time fit is applied?

hadi-ds commented May 9, 2016

Hi there, I have a related issue regarding consecutive call of .fit method on batches of the data. In the following, e.g.:

for e in range(nb_epoch): print("epoch %d" % e) for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)

It seems like each time fit in invoked, the model is fit to the given batch of data, but the next fit will reset the model and starts fitting to the new batch (instead of starting at what weights were at the end of previous round).

The reason I am saying so is that, is my case, I have a large data set I an reading it 10K lines at a time, and use that to fit the model. I see that the loss function is constantly decreased at fit is done, but then when the next fit starts on the next batch, the loss jumps back high and starts reducing as before.
Therefore, what I do instead is to use as much as data I can and run several epoch through that.

Can someone comment if this is expected behaviour every time fit is applied?

@jfsantos

This comment has been minimized.

Show comment
Hide comment
@jfsantos

jfsantos May 9, 2016

Contributor

You should not call model.fit, as it does exactly what you said it does: reset and train a model from scratch. If you're training on successive batches of data, use model.train_on_batch instead. You can also write a wrapper for your data using the Python generator interface and then use model.fit_generator, which behaves like model.fit but will use whatever generator you pass to it instead of a Numpy array.

Contributor

jfsantos commented May 9, 2016

You should not call model.fit, as it does exactly what you said it does: reset and train a model from scratch. If you're training on successive batches of data, use model.train_on_batch instead. You can also write a wrapper for your data using the Python generator interface and then use model.fit_generator, which behaves like model.fit but will use whatever generator you pass to it instead of a Numpy array.

@fchollet

This comment has been minimized.

Show comment
Hide comment
@fchollet

fchollet May 9, 2016

Collaborator

No, no, model.fit does NOT reset the weights of your model. It starts from
the previous state of the model. You can definitely call model.fit multiple
times.

The difference between model.fit and model.train_on_batch is mainly that
model.fit will break up your data into small batches whereas
model.train_on_batch will use the data it gets as a single batch, running a
single gradient update.

Collaborator

fchollet commented May 9, 2016

No, no, model.fit does NOT reset the weights of your model. It starts from
the previous state of the model. You can definitely call model.fit multiple
times.

The difference between model.fit and model.train_on_batch is mainly that
model.fit will break up your data into small batches whereas
model.train_on_batch will use the data it gets as a single batch, running a
single gradient update.

@jfsantos

This comment has been minimized.

Show comment
Hide comment
@jfsantos

jfsantos May 10, 2016

Contributor

Oops, my bad. The behaviour @hadi-ds is seeing is then probably due to the model overfitting a bit to the different slices of the dataset. Definitely the best practice with larger than memory datasets is to use either model.fit_generator with a "smart" generator or HDF5Matrix.

Contributor

jfsantos commented May 10, 2016

Oops, my bad. The behaviour @hadi-ds is seeing is then probably due to the model overfitting a bit to the different slices of the dataset. Definitely the best practice with larger than memory datasets is to use either model.fit_generator with a "smart" generator or HDF5Matrix.

@kishensurajp

This comment has been minimized.

Show comment
Hide comment
@kishensurajp

kishensurajp Feb 23, 2017

Will these images get loaded in main memory or GPU memory?

kishensurajp commented Feb 23, 2017

Will these images get loaded in main memory or GPU memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment