New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras doesn't reproduce Caffe example code accuracy #4444
Comments
Here, an even simpler example. A dense feedforward network with only 2 hidden layers. No momentum, no weight decay, fixed learning rate. I also got rid of mean subtraction, so it runs with the CIFAR data that the Caffe examples download for you out of the box. The Caffe code, which consistently does better than 20% accuracy: The Keras code, which rarely seems to break 15%, and often gets stuck at 10%: import tensorflow as tf
sess = tf.Session()
tf.python.control_flow_ops = tf
from keras import backend as K
K.set_session(sess)
from keras.layers.core import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import SGD
from keras.initializations import normal
from keras.datasets import cifar10
from keras.utils.np_utils import to_categorical
from functools import partial
# Helper functions
def gaussian(shape, name=None, scale=.01):
return normal(shape, scale=scale, name=name)
# Network definition
nn = Sequential()
nn.add(Flatten(input_shape=(32, 32, 3)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.1)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.1)))
nn.add(Dense(10, activation='softmax', init=partial(gaussian, scale=.1)))
nn.compile(loss='categorical_crossentropy',
optimizer=SGD(.00001),
metrics=['accuracy'])
# Get and format data:
(train, label), (test, test_label) = cifar10.load_data()
label, test_label = to_categorical(label), to_categorical(test_label)
# Evaluate network
nn.fit(train, label, validation_data=(test, test_label), nb_epoch=24, batch_size=100)
print("\n", "Final Accuracy:", nn.evaluate(test, test_label)[1]) Learning rate, connectivity, initialization, momentum, loss, batch size, training epochs, everything else seems to be identical. Has anyone had success reproducing any Caffe code in Keras 1 to 1? It seems that given the exact same network definition Caffe reliably does better than Keras in terms of accuracy. Is there something obvious I am missing? |
By the way, you can make your hyper parameters better. from keras.layers.core import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import Adam
from keras.initializations import normal
from keras.datasets import cifar10
from keras.utils.np_utils import to_categorical
from functools import partial
# Helper functions
def gaussian(shape, name=None, scale=.01):
return normal(shape, scale=scale, name=name)
# Network definition
nn = Sequential()
nn.add(Flatten(input_shape=(3, 32, 32)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.0001)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.0001)))
nn.add(Dense(10, activation='softmax', init=partial(gaussian, scale=.0001)))
nn.compile(loss='categorical_crossentropy',
optimizer=Adam(3e-4),
metrics=['accuracy'])
# Get and format data:
(train, label), (test, test_label) = cifar10.load_data()
label, test_label = to_categorical(label), to_categorical(test_label)
# Evaluate network
nn.fit(train, label, validation_data=(test, test_label), nb_epoch=10, batch_size=32)
print("\n", "Final Accuracy:", nn.evaluate(test, test_label)[1]) |
I appreciate your feedback! Yes, absolutely there are ways I can tune the hyperparameters to be better - that is actually the nature of the project I am working on. That said, my original problem still applies - when I implement those same hyperparameters in Caffe, it reliably gets 1-2% better accuracy. The underlying point being that, it seems to me, if you implement the same model in Keras and in Caffe, Caffe reliably does better. This is pretty bad when trying to reproduce published results - 1 percent can be make or break at beating ImageNet, etc. Is there some other explanation for why Keras seems to underperform Caffe? As a postscript, here are 5 trials each of Caffe, Keras w/ Theano dev, Keras w/ Tensorflow with those hyperparameters to illustrate what I mean: ACCURACIES: Theano: Caffe: Note especially the consistency that Caffe seems to have in addition to its performance. |
I haven't used Caffe, but for keras you have nb_epoch=24, and for Caffe you have max_iter=12000. Are those the same? |
Yup, 12000 batches * 100 samples per batch / 50000 samples in train = 24 epochs. (You can also verify this using model.train_on_batch() 12000 times moving across the dataset - note that you have to handle callbacks manually. You end up with the same results.) |
I don't think your caffe and keras examples are equivalent. You have padding set to 2 on your convolutions in caffe, but (note that zero-padding after the conv isn't the same thing, since it is possible to convolve partially with the valid data, which you aren't doing here in keras). |
Thanks Yann! That is a huge help. Looks like zero padding should be done before the convolution is applied (using border_mode='same'), and not after (Using ZeroPadding2D), in order to match how Caffe handles it. Good to know. So, that solves the mystery of the first example, great! Does anyone have any thoughts on the second? It is much more simple, and doesn't have any convolutional layers to misconfigure. |
I tried more complicated examples with larger model sizes ( eg VGG ) and more images. I was wondering why anyone has not tried training imagenet on keras. |
Yes, even with the CNN code I posted originally tweaking hyperparameters easily causes the Caffe and Keras results to diverge again - even with border_mode = 'same'. I agree that there seems to be a lack of empirical evidence showing that Keras can effectively replicate many of the common models out there (without simply copying their weights). I love Keras, but I know that researchers, businesses, etc would think twice if they could get better results implementing the exact same model in another framework. It is possible there is still some flaw in my implementation, and if anyone notices it please let me know. But I am starting to get the impression that there is a deeper issue here - it is just tricky to pin down. Is there something Keras is doing that is fundamentally different than Caffe? After doing a battery of hyperparameter search trials in both Keras and Caffe with the original model I posted, Keras seems to do better when it has less regularization applied relative to Caffe in most layers. One idea is that Keras, or the underlying frameworks, are less numerically optimized, which would add noise to the training process and thus add implicit regularization, or potentially inhibit convergence (as you see with your VGG trials). I know Caffe's backward pass is hardcoded, where Keras (e.g. tf and th) and most other frameworks rely on automatic differentiation. Maybe we are missing some tricks from numerical analysis - does anyone have any thoughts on that? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed. |
I've noticed the same thing. I've tried multiple datasets and always get better accuracy on Caffe. I've noticed the training loss in keras tends to go down slower and it takes many more epochs than caffe. I've tried many hyperparameters and settings but I think you are on to something here. |
I have tried to train on Imagenet from scratch using TF Keras, but the accuracy is a few percentage lower than using Caffe. I'm wondering if anyone has such experience? |
@qtianmercury What model did you train and what accuracies are you getting in keras and caffe? |
Some of my own models. Also, in the past week, I also tried ResNet50 on ImageNet. The accuracy I got using TF is approximately 75%, but with Keras (resnet50 directly from keras-applications), sgd optimizer, and exponential decay, the best accuracy I got is only 65% after 90 epochs. I did not try to train in caffe myself (since the original ResNet experiments are done in caffe). My start learning rate is 0.18 on 4 gpus (with a batch size of 128 * 4). Maybe the learning rate and batch size are too high? |
I have tried to reproduce several of Caffe's example codes, but the accuracy in Keras consistently seems to fall a few percent behind with the convolutional architectures. For example, here is a slightly modified version of Caffe's cifar10_quick prototxt, with the lr_mult stuff removed, and a step lr function:
https://gist.github.com/dnola/459e0ab043b22e8dd93234b26eb66e24
I use the CIFAR data that Caffe downloads for you, and I use Caffe's compute_image_mean script to generate a mean file - just like in their example.
And here is Keras code that, near as I can tell, is identical:
https://gist.github.com/dnola/538aa2cba85a20287dd68d1a5333674f
The Caffe code consistently gets above a 76% accuracy (typically around 76.5%), while the Keras code consistently gets below a 76% accuracy. In fact, in about a dozen trials, the best Keras result I got fell short of the worst Caffe result I got.
It is admittedly a small difference, but very important when trying to reproduce or beat the state of the art. I have had this same issue with a reproduction of the Caffe example cifar10_full code, and the network in network example code - every time falling a few percent shy.
One thought might be that Caffe rolls its softmax and loss into a single layer (http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1SoftmaxWithLossLayer.html#details), but I doubt numerical instability is the issue. Any ideas what might be going on? Alternatively, any Keras code that can reliably reproduce any of Caffe's CIFAR-10 examples would be appreciated, if there is something I am missing.
I am on the latest development version of Keras, and I am using the Tensorflow backend (version 0.11.0rc1), though I get the same sub-76% results with Theano.
Thanks!
The text was updated successfully, but these errors were encountered: