Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras doesn't reproduce Caffe example code accuracy #4444

Closed
dnola opened this issue Nov 20, 2016 · 14 comments
Closed

Keras doesn't reproduce Caffe example code accuracy #4444

dnola opened this issue Nov 20, 2016 · 14 comments

Comments

@dnola
Copy link

dnola commented Nov 20, 2016

I have tried to reproduce several of Caffe's example codes, but the accuracy in Keras consistently seems to fall a few percent behind with the convolutional architectures. For example, here is a slightly modified version of Caffe's cifar10_quick prototxt, with the lr_mult stuff removed, and a step lr function:

https://gist.github.com/dnola/459e0ab043b22e8dd93234b26eb66e24

I use the CIFAR data that Caffe downloads for you, and I use Caffe's compute_image_mean script to generate a mean file - just like in their example.

And here is Keras code that, near as I can tell, is identical:

https://gist.github.com/dnola/538aa2cba85a20287dd68d1a5333674f

The Caffe code consistently gets above a 76% accuracy (typically around 76.5%), while the Keras code consistently gets below a 76% accuracy. In fact, in about a dozen trials, the best Keras result I got fell short of the worst Caffe result I got.

It is admittedly a small difference, but very important when trying to reproduce or beat the state of the art. I have had this same issue with a reproduction of the Caffe example cifar10_full code, and the network in network example code - every time falling a few percent shy.

One thought might be that Caffe rolls its softmax and loss into a single layer (http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1SoftmaxWithLossLayer.html#details), but I doubt numerical instability is the issue. Any ideas what might be going on? Alternatively, any Keras code that can reliably reproduce any of Caffe's CIFAR-10 examples would be appreciated, if there is something I am missing.

I am on the latest development version of Keras, and I am using the Tensorflow backend (version 0.11.0rc1), though I get the same sub-76% results with Theano.

Thanks!

@dnola
Copy link
Author

dnola commented Nov 21, 2016

Here, an even simpler example. A dense feedforward network with only 2 hidden layers. No momentum, no weight decay, fixed learning rate. I also got rid of mean subtraction, so it runs with the CIFAR data that the Caffe examples download for you out of the box.

The Caffe code, which consistently does better than 20% accuracy:
https://gist.github.com/dnola/f98ce117eeb99b27133bc5a75a02f4f2

The Keras code, which rarely seems to break 15%, and often gets stuck at 10%:

import tensorflow as tf

sess = tf.Session()
tf.python.control_flow_ops = tf
from keras import backend as K

K.set_session(sess)

from keras.layers.core import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import SGD
from keras.initializations import normal
from keras.datasets import cifar10
from keras.utils.np_utils import to_categorical

from functools import partial

# Helper functions

def gaussian(shape, name=None, scale=.01):
    return normal(shape, scale=scale, name=name)

# Network definition

nn = Sequential()
nn.add(Flatten(input_shape=(32, 32, 3)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.1)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.1)))
nn.add(Dense(10, activation='softmax', init=partial(gaussian, scale=.1)))
nn.compile(loss='categorical_crossentropy',
           optimizer=SGD(.00001),
           metrics=['accuracy'])

# Get and format data:
(train, label), (test, test_label) = cifar10.load_data()
label, test_label = to_categorical(label), to_categorical(test_label)

# Evaluate network
nn.fit(train, label, validation_data=(test, test_label), nb_epoch=24, batch_size=100)

print("\n", "Final Accuracy:", nn.evaluate(test, test_label)[1])

Learning rate, connectivity, initialization, momentum, loss, batch size, training epochs, everything else seems to be identical.

Has anyone had success reproducing any Caffe code in Keras 1 to 1? It seems that given the exact same network definition Caffe reliably does better than Keras in terms of accuracy. Is there something obvious I am missing?

@yukoba
Copy link
Contributor

yukoba commented Nov 21, 2016

By the way, you can make your hyper parameters better.
This code goes to loss: 1.7683 - acc: 0.3838 - val_loss: 1.7583 - val_acc: 0.3826 at epoch 10 on Theano.

from keras.layers.core import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import Adam
from keras.initializations import normal
from keras.datasets import cifar10
from keras.utils.np_utils import to_categorical

from functools import partial

# Helper functions

def gaussian(shape, name=None, scale=.01):
    return normal(shape, scale=scale, name=name)

# Network definition

nn = Sequential()
nn.add(Flatten(input_shape=(3, 32, 32)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.0001)))
nn.add(Dense(128, activation='linear', init=partial(gaussian, scale=.0001)))
nn.add(Dense(10, activation='softmax', init=partial(gaussian, scale=.0001)))
nn.compile(loss='categorical_crossentropy',
           optimizer=Adam(3e-4),
           metrics=['accuracy'])

# Get and format data:
(train, label), (test, test_label) = cifar10.load_data()
label, test_label = to_categorical(label), to_categorical(test_label)

# Evaluate network
nn.fit(train, label, validation_data=(test, test_label), nb_epoch=10, batch_size=32)

print("\n", "Final Accuracy:", nn.evaluate(test, test_label)[1])

@dnola
Copy link
Author

dnola commented Nov 21, 2016

I appreciate your feedback! Yes, absolutely there are ways I can tune the hyperparameters to be better - that is actually the nature of the project I am working on.

That said, my original problem still applies - when I implement those same hyperparameters in Caffe, it reliably gets 1-2% better accuracy. The underlying point being that, it seems to me, if you implement the same model in Keras and in Caffe, Caffe reliably does better. This is pretty bad when trying to reproduce published results - 1 percent can be make or break at beating ImageNet, etc. Is there some other explanation for why Keras seems to underperform Caffe?

As a postscript, here are 5 trials each of Caffe, Keras w/ Theano dev, Keras w/ Tensorflow with those hyperparameters to illustrate what I mean:

ACCURACIES:
Tensorflow:
0.3411
0.3473
0.3269
0.3781
0.3676
Max: 0.3781
Mean: 0.3522

Theano:
0.3724
0.3511
0.381
0.3892
0.3173
Max: 0.3892
Mean: 0.3622

Caffe:
0.3874
0.3829
0.3789
0.3887
0.3913
Max: 0.3913
Mean: 0.38584

Note especially the consistency that Caffe seems to have in addition to its performance.

@HristoBuyukliev
Copy link

I haven't used Caffe, but for keras you have nb_epoch=24, and for Caffe you have max_iter=12000. Are those the same?

@dnola
Copy link
Author

dnola commented Nov 21, 2016

Yup, 12000 batches * 100 samples per batch / 50000 samples in train = 24 epochs.

(You can also verify this using model.train_on_batch() 12000 times moving across the dataset - note that you have to handle callbacks manually. You end up with the same results.)

@yhenon
Copy link
Contributor

yhenon commented Nov 21, 2016

I don't think your caffe and keras examples are equivalent. You have padding set to 2 on your convolutions in caffe, but border_mode='valid' (default) in keras, followed by a zero padding step. Setting border_mode='same' and getting rid of the zero padding gives me >76% accuracy in keras consistently.

(note that zero-padding after the conv isn't the same thing, since it is possible to convolve partially with the valid data, which you aren't doing here in keras).

@dnola
Copy link
Author

dnola commented Nov 21, 2016

Thanks Yann! That is a huge help. Looks like zero padding should be done before the convolution is applied (using border_mode='same'), and not after (Using ZeroPadding2D), in order to match how Caffe handles it. Good to know. So, that solves the mystery of the first example, great!

Does anyone have any thoughts on the second? It is much more simple, and doesn't have any convolutional layers to misconfigure.

@divamgupta
Copy link

I tried more complicated examples with larger model sizes ( eg VGG ) and more images.
The keras model gets stuck at a constant from the first epoch. Whereas the same model in caffe does very well.

I was wondering why anyone has not tried training imagenet on keras.

@dnola
Copy link
Author

dnola commented Dec 9, 2016

Yes, even with the CNN code I posted originally tweaking hyperparameters easily causes the Caffe and Keras results to diverge again - even with border_mode = 'same'. I agree that there seems to be a lack of empirical evidence showing that Keras can effectively replicate many of the common models out there (without simply copying their weights). I love Keras, but I know that researchers, businesses, etc would think twice if they could get better results implementing the exact same model in another framework.

It is possible there is still some flaw in my implementation, and if anyone notices it please let me know. But I am starting to get the impression that there is a deeper issue here - it is just tricky to pin down. Is there something Keras is doing that is fundamentally different than Caffe? After doing a battery of hyperparameter search trials in both Keras and Caffe with the original model I posted, Keras seems to do better when it has less regularization applied relative to Caffe in most layers. One idea is that Keras, or the underlying frameworks, are less numerically optimized, which would add noise to the training process and thus add implicit regularization, or potentially inhibit convergence (as you see with your VGG trials).

I know Caffe's backward pass is hardcoded, where Keras (e.g. tf and th) and most other frameworks rely on automatic differentiation. Maybe we are missing some tricks from numerical analysis - does anyone have any thoughts on that?

@stale stale bot added the stale label May 23, 2017
@stale
Copy link

stale bot commented May 23, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.

@stale stale bot closed this as completed Jun 22, 2017
@paras42
Copy link

paras42 commented Oct 28, 2017

I've noticed the same thing. I've tried multiple datasets and always get better accuracy on Caffe. I've noticed the training loss in keras tends to go down slower and it takes many more epochs than caffe. I've tried many hyperparameters and settings but I think you are on to something here.

@qtianreal
Copy link

I have tried to train on Imagenet from scratch using TF Keras, but the accuracy is a few percentage lower than using Caffe. I'm wondering if anyone has such experience?

@divamgupta
Copy link

divamgupta commented Jul 20, 2019

@qtianmercury What model did you train and what accuracies are you getting in keras and caffe?

@qtianreal
Copy link

qtianreal commented Jul 28, 2019

@qtianmercury What model did you train and what accuracies are you getting in keras and caffe?

Some of my own models. Also, in the past week, I also tried ResNet50 on ImageNet. The accuracy I got using TF is approximately 75%, but with Keras (resnet50 directly from keras-applications), sgd optimizer, and exponential decay, the best accuracy I got is only 65% after 90 epochs. I did not try to train in caffe myself (since the original ResNet experiments are done in caffe). My start learning rate is 0.18 on 4 gpus (with a batch size of 128 * 4). Maybe the learning rate and batch size are too high?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants