Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saved model behaves differently on different machines #7676

Closed
basaldella opened this issue Aug 17, 2017 · 29 comments
Closed

Saved model behaves differently on different machines #7676

basaldella opened this issue Aug 17, 2017 · 29 comments

Comments

@basaldella
Copy link

basaldella commented Aug 17, 2017

After studying #439, #2228, #2743, #6737 and the new FAQ about reproducibility, I was able to get consistent, reproducible result on my development machines using Theano. If I run my code twice, I get the exact same results.

The problem is that the results are reproducible only on the same machine. In other words, if I

  • Train a model on machine A
  • Evaluate the model using predict
  • Save the model (using save_model, or model_to_json and save_weights)
  • Transfer the model to machine B and load it
  • Evaluate again the model on machine B using predict

The results of the two predicts are different. Using CPU or GPU makes no difference - after I copy the model file(s) from a machine to another, the performance of predict changes dramatically.

The only difference on the two machines is the hardware (I use my laptop's 980M and a workstation with a Titan X Pascal) and the NVIDIA driver version, which is slightly older on the workstation. Both computers run Ubuntu 16.04 LTS and Cuda 8 with cuDNN. All libraries are on the same version on both machines, and the Python version is the same as well (3.6.1).

Is this behavior intended? I expect that running a pre-trained model on with the same architecture and weights on two different machines yields the same results, but this doesn't seem the case.

On a side note, a suggestion: on the FAQs about reproducibility, it should be explicitly stated that the development version of Theano is needed to get reproducible results.

@RunshengSong
Copy link

RunshengSong commented Aug 30, 2017

same here with Tensorflow backend. I trained my models in my local machine (Ubuntu 16.04 LTS). When I tested my model on AWS EC2 instance I got different prediction numbers.

Have you solved this?

@basaldella
Copy link
Author

No, I'm still experiencing this problem.

You can check if any of the solutions posted in #4875 work for you. What versions of the libraries are you running?

@RunshengSong
Copy link

RunshengSong commented Aug 31, 2017

I am running Keras 2.0.6.

From my experience the version of packages is not the problem (at least for now). I did some experiments today and found out that if I remove PCA from my modeling pipeline everything works fine. Did you use sklearn's PCA to reduce the dimension of your inputs? If so you might want to try to remove it. It solved my problem for now.

I don't know why this happens. This post says that the non-deterministic in weights could cause problems, but this doesn't explain why the Neural Networks model behaves the same in the same machine. Inspired by this post I guess the different results of dimension reduction lead to this problem.

@rsmith49
Copy link

Not sure if this will help at all, but I was dealing with this for a day and a half before I realized it was a difference in how the machines handled hashing words before I passed them to my Embedding layer.

@basaldella
Copy link
Author

@RunshengSong, I don't use sklearn. @rsmith49, are you talking about setting PYTHONHASHSEED?

@halhenke
Copy link

halhenke commented Sep 1, 2017

I found the same thing as @rsmith49 - had a word2vec model that would act as if the weights had been completely re-initialized when i loaded them from disk in a new session. After also saving/pickling the dicts that mapped words to ints and reloading them from disk also when i started a new training session the model behaved as expected.

@rsmith49
Copy link

rsmith49 commented Sep 1, 2017

@basaldella Yes, turns out my issue was more along the lines of #4875, and was inconsistent between different Python sessions, not just different machines.

@basaldella
Copy link
Author

@halhenke I'm also using a word2vec model, but I use GloVe's pre-trained weights following this tutorial, so I guess that shouldn't be the issue. Are you using pre-trained weights as well?

@wangchenouc
Copy link

@basaldella Have you fixed this issue?
It seems that I have the same problem. I re-traind a model with fine tuning InceptionV3 with my own images on a GPU machine. After training, the accuracy could up to 91% which I am happy with it. During the training the improved model was saved with callbacks. So I can load the best retrained model with model.load_model(model_path), and I tested it with one image. The predict results are always the same and correct (because I know what this image belong to).
the results is like this: [[ 0.00197385 0.01141251 0.02262068 0.9121536 0.00810914 0.01657074
0.00370198 0.00617629 0.00972648 0.00531203 0.00224261]]

Now, I try to copy the retrained model (HDF5 file) to my laptop, and load the model again, and test the model with the same image, then I got a totally different result.
[[ 0.00373867 0.22160383 0.10066977 0.35440436 0.02839879 0.17799987
0.01744748 0.02645957 0.0299265 0.03026218 0.00908909]]

The python environment are the same in the two machine with keras 2.0.8:
The result are always the same in the same machine.
The weights are the same after I load the model file.
......I checked many things.

Why the results are different in the two machine? Is there somebody know about this?

@basaldella
Copy link
Author

@wangchenouc no, I was absolutely not able to fix this issue. If you have any news please tag me in your issue as well. I'm actually thinking on switching to a lower level framework because I'm not able to solve this problem.

@wangchenouc
Copy link

@basaldella Please look at this #8149

@basaldella
Copy link
Author

@wangchenouc thanks, but this does not solve my issue. In my case the versions of Keras are the same on both machines.

@wangchenouc
Copy link

@basaldella Just compare the version of keras is not enough. Maybe you need to compare every function codes that you used. Try use a very simple cases like what I did, and it's easy to compare your different step by step. I spend 3 whole days to debug the codes step by step, and solved my problem finally.

Good luck to you!

@basaldella
Copy link
Author

@wangchenouc I know. But I'm cloning the same repo on 2 machines, installing the same python&libraries versions with a script, and still, I have no luck on getting the same results.

Thanks for encouraging me though :)

@philiprekers
Copy link

Any news on this issue? I'm running into the same problem.
Two instances - identical because it's the same hardware set up and the second has been installed from an image of the first one.
When I model.save() my model in the first instance and load_model() in the second, the results seem to be random when evaluating in the second instance. Accuracy also drops to unreasonable values (from .97 to .52).

What are the possible causes other than differences in code/set up/hardware? I've been searching for solutions for the last 3 days and nothing seems to work.

@dterg
Copy link

dterg commented Jan 30, 2018

I've looked at the several potential solutions reported here and related threads but no luck either @Philipduerholt . In my case my last layer is a softmax and when I predict the same training data (not even test), I get equal probabilities between my classes i.e. the model is completely random.

@philiprekers
Copy link

philiprekers commented Jan 31, 2018

It worked, finally. And in hindsight it looks simple. I'll try to include all relevant points:
I'll be talking about training instance and production instance.
I use TensorFlow backend.
Python version 3.5.4.
Keras version 2.0.5.

  • I pickle everything I use as input or as ID-map (like word_id_dictionaries).
  • e.g. at the end I have word2index.dict, label2index.dict (most people would use .pkl).
  • For evaluation I also pickle X_test and y_test.
  • Build and train your model in training instance.
  • I use a Sequencial model with Embedding layer (some had problems with that).
  • I use ModelCheckpoint() and save it as a list (callbacks_list), file names end on .hdf5.
  • model.fit has callbacks = callbacks_list.
  • After training, I choose the most promising saved model.
  • I can load_model('models/most_promising.hdf5') in the same instance and evaluate.
  • This works as expected.
  • I transfer the .hdf5 file and all pickled files to production instance.
  • In production I make sure all package versions are equal to training instance.
  • Best to use something like conda env.
  • I import: from keras import backend as K
  • Immediately after imports i set learning phase: K.set_learning_phase(0)
  • I initialize/load all the things:
    • model = load_model(model_path)
    • with open(word2index_path, "rb") as f:
      word2index = pickle.load(f)
    • etc.
  • Evaluation works as expected.
  • Predict works as expected.

I hope it helps.

@ghost
Copy link

ghost commented Mar 21, 2018

I had the same problem...
upgrading Keras on both machines to version 2.1.5 solved problem for me.

@han963xiao
Copy link

i

I had the same problem...
upgrading Keras on both machines to version 2.1.5 solved problem for me.
Amazing!the solution is each machine should have the same version keras!Same inputs on differernt version will have different output....

@xiaoleitw
Copy link

running into the same problem and looking for solution

@xiaoleitw
Copy link

It worked, finally. And in hindsight it looks simple. I'll try to include all relevant points:
I'll be talking about training instance and production instance.
I use TensorFlow backend.
Python version 3.5.4.
Keras version 2.0.5.

  • I pickle everything I use as input or as ID-map (like word_id_dictionaries).

  • e.g. at the end I have word2index.dict, label2index.dict (most people would use .pkl).

  • For evaluation I also pickle X_test and y_test.

  • Build and train your model in training instance.

  • I use a Sequencial model with Embedding layer (some had problems with that).

  • I use ModelCheckpoint() and save it as a list (callbacks_list), file names end on .hdf5.

  • model.fit has callbacks = callbacks_list.

  • After training, I choose the most promising saved model.

  • I can load_model('models/most_promising.hdf5') in the same instance and evaluate.

  • This works as expected.

  • I transfer the .hdf5 file and all pickled files to production instance.

  • In production I make sure all package versions are equal to training instance.

  • Best to use something like conda env.

  • I import: from keras import backend as K

  • Immediately after imports i set learning phase: K.set_learning_phase(0)

  • I initialize/load all the things:

    • model = load_model(model_path)
    • with open(word2index_path, "rb") as f:
      word2index = pickle.load(f)
    • etc.
  • Evaluation works as expected.

  • Predict works as expected.

I hope it helps.

this is not working for me :(

@shrutimittal90
Copy link

tf.set_random_seed(0) worked for me

@jewelcai
Copy link

tf.set_random_seed(0) worked for me

where should this line be placed? Before sess = tf.Session(config=config)?

@urmilanayak
Copy link

I am facing same problem in Golang, following are approach

  1. Train a model on Ubuntu 18.04 (using Python, Tensorflow and Keras)
  2. Optimized, Freezed and Saved model to be used in Tensorflow Go API
  3. LoadSavedModel on Ubuntu 18.04 using Tensorflow Go API
  4. LoadSavedModel on Raspberry Pi 4 using Tensorflow Go API

The weights for all layer's are different, when loaded in Ubuntu (step 3) and Raspberry Pi (step 4). Which is causing the different softmax prediction.

Sample weights on different environment:
These are just sample weights, however all the weights in all layers are different.
Tensorflow API version used to load model: Go Tensorflow (r2.0), Tensorflow C (r2.0), Golang (1.13.6)

Loaded weights in Ubuntu:
[0.5031438 -0.062892914 -0.10482144 -0.04192853 0.7127869 0.46121502 -0.3983221 ....]

Loaded weight in Raspberry Pi for the same layer and same filter as above
[0.49415612 -0.07188058 -0.11380911 -0.050916195 0.70379925 0.45222735 -0.40730977 ....]

How to solve this?

@alankongfq
Copy link

I am facing the same issue with this but in abit weird fashion.
Long story short. I had setup 3 pipeline.

  1. Train pipeline -- train on Azure ML with TF 2.0 - NC6s V2 (Cloud VM) --- training ok
  2. Testing pipeline -- testing on a local machine with TF.2.3 RTX2070 --- predict results ok
  3. Deployment pipeline -- NC6s V2 for inference with TF2.3. (Cloud VM) --- erractic behavior of model.predict

Pipeline 2 and 3, the environment is kept the same with the same code. The only difference is the hardware and GPU.
What baffled me is that the results prediction on local machine was as expected but when deployed in the cloud vm, it sometimes work, sometimes dont. What is even more weird is that if I run a inference on a few images in a sequence -- let say [image1, image2, image3] -- image1 and image3 would predict ok, but image2 would not have a complete prediction. For image2, most part of the prediction is working except for the last few tiles of the image.

I am at a lost here because i dont know where to start debugging and I cant just spin up my VM to test as it cost money. I am not sure if it is related to some memory issue or weights initlaization etc. Anyone has any pointers?

@tu-curious
Copy link

tu-curious commented Oct 23, 2020

@alankongfq : Not even going that far, I found out after a whole day of debugging that my tf 2.1 model gives different predictions when run on CPU vs GPU keeping EVERYTHING else same (same machine, same OS, fixed saved weights, no randomness anywhere). I knew there are precision differences between the two devices, didn't realize they can be so significant. I think it has to do with particular NN architectures as well. With lots of parameters, sometimes a little error in each parameter accumulates to a BIG error in final predictions. The first point of the first answer to this SO question tries to tell the same thing- https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device
This guy also linked some closed Tensorflow GitHub issues which conclude that this is the expected behavior and not a bug. Hope this helps.

@alankongfq
Copy link

Hi @tu-curious, thanks for the pointers, will take a look at this closely when I had the time.

@Dave-Vedant
Copy link

I am experiencing the same problem, I trained a simple 4 dense layers neural network in the ubuntu 20.04 system and it gives me a max accuracy result of 94.05%. Where the same model in google colab is giving me an accuracy of 99.96%. I am wondering what is the reason after that? I also trained them multiple times and in each machine, the accuracy is constant in each result with (+-0.5%) but in different machines, they have large difference of 4.0%. Why???

@hanzhuangsyr
Copy link

I just fixed this problem. I thought that it was something wrong with Keras or TensorFlow, but it turns out that there is a bug in my code. This bug couldn't be found on my Windows computer, but this bug appears on my Linux computer. Wasted a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests