-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saved model behaves differently on different machines #7676
Comments
same here with Tensorflow backend. I trained my models in my local machine (Ubuntu 16.04 LTS). When I tested my model on AWS EC2 instance I got different prediction numbers. Have you solved this? |
No, I'm still experiencing this problem. You can check if any of the solutions posted in #4875 work for you. What versions of the libraries are you running? |
I am running Keras 2.0.6. From my experience the version of packages is not the problem (at least for now). I did some experiments today and found out that if I remove PCA from my modeling pipeline everything works fine. Did you use sklearn's PCA to reduce the dimension of your inputs? If so you might want to try to remove it. It solved my problem for now. I don't know why this happens. This post says that the non-deterministic in weights could cause problems, but this doesn't explain why the Neural Networks model behaves the same in the same machine. Inspired by this post I guess the different results of dimension reduction lead to this problem. |
Not sure if this will help at all, but I was dealing with this for a day and a half before I realized it was a difference in how the machines handled hashing words before I passed them to my |
@RunshengSong, I don't use sklearn. @rsmith49, are you talking about setting |
I found the same thing as @rsmith49 - had a word2vec model that would act as if the weights had been completely re-initialized when i loaded them from disk in a new session. After also saving/pickling the dicts that mapped words to ints and reloading them from disk also when i started a new training session the model behaved as expected. |
@basaldella Yes, turns out my issue was more along the lines of #4875, and was inconsistent between different Python sessions, not just different machines. |
@halhenke I'm also using a word2vec model, but I use GloVe's pre-trained weights following this tutorial, so I guess that shouldn't be the issue. Are you using pre-trained weights as well? |
@basaldella Have you fixed this issue? Now, I try to copy the retrained model (HDF5 file) to my laptop, and load the model again, and test the model with the same image, then I got a totally different result. The python environment are the same in the two machine with keras 2.0.8: Why the results are different in the two machine? Is there somebody know about this? |
@wangchenouc no, I was absolutely not able to fix this issue. If you have any news please tag me in your issue as well. I'm actually thinking on switching to a lower level framework because I'm not able to solve this problem. |
@basaldella Please look at this #8149 |
@wangchenouc thanks, but this does not solve my issue. In my case the versions of Keras are the same on both machines. |
@basaldella Just compare the version of keras is not enough. Maybe you need to compare every function codes that you used. Try use a very simple cases like what I did, and it's easy to compare your different step by step. I spend 3 whole days to debug the codes step by step, and solved my problem finally. Good luck to you! |
@wangchenouc I know. But I'm cloning the same repo on 2 machines, installing the same python&libraries versions with a script, and still, I have no luck on getting the same results. Thanks for encouraging me though :) |
Any news on this issue? I'm running into the same problem. What are the possible causes other than differences in code/set up/hardware? I've been searching for solutions for the last 3 days and nothing seems to work. |
I've looked at the several potential solutions reported here and related threads but no luck either @Philipduerholt . In my case my last layer is a softmax and when I predict the same training data (not even test), I get equal probabilities between my classes i.e. the model is completely random. |
It worked, finally. And in hindsight it looks simple. I'll try to include all relevant points:
I hope it helps. |
I had the same problem... |
i
|
running into the same problem and looking for solution |
this is not working for me :( |
tf.set_random_seed(0) worked for me |
where should this line be placed? Before sess = tf.Session(config=config)? |
I am facing same problem in Golang, following are approach
The weights for all layer's are different, when loaded in Ubuntu (step 3) and Raspberry Pi (step 4). Which is causing the different softmax prediction. Sample weights on different environment: Loaded weights in Ubuntu: Loaded weight in Raspberry Pi for the same layer and same filter as above How to solve this? |
I am facing the same issue with this but in abit weird fashion.
Pipeline 2 and 3, the environment is kept the same with the same code. The only difference is the hardware and GPU. I am at a lost here because i dont know where to start debugging and I cant just spin up my VM to test as it cost money. I am not sure if it is related to some memory issue or weights initlaization etc. Anyone has any pointers? |
@alankongfq : Not even going that far, I found out after a whole day of debugging that my tf 2.1 model gives different predictions when run on CPU vs GPU keeping EVERYTHING else same (same machine, same OS, fixed saved weights, no randomness anywhere). I knew there are precision differences between the two devices, didn't realize they can be so significant. I think it has to do with particular NN architectures as well. With lots of parameters, sometimes a little error in each parameter accumulates to a BIG error in final predictions. The first point of the first answer to this SO question tries to tell the same thing- https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device |
Hi @tu-curious, thanks for the pointers, will take a look at this closely when I had the time. |
I am experiencing the same problem, I trained a simple 4 dense layers neural network in the ubuntu 20.04 system and it gives me a max accuracy result of 94.05%. Where the same model in google colab is giving me an accuracy of 99.96%. I am wondering what is the reason after that? I also trained them multiple times and in each machine, the accuracy is constant in each result with (+-0.5%) but in different machines, they have large difference of 4.0%. Why??? |
I just fixed this problem. I thought that it was something wrong with Keras or TensorFlow, but it turns out that there is a bug in my code. This bug couldn't be found on my Windows computer, but this bug appears on my Linux computer. Wasted a lot of time. |
After studying #439, #2228, #2743, #6737 and the new FAQ about reproducibility, I was able to get consistent, reproducible result on my development machines using Theano. If I run my code twice, I get the exact same results.
The problem is that the results are reproducible only on the same machine. In other words, if I
predict
save_model
, ormodel_to_json
andsave_weights
)predict
The results of the two
predict
s are different. Using CPU or GPU makes no difference - after I copy the model file(s) from a machine to another, the performance ofpredict
changes dramatically.The only difference on the two machines is the hardware (I use my laptop's 980M and a workstation with a Titan X Pascal) and the NVIDIA driver version, which is slightly older on the workstation. Both computers run Ubuntu 16.04 LTS and Cuda 8 with cuDNN. All libraries are on the same version on both machines, and the Python version is the same as well (3.6.1).
Is this behavior intended? I expect that running a pre-trained model on with the same architecture and weights on two different machines yields the same results, but this doesn't seem the case.
On a side note, a suggestion: on the FAQs about reproducibility, it should be explicitly stated that the development version of Theano is needed to get reproducible results.
The text was updated successfully, but these errors were encountered: