-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't transfer trained model between computers. Totally different results almost identical systems. #8149
Comments
Is the code really identical on both systems? The reason I am asking is that you have sigmoid as the final layer although you want a softmax for classification tasks (or just 1 unit with sigmoid activation) . I am just wondering if that could have an effect. |
The code's the same, I just copied from one over to the other. The second suggestion is really good though, I had trouble finding examples of loading models. Your suggestion makes a lot of sense. I'll check it and report back. |
@bmullen-steelcase Have you fixed this problem? I am facing the same problem that saved model on the GPU machine is not working correctly in another machine without using GPU. |
@wangchenouc I was able to come up with sort of a workaround. My model was using the Adam optimizer. What I found was that for some reason, changing the optimizer to another one in the same family (Adadelta, I believe) I was able to actually improve on the training on the new system. Previously the accuracy would blow up. |
@bmullen-steelcase Does the saved model (after your training) have the same results with the same data (the same image, for example) in different machine? My training is good enough for me now, my problem is that it does not work well on the other machine. I describe more about this: In machine A (with a powerful GPU):
In machine B (without GPU): The two machine have the exactly the same python environment with keras 2.0.8. |
@wangchenouc I didn't have good results after the move either. I ended up running a few iterations of training on the CPU system with Adadelta optimizer. It quickly got back to the same level in accuracy (in my case). |
@bmullen-steelcase Do you mean that give up training your model on the GPU machine? Does the model you trained on the CPU machine have the same results on you GPU machine? I just don't understand why it will be a problem when we copy the saved model from one machine to other machine. This is very important to me as the trained model on my GPU system will be used by somebody else. @fchollet Could you give some advises about this? |
@wangchenouc No, I still train on the GPU machine. When I move the model to the CPU machine, it doesn't come over 100%. Also, the optimizer will get dumped. If I need to do additional training on the CPU system, I compile with a new optimizer and run 5-10 iterations on the CPU systemto set the parameters for the new optimizer. It seems to pick up training where it left off. |
@bmullen-steelcase @fchollet Is this confirms that model.save() and model.load_model() don't work the same in different machine with the saved model? Continue training on the new machine with the saved model is not a good way in practice. But it seems that this is what we can do so far...... |
@bmullen-steelcase @fchollet My problem solved, as I found that keras.preprocessing.mage.load_img() is different in my two machines. The GPU machine have interpolation='bilinear' while my CPU machine lost this because of a different version of keras that I installed. Anyway, there is no problem of model.save() and model.load_model(). Some preprocessing function may slightly different in the different machine which effects the results. @bmullen-steelcase I suggest you check the keras version too, I compared their codes and found my problems. |
I know it has passed a lot of time since this question was asked. |
I have two systems, one has GPU support which I use for training models, and the other doesn't. Otherwise, they have the exact same versions of Keras and Tensorflow.
I trained a model on the GPU system and copied it to the CPU system and get unusable results. Accuracy is terrible. I then tried setting the model configurations to deterministic, retraining on the GPU system and moving to the CPU using the following:
THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python3 training.py
Still getting unusable results, I tried one more thing. I tried training the model from scratch on the CPU system, using the parameters found from trial and error on the GPU system. Here's the model design:
Running the exact same code on both systems, side by side, the GPU system gets good accuracy, the CPU system will not converge at all.
What can I do to get this into production?
The text was updated successfully, but these errors were encountered: