Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't transfer trained model between computers. Totally different results almost identical systems. #8149

Closed
bmullen-steelcase opened this issue Oct 16, 2017 · 11 comments

Comments

@bmullen-steelcase
Copy link

I have two systems, one has GPU support which I use for training models, and the other doesn't. Otherwise, they have the exact same versions of Keras and Tensorflow.

I trained a model on the GPU system and copied it to the CPU system and get unusable results. Accuracy is terrible. I then tried setting the model configurations to deterministic, retraining on the GPU system and moving to the CPU using the following:

THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python3 training.py

Still getting unusable results, I tried one more thing. I tried training the model from scratch on the CPU system, using the parameters found from trial and error on the GPU system. Here's the model design:

        model = Sequential()

	model.add(Dense(units=1024, input_dim=80)) 
	model.add(Activation('relu')) 
	model.add(Dropout(0.4))  
	  
	model.add(Dense(units=1024))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))

	model.add(Dense(units=2))
	model.add(Activation('sigmoid'))

Running the exact same code on both systems, side by side, the GPU system gets good accuracy, the CPU system will not converge at all.

What can I do to get this into production?

@tRosenflanz
Copy link

Is the code really identical on both systems? The reason I am asking is that you have sigmoid as the final layer although you want a softmax for classification tasks (or just 1 unit with sigmoid activation) . I am just wondering if that could have an effect.
Also please make sure that the weights of layers actually change after you load a trained model. The proper way is to compile the model first and then call model.load - frequently people compile model after loading the weights and that reinitializes them

@bmullen-steelcase
Copy link
Author

The code's the same, I just copied from one over to the other.

The second suggestion is really good though, I had trouble finding examples of loading models. Your suggestion makes a lot of sense. I'll check it and report back.

@wangchenouc
Copy link

@bmullen-steelcase Have you fixed this problem? I am facing the same problem that saved model on the GPU machine is not working correctly in another machine without using GPU.
Same problem can be found here #7676

@bmullen-steelcase
Copy link
Author

@wangchenouc I was able to come up with sort of a workaround. My model was using the Adam optimizer. What I found was that for some reason, changing the optimizer to another one in the same family (Adadelta, I believe) I was able to actually improve on the training on the new system. Previously the accuracy would blow up.

@wangchenouc
Copy link

@bmullen-steelcase Does the saved model (after your training) have the same results with the same data (the same image, for example) in different machine? My training is good enough for me now, my problem is that it does not work well on the other machine. I describe more about this:

In machine A (with a powerful GPU):
training:

  1. Fine tuning Inception v3 on a new image data set.
  2. Save the best model into hdf5 file (model.hdf5) using model.save(mf_path).
    testing:
  3. load images with code:
    img_path = "test.jpg"
    img = image.load_img(img_path, target_size=(299,299))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
  4. load model with code:
    mf_path = 'model.hdf5'
    model = load_model(mf_path)
    preds = model.predict(x)
  5. results (the results always like this):
    [[ 0.00197385 0.01141251 0.02262068 0.9121536 0.00810914 0.01657074
    0.00370198 0.00617629 0.00972648 0.00531203 0.00224261]]

In machine B (without GPU):
using the same testing code in machine A, and get a result (the results always like this in this machine):
[[ 0.00373867 0.22160383 0.10066977 0.35440436 0.02839879 0.17799987
0.01744748 0.02645957 0.0299265 0.03026218 0.00908909]]

The two machine have the exactly the same python environment with keras 2.0.8.

@bmullen-steelcase
Copy link
Author

@wangchenouc I didn't have good results after the move either. I ended up running a few iterations of training on the CPU system with Adadelta optimizer. It quickly got back to the same level in accuracy (in my case).

@wangchenouc
Copy link

@bmullen-steelcase Do you mean that give up training your model on the GPU machine? Does the model you trained on the CPU machine have the same results on you GPU machine?

I just don't understand why it will be a problem when we copy the saved model from one machine to other machine. This is very important to me as the trained model on my GPU system will be used by somebody else.

@fchollet Could you give some advises about this?

@bmullen-steelcase
Copy link
Author

@wangchenouc No, I still train on the GPU machine. When I move the model to the CPU machine, it doesn't come over 100%. Also, the optimizer will get dumped. If I need to do additional training on the CPU system, I compile with a new optimizer and run 5-10 iterations on the CPU systemto set the parameters for the new optimizer. It seems to pick up training where it left off.

@wangchenouc
Copy link

@bmullen-steelcase @fchollet Is this confirms that model.save() and model.load_model() don't work the same in different machine with the saved model?

Continue training on the new machine with the saved model is not a good way in practice. But it seems that this is what we can do so far......

@wangchenouc
Copy link

@bmullen-steelcase @fchollet My problem solved, as I found that keras.preprocessing.mage.load_img() is different in my two machines. The GPU machine have interpolation='bilinear' while my CPU machine lost this because of a different version of keras that I installed.

Anyway, there is no problem of model.save() and model.load_model(). Some preprocessing function may slightly different in the different machine which effects the results.

@bmullen-steelcase I suggest you check the keras version too, I compared their codes and found my problems.

@LBartolini
Copy link

I have two systems, one has GPU support which I use for training models, and the other doesn't. Otherwise, they have the exact same versions of Keras and Tensorflow.

I trained a model on the GPU system and copied it to the CPU system and get unusable results. Accuracy is terrible. I then tried setting the model configurations to deterministic, retraining on the GPU system and moving to the CPU using the following:

THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python3 training.py

Still getting unusable results, I tried one more thing. I tried training the model from scratch on the CPU system, using the parameters found from trial and error on the GPU system. Here's the model design:

        model = Sequential()

	model.add(Dense(units=1024, input_dim=80)) 
	model.add(Activation('relu')) 
	model.add(Dropout(0.4))  
	  
	model.add(Dense(units=1024))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))

	model.add(Dense(units=2))
	model.add(Activation('sigmoid'))

Running the exact same code on both systems, side by side, the GPU system gets good accuracy, the CPU system will not converge at all.

What can I do to get this into production?

I know it has passed a lot of time since this question was asked.
I had a similar issue, found out that it is a problem of an env variable called PYTHONHASHSEED.
I just had to set it to a fixed number and everything worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants