Can't transfer trained model between computers. Totally different results almost identical systems. #8149

bmullen-steelcase · 2017-10-16T12:47:38Z

I have two systems, one has GPU support which I use for training models, and the other doesn't. Otherwise, they have the exact same versions of Keras and Tensorflow.

I trained a model on the GPU system and copied it to the CPU system and get unusable results. Accuracy is terrible. I then tried setting the model configurations to deterministic, retraining on the GPU system and moving to the CPU using the following:

THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python3 training.py

Still getting unusable results, I tried one more thing. I tried training the model from scratch on the CPU system, using the parameters found from trial and error on the GPU system. Here's the model design:

        model = Sequential()

	model.add(Dense(units=1024, input_dim=80)) 
	model.add(Activation('relu')) 
	model.add(Dropout(0.4))  
	  
	model.add(Dense(units=1024))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))

	model.add(Dense(units=2))
	model.add(Activation('sigmoid'))

Running the exact same code on both systems, side by side, the GPU system gets good accuracy, the CPU system will not converge at all.

What can I do to get this into production?

The text was updated successfully, but these errors were encountered:

tRosenflanz · 2017-10-16T16:39:23Z

Is the code really identical on both systems? The reason I am asking is that you have sigmoid as the final layer although you want a softmax for classification tasks (or just 1 unit with sigmoid activation) . I am just wondering if that could have an effect.
Also please make sure that the weights of layers actually change after you load a trained model. The proper way is to compile the model first and then call model.load - frequently people compile model after loading the weights and that reinitializes them

bmullen-steelcase · 2017-10-16T16:43:35Z

The code's the same, I just copied from one over to the other.

The second suggestion is really good though, I had trouble finding examples of loading models. Your suggestion makes a lot of sense. I'll check it and report back.

wangchenouc · 2017-10-24T14:58:08Z

@bmullen-steelcase Have you fixed this problem? I am facing the same problem that saved model on the GPU machine is not working correctly in another machine without using GPU.
Same problem can be found here #7676

bmullen-steelcase · 2017-10-24T15:04:38Z

@wangchenouc I was able to come up with sort of a workaround. My model was using the Adam optimizer. What I found was that for some reason, changing the optimizer to another one in the same family (Adadelta, I believe) I was able to actually improve on the training on the new system. Previously the accuracy would blow up.

wangchenouc · 2017-10-24T15:32:18Z

@bmullen-steelcase Does the saved model (after your training) have the same results with the same data (the same image, for example) in different machine? My training is good enough for me now, my problem is that it does not work well on the other machine. I describe more about this:

In machine A (with a powerful GPU):
training:

Fine tuning Inception v3 on a new image data set.
Save the best model into hdf5 file (model.hdf5) using model.save(mf_path).
testing:
load images with code:
img_path = "test.jpg"
img = image.load_img(img_path, target_size=(299,299))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
load model with code:
mf_path = 'model.hdf5'
model = load_model(mf_path)
preds = model.predict(x)
results (the results always like this):
[[ 0.00197385 0.01141251 0.02262068 0.9121536 0.00810914 0.01657074
0.00370198 0.00617629 0.00972648 0.00531203 0.00224261]]

In machine B (without GPU):
using the same testing code in machine A, and get a result (the results always like this in this machine):
[[ 0.00373867 0.22160383 0.10066977 0.35440436 0.02839879 0.17799987
0.01744748 0.02645957 0.0299265 0.03026218 0.00908909]]

The two machine have the exactly the same python environment with keras 2.0.8.

bmullen-steelcase · 2017-10-24T16:13:39Z

@wangchenouc I didn't have good results after the move either. I ended up running a few iterations of training on the CPU system with Adadelta optimizer. It quickly got back to the same level in accuracy (in my case).

wangchenouc · 2017-10-24T19:42:12Z

@bmullen-steelcase Do you mean that give up training your model on the GPU machine? Does the model you trained on the CPU machine have the same results on you GPU machine?

I just don't understand why it will be a problem when we copy the saved model from one machine to other machine. This is very important to me as the trained model on my GPU system will be used by somebody else.

@fchollet Could you give some advises about this?

bmullen-steelcase · 2017-10-24T19:52:38Z

@wangchenouc No, I still train on the GPU machine. When I move the model to the CPU machine, it doesn't come over 100%. Also, the optimizer will get dumped. If I need to do additional training on the CPU system, I compile with a new optimizer and run 5-10 iterations on the CPU systemto set the parameters for the new optimizer. It seems to pick up training where it left off.

wangchenouc · 2017-10-25T12:13:15Z

@bmullen-steelcase @fchollet Is this confirms that model.save() and model.load_model() don't work the same in different machine with the saved model?

Continue training on the new machine with the saved model is not a good way in practice. But it seems that this is what we can do so far......

wangchenouc · 2017-10-25T13:28:23Z

@bmullen-steelcase @fchollet My problem solved, as I found that keras.preprocessing.mage.load_img() is different in my two machines. The GPU machine have interpolation='bilinear' while my CPU machine lost this because of a different version of keras that I installed.

Anyway, there is no problem of model.save() and model.load_model(). Some preprocessing function may slightly different in the different machine which effects the results.

@bmullen-steelcase I suggest you check the keras version too, I compared their codes and found my problems.

LBartolini · 2020-09-28T16:14:55Z

I have two systems, one has GPU support which I use for training models, and the other doesn't. Otherwise, they have the exact same versions of Keras and Tensorflow.

I trained a model on the GPU system and copied it to the CPU system and get unusable results. Accuracy is terrible. I then tried setting the model configurations to deterministic, retraining on the GPU system and moving to the CPU using the following:

THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python3 training.py

Still getting unusable results, I tried one more thing. I tried training the model from scratch on the CPU system, using the parameters found from trial and error on the GPU system. Here's the model design:
        model = Sequential()

	model.add(Dense(units=1024, input_dim=80)) 
	model.add(Activation('relu')) 
	model.add(Dropout(0.4))  
	  
	model.add(Dense(units=1024))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5)) 

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))
	model.add(Dropout(0.5))

	model.add(Dense(units=512))  
	model.add(Activation('relu'))

	model.add(Dense(units=2))
	model.add(Activation('sigmoid'))
Running the exact same code on both systems, side by side, the GPU system gets good accuracy, the CPU system will not converge at all.

What can I do to get this into production?

I know it has passed a lot of time since this question was asked.
I had a similar issue, found out that it is a problem of an env variable called PYTHONHASHSEED.
I just had to set it to a fixed number and everything worked as expected.

wangchenouc mentioned this issue Oct 25, 2017

Saved model behaves differently on different machines #7676

Closed

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't transfer trained model between computers. Totally different results almost identical systems. #8149

Can't transfer trained model between computers. Totally different results almost identical systems. #8149

bmullen-steelcase commented Oct 16, 2017

tRosenflanz commented Oct 16, 2017

bmullen-steelcase commented Oct 16, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 25, 2017

wangchenouc commented Oct 25, 2017

LBartolini commented Sep 28, 2020

Can't transfer trained model between computers. Totally different results almost identical systems. #8149

Can't transfer trained model between computers. Totally different results almost identical systems. #8149

Comments

bmullen-steelcase commented Oct 16, 2017

tRosenflanz commented Oct 16, 2017

bmullen-steelcase commented Oct 16, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 24, 2017

bmullen-steelcase commented Oct 24, 2017

wangchenouc commented Oct 25, 2017

wangchenouc commented Oct 25, 2017

LBartolini commented Sep 28, 2020