multi_gpu_model not working w/ TensorFlow 1.14 #13057

rohit-gupta · 2019-07-03T11:05:14Z

System information

Have I written custom code (as opposed to using example directory): No/Yes (very slight change to an example)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow backend (yes / no): yes
TensorFlow version: 1.14
Keras version: Latest master from github
Python version: 3.7 (through Anaconda)
CUDA/cuDNN version: 10.0/7.4.2
GPU model and memory: 2x Tesla K80 (11GB each)

Describe the current behavior

I am using the cifar-10 ResNet example from the Keras examples directory, with the addition of the following line at Line number 360 (just before compilation) in order to use multiple GPUs while training. However this doesn't work.

Line Added:
model = keras.utils.multi_gpu_model(model, gpus=2)

Traceback Error log:

Traceback (most recent call last):
  File "cifar10_resnet_multigpu.py", line 360, in <module>
    model = keras.utils.multi_gpu_model(model, gpus=2)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

Describe the expected behavior

Previously, this typically worked fine and results in faster training due to parallelization across GPUs.

Note: This works fine if the backend is Tensorflow 1.13, so this is a regression.

The text was updated successfully, but these errors were encountered:

QtacierP · 2019-07-04T01:18:14Z

The same with you. The codes can run with tf==1.12.0, but cannot run with tf=1.14.0. I don't know the reason. The largest change is that I have transferred CUDA from 9.0 to 10.0, then nothing has been changed.

rohit-gupta · 2019-07-04T08:30:32Z

@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem

derekhsu · 2019-07-04T16:22:51Z

I have the same problem now. Currently, is downgrade an only way to solve this problem?

rohit-gupta · 2019-07-04T17:16:41Z

@derekhsu I can't really speak for Keras maintainers, but I don't know of any other solution. Bugs like this with critical features like Multi-GPU training are a big problem for Keras.

TheStoneMX · 2019-07-05T19:34:25Z

Same here!!!! very disappointed solutions, please.....

ju-he · 2019-08-12T19:01:36Z

I received the same error message as above when using tf1.14, but after downgrading to 1.12 as well as to 1.13 I am confronted with:

Traceback (most recent call last):
  File "trainer_temp.py", line 226, in <module>
    main()
  File "trainer_temp.py", line 137, in main
    model = multi_gpu_model(build.models['vae'], gpus=gpus)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 227, in multi_gpu_model
    outputs = model(inputs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1858, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4261, in _apply_device_functions
    op._set_device(device_spec.function(op))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 314, in _device_function
    current_device = DeviceSpec.from_string(node_def.device or "")
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 232, in from_string
    return DeviceSpec().parse_from_string(spec)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string
    splits = [x.split(":") for x in spec.split("/")]
AttributeError: 'DeviceSpec' object has no attribute 'split'

Any suggestions whether this is caused by the same issue or if I might have another problem?

System information

OS: Red Hat Enterprise Linux
Python: 3.6.5
Keras: 2.2.4
Tensorflow: 1.12/1.13.1/1.14
Cuda: 9/10
GPU model: NVIDIA GeForce GTX980 Ti

fengwang · 2019-08-20T13:12:36Z

Same problem here. Log:

Traceback (most recent call last):
  File "phase_retrieval_-108_gan.py", line 39, in <module>
    discriminator = multi_gpu_model( discriminator, gpus=2 )
  File "/usr/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/usr/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

TheStoneMX · 2019-08-20T13:46:19Z

I am trying to learn Pytorch.... little by little.... I dont kow when they will fix this.... it have been three months

KalyanKumarPichuka · 2019-08-21T05:22:58Z

same problem here too...

AttributeError Traceback (most recent call last)
in
10 opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
11 #opt = tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
---> 12 parallel_model = multi_gpu_model(model, gpus=2)
13 #parallel_model.compile(loss='categorical_crossentropy',optimizer='rmsprop')
14 parallel_model.compile(optimizer=opt, loss=bce_dice_loss, metrics=[dice_coef])

/opt/conda/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py in multi_gpu_model(model, gpus, cpu_merge, cpu_relocation)
225 # Apply model on slice
226 # (creating a model replica on the target device).
--> 227 outputs = model(inputs)
228 outputs = to_list(outputs)
229

/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in call(self, inputs, **kwargs)
455 # Actually call the layer,
456 # collecting output(s), mask(s), and shape(s).
--> 457 output = self.call(inputs, **kwargs)
458 output_mask = self.compute_mask(inputs, previous_mask)
459

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in call(self, inputs, mask)
562 return self._output_tensor_cache[cache_key]
563 else:
--> 564 output_tensors, _, _ = self.run_internal_graph(inputs, masks)
565 return output_tensors
566

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in run_internal_graph(self, inputs, masks)
719 kwargs['mask'] = computed_mask
720 output_tensors = to_list(
--> 721 layer.call(computed_tensor, **kwargs))
722 output_masks = layer.compute_mask(computed_tensor,
723 computed_mask)

/opt/conda/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
183 normed_training, mean, variance = K.normalize_batch_in_training(
184 inputs, self.gamma, self.beta, reduction_axes,
--> 185 epsilon=self.epsilon)
186
187 if K.backend() != 'cntk':

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in normalize_batch_in_training(x, gamma, beta, reduction_axes, epsilon)
1856 """
1857 if ndim(x) == 4 and list(reduction_axes) in [[0, 1, 2], [0, 2, 3]]:
-> 1858 if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
1859 return _broadcast_normalize_batch_in_training(x, gamma, beta,
1860 reduction_axes,

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _has_nchw_support()
289 bool: if the current scope device placement would support nchw
290 """
--> 291 explicitly_on_cpu = _is_current_explicit_device('CPU')
292 gpus_available = len(_get_available_gpus()) > 0
293 return (not explicitly_on_cpu and gpus_available)

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _is_current_explicit_device(device_type)
264 if device_type not in ['CPU', 'GPU']:
265 raise ValueError('device_type should be either "CPU" or "GPU".')
--> 266 device = _get_current_tf_device()
267 return (device is not None and device.device_type == device_type.upper())
268

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _get_current_tf_device()
245 g = tf.get_default_graph()
246 op = _TfDeviceCaptureOp()
--> 247 g._apply_device_functions(op)
248 return op.device
249

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _apply_device_functions(self, op)
4579 # strings, since identity checks are faster than equality checks.
4580 if device_string is not prior_device_string:
-> 4581 op._set_device_from_string(device_string)
4582 prior_device_string = device_string
4583 op._device_code_locations = self._snapshot_device_function_stack_metadata()

AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

Borda · 2019-08-23T10:56:48Z

yes, TF 1.14 issue, see tensorflow/tensorflow#30728

ju-he · 2019-08-26T16:52:02Z

i found a workaround (which worked at least for my configuration):
since parse_from_string treated the DeviceSpec object it wanted to receive as an string, there was a simple solution, adding the following in
site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string:

if isinstance(spec, DeviceSpec):
        return spec

might not be the optimal solution, but it finally allowed me to use all of my GPUs...

TheStoneMX · 2019-08-26T21:27:12Z

@ju-he thanks for the post, but I cant find that line of code, I dont have anything in line 150, just some comments....

I have this

if isinstance(spec, MergeDevice):
return spec

but never found

parse_from_string in the file

I don't understand what did you changed...

Thanks.

ju-he · 2019-08-26T21:52:10Z

@TheStoneMX which tf version are you using? I downgraded from 1.14 to 1.12 and to 1.10, since some suggested this solution but I still had the issues as described above. Maybe it's due to my specific configuration. But do I understand you correctly, that the "if isinstance-return spec" part (that's the only thing I added) is already there in your version? Then apparently this bug has already been fixed, just not in the version I was using.

TheStoneMX · 2019-08-27T10:11:51Z

@ju-he I am using 1.14 and still can't use multiple GPUs, I am thinking to learn Pytorch..... it has been too long and they are not fixing this bug

ju-he · 2019-08-27T10:19:11Z

@TheStoneMX have you tried switching to tf 1.12 or even 1.10? This together with the Bugfix I posted above should work fine.

TheStoneMX · 2019-08-27T10:46:24Z

Hi @ju-he

Thanks for the email, but I havent be able, I am using conda and everytim I install keras, it reinstall 1.4....

Do you know how I can do it ?

But I thought that 1.3 does not have this problem, because before everything was working.

ju-he · 2019-08-27T13:55:00Z

Hi @TheStoneMX
I stopped using conda a while ago, but I think you can choose the version of a specific package via conda install package=version

TheStoneMX · 2019-08-27T14:02:42Z

I was not using conda for a while then I started to use it again, I am switching to see if I can make work, thanks bro.

JVGD · 2019-08-27T14:57:24Z

Hi, same issue here when trying to use multi GPU with Keras.
I had to fall back to tensorflow-gpu 1.13.2 to make it work :(
Any news on this? Hope things get fixed soon 👍

TheStoneMX · 2019-08-27T16:50:59Z

Hi @ju-he I got it working removing anaconda and using pip3 and installed TensorFlow-GPU 1.13.2

karolbadowski · 2019-09-11T21:07:20Z

I have the same problem when calling:
with device('/gpu:0' if use_GPU else '/cpu:0'): portion of code

Tensorflow-gpu 1.14 has disappointed me as well. I consider 1.13.2 a last reliable version.

Just importing it causes incompatibilities:

for example with numpy,

with management of GPUs / CPUs.

Many things have changed package path and there is no backwards compatibility, for example:

package path to TocoConverter/TFLiteConverter

package path to set_image_dim_ordering

many other places get warning like for example "tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead"

I believe 1.14 is currently more similar to TensorFlow 2 rather than to TensorFlow 1.
Why would there be explicit necessity to change package path to "v1" otherwise.

Please consider backwards compatibility for 1.x.x versions if the version still starts with 1.

jtk1919 · 2019-12-26T09:17:01Z

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

pip3 for python 3.7.5
tensorflow-gpu v. 1.14.0
keras 2.3.1
cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

jayagami · 2020-02-20T01:17:44Z

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

pip3 for python 3.7.5

tensorflow-gpu v. 1.14.0

keras 2.3.1

cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

Just upgrade tf version to 1.15, works for me.

JivanRoquet · 2020-02-27T13:01:01Z

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

jayagami · 2020-03-04T05:40:41Z

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

keras                     2.3.1                    pypi_0    pypi
tensorboard               1.15.0                   pypi_0    pypi
tensorflow-estimator      1.15.1                   pypi_0    pypi
tensorflow-gpu            1.15.0                   pypi_0    pypi

Tested on my computer.

Ubuntu 19.10 , gtx1080ti sli, python3.7, cuda 10.1

alexw994 · 2020-09-29T09:52:42Z

keras 2.2.4
tensorflow 1.13.1
it works for me.

rohit-gupta changed the title ~~multi_gpu_model not working~~ multi_gpu_model not working w/ TensorFlow 1.14 Jul 3, 2019

jvishnuvardhan added the backend:tensorflow label Jul 9, 2019

jvishnuvardhan self-assigned this Jul 9, 2019

jvishnuvardhan added the type:bug/performance label Jul 9, 2019

jvishnuvardhan assigned robieta and unassigned jvishnuvardhan Jul 9, 2019

jvishnuvardhan added the stat:awaiting keras-eng Awaiting response from Keras engineer label Jul 9, 2019

robieta mentioned this issue Aug 27, 2019

sync changes to _TfDeviceCaptureOp #13255

Merged

fchollet closed this as completed in #13255 Aug 28, 2019

CMCDragonkai mentioned this issue Nov 5, 2019

19.09 Keras 2.2.4 is not compatible with Tensorflow 1.14 due to multi-gpu NixOS/nixpkgs#72799

Closed

kvrooman mentioned this issue Dec 6, 2019

Error at training start because of multi GPU bad detection deepfakes/faceswap#948

Closed

psteinb mentioned this issue Feb 11, 2020

Multi gpu training juglab/n2v#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi_gpu_model not working w/ TensorFlow 1.14 #13057

multi_gpu_model not working w/ TensorFlow 1.14 #13057

rohit-gupta commented Jul 3, 2019 •

edited

Loading

QtacierP commented Jul 4, 2019

rohit-gupta commented Jul 4, 2019

derekhsu commented Jul 4, 2019

rohit-gupta commented Jul 4, 2019

TheStoneMX commented Jul 5, 2019

ju-he commented Aug 12, 2019 •

edited

Loading

fengwang commented Aug 20, 2019

TheStoneMX commented Aug 20, 2019

KalyanKumarPichuka commented Aug 21, 2019

Borda commented Aug 23, 2019

ju-he commented Aug 26, 2019

TheStoneMX commented Aug 26, 2019

ju-he commented Aug 26, 2019

TheStoneMX commented Aug 27, 2019

ju-he commented Aug 27, 2019

TheStoneMX commented Aug 27, 2019

ju-he commented Aug 27, 2019

TheStoneMX commented Aug 27, 2019

JVGD commented Aug 27, 2019 •

edited

Loading

TheStoneMX commented Aug 27, 2019

karolbadowski commented Sep 11, 2019 •

edited

Loading

jtk1919 commented Dec 26, 2019 •

edited

Loading

jayagami commented Feb 20, 2020

JivanRoquet commented Feb 27, 2020 •

edited

Loading

jayagami commented Mar 4, 2020 •

edited

Loading

alexw994 commented Sep 29, 2020

multi_gpu_model not working w/ TensorFlow 1.14 #13057

multi_gpu_model not working w/ TensorFlow 1.14 #13057

Comments

rohit-gupta commented Jul 3, 2019 • edited Loading

QtacierP commented Jul 4, 2019

rohit-gupta commented Jul 4, 2019

derekhsu commented Jul 4, 2019

rohit-gupta commented Jul 4, 2019

TheStoneMX commented Jul 5, 2019

ju-he commented Aug 12, 2019 • edited Loading

fengwang commented Aug 20, 2019

TheStoneMX commented Aug 20, 2019

KalyanKumarPichuka commented Aug 21, 2019

Borda commented Aug 23, 2019

ju-he commented Aug 26, 2019

TheStoneMX commented Aug 26, 2019

ju-he commented Aug 26, 2019

TheStoneMX commented Aug 27, 2019

ju-he commented Aug 27, 2019

TheStoneMX commented Aug 27, 2019

ju-he commented Aug 27, 2019

TheStoneMX commented Aug 27, 2019

JVGD commented Aug 27, 2019 • edited Loading

TheStoneMX commented Aug 27, 2019

karolbadowski commented Sep 11, 2019 • edited Loading

jtk1919 commented Dec 26, 2019 • edited Loading

jayagami commented Feb 20, 2020

JivanRoquet commented Feb 27, 2020 • edited Loading

jayagami commented Mar 4, 2020 • edited Loading

alexw994 commented Sep 29, 2020

rohit-gupta commented Jul 3, 2019 •

edited

Loading

ju-he commented Aug 12, 2019 •

edited

Loading

JVGD commented Aug 27, 2019 •

edited

Loading

karolbadowski commented Sep 11, 2019 •

edited

Loading

jtk1919 commented Dec 26, 2019 •

edited

Loading

JivanRoquet commented Feb 27, 2020 •

edited

Loading

jayagami commented Mar 4, 2020 •

edited

Loading