Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_gpu_model not working w/ TensorFlow 1.14 #13057

Closed
rohit-gupta opened this issue Jul 3, 2019 · 26 comments · Fixed by #13255
Closed

multi_gpu_model not working w/ TensorFlow 1.14 #13057

rohit-gupta opened this issue Jul 3, 2019 · 26 comments · Fixed by #13255
Assignees
Labels
stat:awaiting keras-eng Awaiting response from Keras engineer type:bug/performance

Comments

@rohit-gupta
Copy link

rohit-gupta commented Jul 3, 2019

System information

  • Have I written custom code (as opposed to using example directory): No/Yes (very slight change to an example)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow backend (yes / no): yes
  • TensorFlow version: 1.14
  • Keras version: Latest master from github
  • Python version: 3.7 (through Anaconda)
  • CUDA/cuDNN version: 10.0/7.4.2
  • GPU model and memory: 2x Tesla K80 (11GB each)

Describe the current behavior

I am using the cifar-10 ResNet example from the Keras examples directory, with the addition of the following line at Line number 360 (just before compilation) in order to use multiple GPUs while training. However this doesn't work.

Line Added:
model = keras.utils.multi_gpu_model(model, gpus=2)

Traceback Error log:

Traceback (most recent call last):
  File "cifar10_resnet_multigpu.py", line 360, in <module>
    model = keras.utils.multi_gpu_model(model, gpus=2)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

Describe the expected behavior

Previously, this typically worked fine and results in faster training due to parallelization across GPUs.

Note: This works fine if the backend is Tensorflow 1.13, so this is a regression.

@rohit-gupta rohit-gupta changed the title multi_gpu_model not working multi_gpu_model not working w/ TensorFlow 1.14 Jul 3, 2019
@QtacierP
Copy link

QtacierP commented Jul 4, 2019

The same with you. The codes can run with tf==1.12.0, but cannot run with tf=1.14.0. I don't know the reason. The largest change is that I have transferred CUDA from 9.0 to 10.0, then nothing has been changed.

@rohit-gupta
Copy link
Author

@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem

@derekhsu
Copy link

derekhsu commented Jul 4, 2019

I have the same problem now. Currently, is downgrade an only way to solve this problem?

@rohit-gupta
Copy link
Author

@derekhsu I can't really speak for Keras maintainers, but I don't know of any other solution. Bugs like this with critical features like Multi-GPU training are a big problem for Keras.

@TheStoneMX
Copy link

Same here!!!! very disappointed solutions, please.....

@ju-he
Copy link

ju-he commented Aug 12, 2019

I received the same error message as above when using tf1.14, but after downgrading to 1.12 as well as to 1.13 I am confronted with:

Traceback (most recent call last):
  File "trainer_temp.py", line 226, in <module>
    main()
  File "trainer_temp.py", line 137, in main
    model = multi_gpu_model(build.models['vae'], gpus=gpus)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 227, in multi_gpu_model
    outputs = model(inputs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1858, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4261, in _apply_device_functions
    op._set_device(device_spec.function(op))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 314, in _device_function
    current_device = DeviceSpec.from_string(node_def.device or "")
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 232, in from_string
    return DeviceSpec().parse_from_string(spec)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string
    splits = [x.split(":") for x in spec.split("/")]
AttributeError: 'DeviceSpec' object has no attribute 'split'

Any suggestions whether this is caused by the same issue or if I might have another problem?

System information

  • OS: Red Hat Enterprise Linux
  • Python: 3.6.5
  • Keras: 2.2.4
  • Tensorflow: 1.12/1.13.1/1.14
  • Cuda: 9/10
  • GPU model: NVIDIA GeForce GTX980 Ti

@fengwang
Copy link

Same problem here. Log:

Traceback (most recent call last):
  File "phase_retrieval_-108_gan.py", line 39, in <module>
    discriminator = multi_gpu_model( discriminator, gpus=2 )
  File "/usr/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/usr/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

@TheStoneMX
Copy link

I am trying to learn Pytorch.... little by little.... I dont kow when they will fix this.... it have been three months

@KalyanKumarPichuka
Copy link

same problem here too...


AttributeError Traceback (most recent call last)
in
10 opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
11 #opt = tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
---> 12 parallel_model = multi_gpu_model(model, gpus=2)
13 #parallel_model.compile(loss='categorical_crossentropy',optimizer='rmsprop')
14 parallel_model.compile(optimizer=opt, loss=bce_dice_loss, metrics=[dice_coef])

/opt/conda/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py in multi_gpu_model(model, gpus, cpu_merge, cpu_relocation)
225 # Apply model on slice
226 # (creating a model replica on the target device).
--> 227 outputs = model(inputs)
228 outputs = to_list(outputs)
229

/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in call(self, inputs, **kwargs)
455 # Actually call the layer,
456 # collecting output(s), mask(s), and shape(s).
--> 457 output = self.call(inputs, **kwargs)
458 output_mask = self.compute_mask(inputs, previous_mask)
459

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in call(self, inputs, mask)
562 return self._output_tensor_cache[cache_key]
563 else:
--> 564 output_tensors, _, _ = self.run_internal_graph(inputs, masks)
565 return output_tensors
566

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in run_internal_graph(self, inputs, masks)
719 kwargs['mask'] = computed_mask
720 output_tensors = to_list(
--> 721 layer.call(computed_tensor, **kwargs))
722 output_masks = layer.compute_mask(computed_tensor,
723 computed_mask)

/opt/conda/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
183 normed_training, mean, variance = K.normalize_batch_in_training(
184 inputs, self.gamma, self.beta, reduction_axes,
--> 185 epsilon=self.epsilon)
186
187 if K.backend() != 'cntk':

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in normalize_batch_in_training(x, gamma, beta, reduction_axes, epsilon)
1856 """
1857 if ndim(x) == 4 and list(reduction_axes) in [[0, 1, 2], [0, 2, 3]]:
-> 1858 if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
1859 return _broadcast_normalize_batch_in_training(x, gamma, beta,
1860 reduction_axes,

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _has_nchw_support()
289 bool: if the current scope device placement would support nchw
290 """
--> 291 explicitly_on_cpu = _is_current_explicit_device('CPU')
292 gpus_available = len(_get_available_gpus()) > 0
293 return (not explicitly_on_cpu and gpus_available)

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _is_current_explicit_device(device_type)
264 if device_type not in ['CPU', 'GPU']:
265 raise ValueError('device_type should be either "CPU" or "GPU".')
--> 266 device = _get_current_tf_device()
267 return (device is not None and device.device_type == device_type.upper())
268

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _get_current_tf_device()
245 g = tf.get_default_graph()
246 op = _TfDeviceCaptureOp()
--> 247 g._apply_device_functions(op)
248 return op.device
249

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _apply_device_functions(self, op)
4579 # strings, since identity checks are faster than equality checks.
4580 if device_string is not prior_device_string:
-> 4581 op._set_device_from_string(device_string)
4582 prior_device_string = device_string
4583 op._device_code_locations = self._snapshot_device_function_stack_metadata()

AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

@Borda
Copy link

Borda commented Aug 23, 2019

yes, TF 1.14 issue, see tensorflow/tensorflow#30728

@ju-he
Copy link

ju-he commented Aug 26, 2019

i found a workaround (which worked at least for my configuration):
since parse_from_string treated the DeviceSpec object it wanted to receive as an string, there was a simple solution, adding the following in
site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string:

if isinstance(spec, DeviceSpec):
        return spec

might not be the optimal solution, but it finally allowed me to use all of my GPUs...

@TheStoneMX
Copy link

@ju-he thanks for the post, but I cant find that line of code, I dont have anything in line 150, just some comments....

I have this

if isinstance(spec, MergeDevice):
return spec

but never found

parse_from_string in the file

I don't understand what did you changed...

Thanks.

@ju-he
Copy link

ju-he commented Aug 26, 2019

@TheStoneMX which tf version are you using? I downgraded from 1.14 to 1.12 and to 1.10, since some suggested this solution but I still had the issues as described above. Maybe it's due to my specific configuration. But do I understand you correctly, that the "if isinstance-return spec" part (that's the only thing I added) is already there in your version? Then apparently this bug has already been fixed, just not in the version I was using.

@TheStoneMX
Copy link

@ju-he I am using 1.14 and still can't use multiple GPUs, I am thinking to learn Pytorch..... it has been too long and they are not fixing this bug

@ju-he
Copy link

ju-he commented Aug 27, 2019

@TheStoneMX have you tried switching to tf 1.12 or even 1.10? This together with the Bugfix I posted above should work fine.

@TheStoneMX
Copy link

Hi @ju-he

Thanks for the email, but I havent be able, I am using conda and everytim I install keras, it reinstall 1.4....

Do you know how I can do it ?

But I thought that 1.3 does not have this problem, because before everything was working.

@ju-he
Copy link

ju-he commented Aug 27, 2019

Hi @TheStoneMX
I stopped using conda a while ago, but I think you can choose the version of a specific package via conda install package=version

@TheStoneMX
Copy link

I was not using conda for a while then I started to use it again, I am switching to see if I can make work, thanks bro.

@JVGD
Copy link

JVGD commented Aug 27, 2019

Hi, same issue here when trying to use multi GPU with Keras.
I had to fall back to tensorflow-gpu 1.13.2 to make it work :(
Any news on this? Hope things get fixed soon 👍

@TheStoneMX
Copy link

Hi @ju-he I got it working removing anaconda and using pip3 and installed TensorFlow-GPU 1.13.2

@karolbadowski
Copy link

karolbadowski commented Sep 11, 2019

I have the same problem when calling:
with device('/gpu:0' if use_GPU else '/cpu:0'): portion of code

Tensorflow-gpu 1.14 has disappointed me as well. I consider 1.13.2 a last reliable version.

Just importing it causes incompatibilities:

  • for example with numpy,
  • with management of GPUs / CPUs.

Many things have changed package path and there is no backwards compatibility, for example:

  • package path to TocoConverter/TFLiteConverter
  • package path to set_image_dim_ordering
  • many other places get warning like for example "tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead"

I believe 1.14 is currently more similar to TensorFlow 2 rather than to TensorFlow 1.
Why would there be explicit necessity to change package path to "v1" otherwise.

Please consider backwards compatibility for 1.x.x versions if the version still starts with 1.

@jtk1919
Copy link

jtk1919 commented Dec 26, 2019

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

  • pip3 for python 3.7.5
  • tensorflow-gpu v. 1.14.0
  • keras 2.3.1
  • cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

@jayagami
Copy link

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

  • pip3 for python 3.7.5
  • tensorflow-gpu v. 1.14.0
  • keras 2.3.1
  • cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

Just upgrade tf version to 1.15, works for me.

@JivanRoquet
Copy link

JivanRoquet commented Feb 27, 2020

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

@jayagami
Copy link

jayagami commented Mar 4, 2020

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

keras                     2.3.1                    pypi_0    pypi
tensorboard               1.15.0                   pypi_0    pypi
tensorflow-estimator      1.15.1                   pypi_0    pypi
tensorflow-gpu            1.15.0                   pypi_0    pypi

Tested on my computer.

Ubuntu 19.10 , gtx1080ti sli, python3.7, cuda 10.1

@alexw994
Copy link

keras 2.2.4
tensorflow 1.13.1
it works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting keras-eng Awaiting response from Keras engineer type:bug/performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.