[Bug]: convolve with distributed kernel on multiple GPUs #1085

mtar · 2023-02-02T13:47:19Z

What happened?

convolve does not work if the kernel is distributed when more than one GPU is available.

Code snippet triggering the error

import heat as ht

dis_signal = ht.arange(0, 16, split=0, device='gpu', dtype=ht.int)
dis_kernel_odd = ht.ones(3, split=0, dtype=ht.int, device='gpu')
conv = ht.convolve(dis_signal, dis_kernel_odd, mode='full')

Error message or erroneous outcome

$ CUDA_VISIBLE_DEVICES=0,1,2,3 srun --ntasks=2 -l python test.py 
1:Traceback (most recent call last):
1:   File ".../test.py", line 7, in <module>
1:     conv = ht.convolve(dis_signal, dis_kernel_odd, mode='full')
1:   File ".../heat-venv_2023/lib/python3.10/site-packages/heat/core/signal.py", line 161, in convolve
1:     local_signal_filtered = fc.conv1d(signal, t_v1)
1: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

Version

main (development branch)

Python version

3.10

PyTorch version

1.12

MPI version

OpenMPI 4.1.4

mtar · 2023-02-03T12:40:24Z

heat/heat/core/signal.py

Line 159 in 8597417

rec_v = v.comm.bcast(t_v, root=r)

This is causing the issue. Unlike our own implementation, it keeps the device between processes.

shahpratham · 2023-02-05T14:03:39Z

@mtar should I use 'Bcast' then, as mentioned here in #790?

mtar · 2023-02-06T10:51:44Z

@mtar should I use 'Bcast' then, as mentioned here in #790?

Yes, that should fix it.

* changed to Bcast * clone tensor before Bcast --------- Co-authored-by: Michael Tarnawa <m.tarnawa@fz-juelich.de>

mtar added bug Something isn't working devices signal processing signal redistribution Related to distributed tensors communication and removed devices redistribution Related to distributed tensors labels Feb 2, 2023

shahpratham mentioned this issue Feb 8, 2023

Fix #1085: Convolve with distributed kernel on multiple GPUs #1095

Merged

4 tasks

mtar closed this as completed in #1095 Feb 27, 2023

mtar added a commit that referenced this issue Feb 27, 2023

Fix #1085: Convolve with distributed kernel on multiple GPUs (#1095)

ddb87d6

* changed to Bcast * clone tensor before Bcast --------- Co-authored-by: Michael Tarnawa <m.tarnawa@fz-juelich.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: convolve with distributed kernel on multiple GPUs #1085

[Bug]: convolve with distributed kernel on multiple GPUs #1085

mtar commented Feb 2, 2023 •

edited

mtar commented Feb 3, 2023

shahpratham commented Feb 5, 2023

mtar commented Feb 6, 2023

[Bug]: convolve with distributed kernel on multiple GPUs #1085

[Bug]: convolve with distributed kernel on multiple GPUs #1085

Comments

mtar commented Feb 2, 2023 • edited

What happened?

Code snippet triggering the error

Error message or erroneous outcome

Version

Python version

PyTorch version

MPI version

mtar commented Feb 3, 2023

shahpratham commented Feb 5, 2023

mtar commented Feb 6, 2023

mtar commented Feb 2, 2023 •

edited