[Bug]: bug (or feature (?)) in `gather` when gathering from several GPU-devices #1171

mrfh92 · 2023-06-23T14:44:21Z

What happened?

When performing gather-operations on torch tensors on several GPUs, the tensors in the resulting list of tensors are stil on different devices (though on the same MPI-rank, surprisingly); see the below example with 8 MPI-processes on 2 nodes with 4 GPUs each.

This is actually the problem that causes print to fail when communication over GPU is required (see #1121, #1170): at the end of _torch_data the local data are gathered from all MPI-ranks and concatenation via torch.cat fails because the gathered local arrays are still on different devices...

Code snippet triggering the error

import heat as ht
import time

N = 2**4
a=ht.arange(N,dtype=ht.float32,device='gpu',split=0)

time.sleep(a.comm.rank/100) # just for nice output
print(a.comm.rank,a.larray)

time.sleep(1)
b=a.comm.gather(a.larray)

time.sleep(a.comm.rank/100) #just for nice output
print(a.comm.rank,b)

run with

#!/bin/bash

#SBATCH --account=******
#SBATCH --nodes=2
#SBATCH --tasks-per-node=4
#SBATCH --gres=gpu:4

srun python test_print.py



### Error message or erroneous outcome

```shell
1 tensor([2., 3.], device='cuda:1')
1 None
2 tensor([4., 5.], device='cuda:2')
2 None
3 tensor([6., 7.], device='cuda:3')
3 None
5 tensor([10., 11.], device='cuda:1')
5 None
4 tensor([8., 9.], device='cuda:0')
4 None
6 tensor([12., 13.], device='cuda:2')
6 None
7 tensor([14., 15.], device='cuda:3')
7 None
0 tensor([0., 1.], device='cuda:0')
0 [tensor([0., 1.], device='cuda:0'), tensor([2., 3.], device='cuda:1'), tensor([4., 5.], device='cuda:2'), tensor([6., 7.], device='cuda:3'), tensor([8., 9.], device='cuda:0'), tensor([10., 11.], device='cuda:1'), tensor([12., 13.], device='cuda:2'), tensor([14., 15.], device='cuda:3')]

I would have expected that all tensors in the list on process 0 are on device 'cuda:0' of node 1 (since --at least in my impression-- this is the device where MPI-process 0 should "live" on...).



### Version

main (development branch)

### Python version

None

### PyTorch version

None

### MPI version

_No response_

The text was updated successfully, but these errors were encountered:

mrfh92 · 2023-06-26T07:46:01Z

This is due to the way how CUDA-aware MPI works.

mrfh92 · 2023-06-27T14:46:38Z

Interesting to know: In fact, the buffered (Heat-adapted) version Gather works as expected. Only the unbuffered version gather (which is just a fallback to the respective mpi4py-routine with the same name) causes this issue.

mrfh92 added bug Something isn't working MPI Anything related to MPI communication communication labels Jun 23, 2023

This was referenced Jun 23, 2023

ht.print can now print arrays distributed over n>1 GPUs #1170

Merged

[Bug]: heat.print fails if communication over GPUs is required #1121

Closed

mrfh92 closed this as completed Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: bug (or feature (?)) in `gather` when gathering from several GPU-devices #1171

[Bug]: bug (or feature (?)) in `gather` when gathering from several GPU-devices #1171

mrfh92 commented Jun 23, 2023

mrfh92 commented Jun 26, 2023

mrfh92 commented Jun 27, 2023

[Bug]: bug (or feature (?)) in gather when gathering from several GPU-devices #1171

[Bug]: bug (or feature (?)) in gather when gathering from several GPU-devices #1171

Comments

mrfh92 commented Jun 23, 2023

What happened?

Code snippet triggering the error

mrfh92 commented Jun 26, 2023

mrfh92 commented Jun 27, 2023

[Bug]: bug (or feature (?)) in `gather` when gathering from several GPU-devices #1171

[Bug]: bug (or feature (?)) in `gather` when gathering from several GPU-devices #1171