Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: bug (or feature (?)) in gather when gathering from several GPU-devices #1171

Closed
mrfh92 opened this issue Jun 23, 2023 · 2 comments
Closed
Labels
bug Something isn't working communication MPI Anything related to MPI communication

Comments

@mrfh92
Copy link
Collaborator

mrfh92 commented Jun 23, 2023

What happened?

When performing gather-operations on torch tensors on several GPUs, the tensors in the resulting list of tensors are stil on different devices (though on the same MPI-rank, surprisingly); see the below example with 8 MPI-processes on 2 nodes with 4 GPUs each.

This is actually the problem that causes print to fail when communication over GPU is required (see #1121, #1170): at the end of _torch_data the local data are gathered from all MPI-ranks and concatenation via torch.cat fails because the gathered local arrays are still on different devices...

Code snippet triggering the error

import heat as ht
import time

N = 2**4
a=ht.arange(N,dtype=ht.float32,device='gpu',split=0)

time.sleep(a.comm.rank/100) # just for nice output
print(a.comm.rank,a.larray)

time.sleep(1)
b=a.comm.gather(a.larray)

time.sleep(a.comm.rank/100) #just for nice output
print(a.comm.rank,b)

run with

#!/bin/bash

#SBATCH --account=******
#SBATCH --nodes=2
#SBATCH --tasks-per-node=4
#SBATCH --gres=gpu:4

srun python test_print.py


### Error message or erroneous outcome

```shell
1 tensor([2., 3.], device='cuda:1')
1 None
2 tensor([4., 5.], device='cuda:2')
2 None
3 tensor([6., 7.], device='cuda:3')
3 None
5 tensor([10., 11.], device='cuda:1')
5 None
4 tensor([8., 9.], device='cuda:0')
4 None
6 tensor([12., 13.], device='cuda:2')
6 None
7 tensor([14., 15.], device='cuda:3')
7 None
0 tensor([0., 1.], device='cuda:0')
0 [tensor([0., 1.], device='cuda:0'), tensor([2., 3.], device='cuda:1'), tensor([4., 5.], device='cuda:2'), tensor([6., 7.], device='cuda:3'), tensor([8., 9.], device='cuda:0'), tensor([10., 11.], device='cuda:1'), tensor([12., 13.], device='cuda:2'), tensor([14., 15.], device='cuda:3')]

I would have expected that all tensors in the list on process 0 are on device 'cuda:0' of node 1 (since --at least in my impression-- this is the device where MPI-process 0 should "live" on...).



### Version

main (development branch)

### Python version

None

### PyTorch version

None

### MPI version

_No response_
@mrfh92 mrfh92 added bug Something isn't working MPI Anything related to MPI communication communication labels Jun 23, 2023
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jun 26, 2023

This is due to the way how CUDA-aware MPI works.

@mrfh92 mrfh92 closed this as completed Jun 26, 2023
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jun 27, 2023

Interesting to know: In fact, the buffered (Heat-adapted) version Gather works as expected. Only the unbuffered version gather (which is just a fallback to the respective mpi4py-routine with the same name) causes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working communication MPI Anything related to MPI communication
Projects
None yet
Development

No branches or pull requests

1 participant