Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: heat.print fails if communication over GPUs is required #1121

Closed
ClaudiaComito opened this issue Mar 15, 2023 · 2 comments · Fixed by #1170
Closed

[Bug]: heat.print fails if communication over GPUs is required #1121

ClaudiaComito opened this issue Mar 15, 2023 · 2 comments · Fixed by #1170
Labels
bug Something isn't working HW:CUDA

Comments

@ClaudiaComito
Copy link
Contributor

What happened?

Printing of a large array that's distributed over GPUs fails.

Code snippet triggering the error

import heat

N = 2**10
a=heat.arange(N,dtype=heat.float32,device='gpu',split=0)
print(a)

Error message or erroneous outcome

> srun --ntasks=2 python test_gpu_qa.py

Traceback (most recent call last):
  File "/p/project/haf/users/comito1/devel/heat/test_gpu_qa.py", line 9, in <module>
    print("BEFORE reshape: ", a)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/dndarray.py", line 1806, in __str__
    return printing.__str__(self)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
[1678880909.786307] [hdfmlc03:18341:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x6902e00 was not matched
[1678880909.786324] [hdfmlc03:18341:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x14780c86eb50 was not matched
srun: error: hdfmlc03: task 0: Exited with exit code 1

Version

1.2.x

Python version

3.9

PyTorch version

1.12

MPI version

OpenMPI 4.1.2
@ClaudiaComito ClaudiaComito added bug Something isn't working HW:CUDA HW:HPC MPI Anything related to MPI communication and removed HW:HPC MPI Anything related to MPI communication labels Mar 15, 2023
@mrfh92 mrfh92 self-assigned this Jun 23, 2023
@mrfh92
Copy link
Collaborator

mrfh92 commented Jun 23, 2023

I could reproduce the error on 1 node with 4 GPUs as well:

Traceback (most recent call last):
  File "test_print.py", line 5, in <module>
    print(a)
  File "************/heat/heat/core/dndarray.py", line 1821, in __str__
    return printing.__str__(self)
  File "************/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "************/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "************/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)



srun: error: **************: task 0: Exited with exit code 1

@mrfh92
Copy link
Collaborator

mrfh92 commented Jun 23, 2023

Two smaller errors have already been corrected; see draft in #1170
The main problem, however, is due to #1171 (see also the corresponding comment in #1170)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HW:CUDA
Projects
None yet
2 participants