[Bug]: `heat.print` fails if communication over GPUs is required #1121

ClaudiaComito · 2023-03-15T11:54:15Z

What happened?

Printing of a large array that's distributed over GPUs fails.

Code snippet triggering the error

import heat

N = 2**10
a=heat.arange(N,dtype=heat.float32,device='gpu',split=0)
print(a)

Error message or erroneous outcome

> srun --ntasks=2 python test_gpu_qa.py

Traceback (most recent call last):
  File "/p/project/haf/users/comito1/devel/heat/test_gpu_qa.py", line 9, in <module>
    print("BEFORE reshape: ", a)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/dndarray.py", line 1806, in __str__
    return printing.__str__(self)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "/p/project/haf/users/comito1/devel/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
[1678880909.786307] [hdfmlc03:18341:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x6902e00 was not matched
[1678880909.786324] [hdfmlc03:18341:0]       tag_match.c:62   UCX  WARN  unexpected tag-receive descriptor 0x14780c86eb50 was not matched
srun: error: hdfmlc03: task 0: Exited with exit code 1

Version

1.2.x

Python version

3.9

PyTorch version

1.12

MPI version

OpenMPI 4.1.2

mrfh92 · 2023-06-23T13:20:21Z

I could reproduce the error on 1 node with 4 GPUs as well:

Traceback (most recent call last):
  File "test_print.py", line 5, in <module>
    print(a)
  File "************/heat/heat/core/dndarray.py", line 1821, in __str__
    return printing.__str__(self)
  File "************/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "************/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "************/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)



srun: error: **************: task 0: Exited with exit code 1

…ired)

mrfh92 · 2023-06-23T14:47:50Z

Two smaller errors have already been corrected; see draft in #1170
The main problem, however, is due to #1171 (see also the corresponding comment in #1170)

ClaudiaComito added bug Something isn't working HW:CUDA HW:HPC MPI Anything related to MPI communication and removed HW:HPC MPI Anything related to MPI communication labels Mar 15, 2023

mrfh92 self-assigned this Jun 23, 2023

mrfh92 pushed a commit that referenced this issue Jun 23, 2023

tried to fix bug #1121 (printing fails when communication on gpu requ…

b8dbcfa

…ired)

This was referenced Jun 23, 2023

ht.print can now print arrays distributed over n>1 GPUs #1170

Merged

[Bug]: bug (or feature (?)) in gather when gathering from several GPU-devices #1171

Closed

mrfh92 removed their assignment Jun 28, 2023

mrfh92 mentioned this issue Jul 10, 2023

Bug/1121 print fails on gpu (solution with Gatherv) #1179

Closed

mrfh92 closed this as completed in #1170 Jul 24, 2023

mrfh92 mentioned this issue Aug 25, 2023

Bugs/341 rework mpi communication wrappers #1204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: `heat.print` fails if communication over GPUs is required #1121

[Bug]: `heat.print` fails if communication over GPUs is required #1121

ClaudiaComito commented Mar 15, 2023

mrfh92 commented Jun 23, 2023

mrfh92 commented Jun 23, 2023

[Bug]: heat.print fails if communication over GPUs is required #1121

[Bug]: heat.print fails if communication over GPUs is required #1121

Comments

ClaudiaComito commented Mar 15, 2023

What happened?

Code snippet triggering the error

Error message or erroneous outcome

Version

Python version

PyTorch version

MPI version

mrfh92 commented Jun 23, 2023

mrfh92 commented Jun 23, 2023

[Bug]: `heat.print` fails if communication over GPUs is required #1121

[Bug]: `heat.print` fails if communication over GPUs is required #1121