Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed MPI simulation: cudaErrorInvalidResourceHandle #77

Closed
danlkv opened this issue Aug 7, 2023 · 9 comments
Closed

Distributed MPI simulation: cudaErrorInvalidResourceHandle #77

danlkv opened this issue Aug 7, 2023 · 9 comments

Comments

@danlkv
Copy link

danlkv commented Aug 7, 2023

I'm following the example for distributed multi-GPU simulation using MPI: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/custatevec/distributed_index_bit_swap_mpi.py

When I run it with mpiexec -n 2 python distributed_index_bit_swap_mpi.py I get the following stacktrace:

  File "...distributed_index_bit_swap_mpi.py", line 266, in <module>
    run_distributed_index_bit_swaps(
  File "...distributed_index_bit_swap_mpi.py", line 166, in run_distributed_index_bit_swaps
    d_sub_sv_p2p = cp.cuda.runtime.ipcOpenMemHandle(dst_mem_handle)
  File "cupy_backends/cuda/api/runtime.pyx", line 456, in cupy_backends.cuda.api.runtime.ipcOpenMemHandle
  File "cupy_backends/cuda/api/runtime.pyx", line 462, in cupy_backends.cuda.api.runtime.ipcOpenMemHandle
  File "cupy_backends/cuda/api/runtime.pyx", line 143, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidResourceHandle: invalid resource handle
x3005c0s37b0n0.hsn.cm.polaris.alcf.anl.gov: rank 1 exited with code 1

I also created a minimal reproducible example that gives the same error:

import cupy as cp
from mpi4py import MPI

rank = MPI.COMM_WORLD.Get_rank()
cp.cuda.runtime.setDevice(rank)
X = cp.zeros(100)
ipc_mem_handle = cp.cuda.runtime.ipcGetMemHandle(X.data.ptr)
# This line will also raise:
#local_open_handle = cp.cuda.runtime.ipcOpenMemHandle(ipc_mem_handle)
ipc_mem_handles = MPI.COMM_WORLD.allgather(ipc_mem_handle)
other = (rank + 1) % MPI.COMM_WORLD.Get_size()
remote_handle = ipc_mem_handles[other]
remote_open_handle = cp.cuda.runtime.ipcOpenMemHandle(remote_handle)

Notably, calling the ipcOpenMemHandle on the local handle raises cudaErrorDeviceUninitialized: invalid device context, similar to this issue

My config is:

In [1]: import cupy as cp; cp.show_config()
OS                           : Linux-5.3.18-150300.59.115-default-x86_64-with-glibc2.31
Python Version               : 3.10.9
CuPy Version                 : 11.5.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.23.5
SciPy Version                : 1.11.1
Cython Build Version         : 0.29.32
Cython Runtime Version       : 0.29.33
CUDA Root                    : /soft/compilers/cudatoolkit/cuda-11.4.4
nvcc PATH                    : /soft/compilers/cudatoolkit/cuda-11.4.4/bin/nvcc
CUDA Build Version           : 11080
CUDA Driver Version          : 11040
CUDA Runtime Version         : 11040
cuBLAS Version               : (available)
cuFFT Version                : 10502
cuRAND Version               : 10205
cuSOLVER Version             : (11, 2, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 4)
Thrust Version               : 101501
CUB Build Version            : 101501
Jitify Build Version         : 4a37de0
cuDNN Build Version          : 8700
cuDNN Version                : 8600
NCCL Build Version           : 21602
NCCL Runtime Version         : 21602
cuTENSOR Version             : 10700
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA A100-SXM4-40GB
Device 0 Compute Capability  : 80
Device 0 PCI Bus ID          : 0000:07:00.0
Device 1 Name                : NVIDIA A100-SXM4-40GB
Device 1 Compute Capability  : 80
Device 1 PCI Bus ID          : 0000:46:00.0
Device 2 Name                : NVIDIA A100-SXM4-40GB
Device 2 Compute Capability  : 80
Device 2 PCI Bus ID          : 0000:85:00.0
Device 3 Name                : NVIDIA A100-SXM4-40GB
Device 3 Compute Capability  : 80
Device 3 PCI Bus ID          : 0000:C7:00.0

Appreciate any help!

@jagunnels
Copy link

Given the effort that you put in to creating a minimal reproducible example, I would suggest that you contact support for this system. Generally, on a large system, the support team would be thrilled to see a ~10 line reproducer. Please let us know what they say (or if that path is not fruitful).

@mtjrider
Copy link
Collaborator

mtjrider commented Aug 8, 2023

Hi @danlkv ... you're using Polaris, right? I believe this system runs SS11 with a special CUDA-aware MPICH. Referencing this.

I'd make sure that the correct MPI implementation is being loaded by mpi4py.

I'm making a few assumptions about which compute resource you're using, but keep us looped in with what their support teams say.

@danlkv
Copy link
Author

danlkv commented Aug 8, 2023

Thanks for getting back to me. Yes, I'm using a single node of Polaris for these tests. I did verify that the data after allgather is correct. Does ipcOpenMemHandle use MPI? I thought this was a custom Nvidia p2p communication protocol.

@DmitryLyakh
Copy link
Collaborator

DmitryLyakh commented Aug 8, 2023

@danlkv No, ipcOpenMemHandle does not rely on MPI. Can you build and run the standard CUDA IPC sample on Polaris? https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simpleIPC

@mtjrider
Copy link
Collaborator

mtjrider commented Aug 8, 2023

Thanks @DmitryLyakh

@danlkv there are two things happening:

  1. You're trying to create IPC handles.
  2. You're passing the IPC handles to MPI.

(2) would only work if the MPI implementation loaded by mpi4py is CUDA aware (unless data is being moved to the host prior to the actual allgather collective).

Regarding (1), you'd need to confirm that all GPUs are actually peer accessible, and that there isn't an issue with your cupy installation.

@danlkv
Copy link
Author

danlkv commented Aug 11, 2023

@DmitryLyakh I can build the sample using cudatoolkit 11.8, but not smaller versions.

nvcc fatal   : Unsupported gpu architecture 'compute_89'

I did not previously use the 11.8 cuda, since I'm not sure it is compatible with the device (Driver cuda version 11.4). I checked compute capability with deviceQuery example compiled with 11.8 cuda:

-> % ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A100 80GB PCIe"
  CUDA Driver Version / Runtime Version          11.4 / 11.8
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 80995 MBytes (84929216512 bytes)
[...]

Does 8.0 means that it will not be compatible with 8.9? Should I try to run everything with 11.8 or try to change the Makefile so it uses a smaller compute capability?

P.S. samples make doesn't look in PATH for nvcc, so I had to export CUDA_PATH to make it work.

@DmitryLyakh
Copy link
Collaborator

If you run on A100 GPU, you only need compute capability 8.0, and you can use CUDA-11.8. You do not need compute capability 89 in this case (you can just remove those flags). I would recommend trying with CUDA-11.8 regardless.

@danlkv
Copy link
Author

danlkv commented Aug 22, 2023

Update: the problem just magically disappeared and everything now works on all cuda versions. I now use 11.4 and checked it up to 64 nodes. I guess there was some problem on ALCF side.

@danlkv danlkv closed this as completed Aug 22, 2023
@mtjrider
Copy link
Collaborator

Update: the problem just magically disappeared and everything now works on all cuda versions. I now use 11.4 and checked it up to 64 nodes. I guess there was some problem on ALCF side.

Thanks for keeping us posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants