-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed MPI simulation: cudaErrorInvalidResourceHandle #77
Comments
Given the effort that you put in to creating a minimal reproducible example, I would suggest that you contact support for this system. Generally, on a large system, the support team would be thrilled to see a ~10 line reproducer. Please let us know what they say (or if that path is not fruitful). |
Hi @danlkv ... you're using Polaris, right? I believe this system runs SS11 with a special CUDA-aware MPICH. Referencing this. I'd make sure that the correct MPI implementation is being loaded by mpi4py. I'm making a few assumptions about which compute resource you're using, but keep us looped in with what their support teams say. |
Thanks for getting back to me. Yes, I'm using a single node of Polaris for these tests. I did verify that the data after allgather is correct. Does |
@danlkv No, ipcOpenMemHandle does not rely on MPI. Can you build and run the standard CUDA IPC sample on Polaris? https://github.com/NVIDIA/cuda-samples/tree/master/Samples/0_Introduction/simpleIPC |
Thanks @DmitryLyakh @danlkv there are two things happening:
(2) would only work if the MPI implementation loaded by mpi4py is CUDA aware (unless data is being moved to the host prior to the actual Regarding (1), you'd need to confirm that all GPUs are actually peer accessible, and that there isn't an issue with your cupy installation. |
@DmitryLyakh I can build the sample using cudatoolkit 11.8, but not smaller versions.
I did not previously use the 11.8 cuda, since I'm not sure it is compatible with the device (Driver cuda version 11.4). I checked compute capability with
Does 8.0 means that it will not be compatible with 8.9? Should I try to run everything with 11.8 or try to change the Makefile so it uses a smaller compute capability? P.S. samples |
If you run on A100 GPU, you only need compute capability 8.0, and you can use CUDA-11.8. You do not need compute capability 89 in this case (you can just remove those flags). I would recommend trying with CUDA-11.8 regardless. |
Update: the problem just magically disappeared and everything now works on all cuda versions. I now use 11.4 and checked it up to 64 nodes. I guess there was some problem on ALCF side. |
Thanks for keeping us posted. |
I'm following the example for distributed multi-GPU simulation using MPI: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/custatevec/distributed_index_bit_swap_mpi.py
When I run it with
mpiexec -n 2 python distributed_index_bit_swap_mpi.py
I get the following stacktrace:I also created a minimal reproducible example that gives the same error:
Notably, calling the
ipcOpenMemHandle
on the local handle raisescudaErrorDeviceUninitialized: invalid device context
, similar to this issueMy config is:
Appreciate any help!
The text was updated successfully, but these errors were encountered: