Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudart.cudaSetDevice allocates memory on GPU other than target #20

Closed
QuiteAFoxtrot opened this issue Mar 29, 2022 · 3 comments
Closed

Comments

@QuiteAFoxtrot
Copy link

QuiteAFoxtrot commented Mar 29, 2022

cuda-python 11.6.1
cuda toolkit 11.2
Ubuntu Linux

If you run something like the following on a multi-GPU machine

device_num = 5
err, = cuda.cuInit(0)
err, device = cuda.cuDeviceGet(device_num)
err, cuda_context = cuda.cuCtxCreate(0, device)
err, = cudart.cudaSetDevice(device)

The call to cudart.cudaSetDevice will properly set your device to '5', but it will also allocate ~305 MB of memory on device 0 (or whichever is the 0th device in the device list provided by CUDA_VISIBLE_DEVICES). I think this issue (possibly in the C-CUDA runtime underneath?) may possibly be the root of many downstream issues in libraries like Tensorflow and Pytorch who have similar issues where a user selects a device but still gets tons of allocations on other devices. This 305 MB may not sound like a lot, but I'm running a program on an Nvidia-DGX with 16 GPUs and I have 64 worker processes, causing 64*305 = 19GB of unusable space to be allocated on GPU 0, which crashes the program. I cannot simply set CUDA_VISIBLE_DEVICES to correct this problem because the workers are communicating via shared GPU memory (via cuIPCMemHandle) with their parent process, and the parent process needs access to all GPUs. Additionally, the worker processes are performing data augmentation on one GPU, while writing output to another GPU with a different device ID.

I am trying to investigate a workaround to not call 'cudart.cudaSetDevice' at all, but when it is not called I cannot properly use the pointer given by cuda.cuMemAlloc to create a PyTorch tensor. When I call cudart.cudaSetDevice, I am able to use the pointer properly.

@vzhurba01
Copy link
Collaborator

Thanks for the report! I've pushed a fix a6511d5 to main branch and it can be installed from source.

The PyPi/Conda packages will receive the fix in the next release.

FYI for the code snippet you likely want to pass the device_num instead of device. Here the device_num is a device ordinal whereas device is a device handle. Their integer representation may not always match.
(in fact cudaSetDevice internally calls cuDeviceGet on the passed deviceOrdinal to get a device handle)

@QuiteAFoxtrot
Copy link
Author

Great, thank you! A question regarding your device_num advice - does that hold up under CUDA_VISIBLE_DEVICES? For example, if CUDA_VISIBLE_DEVICES = 4,5,6,7 and I set device_num=2, presumably that actually gives me a pointer to device 6 - so if I passed a "2" into cuda.cuCtxCreate, will that also properly map to device 6?

I've also noticed some strange behavior with cuda.cuDeviceCanAccessPeer - namely if you pass in the same device twice it reports that access is not possible (which I interpret as, it can't map its own memory?). Is that intentional behavior? If so, would you like me to open another issue?

@vzhurba01
Copy link
Collaborator

vzhurba01 commented Apr 1, 2022

does that hold up under CUDA_VISIBLE_DEVICES?

It holds up. Your following example works just like that.

which I interpret as, it can't map its own memory?

That's expected, peer-to-self is disallowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants