-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NV50_P2P allocation class unimplemented in nvproxy #9827
Comments
Fixes #9827. PiperOrigin-RevId: 591875507
Let me know if #9828 solves this issue. Do you know if this will repro on T4 as well? Hard to get a hold of A100. But I will try again tomorrow. |
I tried running the above mentioned Docker image on an A100 with runsc, it segfaults and crashes with a different error:
The boot logs show that a different set of ioctls are not implemented:
|
What Nvidia driver version is being used at Modal? I was testing on 525.105.17. |
Here is the relevant logs from the segfault (looks like a null pointer dereference): Boot logs
|
RIP = 0x7eb1f7de9ddb
So we need to look at offset |
Using
|
@nixprime pointed out that I was looking at the objdump of the wrong file. He figured out the actual fault instruction:
Per logs #9833 fixes this issue, but now we get a different exception:
|
The logs also show an unimplemented control command ( The logs also shows 4 user faults happening at the same instruction:
So need to look at
|
Updates #9827 PiperOrigin-RevId: 592873910
If libnccl.so.2:getHostHash() fails to fopen(/proc/sys/kernel/random/boot_id), it calls fclose(NULL) and takes SIGSEGV. Updates #9827 PiperOrigin-RevId: 592899854
This also adds a repro case for issue #9827, although it is commented out for now since it doesn't work yet. PiperOrigin-RevId: 592958681
Our driver version is still Sorry I couldn't get a multi-GPU A100 VM to test the repro myself. Looks like it's thrown up a lot of things! Internally we populate |
Do you know if the A100 GPU has 40GB memory or 80GB? |
So the
Some investigation showed that we were over allocating in tmpfs on page faults. See: gvisor/pkg/sentry/fsimpl/tmpfs/regular_file.go Lines 300 to 310 in 149350e
We were trying to allocate So I updated tmpfs to only allocate |
…te(). Updates #9827 PiperOrigin-RevId: 593304736
Strace logs show an interesting pattern:
These files are created in |
Hey, just got back from PTO. Thanks again for your investigation. Looks like I should incorporate the related fixes into our runtime and see what I get with the repro program. |
This also adds a repro case for issue #9827, although it is commented out for now since it doesn't work yet. PiperOrigin-RevId: 595238934
I added a modified version of the Can you assist with making it run for just a small amount of time under |
I think
|
Updates #9827. PiperOrigin-RevId: 596766912
Per Jamie's findings, the reproducer should now work with gVisor (after increasing /dev/shm size limit). |
There are TODOs still referencing this issue:
Search TODO |
Thanks for all your great help here! Still meaning to look into https://github.com/google/gvisor/blob/HEAD/images/gpu/pytorch/issue_9827.py#L87 |
We tested this in A100s with 40 and 80GB and the merged fix works. |
Revisting the currently skipped test added during this issue's investigation. Could use your guidance @ayushr2 on how to get debug logs enabled for test executions. I've tried the typical way (daemon.json and reloading docker) but that has not done anything. I read through the test code a bit and didn't see anything. I'm running the test on an A100 80GB SXM4 server and getting SIGSEV:
|
I usually look at the Makefile and Buildkite configuration to see how the Lines 304 to 310 in a5b10b7
And see how this is invoked in Buildkite: gvisor/.buildkite/pipeline.yaml Lines 180 to 185 in a5b10b7
So for your usecase, maybe comment out Line 307-309 in the Makefile and invoke Line 141 in a5b10b7
To find the debug logs, see what |
Thanks a lot 🙏 |
Description
When running multi-GPU training on A100s applications can attempt to use the unimplemented
NV50_P2P
allocation class.This presents as a
'mapping of buffer object failed'
error.I've take a look at the implementation of this allocation class and unfortunately it's non-trivial: https://github.dev/NVIDIA/open-gpu-kernel-modules/blob/4c29105335610933e744f4ab2524ea63fc39edaf/src/common/sdk/nvidia/inc/class/cl503b.h#L57
Opening this as a tracking issue.
Steps to reproduce
Dockerfile
This Dockerfile runs but I wasn't able to use it to reproduce the issue because I couldn't get an on-demand multi-GPU A100 VM in GCP 😓.
On Modal this script reproduces the issue most of the time. I was able to observe the
unknown allocation class
logline by running this script with debug logs enabled and then jumping onto the worker it ran on to read them.runsc version
docker version (if using docker)
No response
uname
No response
runsc debug logs (if available)
Omitted. Logs show
unknown allocation class
for0x0000503b
, which isNV50_P2P
.The text was updated successfully, but these errors were encountered: