feat: Support running createContainer hooks in CDI spec#13034
feat: Support running createContainer hooks in CDI spec#13034LandonTClipp wants to merge 1 commit into
Conversation
3bbfaaa to
86b7d45
Compare
|
@ayushr2 FYI. This should be better than my last attempt because this doesn't touch the old NVIDIA_VISIBLE_DEVICES codepath and relies on CDI spec files defining the hooks to run. I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system. |
86b7d45 to
f01a0fb
Compare
I don't understand this part and why the So running it normally and not special casing it like it is done now should work... |
I think if we take the approach of adding general support for running |
10e0830 to
4c3968d
Compare
|
Here is some more context as to why the // Set up /dev directory if needed.
if devIoFD >= 0 {
if err := SetupDev(spec, conf, root, procPath); err != nil {
util.Fatalf("error setting up /dev: %v", err)
}
}
if spec.Hooks != nil {
state := specs.State{
Version: specs.Version,
ID: containerID,
Status: specs.StateCreating,
Pid: 0,
Bundle: bundleDir,
Annotations: spec.Annotations,
}
if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
util.Fatalf("error executing CreateContainer hooks: %v", err)
}
}This fails to run and the gofer emits this error: This is because
The gVisor gofer on the other hand does not bind mount I ran an experiment to prove this. I bind-mount if spec.Hooks != nil {
origRoot := spec.Root.Path
if err := unix.Mount(origRoot, origRoot, "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
log.Warningf("PoC: failed to self-bind-mount rootfs %q: %v", origRoot, err)
} else {
log.Infof("PoC: self-bind-mounted %q; pivot_root in hooks should now succeed", origRoot)
defer unix.Unmount(origRoot, unix.MNT_DETACH)
}
state := specs.State{
Version: specs.Version,
ID: containerID,
Status: specs.StateCreating,
Pid: 0,
Bundle: bundleDir,
Annotations: spec.Annotations,
}
if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
util.Fatalf("error executing CreateContainer hooks: %v", err)
}
}My pod now starts successfully: But nvidia-smi doesn't work: And you can see the ldconfig step didn't produce the This is why my hacky method of extracting the TL;DRI think the core problem can be distilled down to the fact that gVisor bind-mounts CDI files at The root of the problem is a mismatch between what nvidia-ctk expects (a rootfs at spec.Root.Path with CDI mounts attached) and what gVisor provides (CDI mounts attached to /proc/fs/root, not to spec.Root.Path). I hope that long explanation makes some sort of sense. OptionsThere are a few routes we can take to fix this:
Options 2 and 3 would preserve the OCI semantics and let |
The startup sequence of the gofer is fairly complex.
But I wonder what we can do is:
RE option (2): The hook takes in the |
|
Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing If we self-bind |
Yes, this is what Landon is proposing in option (2).
Yes this is possible too. |
Yes the I will look into the route you described. I think that's the correct way of doing it since runc also operates on the rootfs according to the bundle. |
|
I discussed more with @avagin. He pointed out a legit usecase. That the container rootfs provided in the OCI spec is read-only from the beginning and the spec doesn't contain any mounts. In this case, current implementation will not create any new directories/files inside container rootfs. But my proposal will fail while creating the What @shayonj described in the second paragraph of his comment seems more correct than my proposal. Maybe that is the path to pursue. @avagin thoughts? |
This implements the idea @shayonj suggested in google#13034 (comment). We do all of the CDI mounts and createContainer hooks inside of spec.Root.Path instead of /proc/fs/root. This should allow hooks which need to pivot_root(2) to successfully run AFTER the CDI mounts have been performed.
|
I implemented @shayonj's idea but I'm on a plane and transferring binaries around is hard, so I'll do some final testing maybe tomorrow. I think it will work. |
|
@ayushr2 @shayonj I confirmed the current changes work: @shayonj's idea turned out to be way simpler than I expected. Please take a look, I think this might be the one! |
shayonj
left a comment
There was a problem hiding this comment.
Looking good, just some minor comments from my review if its useful to you.
aeec302 to
c262c0b
Compare
|
@shayonj please see the most recent set of changes. Highlights:
I don't yet know how we can solve createContainer hooks on readonly rootfs and I don't have a system I can test this out, so I'm happy with implementing it only for lisafs. |
shayonj
left a comment
There was a problem hiding this comment.
Looking good to me, just some minor comments. Perhaps the gVisor team can help with spotting anything else and seeing it through 🙏🏾
great work!
|
@ayushr2 please take another look at your earliest convenience. |
|
Reviewing! Could you please squash the commits. Copybara doesn't have the ability to squash and merge yet so all commits from PR are applied. We want to keep the master branch clean. |
e2ae571 to
64680ff
Compare
| flags := uint32(unix.MS_SLAVE | unix.MS_REC) | ||
| if spec.Linux != nil && spec.Linux.RootfsPropagation != "" { | ||
| flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation}) | ||
| } | ||
| if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil { | ||
| return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err) | ||
| } |
There was a problem hiding this comment.
Ok rootfs propagation flags is a bit tricky. See code pointers from runc implementation: [1] (prepareRoot()) and [2] (logic after pivot_root(2))
My reading is that, runc does the following:
- Sets all mounts in
/recursively asMS_SLAVE(unlessspec.Linux.RootfsPropagationis specially configured. Note that it can be configured withMS_SHAREDtoo) - Sets the mount-point containing
spec.Root.PathasMS_PRIVATE(orMS_SLAVEifspec.Linux.RootfsPropagationcontains that flag). This is important for the mentioned reasons. - Remounts rootfs onto itself.
- Then does all the bind-mounts inside
spec.Root.Path. - pivot_root
- Then update the rootfs propagation flags back to
spec.Linux.RootfsPropagation(in case it contained something likeMS_SHARED)
This dance ensures that the bind mounts in rootfs are never propagated to the host mount namespace, even if spec.Linux.RootfsPropagation contains MS_SHARED. And at the end, the mount propagation across all mounts is as the user requested.
In gVisor's case, we don't support MS_SHARED at all. Mount operations inside the sandbox are not mirrored on the host. We validate that the OCI spec does not contain MS_SHARED.
Before this change, the steps we were taking were:
- Set all mounts in
/recursively asMS_SLAVE. - Bind mount container rootfs into /proc/fs/root`.
- Remount rootfs based on
spec.Linux.RootfsPropagation. - Does all the bind-mounts inside
/proc/fs/root.
Per the current state of the PR, the steps we take are:
- Set all mounts in
/recursively asMS_SLAVE. - Does all the bind-mounts inside
spec.Root.Path. - Remounts rootfs onto itself using
MS_BIND|MS_REC. - Bind mount container rootfs recursively into
/proc/fs/root. - Remount
/proc/fs/rootbased onspec.Linux.RootfsPropagation.
By executing the container bind mounts before bind-mounting the rootfs onto itself with the recursive flag (MS_REC), you are cloning the entire populated mount tree and stacking it directly on top of itself. This doubles the number of mount objects in the kernel for that tree.
The order must be swapped to match runc. You must make spec.Root.Path a mount point first (Step 3), and then perform the bind mounts inside of it (Step 2). We can also do this self-bind-mount unconditionally (similar to runc) and quote the reason that createContainer hooks might pivot_root(2) and that runc has it like this. (Although runc self-bind-mounts for different reason that pivot_root(2) into spec.Root.Path requires it to be mount point.)
[1] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L1064-L1115
[2] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L239-L249
There was a problem hiding this comment.
I took a few minutes to grok what you are saying and this is my understanding of the issue:
containerRootFsis a regular directory. We call SetupMounts to bind-mount the mounts from the OCI spec into here. There are N mounts added.- We self-bind-mount
containerRootFsonto itself which replicates the mounts added in step 1. We're now at 2*N mounts. - We bind-mount
containerRootFs -> goferRootFswhich again replicates the mounts originated from step 1. We're now at 3*N mounts.
If I'm interpreting that correctly then I understand the need to unconditionally self-bind-mount containerRootFs before we call SetupMounts. I'll change that, great callout!
There was a problem hiding this comment.
I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.
I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.
As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).
I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.
If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.
There was a problem hiding this comment.
I definitely appreciate the attention you are giving here, I also recognize how sensitive this codepath is so I appreciate you taking care with it. I am mostly unfamiliar with the context here so I rely on your input!
I don't know if it's worth us creating some kind of test that asserts the mount points have the attributes we expect them to be at specific points in the code. We could create hook points in this function that let us assert in tests certain properties at specific points in the execution. Otherwise we are having an academic discussion without concrete proof as to what's happening.
I'll make the change you specified here.
There was a problem hiding this comment.
The latest commit contains the changes you requested. One thing worth noting:
The propagation-flag block was previously inside if rootfsConf.ShouldUseLisafs(). After moving it up, it now also runs in the non-lisafs path (EROFS etc., where containerRootFs == goferRootFs == /proc/fs/root). Before this PR the non-lisafs path never honored spec.Linux.RootfsPropagation at all. Applying it is arguably more correct, but it is a behavior change for that path. I just want to confirm that applying it uniformly in both cases is okay with you.
There was a problem hiding this comment.
I agree that this is more correct and we should do this. Thanks for the heads up. Also thanks for working patiently through all the iterations. I think this is the right solution. I hope you have tested the latest changes with your CDI hook reproducer?
There was a problem hiding this comment.
I will do one final test with all of the changes tomorrow then hopefully we should be good to go. I'll trust that you guys have a testing mechanism for EROFS since I have no way to check that. I'll get back to you tomorrow. Thanks!
| flags := uint32(unix.MS_SLAVE | unix.MS_REC) | ||
| if spec.Linux != nil && spec.Linux.RootfsPropagation != "" { | ||
| flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation}) | ||
| } | ||
| if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil { | ||
| return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err) | ||
| } |
There was a problem hiding this comment.
I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.
I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.
As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).
I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.
If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.
2532590 to
f1be32f
Compare
ayushr2
left a comment
There was a problem hiding this comment.
LGTM! Could you squash your commits?
26088bf to
ff89142
Compare
|
Pulling this in and running all tests. |
a489149 to
f3bd6c0
Compare
|
With the latest changes, I'm able to confirm this works on our systems: |
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in
the CDI spec. This is needed to support NVIDIA's k8s-device-plugin
running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin
creates a CDI spec file at `/var/run/cdi/[...].json` that contains
instructions on how to mount GPU devices, which client libraries to
bind-mount into the container, and which `nvidia-ctk` hooks need to be
run.
While the device cdev and client library injection mechanism already
worked with gVisor, the createContainer hooks that created the library
symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`)
and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were
missing. This meant that processes inside the container could not
resolve the client libraries and thus did not know how to communicate
with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file
contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli
configure` command. This largely did the same things that the CDI spec
file instructs us to do, but it is a legacy path and is not using CDI at
all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
f3bd6c0 to
ae18a84
Compare
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
Description
------------
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.
While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.
gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.
How it Works
------------
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
-------
I ran this on an H200 system and confirmed both nvidia-smi:
```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | N/A Off | 0 |
| N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#
```
And CUDA vectoradd:
```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
Description
This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in
DEVICE_LIST_STRATEGY=cdi-cri. In this mode, the plugin creates a CDI spec file at/var/run/cdi/[...].jsonthat contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and whichnvidia-ctkhooks need to be run.While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g.
/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1) and updated the ldconfig cache (nvidia-ctk hook update-ldcache) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the/dev/nvidiactland/dev/nvidia${n}cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.gVisor previously solved this problem by using the
nvidia-container-cli configurecommand. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.How it Works
In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.
This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.
Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.
Result
I ran this on an H200 system and confirmed both nvidia-smi:
And CUDA vectoradd:
both work.
This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.