feat: Support running createContainer hooks in CDI spec by LandonTClipp · Pull Request #13034 · google/gvisor

LandonTClipp · 2026-04-28T19:35:26Z

Description

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in DEVICE_LIST_STRATEGY=cdi-cri. In this mode, the plugin creates a CDI spec file at /var/run/cdi/[...].json that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which nvidia-ctk hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1) and updated the ldconfig cache (nvidia-ctk hook update-ldcache) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the /dev/nvidiactl and /dev/nvidia${n} cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the nvidia-container-cli configure command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result

I ran this on an H200 system and confirmed both nvidia-smi:

root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

And CUDA vectoradd:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

LandonTClipp · 2026-04-28T19:38:02Z

@ayushr2 FYI. This should be better than my last attempt because this doesn't touch the old NVIDIA_VISIBLE_DEVICES codepath and relies on CDI spec files defining the hooks to run.

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

ayushr2 · 2026-05-02T20:21:55Z

At this point in the codebase, we're already in the gofer's mount namespace, so not only would nvidia-ctk update-ldcache be doing something redundant, but it messes up paths inside of the new namespace.

I don't understand this part and why the nvidia-ctk hook update-ldcache hook can not be run normally in the gofer. This is what runc does as well. It runs the CreateContainer hooks from inside the container namespace but before pivot_rooting.

So running it normally and not special casing it like it is done now should work...

ayushr2 · 2026-05-02T20:23:08Z

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

I think if we take the approach of adding general support for running CreateContainer in the gofer before pivot_root, I think your fix will be backwards compatible and not need any flag gating. I don't think GKE device plugin relies on CreateContainer hooks

LandonTClipp · 2026-05-04T17:30:02Z

Here is some more context as to why the nvidia-ctk hooks update-ldconfig is breaking. nvidia-ctk does a pivot_root into the rootfs. This requires the rootfs to be a bind mount. Currently when I try running the update-ldconfig command in the gofer using something like this in gofer_mount.go:

	// Set up /dev directory if needed.
	if devIoFD >= 0 {
		if err := SetupDev(spec, conf, root, procPath); err != nil {
			util.Fatalf("error setting up /dev: %v", err)
		}
	}

	if spec.Hooks != nil {
		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

This fails to run and the gofer emits this error:

W0504 16:35:52.912118       1 util.go:107] FATAL ERROR: error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

This is because . (which is the container rootfs) is not a bind mount. Runc does the following:

Create a new container namespace clone(CLONE_NEWNS | ...)
In prepareRootfs():
- Self-bind-mount spec.Root.Path onto itself (necessary for pivot_root).
- Apply all spec mounts.
- Run CreateContainer hooks
- Call pivot_root.

The gVisor gofer on the other hand does not bind mount spec.Root.Path (SetupMounts()). It attaches all CDI mounts as children of /proc/fs/root/<dest>. So even if you make spec.Root.Path a bind-mount to make pivot_root succeed, the CDI mounts are in /proc/fs/root, not in spec.Root.Path, so nvidia-ctk/ldconfig won't see the libraries.

I ran an experiment to prove this. I bind-mount spec.Root.Path just so that pivot_root succeeds:

	if spec.Hooks != nil {
		origRoot := spec.Root.Path
		if err := unix.Mount(origRoot, origRoot, "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
			log.Warningf("PoC: failed to self-bind-mount rootfs %q: %v", origRoot, err)
		} else {
			log.Infof("PoC: self-bind-mounted %q; pivot_root in hooks should now succeed", origRoot)
			defer unix.Unmount(origRoot, unix.MNT_DETACH)
		}

		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

My pod now starts successfully:

  Normal  Started    0s    kubelet            spec.containers{ubuntu}: Started container ubuntu
lclipp@CW-HP216DG9DT-L gvisor %

But nvidia-smi doesn't work:

root@debug-pod:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

And you can see the ldconfig step didn't produce the libcuda.so.1 -> libcuda.so.580.126.20 symlink:

root@debug-pod:/# ls -lah /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root  12 May  4 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
-rw-r--r-- 1 root root 92M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcuda.so.580.126.20
-rw-r--r-- 1 root root 10M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.580.126.20

This is why my hacky method of extracting the --folder arguments and then running ldconfig ourselves works because it removes the need to do a pivot_root that is being done in nvidia-ctk and we can point it to the /proc/fs/root path instead of spec.Root.Path.

TL;DR

I think the core problem can be distilled down to the fact that gVisor bind-mounts CDI files at /proc/fs/root instead of spec.Root.Path when TestOnlyAllowRunAsCurrentUserWithoutChroot = false. Then when nvidia-ctk hook update-ldcache goes to pivot_root into spec.Root.Path, it fails because gVisor does not self-bind-mount, and even if you do this, spec.Root.Path does not have the client libraries bind mounted (they're under /proc/fs/root instead).

The root of the problem is a mismatch between what nvidia-ctk expects (a rootfs at spec.Root.Path with CDI mounts attached) and what gVisor provides (CDI mounts attached to /proc/fs/root, not to spec.Root.Path).

I hope that long explanation makes some sort of sense.

Options

There are a few routes we can take to fix this:

Run ldconfig manually like what I was doing before.
Modify the spec.State bundle being passed as stdin to the hooks to use /proc/fs/root for the root. This is messy and weird but I think it would work.
Change gVisor to do all CDI mounting on spec.Root.Path instead of /proc/fs/root.

Options 2 and 3 would preserve the OCI semantics and let nvidia-ctk run as normal. I'm not sure which of those two would be easier to implement. I do not understand why gVisor ever mounted devices onto /proc/fs/root.

ayushr2 · 2026-05-04T20:26:33Z

I do not understand why gVisor ever mounted devices onto /proc/fs/root

The startup sequence of the gofer is fairly complex.

We want to run the gofer with minimal capabilities. This capability set does not include CAP_SYS_ADMIN.
After most of the gofer set-up work is done, we re-execute the gofer and drop all capabilities except for the ones linked above. See this.
pivot_root(2) requires CAP_SYS_ADMIN. So it needs to happen before we drop capabilities during re-exec. Hence we pivot_root in sandboxsetup.SetupRootFS() before re-exec and set --setup-root=false during the re-exec.
To re-exec, we need access to /proc/self/exe. We can't create bind mount the host /proc into the container rootfs, that'd be a security risk. This is why the pivot_root(2) is done in /proc/fs. Inside that directory, /root is the rootfs and /proc is the host procfs. Then later, we unmount the host procfs and chroot into /root.

But I wonder what we can do is:

Do all work in container rootfs
Create 2 bind mounts in container rootfs: /proc/self/fd and /proc/self/exe. This should be done after other previous bind mounts are created.
pivot_root into container rootfs
re-exec without capabilities
open /proc/self/fd as we do now
unmount both these directories

RE option (2): The hook takes in the spec.State bundle. Do you know how the nvidia hook figures out the container root from there? The ldconfig command expects a container-root flag: https://github.com/NVIDIA/nvidia-container-toolkit/blob/3cfea27c9a7fb47af2d9607e2f661fefd67c0ab3/internal/ldconfig/ldconfig.go#L99. Who is setting this? Is this already present in the OCI spec's hook arguments? And does it point to spec.Root.Path?

shayonj · 2026-05-04T20:40:53Z

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

ayushr2 · 2026-05-04T20:46:23Z

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

Yes, this is what Landon is proposing in option (2).

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

Yes this is possible too.

LandonTClipp · 2026-05-05T19:19:14Z

Do you know how the nvidia hook figures out the container root from there?

Yes the specs.State has a Bundle attribute that is a path to a directory which contains further configuration, one of which is the rootfs path. This is what nvidia-ctk uses to pivot_root into.

I will look into the route you described. I think that's the correct way of doing it since runc also operates on the rootfs according to the bundle.

ayushr2 · 2026-05-05T19:34:17Z

I discussed more with @avagin. He pointed out a legit usecase. That the container rootfs provided in the OCI spec is read-only from the beginning and the spec doesn't contain any mounts. In this case, current implementation will not create any new directories/files inside container rootfs. But my proposal will fail while creating the /proc/self/fd and /proc/self/exe bind mounts inside it.

What @shayonj described in the second paragraph of his comment seems more correct than my proposal. Maybe that is the path to pursue. @avagin thoughts?

@shayonj

This implements the idea @shayonj suggested in google#13034 (comment). We do all of the CDI mounts and createContainer hooks inside of spec.Root.Path instead of /proc/fs/root. This should allow hooks which need to pivot_root(2) to successfully run AFTER the CDI mounts have been performed.

LandonTClipp · 2026-05-05T21:07:21Z

I implemented @shayonj's idea but I'm on a plane and transferring binaries around is hard, so I'll do some final testing maybe tomorrow. I think it will work.

LandonTClipp · 2026-05-06T18:12:47Z

@ayushr2 @shayonj I confirmed the current changes work:

root@debug-pod:/# nvidia-smi
Wed May  6 20:09:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   30C    P0             77W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@shayonj's idea turned out to be way simpler than I expected. Please take a look, I think this might be the one!

shayonj

Looking good, just some minor comments from my review if its useful to you.

LandonTClipp · 2026-05-06T21:16:56Z

@shayonj please see the most recent set of changes. Highlights:

I split out two variables: containerRootFs and goferRootFs. In the LisaFS case, containerRootFs = spec.Root.Path and goferRootFs = /proc/fs/root. In the non-LisaFS case, both are set to /proc/fs/root.
Most/all rootfs preparation is done on the containerRootFs variable.
Non-lisafs cases do not get createContainer hooks run with warnings logged in such cases.

I don't yet know how we can solve createContainer hooks on readonly rootfs and I don't have a system I can test this out, so I'm happy with implementing it only for lisafs.

shayonj

Looking good to me, just some minor comments. Perhaps the gVisor team can help with spotting anything else and seeing it through 🙏🏾

great work!

LandonTClipp · 2026-05-11T21:08:20Z

@ayushr2 please take another look at your earliest convenience.

ayushr2 · 2026-05-11T21:26:50Z

Reviewing! Could you please squash the commits. Copybara doesn't have the ability to squash and merge yet so all commits from PR are applied. We want to keep the master branch clean.

ayushr2 · 2026-05-12T08:23:00Z

+		flags := uint32(unix.MS_SLAVE | unix.MS_REC)
+		if spec.Linux != nil && spec.Linux.RootfsPropagation != "" {
+			flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation})
+		}
+		if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil {
+			return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err)
+		}


Ok rootfs propagation flags is a bit tricky. See code pointers from runc implementation: [1] (prepareRoot()) and [2] (logic after pivot_root(2))

My reading is that, runc does the following:

Sets all mounts in / recursively as MS_SLAVE (unless spec.Linux.RootfsPropagation is specially configured. Note that it can be configured with MS_SHARED too)

Sets the mount-point containing spec.Root.Path as MS_PRIVATE (or MS_SLAVE if spec.Linux.RootfsPropagation contains that flag). This is important for the mentioned reasons.

Remounts rootfs onto itself.

Then does all the bind-mounts inside spec.Root.Path.

pivot_root

Then update the rootfs propagation flags back to spec.Linux.RootfsPropagation (in case it contained something like MS_SHARED)

This dance ensures that the bind mounts in rootfs are never propagated to the host mount namespace, even if spec.Linux.RootfsPropagation contains MS_SHARED. And at the end, the mount propagation across all mounts is as the user requested.

In gVisor's case, we don't support MS_SHARED at all. Mount operations inside the sandbox are not mirrored on the host. We validate that the OCI spec does not contain MS_SHARED.

Before this change, the steps we were taking were:

Set all mounts in / recursively as MS_SLAVE.

Bind mount container rootfs into /proc/fs/root`.

Remount rootfs based on spec.Linux.RootfsPropagation.

Does all the bind-mounts inside /proc/fs/root.

Per the current state of the PR, the steps we take are:

Set all mounts in / recursively as MS_SLAVE.

Does all the bind-mounts inside spec.Root.Path.

Remounts rootfs onto itself using MS_BIND|MS_REC.

Bind mount container rootfs recursively into /proc/fs/root.

Remount /proc/fs/root based on spec.Linux.RootfsPropagation.

By executing the container bind mounts before bind-mounting the rootfs onto itself with the recursive flag (MS_REC), you are cloning the entire populated mount tree and stacking it directly on top of itself. This doubles the number of mount objects in the kernel for that tree.

The order must be swapped to match runc. You must make spec.Root.Path a mount point first (Step 3), and then perform the bind mounts inside of it (Step 2). We can also do this self-bind-mount unconditionally (similar to runc) and quote the reason that createContainer hooks might pivot_root(2) and that runc has it like this. (Although runc self-bind-mounts for different reason that pivot_root(2) into spec.Root.Path requires it to be mount point.)

[1] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L1064-L1115
[2] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L239-L249

I took a few minutes to grok what you are saying and this is my understanding of the issue:

containerRootFs is a regular directory. We call SetupMounts to bind-mount the mounts from the OCI spec into here. There are N mounts added.

We self-bind-mount containerRootFs onto itself which replicates the mounts added in step 1. We're now at 2*N mounts.

We bind-mount containerRootFs -> goferRootFs which again replicates the mounts originated from step 1. We're now at 3*N mounts.

If I'm interpreting that correctly then I understand the need to unconditionally self-bind-mount containerRootFs before we call SetupMounts. I'll change that, great callout!

I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.

I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.

As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).

I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.

If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.

I definitely appreciate the attention you are giving here, I also recognize how sensitive this codepath is so I appreciate you taking care with it. I am mostly unfamiliar with the context here so I rely on your input!

I don't know if it's worth us creating some kind of test that asserts the mount points have the attributes we expect them to be at specific points in the code. We could create hook points in this function that let us assert in tests certain properties at specific points in the execution. Otherwise we are having an academic discussion without concrete proof as to what's happening.

I'll make the change you specified here.

The latest commit contains the changes you requested. One thing worth noting:

The propagation-flag block was previously inside if rootfsConf.ShouldUseLisafs(). After moving it up, it now also runs in the non-lisafs path (EROFS etc., where containerRootFs == goferRootFs == /proc/fs/root). Before this PR the non-lisafs path never honored spec.Linux.RootfsPropagation at all. Applying it is arguably more correct, but it is a behavior change for that path. I just want to confirm that applying it uniformly in both cases is okay with you.

I agree that this is more correct and we should do this. Thanks for the heads up. Also thanks for working patiently through all the iterations. I think this is the right solution. I hope you have tested the latest changes with your CDI hook reproducer?

I will do one final test with all of the changes tomorrow then hopefully we should be good to go. I'll trust that you guys have a testing mechanism for EROFS since I have no way to check that. I'll get back to you tomorrow. Thanks!

ayushr2 · 2026-05-12T18:07:22Z

+		flags := uint32(unix.MS_SLAVE | unix.MS_REC)
+		if spec.Linux != nil && spec.Linux.RootfsPropagation != "" {
+			flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation})
+		}
+		if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil {
+			return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err)
+		}


I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.

I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.

As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).

I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.

If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.

ayushr2

LGTM! Could you squash your commits?

ayushr2 · 2026-05-13T06:13:07Z

Pulling this in and running all tests.

LandonTClipp · 2026-05-13T15:39:53Z

With the latest changes, I'm able to confirm this works on our systems:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
lclipp@CW-HP216DG9DT-L gvisor % k exec -it debug-pod -- /bin/bash 
root@debug-pod:/# nvidia-smi
Wed May 13 17:39:16 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   31C    P0             78W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0 PiperOrigin-RevId: 914666567

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Signed-off-by: LandonTClipp <lclipp@coreweave.com>

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84 PiperOrigin-RevId: 914666567

LandonTClipp force-pushed the k8s-device-plugin-support branch from 3bbfaaa to 86b7d45 Compare April 28, 2026 19:36

LandonTClipp force-pushed the k8s-device-plugin-support branch from 86b7d45 to f01a0fb Compare April 28, 2026 19:43

LandonTClipp mentioned this pull request Apr 29, 2026

feat: Support NVIDIA k8s-device-plugin with CDI #13024

Closed

ayushr2 mentioned this pull request May 2, 2026

runsc: NVIDIA CDI / libcuda not applied for io.containerd.runsc.v1 (direct shim) #13060

Open

ayushr2 reviewed May 2, 2026

View reviewed changes

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/hook.go Outdated

Comment thread runsc/container/container.go Outdated

Comment thread runsc/container/hook.go Outdated

ayushr2 reviewed May 2, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from 10e0830 to 4c3968d Compare May 4, 2026 16:33

shayonj reviewed May 6, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from aeec302 to c262c0b Compare May 6, 2026 21:08

LandonTClipp requested review from ayushr2 and shayonj May 6, 2026 21:21

shayonj reviewed May 10, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch from e2ae571 to 64680ff Compare May 11, 2026 21:37

ayushr2 reviewed May 11, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

Comment thread runsc/cmd/gofer.go Outdated

ayushr2 reviewed May 12, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go

ayushr2 reviewed May 12, 2026

View reviewed changes

LandonTClipp force-pushed the k8s-device-plugin-support branch from 2532590 to f1be32f Compare May 12, 2026 19:05

ayushr2 reviewed May 13, 2026

View reviewed changes

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch 2 times, most recently from 26088bf to ff89142 Compare May 13, 2026 01:29

ayushr2 approved these changes May 13, 2026

View reviewed changes

ayushr2 added the ready to pull label May 13, 2026

copybara-service Bot mentioned this pull request May 13, 2026

feat: Support running createContainer hooks in CDI spec #13162

Open

ayushr2 reviewed May 13, 2026

View reviewed changes

Comment thread runsc/container/container.go Outdated

LandonTClipp force-pushed the k8s-device-plugin-support branch 3 times, most recently from a489149 to f3bd6c0 Compare May 13, 2026 15:36

LandonTClipp force-pushed the k8s-device-plugin-support branch from f3bd6c0 to ae18a84 Compare May 13, 2026 19:00

Conversation

LandonTClipp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How it Works

Result

Uh oh!

LandonTClipp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayushr2 commented May 2, 2026

Uh oh!

ayushr2 commented May 2, 2026

Uh oh!

LandonTClipp commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Options

Uh oh!

ayushr2 commented May 4, 2026

Uh oh!

shayonj commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ayushr2 commented May 4, 2026

Uh oh!

LandonTClipp commented May 5, 2026

Uh oh!

ayushr2 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LandonTClipp commented May 5, 2026

Uh oh!

LandonTClipp commented May 6, 2026

Uh oh!

shayonj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LandonTClipp commented May 6, 2026

Uh oh!

shayonj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LandonTClipp commented May 11, 2026

Uh oh!

ayushr2 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayushr2 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

LandonTClipp May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ayushr2 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

LandonTClipp May 12, 2026

Choose a reason for hiding this comment

Uh oh!

LandonTClipp May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

LandonTClipp commented Apr 28, 2026 •

edited

Loading

LandonTClipp commented Apr 28, 2026 •

edited

Loading

LandonTClipp commented May 4, 2026 •

edited

Loading

shayonj commented May 4, 2026 •

edited

Loading

ayushr2 commented May 5, 2026 •

edited

Loading

LandonTClipp May 12, 2026 •

edited

Loading