Skip to content

feat: Support running createContainer hooks in CDI spec#13034

Open
LandonTClipp wants to merge 1 commit into
google:masterfrom
LandonTClipp:k8s-device-plugin-support
Open

feat: Support running createContainer hooks in CDI spec#13034
LandonTClipp wants to merge 1 commit into
google:masterfrom
LandonTClipp:k8s-device-plugin-support

Conversation

@LandonTClipp
Copy link
Copy Markdown

@LandonTClipp LandonTClipp commented Apr 28, 2026

Description

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in DEVICE_LIST_STRATEGY=cdi-cri. In this mode, the plugin creates a CDI spec file at /var/run/cdi/[...].json that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which nvidia-ctk hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1) and updated the ldconfig cache (nvidia-ctk hook update-ldcache) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the /dev/nvidiactl and /dev/nvidia${n} cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the nvidia-container-cli configure command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result

I ran this on an H200 system and confirmed both nvidia-smi:

root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/# 

And CUDA vectoradd:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from 3bbfaaa to 86b7d45 Compare April 28, 2026 19:36
@LandonTClipp
Copy link
Copy Markdown
Author

LandonTClipp commented Apr 28, 2026

@ayushr2 FYI. This should be better than my last attempt because this doesn't touch the old NVIDIA_VISIBLE_DEVICES codepath and relies on CDI spec files defining the hooks to run.

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

Comment thread runsc/container/hook.go Outdated
Comment thread runsc/container/hook.go Outdated
Comment thread runsc/container/hook.go Outdated
Comment thread runsc/container/container.go Outdated
Comment thread runsc/container/hook.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 2, 2026

At this point in the codebase, we're already in the gofer's mount namespace, so not only would nvidia-ctk update-ldcache be doing something redundant, but it messes up paths inside of the new namespace.

I don't understand this part and why the nvidia-ctk hook update-ldcache hook can not be run normally in the gofer. This is what runc does as well. It runs the CreateContainer hooks from inside the container namespace but before pivot_rooting.

So running it normally and not special casing it like it is done now should work...

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 2, 2026

I still suspect that a config parameter that enables this behavior would be desired so you'll just need to let me know how you want to proceed so that this does not have backwards-incompatible changes with your system.

I think if we take the approach of adding general support for running CreateContainer in the gofer before pivot_root, I think your fix will be backwards compatible and not need any flag gating. I don't think GKE device plugin relies on CreateContainer hooks

@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from 10e0830 to 4c3968d Compare May 4, 2026 16:33
@LandonTClipp
Copy link
Copy Markdown
Author

LandonTClipp commented May 4, 2026

Here is some more context as to why the nvidia-ctk hooks update-ldconfig is breaking. nvidia-ctk does a pivot_root into the rootfs. This requires the rootfs to be a bind mount. Currently when I try running the update-ldconfig command in the gofer using something like this in gofer_mount.go:

	// Set up /dev directory if needed.
	if devIoFD >= 0 {
		if err := SetupDev(spec, conf, root, procPath); err != nil {
			util.Fatalf("error setting up /dev: %v", err)
		}
	}

	if spec.Hooks != nil {
		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

This fails to run and the gofer emits this error:

W0504 16:35:52.912118       1 util.go:107] FATAL ERROR: error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

error executing CreateContainer hooks: failure executing hook "/usr/bin/nvidia-ctk", err: exit status 1
stdout: 
stderr: 2026/05/04 16:35:52 Error updating ldcache: error running pivot_root: pivot_root .: invalid argument
exit status 1

This is because . (which is the container rootfs) is not a bind mount. Runc does the following:

  1. Create a new container namespace clone(CLONE_NEWNS | ...)
  2. In prepareRootfs():

The gVisor gofer on the other hand does not bind mount spec.Root.Path (SetupMounts()). It attaches all CDI mounts as children of /proc/fs/root/<dest>. So even if you make spec.Root.Path a bind-mount to make pivot_root succeed, the CDI mounts are in /proc/fs/root, not in spec.Root.Path, so nvidia-ctk/ldconfig won't see the libraries.

I ran an experiment to prove this. I bind-mount spec.Root.Path just so that pivot_root succeeds:

	if spec.Hooks != nil {
		origRoot := spec.Root.Path
		if err := unix.Mount(origRoot, origRoot, "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
			log.Warningf("PoC: failed to self-bind-mount rootfs %q: %v", origRoot, err)
		} else {
			log.Infof("PoC: self-bind-mounted %q; pivot_root in hooks should now succeed", origRoot)
			defer unix.Unmount(origRoot, unix.MNT_DETACH)
		}

		state := specs.State{
			Version:     specs.Version,
			ID:          containerID,
			Status:      specs.StateCreating,
			Pid:         0,
			Bundle:      bundleDir,
			Annotations: spec.Annotations,
		}
		if err := container.ExecuteHooks(spec.Hooks.CreateContainer, state); err != nil {
			util.Fatalf("error executing CreateContainer hooks: %v", err)
		}
	}

My pod now starts successfully:

  Normal  Started    0s    kubelet            spec.containers{ubuntu}: Started container ubuntu
lclipp@CW-HP216DG9DT-L gvisor % 

But nvidia-smi doesn't work:

root@debug-pod:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

And you can see the ldconfig step didn't produce the libcuda.so.1 -> libcuda.so.580.126.20 symlink:

root@debug-pod:/# ls -lah /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root  12 May  4 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
-rw-r--r-- 1 root root 92M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcuda.so.580.126.20
-rw-r--r-- 1 root root 10M Apr 29 20:54 /usr/lib/x86_64-linux-gnu/libcudadebugger.so.580.126.20

This is why my hacky method of extracting the --folder arguments and then running ldconfig ourselves works because it removes the need to do a pivot_root that is being done in nvidia-ctk and we can point it to the /proc/fs/root path instead of spec.Root.Path.

TL;DR

I think the core problem can be distilled down to the fact that gVisor bind-mounts CDI files at /proc/fs/root instead of spec.Root.Path when TestOnlyAllowRunAsCurrentUserWithoutChroot = false. Then when nvidia-ctk hook update-ldcache goes to pivot_root into spec.Root.Path, it fails because gVisor does not self-bind-mount, and even if you do this, spec.Root.Path does not have the client libraries bind mounted (they're under /proc/fs/root instead).

The root of the problem is a mismatch between what nvidia-ctk expects (a rootfs at spec.Root.Path with CDI mounts attached) and what gVisor provides (CDI mounts attached to /proc/fs/root, not to spec.Root.Path).

I hope that long explanation makes some sort of sense.

Options

There are a few routes we can take to fix this:

  1. Run ldconfig manually like what I was doing before.
  2. Modify the spec.State bundle being passed as stdin to the hooks to use /proc/fs/root for the root. This is messy and weird but I think it would work.
  3. Change gVisor to do all CDI mounting on spec.Root.Path instead of /proc/fs/root.

Options 2 and 3 would preserve the OCI semantics and let nvidia-ctk run as normal. I'm not sure which of those two would be easier to implement. I do not understand why gVisor ever mounted devices onto /proc/fs/root.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 4, 2026

I do not understand why gVisor ever mounted devices onto /proc/fs/root

The startup sequence of the gofer is fairly complex.

  • We want to run the gofer with minimal capabilities. This capability set does not include CAP_SYS_ADMIN.
  • After most of the gofer set-up work is done, we re-execute the gofer and drop all capabilities except for the ones linked above. See this.
  • pivot_root(2) requires CAP_SYS_ADMIN. So it needs to happen before we drop capabilities during re-exec. Hence we pivot_root in sandboxsetup.SetupRootFS() before re-exec and set --setup-root=false during the re-exec.
  • To re-exec, we need access to /proc/self/exe. We can't create bind mount the host /proc into the container rootfs, that'd be a security risk. This is why the pivot_root(2) is done in /proc/fs. Inside that directory, /root is the rootfs and /proc is the host procfs. Then later, we unmount the host procfs and chroot into /root.

But I wonder what we can do is:

  • Do all work in container rootfs
  • Create 2 bind mounts in container rootfs: /proc/self/fd and /proc/self/exe. This should be done after other previous bind mounts are created.
  • pivot_root into container rootfs
  • re-exec without capabilities
  • open /proc/self/fd as we do now
  • unmount both these directories

RE option (2): The hook takes in the spec.State bundle. Do you know how the nvidia hook figures out the container root from there? The ldconfig command expects a container-root flag: https://github.com/NVIDIA/nvidia-container-toolkit/blob/3cfea27c9a7fb47af2d9607e2f661fefd67c0ab3/internal/ldconfig/ldconfig.go#L99. Who is setting this? Is this already present in the OCI spec's hook arguments? And does it point to spec.Root.Path?

@shayonj
Copy link
Copy Markdown
Contributor

shayonj commented May 4, 2026

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 4, 2026

Not to add too much noise here, but I was wondering if there’s a way to avoid special-casing ldconfig. Given that nvidia-cdi-hook gets the container root by reading state.Bundle/config.json and using root.path, could we instead make the assembled rootfs visible at spec.Root.Path before running CreateContainer hooks?

Yes, this is what Landon is proposing in option (2).

If we self-bind spec.Root.Path, apply the CDI mounts and /dev setup there, run the hooks, and then recursively bind that prepared tree to /proc/fs/root before the existing gVisor pivot_root, I think the NVIDIA hooks would see the same root they expect from runc while preserving the current gofer reexec flow. That might let this stay as generic CreateContainer hook support without parsing NVIDIA hook args or running ldconfig ourselves. Def feel free to lmk if I missed something :D

Yes this is possible too.

@LandonTClipp
Copy link
Copy Markdown
Author

Do you know how the nvidia hook figures out the container root from there?

Yes the specs.State has a Bundle attribute that is a path to a directory which contains further configuration, one of which is the rootfs path. This is what nvidia-ctk uses to pivot_root into.

I will look into the route you described. I think that's the correct way of doing it since runc also operates on the rootfs according to the bundle.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 5, 2026

I discussed more with @avagin. He pointed out a legit usecase. That the container rootfs provided in the OCI spec is read-only from the beginning and the spec doesn't contain any mounts. In this case, current implementation will not create any new directories/files inside container rootfs. But my proposal will fail while creating the /proc/self/fd and /proc/self/exe bind mounts inside it.

What @shayonj described in the second paragraph of his comment seems more correct than my proposal. Maybe that is the path to pursue. @avagin thoughts?

LandonTClipp added a commit to LandonTClipp/gvisor that referenced this pull request May 5, 2026
This implements the idea @shayonj suggested in
google#13034 (comment). We
do all of the CDI mounts and createContainer hooks inside of
spec.Root.Path instead of /proc/fs/root. This should allow hooks which
need to pivot_root(2) to successfully run AFTER the CDI mounts have been
performed.
@LandonTClipp
Copy link
Copy Markdown
Author

I implemented @shayonj's idea but I'm on a plane and transferring binaries around is hard, so I'll do some final testing maybe tomorrow. I think it will work.

@LandonTClipp
Copy link
Copy Markdown
Author

@ayushr2 @shayonj I confirmed the current changes work:

root@debug-pod:/# nvidia-smi
Wed May  6 20:09:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   30C    P0             77W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@shayonj's idea turned out to be way simpler than I expected. Please take a look, I think this might be the one!

Copy link
Copy Markdown
Contributor

@shayonj shayonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some minor comments from my review if its useful to you.

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from aeec302 to c262c0b Compare May 6, 2026 21:08
@LandonTClipp
Copy link
Copy Markdown
Author

@shayonj please see the most recent set of changes. Highlights:

  1. I split out two variables: containerRootFs and goferRootFs. In the LisaFS case, containerRootFs = spec.Root.Path and goferRootFs = /proc/fs/root. In the non-LisaFS case, both are set to /proc/fs/root.
  2. Most/all rootfs preparation is done on the containerRootFs variable.
  3. Non-lisafs cases do not get createContainer hooks run with warnings logged in such cases.

I don't yet know how we can solve createContainer hooks on readonly rootfs and I don't have a system I can test this out, so I'm happy with implementing it only for lisafs.

@LandonTClipp LandonTClipp requested review from ayushr2 and shayonj May 6, 2026 21:21
Copy link
Copy Markdown
Contributor

@shayonj shayonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me, just some minor comments. Perhaps the gVisor team can help with spotting anything else and seeing it through 🙏🏾

great work!

Comment thread runsc/cmd/sandboxsetup/BUILD Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
@LandonTClipp
Copy link
Copy Markdown
Author

@ayushr2 please take another look at your earliest convenience.

@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 11, 2026

Reviewing! Could you please squash the commits. Copybara doesn't have the ability to squash and merge yet so all commits from PR are applied. We want to keep the master branch clean.

@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from e2ae571 to 64680ff Compare May 11, 2026 21:37
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment thread runsc/cmd/gofer.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment on lines +186 to +192
flags := uint32(unix.MS_SLAVE | unix.MS_REC)
if spec.Linux != nil && spec.Linux.RootfsPropagation != "" {
flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation})
}
if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil {
return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok rootfs propagation flags is a bit tricky. See code pointers from runc implementation: [1] (prepareRoot()) and [2] (logic after pivot_root(2))

My reading is that, runc does the following:

  • Sets all mounts in / recursively as MS_SLAVE (unless spec.Linux.RootfsPropagation is specially configured. Note that it can be configured with MS_SHARED too)
  • Sets the mount-point containing spec.Root.Path as MS_PRIVATE (or MS_SLAVE if spec.Linux.RootfsPropagation contains that flag). This is important for the mentioned reasons.
  • Remounts rootfs onto itself.
  • Then does all the bind-mounts inside spec.Root.Path.
  • pivot_root
  • Then update the rootfs propagation flags back to spec.Linux.RootfsPropagation (in case it contained something like MS_SHARED)

This dance ensures that the bind mounts in rootfs are never propagated to the host mount namespace, even if spec.Linux.RootfsPropagation contains MS_SHARED. And at the end, the mount propagation across all mounts is as the user requested.

In gVisor's case, we don't support MS_SHARED at all. Mount operations inside the sandbox are not mirrored on the host. We validate that the OCI spec does not contain MS_SHARED.

Before this change, the steps we were taking were:

  • Set all mounts in / recursively as MS_SLAVE.
  • Bind mount container rootfs into /proc/fs/root`.
  • Remount rootfs based on spec.Linux.RootfsPropagation.
  • Does all the bind-mounts inside /proc/fs/root.

Per the current state of the PR, the steps we take are:

  • Set all mounts in / recursively as MS_SLAVE.
  • Does all the bind-mounts inside spec.Root.Path.
  • Remounts rootfs onto itself using MS_BIND|MS_REC.
  • Bind mount container rootfs recursively into /proc/fs/root.
  • Remount /proc/fs/root based on spec.Linux.RootfsPropagation.

By executing the container bind mounts before bind-mounting the rootfs onto itself with the recursive flag (MS_REC), you are cloning the entire populated mount tree and stacking it directly on top of itself. This doubles the number of mount objects in the kernel for that tree.

The order must be swapped to match runc. You must make spec.Root.Path a mount point first (Step 3), and then perform the bind mounts inside of it (Step 2). We can also do this self-bind-mount unconditionally (similar to runc) and quote the reason that createContainer hooks might pivot_root(2) and that runc has it like this. (Although runc self-bind-mounts for different reason that pivot_root(2) into spec.Root.Path requires it to be mount point.)

[1] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L1064-L1115
[2] https://github.com/opencontainers/runc/blob/0811f957a516ddda171cbf75d5f5ff36b7154893/libcontainer/rootfs_linux.go#L239-L249

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a few minutes to grok what you are saying and this is my understanding of the issue:

  1. containerRootFs is a regular directory. We call SetupMounts to bind-mount the mounts from the OCI spec into here. There are N mounts added.
  2. We self-bind-mount containerRootFs onto itself which replicates the mounts added in step 1. We're now at 2*N mounts.
  3. We bind-mount containerRootFs -> goferRootFs which again replicates the mounts originated from step 1. We're now at 3*N mounts.

If I'm interpreting that correctly then I understand the need to unconditionally self-bind-mount containerRootFs before we call SetupMounts. I'll change that, great callout!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.

I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.

As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).

I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.

If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely appreciate the attention you are giving here, I also recognize how sensitive this codepath is so I appreciate you taking care with it. I am mostly unfamiliar with the context here so I rely on your input!

I don't know if it's worth us creating some kind of test that asserts the mount points have the attributes we expect them to be at specific points in the code. We could create hook points in this function that let us assert in tests certain properties at specific points in the execution. Otherwise we are having an academic discussion without concrete proof as to what's happening.

I'll make the change you specified here.

Copy link
Copy Markdown
Author

@LandonTClipp LandonTClipp May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit contains the changes you requested. One thing worth noting:

The propagation-flag block was previously inside if rootfsConf.ShouldUseLisafs(). After moving it up, it now also runs in the non-lisafs path (EROFS etc., where containerRootFs == goferRootFs == /proc/fs/root). Before this PR the non-lisafs path never honored spec.Linux.RootfsPropagation at all. Applying it is arguably more correct, but it is a behavior change for that path. I just want to confirm that applying it uniformly in both cases is okay with you.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is more correct and we should do this. Thanks for the heads up. Also thanks for working patiently through all the iterations. I think this is the right solution. I hope you have tested the latest changes with your CDI hook reproducer?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do one final test with all of the changes tomorrow then hopefully we should be good to go. I'll trust that you guys have a testing mechanism for EROFS since I have no way to check that. I'll get back to you tomorrow. Thanks!

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment thread runsc/cmd/gofer.go Outdated
Comment thread runsc/container/container.go Outdated
Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
Comment on lines +186 to +192
flags := uint32(unix.MS_SLAVE | unix.MS_REC)
if spec.Linux != nil && spec.Linux.RootfsPropagation != "" {
flags = specutils.PropOptionsToFlags([]string{spec.Linux.RootfsPropagation})
}
if err := specutils.SafeMount("", goferRootFs, "", uintptr(flags), "", procPath); err != nil {
return fmt.Errorf("mounting root (%q) with flags: %#x, err: %v", goferRootFs, flags, err)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I gave way too much detail than was required. I started writing out this comment believing there was a mount propagation flag bug being introduced. But the fact that we don't support MS_SHARE changed that. But I left all of that context in case it is helpful for other reviewers to find any mount propagation flag bugs introduced here.

I am trying to be very thorough with changes here. These code paths are critical to gVisor security architecture so margin of error is low here.

As you pointed out, we are still doubling the mounts when we bind-mount containerRootfs -> goferRootfs. But after pivot_root(2), those underlying bind mounts from containerRootfs should be released by the host kernel (because runsc/cmd/sandboxsetup/fs.go:PivotRoot() unmount the old_root).

I do think that there is still one bug remaining regarding the mount propagation flags. Before this PR, RootfsPropagation flags were applied before bind-mounts were created in rootfs. Now RootfsPropagation is applied after bind-mounts are applied in rootfs.

If RootfsPropagation contains MS_PRIVATE, then only the top-level /proc/fs/root directory private, but all N sub-mounts remain MS_SLAVE (what they inherited from rootfs mountpoint at the time of bind-mount). Before this change, the rootfs mountpoint would have been MS_PRIVATE and then all the bind mounts would have inherited that instead. So we need to move the rootfs propagation logic after the self-bind-mount.

@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from 2532590 to f1be32f Compare May 12, 2026 19:05
Copy link
Copy Markdown
Collaborator

@ayushr2 ayushr2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Could you squash your commits?

Comment thread runsc/cmd/sandboxsetup/gofer_mount.go Outdated
@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch 2 times, most recently from 26088bf to ff89142 Compare May 13, 2026 01:29
@ayushr2
Copy link
Copy Markdown
Collaborator

ayushr2 commented May 13, 2026

Pulling this in and running all tests.

Comment thread runsc/container/container.go Outdated
@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch 3 times, most recently from a489149 to f3bd6c0 Compare May 13, 2026 15:36
@LandonTClipp
Copy link
Copy Markdown
Author

With the latest changes, I'm able to confirm this works on our systems:

lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
lclipp@CW-HP216DG9DT-L gvisor % k exec -it debug-pod -- /bin/bash 
root@debug-pod:/# nvidia-smi
Wed May 13 17:39:16 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   31C    P0             78W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/# 

copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0
PiperOrigin-RevId: 914666567
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in
the CDI spec. This is needed to support NVIDIA's k8s-device-plugin
running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin
creates a CDI spec file at `/var/run/cdi/[...].json` that contains
instructions on how to mount GPU devices, which client libraries to
bind-mount into the container, and which `nvidia-ctk` hooks need to be
run.

While the device cdev and client library injection mechanism already
worked with gVisor, the createContainer hooks that created the library
symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`)
and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were
missing. This meant that processes inside the container could not
resolve the client libraries and thus did not know how to communicate
with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file
contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli
configure` command. This largely did the same things that the CDI spec
file instructs us to do, but it is a legacy path and is not using CDI at
all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
@LandonTClipp LandonTClipp force-pushed the k8s-device-plugin-support branch from f3bd6c0 to ae18a84 Compare May 13, 2026 19:00
copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
copybara-service Bot pushed a commit that referenced this pull request May 13, 2026
Description
------------

This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run.

While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it.

gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all.

How it Works
------------

In gofer_mount.go, the code is changed to have explicit understandings
as to what is the containerRootFs (usually under /var/lib/.../root) and
the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they
would pivot_root(2) into the containerRootFs while gVisor would operate
under the goferRootFs. This meant that nvidia-ctk did not see any CDI
devices mounted into the containerRootFs.

This commit changes gVisor such that all devices and setup is done under
the containerRootFs. We then bind-mount containerRootFs into goferRootFs
after running the CreateContainer hooks. The gofer pivot_roots into the
goferRootFs as before.

Note that createContainer hooks are only run if the underlying rootfs is
writable. There are many scenarios, such as when using EROFS, where
createContainer hooks can't be executed. This problem will be saved for
another day to solve.

Result
-------

I ran this on an H200 system and confirmed both nvidia-smi:

```
root@debug-pod:/# nvidia-smi
Tue Apr 28 19:34:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   N/A              Off |                    0 |
| N/A   27C    P0             76W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@debug-pod:/#

```

And CUDA vectoradd:

```
lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

both work.

This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode.

FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84
PiperOrigin-RevId: 914666567
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants