feat: Support NVIDIA k8s-device-plugin with CDI by LandonTClipp · Pull Request #13024 · google/gvisor

LandonTClipp · 2026-04-27T17:35:54Z

Description

gVisor supports two paths for injecting NVIDIA devices:

Through the legacy OCI pre-start hook created by
nvidia-container-runtime-hook (legacy)
Through CDI annotations.

When gVisor detected GPUs were being attached through the legacy hook
method, it would correctly run the nvidia-container-cli to modify the
container's rootfs to create the NVIDIA library symlinks that client
applications need (through ldconfig) to speak to the host driver. When
the NVIDIA k8s-device-plugin is used in CDI mode, it injects the client
libraries but not the symlinks. The only step that was missing was the
symlink creation with nvidia-container-cli.

This commit modifies the GPU detection logic to check both the CDI
annotations for the presence of GPUs and the legacy hook-based method.
If GPUs are detected in the CDI annotations, the code path to run the
symlinking step is performed.

The normal limitations apply in this scenario regarding
nvidia-container-cli, namely that the container's rootfs must be
writable. If it's not writable, like when using a readonly image
distribution mechanism (EROFS for example), this will not work. This
limitation applied in the legacy scenario anyway, so there is no
regression.

Doesn't k8s-device-plugin already do symlinking?

You'll notice in k8s-device-plugin, it already has hooks for creating these symlinks. For example:

  "containerEdits": {
    "hooks": [
      {
        "hookName": "createContainer",
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "create-symlinks",
          "--link",
          "libnvidia-opticalflow.so.1::/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so",
          "--link",
          "libGLX_nvidia.so.580.126.20::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0",
          "--link",
          "libcuda.so.1::/usr/lib/x86_64-linux-gnu/libcuda.so"
        ]
      },
    ],

However, gVisor explicitly does not run createContainer hooks because they're not supported.

This is why gVisor must run nvidia-container-cli configure manually to create the symlinks and update the ldconfig cache.

gVisor supports two paths for injecting NVIDIA devices: 1. Through the legacy OCI pre-start hook created by nvidia-container-runtime-hook (legacy) 2. Through CDI annotations. When gVisor detected GPUs were being attached through the legacy hook method, it would correctly run the nvidia-container-cli to modify the container's rootfs to create the NVIDIA library symlinks that client applications need (through ldconfig) to speak to the host driver. When the NVIDIA k8s-device-plugin is used in CDI mode, it injects the client libraries but not the symlinks. The only step that was missing was the symlink creation with nvidia-container-cli. This commit modifies the GPU detection logic to check both the CDI annotations for the presence of GPUs and the legacy hook-based method. If GPUs are detected in the CDI annotations, the code path to run the symlinking step is performed. The normal limitations apply in this scenario regarding nvidia-container-cli, namely that the container's rootfs must be writable. If it's not writable, like when using a readonly image distribution mechanism (EROFS for example), this will not work. This limitation applied in the legacy scenario anyway, so there is no regression. Signed-off-by: LandonTClipp <lclipp@coreweave.com>

A nil pointer dereference was happening when an error happened in the gofer which was obfuscating the problem. This is a simple check for nil before calling Close on the `runc.IO`. Signed-off-by: LandonTClipp <lclipp@coreweave.com>

LandonTClipp · 2026-04-27T17:51:10Z

@ayushr2 one major question I have is whether this will break GKE's infrastructure. I had to make an assumption that GKE environments are somehow performing the client library symlinking that is missing. In this case, it might make sense to introduce a config parameter that toggles the behavior in this PR and keep it disabled by default. You'll have to let me know what you think.

ayushr2 · 2026-04-27T18:38:11Z

Yeah I think this might break GKE because specutils.GPUFunctionalityRequested() is true for GKE. I am not sure if nvidia-container-cli is present in GKE Nodes.

If the GPU devices are already bind-mounted into the container filesystem, then we should probably not run nvidia-container-cli configure again, that bind mounts devices again. All we want to do is emulate the createContainer hook (by only invoking nvidia-ctk to create symlinks), like we emulate the prestart hook.

LandonTClipp · 2026-04-27T19:37:52Z

I will look into narrowing the scope of what we're doing to the rootfs with nvidia-ctk. I did test the current method on our systems and it works as expected but I did suspect I was using too broad of a hammer.

ayushr2 · 2026-04-28T06:46:36Z

Also, I think you might be able to get away with just passing --nvproxy-docker runsc flag. It forces the nvidia prestart hook code path even when the OCI spec doesn't have that hook. So you don't need this PR.

But I still think the right thing to do is to emulate nvidia-ctk invocation in gVisor for symlink creation. Because the pre-start hook is doing extra work.

LandonTClipp · 2026-04-28T16:13:02Z

The issues I have with the GPUFunctionalityRequestedViaHook code path was that it would mount all of the /dev/nvidiaN devices from the host, not just the one that was requested by CDI. This is why I started investigating if we can just use raw CDI for everything because the k8s-device-plugin gives us a CDI spec file, that if adhered to correctly, gives a correct working environment inside the container.

The issues I discovered were multi-layered:

gVisor would correctly mount the client library as specified in the CDI spec file. Example:

   {
     "hostPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.580.126.20",
     "containerPath": "/usr/lib/x86_64-linux-gnu/libcuda.so.580.126.20",
     "options": [
       "ro",
       "nosuid",
       "nodev",
       "bind"
     ]
   },

But it would not create the libcuda.so -> libcuda.so.1 symlink that is supposed to be created by this createContainer hook:

   {
     "hookName": "createContainer",
     "path": "/usr/bin/nvidia-ctk",
     "args": [
       "nvidia-ctk",
       "hook",
       "create-symlinks",
       "--link",
       "libGLX_nvidia.so.580.126.20::/usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0",
       "--link",
       "libcuda.so.1::/usr/lib/x86_64-linux-gnu/libcuda.so",
       "--link",
       "libnvidia-opticalflow.so.1::/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so"
     ]

Because createContainer hooks were explicitly not run.

After the libcuda.so -> libcuda.so.1 symlink is created, you're supposed to also update the ldcache. This step completes the symlink chain so that libcuda.so.1 -> libcuda.so.580.126.20. This step also wasn't being run because it's a createContainer hook.
The reason why createContainer hooks were disabled appears to be because nobody had figured out how to make the symlinking and ldcache dance work with the gofer (filesystem proxy) and the sentry (thing that manages the VFS). I have a working solution internally that I will submit as a PR soon.

LandonTClipp · 2026-04-28T19:35:34Z

Closed in favor of #13034

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support f3bd6c0 PiperOrigin-RevId: 914666567

Description ------------ This commit adds the ability for gVisor to run createContainer hooks in the CDI spec. This is needed to support NVIDIA's k8s-device-plugin running in `DEVICE_LIST_STRATEGY=cdi-cri`. In this mode, the plugin creates a CDI spec file at `/var/run/cdi/[...].json` that contains instructions on how to mount GPU devices, which client libraries to bind-mount into the container, and which `nvidia-ctk` hooks need to be run. While the device cdev and client library injection mechanism already worked with gVisor, the createContainer hooks that created the library symlinks (e.g. `/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1`) and updated the ldconfig cache (`nvidia-ctk hook update-ldcache`) were missing. This meant that processes inside the container could not resolve the client libraries and thus did not know how to communicate with the `/dev/nvidiactl` and `/dev/nvidia${n}` cdevs. The CDI spec file contains the instructions on how to do this, so now gVisor follows it. gVisor previously solved this problem by using the `nvidia-container-cli configure` command. This largely did the same things that the CDI spec file instructs us to do, but it is a legacy path and is not using CDI at all. How it Works ------------ In gofer_mount.go, the code is changed to have explicit understandings as to what is the containerRootFs (usually under /var/lib/.../root) and the goferRootFs (/proc/fs). The issue with nvidia-ctk hooks is that they would pivot_root(2) into the containerRootFs while gVisor would operate under the goferRootFs. This meant that nvidia-ctk did not see any CDI devices mounted into the containerRootFs. This commit changes gVisor such that all devices and setup is done under the containerRootFs. We then bind-mount containerRootFs into goferRootFs after running the CreateContainer hooks. The gofer pivot_roots into the goferRootFs as before. Note that createContainer hooks are only run if the underlying rootfs is writable. There are many scenarios, such as when using EROFS, where createContainer hooks can't be executed. This problem will be saved for another day to solve. Result ------- I ran this on an H200 system and confirmed both nvidia-smi: ``` root@debug-pod:/# nvidia-smi Tue Apr 28 19:34:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H200 Off | N/A Off | 0 | | N/A 27C P0 76W / 700W | 0MiB / 143771MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ root@debug-pod:/# ``` And CUDA vectoradd: ``` lclipp@CW-HP216DG9DT-L gvisor % k logs cuda-vectoradd-kata-gvisor [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` both work. This supersedes #13024 because this method does not touch the legacy hook-based NVIDIA_DEVICES method. This PR makes gVisor fully compatible with the NVIDIA k8s-device-plugin when using it in cdi-cri mode. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13034 from LandonTClipp:k8s-device-plugin-support ae18a84 PiperOrigin-RevId: 914666567

LandonTClipp added 2 commits April 27, 2026 17:29

fix: Fix nil dereference when sandbox is closed

509a43b

A nil pointer dereference was happening when an error happened in the gofer which was obfuscating the problem. This is a simple check for nil before calling Close on the `runc.IO`. Signed-off-by: LandonTClipp <lclipp@coreweave.com>

LandonTClipp mentioned this pull request Apr 28, 2026

feat: Support running createContainer hooks in CDI spec #13034

Open

LandonTClipp closed this Apr 28, 2026

copybara-service Bot mentioned this pull request May 13, 2026

feat: Support running createContainer hooks in CDI spec #13162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support NVIDIA k8s-device-plugin with CDI#13024

feat: Support NVIDIA k8s-device-plugin with CDI#13024
LandonTClipp wants to merge 2 commits into
google:masterfrom
LandonTClipp:cdi-support

LandonTClipp commented Apr 27, 2026 •

edited

Loading

Uh oh!

LandonTClipp commented Apr 27, 2026

Uh oh!

ayushr2 commented Apr 27, 2026

Uh oh!

LandonTClipp commented Apr 27, 2026

Uh oh!

ayushr2 commented Apr 28, 2026

Uh oh!

LandonTClipp commented Apr 28, 2026

Uh oh!

LandonTClipp commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LandonTClipp commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Doesn't k8s-device-plugin already do symlinking?

Uh oh!

LandonTClipp commented Apr 27, 2026

Uh oh!

ayushr2 commented Apr 27, 2026

Uh oh!

LandonTClipp commented Apr 27, 2026

Uh oh!

ayushr2 commented Apr 28, 2026

Uh oh!

LandonTClipp commented Apr 28, 2026

Uh oh!

LandonTClipp commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LandonTClipp commented Apr 27, 2026 •

edited

Loading

LandonTClipp commented Apr 28, 2026 •

edited

Loading