Skip to content

runsc: skip no-op CDI disable-device-node-modification hook under nvproxy#13308

Closed
a7i wants to merge 1 commit into
google:masterfrom
a7i:a7i/skip-nvidia-disable-device-node-modification
Closed

runsc: skip no-op CDI disable-device-node-modification hook under nvproxy#13308
a7i wants to merge 1 commit into
google:masterfrom
a7i:a7i/skip-nvidia-disable-device-node-modification

Conversation

@a7i
Copy link
Copy Markdown
Contributor

@a7i a7i commented May 28, 2026

Fixes #13283. Alternative approach to #13284, per @ayushr2's suggestion on that PR.

What this does

NVIDIA's k8s-device-plugin and nvidia-ctk cdi generate emit a CDI createContainer hook nvidia-ctk hook disable-device-node-modification that bind-mounts a modified /proc/driver/nvidia/params (with ModifyDeviceFiles=0) over the container's procfs, so in-container libnvidia-ml does not try to recreate /dev/nvidiaN device files.

Under nvproxy this hook is a no-op for two reasons:

  1. nvproxy mediates all NVIDIA device access and the sentry owns /dev, so in-container libnvidia-ml cannot create new device files regardless of what /proc/driver/nvidia/params says.
  2. gVisor already exposes /proc/driver/nvidia/params inside the sandbox with ModifyDeviceFiles forced to 0; see pkg/sentry/devices/nvproxy/nvproxy.go, consistent with libnvidia-container's src/nvc_mount.c:mount_procfs().

In addition, executing the hook is currently fatal during sandbox setup because the hook opens <containerRootFs>/proc/driver/nvidia/params, but procfs is not mounted into containerRootFs at hook time (the sentry serves /proc itself later), so the open(2) fails with ENOENT and aborts container creation:

hooks.go:63] Executing hook nvidia-ctk hook disable-device-node-modification
util.go:107] FATAL ERROR: error executing CreateContainer hooks
stderr: failed to mount modified params file: open o_path procfd:
  open /run/containerd/.../rootfs/proc/driver/nvidia/params:
  no such file or directory

This change filters that hook out of spec.Hooks.CreateContainer before invoking specutils.ExecuteHooks, when nvproxy is enabled. The filter is targeted at the specific argv shape (nvidia-ctk hook disable-device-node-modification) and is a no-op for any other hook.

Sequence

gofer setup (containerRootFs)
  SetupMounts          → libcuda / library bind-mounts
  SetupDev             → /dev/nvidia* cdevs
  ExecuteHooks         ← NEW: drop disable-device-node-modification before calling
                          ExecuteHooks; the other three NVIDIA hooks
                          (create-symlinks, enable-cuda-compat, update-ldcache)
                          continue to execute
bind-mount containerRootFs → goferRootFs (existing behavior)
pivot_root into goferRootFs
sentry boots; serves its own /proc with ModifyDeviceFiles=0

Why this instead of #13284

@ayushr2 pointed out on #13284 that the hook's only semantic effect is something nvproxy already does. Skipping is therefore functionally equivalent to making the hook succeed (as #13284 does by bind-mounting host /proc/driver/nvidia into containerRootFs), and is simpler:

Aspect #13284 (bind-mount) This PR (skip)
Container surface Adds a read-only bind mount of host /proc/driver/nvidia into per-container rootfs (sentry then shadows /proc again, so the hook's write is overwritten anyway) None added
Code New SetupNvidiaProcDriver + os.MkdirAll + bind mount + EROFS handling Single filter function next to ExecuteHooks
Behavior on EROFS rootfs Fatal on MkdirAll Unchanged from today
Generalization Generalizes to any future hook that wants procfs at hook time Targeted; address future cases when they arise

Tests

  • TestIsNvidiaDisableDeviceNodeModificationHook covers the predicate (matches the specific nvidia-ctk hook disable-device-node-modification argv shape, rejects other subcommands, other binaries, short argv, and nvidia-ctk-prefix-but-not-suffix paths).
  • TestFilterNVProxyNoOpHooks covers the filter behavior (nil/empty input, only-no-op input, no no-op present, no-op in various positions, interleaved with non-NVIDIA hooks).
  • The integration call site is SetupRootFS's existing if rootfsConf.ShouldUseLisafs() block where it already calls specutils.ExecuteHooks; the filter is applied right before that call, gated on specutils.NVProxyEnabled(spec, conf). The other GPU integration tests exercise the same path and continue to apply.

Verified end-to-end

Tested on a Tesla T4 host (Ubuntu 24.04, kernel 6.17, containerd 2.2.2, kubelet 1.35, k8s-device-plugin v0.17.0 in DEVICE_LIST_STRATEGY=cdi-cri, runsc release-20260520.0 with this change applied as a one-off patch to a sandbox node). Before this change, the gofer log shows the FATAL ERROR above. With this change applied, the gofer logs three successful hook executions (create-symlinks, enable-cuda-compat, update-ldcache) and skips disable-device-node-modification, the sandbox starts, and a PyTorch CUDA pod reports cuda available: True end to end.

Risks

  • No host-filesystem exposure added: the previous approach bind-mounted host /proc/driver/nvidia into containerRootFs; this PR adds no new mounts.
  • No-op when nvproxy is disabled: filter is gated on specutils.NVProxyEnabled(spec, conf).
  • No-op for any other hook: predicate matches only the specific nvidia-ctk hook disable-device-node-modification argv shape.
  • Behavior change when the hook would have failed: under nvproxy, hook now silently skipped rather than aborting; the sandbox starts. Under nvproxy, this is functionally equivalent to the hook succeeding because the sentry overrides /proc/driver/nvidia/params anyway.

…roxy

NVIDIA's k8s-device-plugin and `nvidia-ctk cdi generate` emit a CDI
createContainer hook `nvidia-ctk hook disable-device-node-modification`,
which bind-mounts a modified /proc/driver/nvidia/params (ModifyDeviceFiles=0)
over the container's procfs so in-container libnvidia-ml does not try to
recreate /dev/nvidiaN device files.

Under nvproxy this hook is a no-op: gVisor already exposes
/proc/driver/nvidia/params inside the sandbox with ModifyDeviceFiles forced
to 0 (see pkg/sentry/devices/nvproxy/nvproxy.go's procDriverNvidiaParams
setup, consistent with libnvidia-container's src/nvc_mount.c:mount_procfs()),
and nvproxy mediates all device access regardless.

In addition, executing the hook is currently fatal during sandbox setup
because the hook opens <containerRootFs>/proc/driver/nvidia/params, but
procfs is not mounted into containerRootFs at hook time (the sentry serves
/proc itself later), so the open(2) fails with ENOENT and aborts container
creation:

  hooks.go:63] Executing hook nvidia-ctk hook disable-device-node-modification
  util.go:107] FATAL ERROR: error executing CreateContainer hooks
  stderr: failed to mount modified params file: open o_path procfd:
    open /run/containerd/.../rootfs/proc/driver/nvidia/params:
    no such file or directory

Filter the hook out before executing CreateContainer hooks when nvproxy is
enabled. This is functionally equivalent to making the hook succeed and is
simpler than google#13284 (no host-filesystem bind-mount into containerRootFs);
the approach was suggested by @ayushr2 on google#13284.

Fixes google#13283.
@a7i a7i closed this May 28, 2026
@a7i a7i deleted the a7i/skip-nvidia-disable-device-node-modification branch May 28, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

createContainer hook disable-device-node-modification fails: /proc/driver/nvidia/params not available in containerRootFs (gvisor#13034 follow-up)

1 participant