Skip to content

nvproxy: frontendFDs are present during runsc checkpoint despite being closed #12144

@mattnappo

Description

@mattnappo

Description

Hello! I am currently using cuda-checkpoint within gVisor, and runsc checkpoint is failing with nvproxy.frontendFD is not saveable. From my understanding, this error means that a /dev/nvidia#, /dev/nvidiactl, or /dev/nvidia-uvm file descriptor is open at the time of a gVisor snapshot. However, running lsof -nP inside the container after cuda-checkpoint indicates that none of these file descriptors are currently open. Why is there a discrepancy between lsof and the nvproxy.frontendFDs set? Is it possible that gVisor has a bug where it holds on to these frontendFD file descriptors even if they're closed by the guest?

Steps to reproduce

I came across this issue when attempting to checkpoint/restore sglang.

Step 1: Build the image from this Dockerfile

FROM nvidia/cuda:12.9.1-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    wget \
    git \
    python3 \
    python3-pip \
    lsof \
    && rm -rf /var/lib/apt/lists/*

# Install CUDA Toolkit (nvcc)
RUN arch="x86_64" && \
    distro="debian12" && \
    filename="cuda-keyring_1.1-1_all.deb" && \
    cuda_keyring_url="https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/${filename}" && \
    wget ${cuda_keyring_url} && \
    dpkg -i ${filename} && \
    rm -f ${filename} && \
    apt-get update && \
    apt-get install -y cuda-nvcc-12-9 && \
    rm -rf /var/lib/apt/lists/*

RUN apt-get update -y && apt-get upgrade -y
RUN apt-get install numactl -y

# Clone and install sglang
RUN pip install --upgrade pip setuptools
RUN git clone https://github.com/sgl-project/sglang.git /sglang && \
    cd /sglang && \
    pip install -e "python[all]"

RUN wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint /usr/bin/cuda-checkpoint

CMD ["python3", "-c", "import sglang.srt.entrypoints.engine; import time; import os; print(f'pid: {os.getpid()}', flush=True); time.sleep(1000)"]

Step 2: Prepare the container
Follow the instructions here to create an OCI bundle and config. Make make sure to setup GPU support. This guide is useful.

Step 3: Running the container
Run the container: runsc -nvproxy run -bundle bundle/ sglang-container

Then, once the PID is printed, run cuda-checkpoint inside the container: runsc exec sglang-container cuda-checkpoint --toggle --pid 1. This step should succeed.

You can now verify that no handles to /dev/nvidia* are present: runsc exec sglang-container lsof -nP.

Step 4: Checkpoint the container
This step fails: checkpoint -image-path checkpointed_img/ sglang-container with the error message pasted below. I've also uploaded the container's lsof -nP output, grepped for cuda|nvidia. I'm also seeing lots of the following, and I'm not sure if it's related.

object.go:260 [ 1689: 1689] nvproxy: freeing object with unknown handle 0xc1d09e26:0x5c000006
frontend.go:76 [ 1707: 1707] nvproxy: failed to open nvidia0: no such file or directory

Log output

runsc version

docker version (if using docker)

uname

oracle linux 9.6, kernel 5.15.0-309.180.4.el9uek.x86_64

kubectl (if using Kubernetes)

N/A

repo state (if built from source)

No response

runsc debug logs (if available)

See the above file upload.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions