-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Description
Hello! I am currently using cuda-checkpoint
within gVisor, and runsc checkpoint
is failing with nvproxy.frontendFD is not saveable. From my understanding, this error means that a /dev/nvidia#
, /dev/nvidiactl
, or /dev/nvidia-uvm
file descriptor is open at the time of a gVisor snapshot. However, running lsof -nP
inside the container after cuda-checkpoint
indicates that none of these file descriptors are currently open. Why is there a discrepancy between lsof
and the nvproxy.frontendFDs
set? Is it possible that gVisor has a bug where it holds on to these frontendFD
file descriptors even if they're closed by the guest?
Steps to reproduce
I came across this issue when attempting to checkpoint/restore sglang.
Step 1: Build the image from this Dockerfile
FROM nvidia/cuda:12.9.1-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
wget \
git \
python3 \
python3-pip \
lsof \
&& rm -rf /var/lib/apt/lists/*
# Install CUDA Toolkit (nvcc)
RUN arch="x86_64" && \
distro="debian12" && \
filename="cuda-keyring_1.1-1_all.deb" && \
cuda_keyring_url="https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/${filename}" && \
wget ${cuda_keyring_url} && \
dpkg -i ${filename} && \
rm -f ${filename} && \
apt-get update && \
apt-get install -y cuda-nvcc-12-9 && \
rm -rf /var/lib/apt/lists/*
RUN apt-get update -y && apt-get upgrade -y
RUN apt-get install numactl -y
# Clone and install sglang
RUN pip install --upgrade pip setuptools
RUN git clone https://github.com/sgl-project/sglang.git /sglang && \
cd /sglang && \
pip install -e "python[all]"
RUN wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint /usr/bin/cuda-checkpoint
CMD ["python3", "-c", "import sglang.srt.entrypoints.engine; import time; import os; print(f'pid: {os.getpid()}', flush=True); time.sleep(1000)"]
Step 2: Prepare the container
Follow the instructions here to create an OCI bundle and config. Make make sure to setup GPU support. This guide is useful.
Step 3: Running the container
Run the container: runsc -nvproxy run -bundle bundle/ sglang-container
Then, once the PID is printed, run cuda-checkpoint
inside the container: runsc exec sglang-container cuda-checkpoint --toggle --pid 1
. This step should succeed.
You can now verify that no handles to /dev/nvidia*
are present: runsc exec sglang-container lsof -nP
.
Step 4: Checkpoint the container
This step fails: checkpoint -image-path checkpointed_img/ sglang-container
with the error message pasted below. I've also uploaded the container's lsof -nP
output, grepped for cuda|nvidia
. I'm also seeing lots of the following, and I'm not sure if it's related.
object.go:260 [ 1689: 1689] nvproxy: freeing object with unknown handle 0xc1d09e26:0x5c000006
frontend.go:76 [ 1707: 1707] nvproxy: failed to open nvidia0: no such file or directory
Log output
runsc version
docker version (if using docker)
uname
oracle linux 9.6, kernel 5.15.0-309.180.4.el9uek.x86_64
kubectl (if using Kubernetes)
N/A
repo state (if built from source)
No response
runsc debug logs (if available)
See the above file upload.