nvproxy: `frontendFD`s are present during `runsc checkpoint` despite being closed

### Description

Hello! I am currently using `cuda-checkpoint` within gVisor, and `runsc checkpoint` is failing with [nvproxy.frontendFD is not saveable](https://github.com/google/gvisor/blob/master/pkg/sentry/devices/nvproxy/save_restore_impl.go#L37). From my understanding, this error means that a `/dev/nvidia#`, `/dev/nvidiactl`, or `/dev/nvidia-uvm` file descriptor is open at the time of a gVisor snapshot. However, running `lsof -nP` inside the container after `cuda-checkpoint` indicates that none of these file descriptors are currently open. Why is there a discrepancy between `lsof` and the `nvproxy.frontendFDs` set? Is it possible that gVisor has a bug where it holds on to these `frontendFD` file descriptors even if they're closed by the guest?

### Steps to reproduce

I came across this issue when attempting to checkpoint/restore [sglang](https://github.com/sgl-project/sglang).

**Step 1: Build the image from this Dockerfile**
```
FROM nvidia/cuda:12.9.1-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    wget \
    git \
    python3 \
    python3-pip \
    lsof \
    && rm -rf /var/lib/apt/lists/*

# Install CUDA Toolkit (nvcc)
RUN arch="x86_64" && \
    distro="debian12" && \
    filename="cuda-keyring_1.1-1_all.deb" && \
    cuda_keyring_url="https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/${filename}" && \
    wget ${cuda_keyring_url} && \
    dpkg -i ${filename} && \
    rm -f ${filename} && \
    apt-get update && \
    apt-get install -y cuda-nvcc-12-9 && \
    rm -rf /var/lib/apt/lists/*

RUN apt-get update -y && apt-get upgrade -y
RUN apt-get install numactl -y

# Clone and install sglang
RUN pip install --upgrade pip setuptools
RUN git clone https://github.com/sgl-project/sglang.git /sglang && \
    cd /sglang && \
    pip install -e "python[all]"

RUN wget https://github.com/NVIDIA/cuda-checkpoint/raw/refs/heads/main/bin/x86_64_Linux/cuda-checkpoint /usr/bin/cuda-checkpoint

CMD ["python3", "-c", "import sglang.srt.entrypoints.engine; import time; import os; print(f'pid: {os.getpid()}', flush=True); time.sleep(1000)"]
```

**Step 2: Prepare the container**
Follow the instructions [here](https://gvisor.dev/docs/user_guide/quick_start/oci/) to create an OCI bundle and config. Make make sure to setup [GPU support](https://gvisor.dev/docs/user_guide/gpu/). [This guide](https://gvisor.dev/blog/2023/06/20/gpu-pytorch-stable-diffusion/) is useful.

**Step 3: Running the container**
Run the container: `runsc -nvproxy run -bundle bundle/ sglang-container`

Then, once the PID is printed, run `cuda-checkpoint` inside the container: `runsc exec sglang-container cuda-checkpoint --toggle --pid 1`. This step should succeed.

You can now verify that no handles to `/dev/nvidia*` are present: `runsc exec sglang-container lsof -nP`.

**Step 4: Checkpoint the container**
This step fails: `checkpoint -image-path checkpointed_img/ sglang-container` with the error message pasted below. I've also uploaded the container's `lsof -nP` output, grepped for `cuda|nvidia`. I'm also seeing lots of the following, and I'm not sure if it's related.

```
object.go:260 [ 1689: 1689] nvproxy: freeing object with unknown handle 0xc1d09e26:0x5c000006
frontend.go:76 [ 1707: 1707] nvproxy: failed to open nvidia0: no such file or directory
```

**Log output**
* [runsc_boot_logs.txt](https://github.com/user-attachments/files/22394081/runsc_boot_logs.txt)
* [lsof_post_cuda_checkpoint.txt](https://github.com/user-attachments/files/22394000/lsof_post_cuda_checkpoint.txt)

### runsc version

```shell

```

### docker version (if using docker)

```shell

```

### uname

oracle linux 9.6, kernel 5.15.0-309.180.4.el9uek.x86_64

### kubectl (if using Kubernetes)

```shell
N/A
```

### repo state (if built from source)

_No response_

### runsc debug logs (if available)

```shell
See the above file upload.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvproxy: `frontendFD`s are present during `runsc checkpoint` despite being closed #12144

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvproxy: frontendFDs are present during runsc checkpoint despite being closed #12144

Description

Description

Steps to reproduce

runsc version

docker version (if using docker)

uname

kubectl (if using Kubernetes)

repo state (if built from source)

runsc debug logs (if available)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

nvproxy: `frontendFD`s are present during `runsc checkpoint` despite being closed #12144