Skip to content

nvproxy: 2x performance degradation in GPU→CPU pageable transfer bandwidth when shmem_enabled=always #12804

@luiscape

Description

@luiscape

When the host has transparent_hugepage/shmem_enabled set to always, d2h (GPU→CPU) pageable cudaMemcpy bandwidth drops by ~50% under gVisor/nvproxy compared to runc. Pinned transfers are unaffected.

The root cause is that the NVIDIA driver's pin_user_pages() must split THP compound pages backing the Sentry's memfd on every ~4 MB DMA chunk. Adding MADV_NOHUGEPAGE on DMA-pinned regions in rmAllocOSDescriptor fully recovers the lost bandwidth.

Environment

  • GPU: NVIDIA A10G (PCIe Gen4 x16)
  • Driver: 580.95.05
  • gVisor version d2403f2
  • Host: Amazon Linux 2023, kernel 6.1.x
  • Instance: g5.12xlarge (also reproduced on g5.48xlarge)

Reproduction

Step 1: Set the host THP shmem setting

# Check current setting:
cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
# If it shows [never] or [advise], set it to 'always':
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled

Step 2: Run a pageable d2h transfer benchmark

Any PyTorch script that does cpu_tensor.copy_(gpu_tensor) with non-pinned memory will demonstrate the issue. Minimal example:

import torch

device = torch.device("cuda:0")
size = 256 * 1024 * 1024  # 256 MB
n = size // 4

gpu = torch.empty(n, dtype=torch.float32, device=device)
cpu = torch.empty(n, dtype=torch.float32)  # pageable (not pinned)

# Warmup
for _ in range(5):
    cpu.copy_(gpu)
torch.cuda.synchronize()

# Timed
t0 = torch.cuda.Event(enable_timing=True)
t1 = torch.cuda.Event(enable_timing=True)
t0.record()
for _ in range(20):
    cpu.copy_(gpu)
t1.record()
torch.cuda.synchronize()

elapsed = t0.elapsed_time(t1) / 1000.0
bw = (size * 20) / elapsed / (1024**3)
print(f"d2h pageable: {bw:.2f} GB/s")

Step 3: Compare results

Run the above inside a gVisor container (--runtime=runsc --gpus all) with shmem_enabled set to always vs never:

# Broken: shmem_enabled=always
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~5.1 GB/s

# Working: shmem_enabled=never
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s

# Control: runc is unaffected by either setting
docker run --rm --runtime=nvidia --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s (both settings)

Measured results (A10G, 256 MB d2h pageable, 20 repeats)

Runtime shmem_enabled d2h pageable d2h pinned h2d pageable
runc never 10.79 GB/s 12.30 GB/s 12.22 GB/s
runc always 10.79 GB/s 12.30 GB/s 12.22 GB/s
gVisor never 10.69 GB/s 12.30 GB/s 12.17 GB/s
gVisor always 5.11 GB/s 12.30 GB/s 12.17 GB/s

Key observations:

  • Only gVisor + shmem_enabled=always is affected.
  • Only d2h (GPU→CPU) pageable transfers are affected.
  • Pinned transfers are identical across all configurations.
  • h2d (CPU→GPU) pageable is unaffected.
  • runc is unaffected because the nvidia driver pins the application process's pages directly — these are private anonymous mappings, not shmem/tmpfs-backed.

Root cause

Background

gVisor stores all application memory in a single memfd (runsc-memory), created in runsc/boot/loader.go:createMemoryFile(). This memfd is backed by the kernel's tmpfs/shmem subsystem. The shmem_enabled sysctl controls whether tmpfs uses transparent huge pages (THP).

When shmem_enabled=always, the kernel backs this memfd with 2 MB compound pages. This is beneficial for general workloads — gVisor takes a host page fault for every page of application memory, and THP reduces the fault count by 512× (2 MB / 4 KB).

This is documented in #8734 and is the recommended configuration for production gVisor deployments with GPU workloads.

The problem

As far as I understand, when CUDA performs a pageable d2h cudaMemcpy, the NVIDIA driver uses a chunked DMA pattern:

  1. Pin ~4 MB of host pages via pin_user_pages(FOLL_WRITE|FOLL_LONGTERM)
  2. DMA from GPU VRAM to the pinned host pages
  3. Unpin the pages
  4. Repeat for the next chunk (~256 times for 1 GB)

In nvproxy, rmAllocOSDescriptor() (pkg/sentry/devices/nvproxy/frontend.go) maps the application's memory into the Sentry's address space and passes the Sentry-space pointer to the host NVIDIA driver. The driver then calls pin_user_pages() on the Sentry's pages.

When those pages are backed by 2 MB THP compound pages, pin_user_pages() must split the compound page into 4 KB pages before it can pin a sub-range. This splitting is expensive:

  • Takes the anon_vma lock (contended across CPUs)
  • Walks and updates all page table entries pointing to the compound page
  • Triggers TLB shootdown IPIs to all CPUs that may have cached the mapping
  • Modifies page reference counts and flags on each sub-page

This splitting happens on every chunk of the pageable transfer (768 times for 1 GB), causing the ~2× bandwidth regression.

Does this make sense?

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions