-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
When the host has transparent_hugepage/shmem_enabled set to always, d2h (GPU→CPU) pageable cudaMemcpy bandwidth drops by ~50% under gVisor/nvproxy compared to runc. Pinned transfers are unaffected.
The root cause is that the NVIDIA driver's pin_user_pages() must split THP compound pages backing the Sentry's memfd on every ~4 MB DMA chunk. Adding MADV_NOHUGEPAGE on DMA-pinned regions in rmAllocOSDescriptor fully recovers the lost bandwidth.
Environment
- GPU: NVIDIA A10G (PCIe Gen4 x16)
- Driver: 580.95.05
- gVisor version d2403f2
- Host: Amazon Linux 2023, kernel 6.1.x
- Instance: g5.12xlarge (also reproduced on g5.48xlarge)
Reproduction
Step 1: Set the host THP shmem setting
# Check current setting:
cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
# If it shows [never] or [advise], set it to 'always':
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabledStep 2: Run a pageable d2h transfer benchmark
Any PyTorch script that does cpu_tensor.copy_(gpu_tensor) with non-pinned memory will demonstrate the issue. Minimal example:
import torch
device = torch.device("cuda:0")
size = 256 * 1024 * 1024 # 256 MB
n = size // 4
gpu = torch.empty(n, dtype=torch.float32, device=device)
cpu = torch.empty(n, dtype=torch.float32) # pageable (not pinned)
# Warmup
for _ in range(5):
cpu.copy_(gpu)
torch.cuda.synchronize()
# Timed
t0 = torch.cuda.Event(enable_timing=True)
t1 = torch.cuda.Event(enable_timing=True)
t0.record()
for _ in range(20):
cpu.copy_(gpu)
t1.record()
torch.cuda.synchronize()
elapsed = t0.elapsed_time(t1) / 1000.0
bw = (size * 20) / elapsed / (1024**3)
print(f"d2h pageable: {bw:.2f} GB/s")Step 3: Compare results
Run the above inside a gVisor container (--runtime=runsc --gpus all) with shmem_enabled set to always vs never:
# Broken: shmem_enabled=always
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~5.1 GB/s
# Working: shmem_enabled=never
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s
# Control: runc is unaffected by either setting
docker run --rm --runtime=nvidia --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s (both settings)Measured results (A10G, 256 MB d2h pageable, 20 repeats)
| Runtime | shmem_enabled |
d2h pageable | d2h pinned | h2d pageable |
|---|---|---|---|---|
| runc | never | 10.79 GB/s | 12.30 GB/s | 12.22 GB/s |
| runc | always | 10.79 GB/s | 12.30 GB/s | 12.22 GB/s |
| gVisor | never | 10.69 GB/s | 12.30 GB/s | 12.17 GB/s |
| gVisor | always | 5.11 GB/s | 12.30 GB/s | 12.17 GB/s |
Key observations:
- Only gVisor +
shmem_enabled=alwaysis affected. - Only d2h (GPU→CPU) pageable transfers are affected.
- Pinned transfers are identical across all configurations.
- h2d (CPU→GPU) pageable is unaffected.
- runc is unaffected because the nvidia driver pins the application process's pages directly — these are private anonymous mappings, not shmem/tmpfs-backed.
Root cause
Background
gVisor stores all application memory in a single memfd (runsc-memory), created in runsc/boot/loader.go:createMemoryFile(). This memfd is backed by the kernel's tmpfs/shmem subsystem. The shmem_enabled sysctl controls whether tmpfs uses transparent huge pages (THP).
When shmem_enabled=always, the kernel backs this memfd with 2 MB compound pages. This is beneficial for general workloads — gVisor takes a host page fault for every page of application memory, and THP reduces the fault count by 512× (2 MB / 4 KB).
This is documented in #8734 and is the recommended configuration for production gVisor deployments with GPU workloads.
The problem
As far as I understand, when CUDA performs a pageable d2h cudaMemcpy, the NVIDIA driver uses a chunked DMA pattern:
- Pin ~4 MB of host pages via
pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) - DMA from GPU VRAM to the pinned host pages
- Unpin the pages
- Repeat for the next chunk (~256 times for 1 GB)
In nvproxy, rmAllocOSDescriptor() (pkg/sentry/devices/nvproxy/frontend.go) maps the application's memory into the Sentry's address space and passes the Sentry-space pointer to the host NVIDIA driver. The driver then calls pin_user_pages() on the Sentry's pages.
When those pages are backed by 2 MB THP compound pages, pin_user_pages() must split the compound page into 4 KB pages before it can pin a sub-range. This splitting is expensive:
- Takes the
anon_vmalock (contended across CPUs) - Walks and updates all page table entries pointing to the compound page
- Triggers TLB shootdown IPIs to all CPUs that may have cached the mapping
- Modifies page reference counts and flags on each sub-page
This splitting happens on every chunk of the pageable transfer (768 times for 1 GB), causing the ~2× bandwidth regression.
Does this make sense?