nvproxy: 2x performance degradation in GPU→CPU pageable transfer bandwidth when `shmem_enabled=always`

When the host has `transparent_hugepage/shmem_enabled` set to `always`, d2h (GPU→CPU) pageable `cudaMemcpy` bandwidth drops by ~50% under gVisor/nvproxy compared to runc. Pinned transfers are unaffected.

The root cause is that the NVIDIA driver's `pin_user_pages()` must split THP compound pages backing the Sentry's memfd on every ~4 MB DMA chunk. Adding `MADV_NOHUGEPAGE` on DMA-pinned regions in `rmAllocOSDescriptor` fully recovers the lost bandwidth.

## Environment

- GPU: NVIDIA A10G (PCIe Gen4 x16)
- Driver: 580.95.05
- gVisor version d2403f237047b6e90f4f1a8e009c5d2cce065fce
- Host: Amazon Linux 2023, kernel 6.1.x
- Instance: [g5.12xlarge](https://instances.vantage.sh/aws/ec2/g5.12xlarge?currency=USD) (also reproduced on g5.48xlarge)

## Reproduction

### Step 1: Set the host THP shmem setting

```bash
# Check current setting:
cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
# If it shows [never] or [advise], set it to 'always':
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
```

### Step 2: Run a pageable d2h transfer benchmark

Any PyTorch script that does `cpu_tensor.copy_(gpu_tensor)` with non-pinned memory will demonstrate the issue. Minimal example:

```python
import torch

device = torch.device("cuda:0")
size = 256 * 1024 * 1024  # 256 MB
n = size // 4

gpu = torch.empty(n, dtype=torch.float32, device=device)
cpu = torch.empty(n, dtype=torch.float32)  # pageable (not pinned)

# Warmup
for _ in range(5):
    cpu.copy_(gpu)
torch.cuda.synchronize()

# Timed
t0 = torch.cuda.Event(enable_timing=True)
t1 = torch.cuda.Event(enable_timing=True)
t0.record()
for _ in range(20):
    cpu.copy_(gpu)
t1.record()
torch.cuda.synchronize()

elapsed = t0.elapsed_time(t1) / 1000.0
bw = (size * 20) / elapsed / (1024**3)
print(f"d2h pageable: {bw:.2f} GB/s")
```

### Step 3: Compare results

Run the above inside a gVisor container (`--runtime=runsc --gpus all`) with `shmem_enabled` set to `always` vs `never`:

```bash
# Broken: shmem_enabled=always
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~5.1 GB/s

# Working: shmem_enabled=never
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/shmem_enabled
docker run --rm --runtime=runsc --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s

# Control: runc is unaffected by either setting
docker run --rm --runtime=nvidia --gpus all <image> python3 bench.py
# → d2h pageable: ~10.7 GB/s (both settings)
```

### Measured results (A10G, 256 MB d2h pageable, 20 repeats)

| Runtime | `shmem_enabled` | d2h pageable | d2h pinned | h2d pageable |
|---------|-----------------|-------------|------------|-------------|
| runc | never | 10.79 GB/s | 12.30 GB/s | 12.22 GB/s |
| runc | always | 10.79 GB/s | 12.30 GB/s | 12.22 GB/s |
| gVisor | never | 10.69 GB/s | 12.30 GB/s | 12.17 GB/s |
| **gVisor** | **always** | **5.11 GB/s** | 12.30 GB/s | 12.17 GB/s |

Key observations:
- Only gVisor + `shmem_enabled=always` is affected.
- Only d2h (GPU→CPU) pageable transfers are affected.
- Pinned transfers are identical across all configurations.
- h2d (CPU→GPU) pageable is unaffected.
- runc is unaffected because the nvidia driver pins the application process's pages directly — these are private anonymous mappings, not shmem/tmpfs-backed.

## Root cause

### Background

gVisor stores all application memory in a single memfd (`runsc-memory`), created in `runsc/boot/loader.go:createMemoryFile()`. This memfd is backed by the kernel's tmpfs/shmem subsystem. The `shmem_enabled` sysctl controls whether tmpfs uses transparent huge pages (THP).

When `shmem_enabled=always`, the kernel backs this memfd with 2 MB compound pages. This is beneficial for general workloads — gVisor takes a host page fault for every page of application memory, and THP reduces the fault count by 512× (2 MB / 4 KB).

This is documented in [#8734](https://github.com/google/gvisor/issues/8734) and is the recommended configuration for production gVisor deployments with GPU workloads.

### The problem

As far as I understand, when CUDA performs a pageable d2h `cudaMemcpy`, the NVIDIA driver uses a chunked DMA pattern:

1. **Pin** ~4 MB of host pages via `pin_user_pages(FOLL_WRITE|FOLL_LONGTERM)`
2. **DMA** from GPU VRAM to the pinned host pages
3. **Unpin** the pages
4. Repeat for the next chunk (~256 times for 1 GB)

In nvproxy, `rmAllocOSDescriptor()` (`pkg/sentry/devices/nvproxy/frontend.go`) maps the application's memory into the Sentry's address space and passes the Sentry-space pointer to the host NVIDIA driver. The driver then calls `pin_user_pages()` on the Sentry's pages.

When those pages are backed by 2 MB THP compound pages, `pin_user_pages()` must split the compound page into 4 KB pages before it can pin a sub-range. This splitting is expensive:

- Takes the `anon_vma` lock (contended across CPUs)
- Walks and updates all page table entries pointing to the compound page
- Triggers TLB shootdown IPIs to all CPUs that may have cached the mapping
- Modifies page reference counts and flags on each sub-page

This splitting happens on **every chunk** of the pageable transfer (768 times for 1 GB), causing the ~2× bandwidth regression.

Does this make sense?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvproxy: 2x performance degradation in GPU→CPU pageable transfer bandwidth when `shmem_enabled=always` #12804

Environment

Reproduction

Step 1: Set the host THP shmem setting

Step 2: Run a pageable d2h transfer benchmark

Step 3: Compare results

Measured results (A10G, 256 MB d2h pageable, 20 repeats)

Root cause

Background

The problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime	`shmem_enabled`	d2h pageable	d2h pinned	h2d pageable
runc	never	10.79 GB/s	12.30 GB/s	12.22 GB/s
runc	always	10.79 GB/s	12.30 GB/s	12.22 GB/s
gVisor	never	10.69 GB/s	12.30 GB/s	12.17 GB/s
gVisor	always	5.11 GB/s	12.30 GB/s	12.17 GB/s

nvproxy: 2x performance degradation in GPU→CPU pageable transfer bandwidth when shmem_enabled=always #12804

Description

Environment

Reproduction

Step 1: Set the host THP shmem setting

Step 2: Run a pageable d2h transfer benchmark

Step 3: Compare results

Measured results (A10G, 256 MB d2h pageable, 20 repeats)

Root cause

Background

The problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

nvproxy: 2x performance degradation in GPU→CPU pageable transfer bandwidth when `shmem_enabled=always` #12804