Skip to content

fix(cuda): preallocation memory leak + new vmaf_cuda_state_free API (Netflix#1300, ADR-0157)#94

Merged
lusoris merged 1 commit intomasterfrom
fix/cuda-preallocation-leak-netflix-1300
Apr 24, 2026
Merged

fix(cuda): preallocation memory leak + new vmaf_cuda_state_free API (Netflix#1300, ADR-0157)#94
lusoris merged 1 commit intomasterfrom
fix/cuda-preallocation-leak-netflix-1300

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 24, 2026

Summary

Addresses Netflix upstream issue #1300 ("CUDA-VMAF Memory Leak (preallocation method) in libvmaf", OPEN since 2024).

Root causes (four framework-side leaks confirmed via ASan):

  1. VmafCudaState heap allocation had no public free. vmaf_cuda_state_init() mallocs the struct; vmaf_cuda_import_state() copies-by-value; vmaf_close() → vmaf_cuda_release() frees the internals + memsets, but never the heap allocation itself. No public API to free it.
  2. CudaFunctions driver table never freed. vmaf_cuda_state_init dlopens libcuda.so via cuda_load_functions(); vmaf_cuda_release never called cuda_free_functions().
  3. vmaf_ring_buffer_close() destroyed a locked pthread_mutex — POSIX UB.
  4. Cold-start leak in init_with_primary_context() — retained primary context not released on cuStreamCreateWithPriority failure.

Fixes:

  • New public API vmaf_cuda_state_free(VmafCudaState *cu_state) — NULL-safe free() wrapper. Must be called AFTER vmaf_close(). Mirrors the SYCL backend's vmaf_sycl_state_free() pattern.
  • vmaf_cuda_release saves CudaFunctions* before the existing memset, then calls cuda_free_functions() via the saved local. Order is load-bearing.
  • vmaf_ring_buffer_close now does unlock → destroy → free(pic) → free(rb).
  • init_with_primary_context cold-start unwind releases the retained primary context; outer vmaf_cuda_state_init failure-unwind frees c + c->f.

Scope

Test

  • meson test -C libvmaf/build-cuda40/40 pass (was 39; + new reducer).
  • meson test -C build (CPU-only) → 35/35 pass.
  • ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak183 bytes in 4 allocations, all inside libcuda.so.1 (cuInit's process-lifetime driver cache — does NOT grow per cycle; verified N=1 vs N=10 produces identical byte counts). Zero libvmaf/src/* frames in any leak trace.
  • clang-tidy -p libvmaf/build-cuda --quiet <5 touched files>exit 0.
  • CI-equivalent clang-tidy -p build --quiet libvmaf/include/libvmaf/libvmaf_cuda.h (the only CI-visible file post-ADR-0156 exclusion) → exit 0.
  • pre-commit run --files <touched> → all hooks pass.

Type

  • fix — bug fix
  • feat — new public API (vmaf_cuda_state_free)

Checklist

  • Commits follow Conventional Commits.
  • Pre-commit hooks green locally.
  • Unit tests pass: meson test -C libvmaf/build-cuda → 40/40, meson test -C build → 35/35.
  • ASan confirms zero framework-side leaked bytes across 10 cycles.
  • Reducer verified fail-without-fix / pass-with-fix.
  • Visible behaviour change flagged under ### Added + ### Fixed in CHANGELOG.
  • ADR-0122 / ADR-0123 / ADR-0156 preserved verbatim.

Netflix golden-data gate (ADR-0024)

  • I did not modify any assertAlmostEqual(...) score in the Netflix golden Python tests.
  • CUDA ownership-API change only; CPU numerical paths untouched.

Cross-backend numerical results

N/A — memory-management refactor; no CUDA kernel arithmetic changed.
Netflix golden CPU pair bit-identical (35/35 CPU tests pass pre- and post-PR).

Deep-dive deliverables (ADR-0108)

  • Research digest — no digest needed: upstream issue links directly from the ADR; direct fix.
  • Decision matrix — ADR-0157 §Alternatives considered (4 options: full fix / surgical / implicit ownership / document).
  • AGENTS.md invariant notelibvmaf/src/cuda/AGENTS.md under "Rebase-sensitive invariants".
  • Reproducer / smoke-test command — pasted below.
  • CHANGELOG.md "lusoris fork" entry — under ### Added + ### Fixed.
  • Rebase note — entry 0050 in docs/rebase-notes.md.

Reproducer

Verify the fix is live:

cd libvmaf
meson setup build-cuda -Denable_cuda=true -Denable_sycl=false
ninja -C build-cuda
meson test -C build-cuda test_cuda_preallocation_leak -v
# Expect: test_cuda_preallocation_leak_reducer passes (10 cycles clean).

ASan verification:

cd libvmaf
meson setup build-asan-cuda -Db_sanitize=address -Denable_cuda=true -Denable_sycl=false --buildtype=debug
ninja -C build-asan-cuda
ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak
# Expect: "SUMMARY: AddressSanitizer: ~183 byte(s) leaked in 4 allocation(s)."
# ALL 4 allocations are inside libcuda.so.1 (cuInit process cache).
# ZERO entries with "in vmaf_" or "src/" frames.

Issue-matching real-world repro (requires actual video frames):

for (batch = 0; batch < N; batch++) {
    VmafContext *vmaf; VmafCudaState *cu_state; VmafModel *model;
    vmaf_init(&vmaf, cfg);
    vmaf_cuda_state_init(&cu_state, cu_cfg);
    vmaf_cuda_import_state(vmaf, cu_state);
    vmaf_model_load(&model, &mcfg, "vmaf_v0.6.1");
    vmaf_use_features_from_model(vmaf, model);
    vmaf_cuda_preallocate_pictures(vmaf, pic_cfg);
    for (i = 0; i < 30; i++) { /* process 30 frames */ }
    vmaf_close(vmaf);
    vmaf_cuda_state_free(cu_state);  /* NEW: required step */
    vmaf_model_destroy(model);
}
/* Pre-fix: GPU memory rises every iteration.
 * Post-fix: GPU memory flat. */

Migration guide for existing callers

Every caller that does:

vmaf_cuda_state_init(&cu_state, cfg);
vmaf_cuda_import_state(vmaf, cu_state);
... use ...
vmaf_close(vmaf);
/* cu_state pointer was leaked before */

must now add the free:

vmaf_cuda_state_init(&cu_state, cfg);
vmaf_cuda_import_state(vmaf, cu_state);
... use ...
vmaf_close(vmaf);
vmaf_cuda_state_free(cu_state);  // NEW: required AFTER vmaf_close()

Order is load-bearing: vmaf_close first (tears down the CUDA stream + ctx via the struct copy), then vmaf_cuda_state_free (releases the heap allocation). Reversing is a use-after-free. Callers who were using free(cu_state) directly must switch — mixing in-tree free() with the new API will double-free.

Known follow-ups

  • ffmpeg filter libavfilter/vf_libvmaf.c should be audited for the same cleanup sequence during the next ffmpeg-patches refresh — noted as follow-up in ADR-0157.
  • Self-hosted GPU runner (backlog T7-3) would give test_cuda_preallocation_leak real CI coverage instead of SKIP on the driver-probe step.

🤖 Generated with Claude Code

…etflix#1300, ADR-0157)

Netflix upstream issue Netflix#1300 reports that CUDA-accelerated VMAF in an
init/preallocate/fetch/close loop leaks GPU memory monotonically across
cycles. Verified via ASan on `test_cuda_pic_preallocation`:
30 799 bytes leaked across 28 allocations, with four distinct
framework-side paths.

Root causes + fixes:

1. `VmafCudaState` heap allocation had no public free.
   `vmaf_cuda_state_init(&cu_state, cfg)` mallocs the struct;
   `vmaf_cuda_import_state` copies-by-value without taking ownership;
   `vmaf_close → vmaf_cuda_release` frees the internals + memset's but
   never the heap allocation itself. Fix: new public symbol
   `vmaf_cuda_state_free(VmafCudaState *cu_state)` in
   `libvmaf/include/libvmaf/libvmaf_cuda.h`, implemented as a NULL-safe
   `free()` wrapper in `libvmaf/src/cuda/common.c`. Mirrors the SYCL
   backend's `vmaf_sycl_state_free()` ownership pattern.

2. `CudaFunctions` driver function-pointer table was never freed.
   `vmaf_cuda_state_init` calls `cuda_load_functions()` which dlopens
   libcuda.so and allocates the table; `vmaf_cuda_release` destroyed
   the stream + context but never called `cuda_free_functions()`. Fix:
   save the `CudaFunctions *` pointer before the existing `memset`,
   then call `cuda_free_functions(&f)` via the saved local. Order
   matters — memset first so `cu_state->f` is zeroed in the caller's
   struct, then free via the saved local.

3. `vmaf_ring_buffer_close()` destroyed the `pthread_mutex` while
   locked — POSIX UB. Fix: unlock → destroy → free(pic) → free(rb).

4. Adjacent cold-start leak in `init_with_primary_context()`: if
   `cuStreamCreateWithPriority` failed after `cuDevicePrimaryCtxRetain`
   succeeded, the retained primary context was never released. Fix:
   release on the `fail_after_pop` path. Also added an outer failure
   unwind in `vmaf_cuda_state_init` so a botched inner init frees both
   `c` and `c->f` cleanly.

Test-side cleanup (separate from framework fix):

- `test_cuda_pic_preallocation.c` — every test that calls
  `vmaf_cuda_state_init()` now calls `vmaf_cuda_state_free()` after
  `vmaf_close()`; every test that calls `vmaf_model_load()` now calls
  `vmaf_model_destroy()` after `vmaf_close()`.
- `test_cuda_buffer_alloc_oom.c` — swapped internal `free(cu_state)`
  for the new public `vmaf_cuda_state_free()`.

New GPU-gated reducer `libvmaf/test/test_cuda_preallocation_leak.c`:
runs 10 cycles of init / preallocate / fetch 10 pictures / close with
full cleanup on each cycle. Registered under `enable_cuda` guard in
`libvmaf/test/meson.build`. SKIPs cleanly when no CUDA device visible.

**Visible behaviour change** for callers: every CUDA caller must now
call `vmaf_cuda_state_free(cu_state)` AFTER `vmaf_close(vmaf)`.
Callers relying on informal `free(cu_state)` will double-free —
flagged under `### Added` + `### Fixed` in CHANGELOG.

Preserves ADR-0122 / ADR-0123 null-guards on public entries + ADR-0156
CHECK_CUDA_GOTO cleanup paths verbatim; all four ADRs compose cleanly.

Verification:
- `meson test -C libvmaf/build-cuda` → 40/40 pass (was 39; + new
  reducer).
- `meson test -C build` (CPU-only) → 35/35 pass.
- `ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1'
  build-asan-cuda/test/test_cuda_preallocation_leak` → 183 bytes
  leaked in 4 allocations, all inside `libcuda.so.1`'s cuInit cache
  (persists for process lifetime, does NOT grow per cycle).
  **Zero `libvmaf/src/*` frames** in the leak traces.
- `clang-tidy -p build-cuda --quiet <5 touched files>` → exit 0.
- CI-equivalent `clang-tidy -p build --quiet
  libvmaf/include/libvmaf/libvmaf_cuda.h` (only CI-visible file
  post-exclusion) → exit 0.

Closes backlog item T1-7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris merged commit fd1b22c into master Apr 24, 2026
46 checks passed
@lusoris lusoris deleted the fix/cuda-preallocation-leak-netflix-1300 branch April 24, 2026 13:58
@github-actions github-actions Bot mentioned this pull request Apr 24, 2026
lusoris added a commit that referenced this pull request Apr 24, 2026
…nner guide, doc-drift enforcement (#103)

Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md`
plus the audit-flagged docs gaps that surfaced during scope-checking.

T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165)
  Closes Issue #20. New tracked file `docs/state.md` (Open / Recently
  closed / Confirmed not-affected / Deferred) is the canonical in-tree
  bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR
  update on every bug close / open / rule-out. PR template carries a
  checkbox; opt-out `no state delta: REASON` for PRs without bug-status
  impact. ADRs cover decisions, this file covers bug status.

T7-2 — MCP server release artifact channel (ADR-0166)
  Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release
  attachment with Sigstore keyless signing + PEP 740 attestations + SLSA
  L3 provenance. Wired as new `mcp-build` / `mcp-sign` /
  `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this
  lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher
  binding required (operational note in the ADR).

T7-3 — Self-hosted GPU runner enrollment guide
  New docs/development/self-hosted-runner.md pins the registration
  steps so an operator can stand a runner up in ~10 minutes. Per popup
  2026-04-25 the user's local dev box (CUDA + Intel) will be the first
  runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`,
  `avx512`) reserved for future job targeting.

ADR-0167 — Path-mapped doc-drift enforcement
  Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed
  in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing
  checks because the workflow was advisory + accepted ADR additions
  as "docs were touched"). Two layers:

  Layer 1 (in-session): new project hook
  `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an
  informational `NOTICE` when a user-discoverable surface is touched
  but no matching `docs/<topic>/` file is touched. Mirrors the
  `auto-snapshot-warn.sh` pattern — informational stderr, no block.

  Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check`
  promoted from advisory (`continue-on-error: true`) to blocking +
  rewritten with a path-mapped surface→docs check. ADR additions no
  longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine
  internal-refactor / bug-fix / test PRs.

Audit fixes (2 HIGH + 4 MEDIUM):
  - docs/api/gpu.md — vmaf_cuda_state_free() public API documented
    (was: missing entirely, despite the symbol shipping in PR #94).
  - docs/api/index.md — -EAGAIN error code added to the error
    semantics list (PR #91 / ADR-0154).
  - docs/api/index.md — vmaf_read_pictures monotonic-index requirement
    documented (PR #88 / ADR-0152).
  - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated
    (was: "scalar only. SIMD / GPU paths are follow-up workstreams"
    36 hours after PRs #98/#99/#100 landed all three SIMD ports).
  - docs/metrics/features.md — PSNR-HVS backends updated for AVX2
    (PR #96) + NEON (PR #97).
  - docs/metrics/features.md — float_ms_ssim <176×176 minimum
    documented (PR #90 / ADR-0153).

ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20.

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant