fix(cuda): preallocation memory leak + new vmaf_cuda_state_free API (Netflix#1300, ADR-0157) by lusoris · Pull Request #94 · lusoris/vmaf

lusoris · 2026-04-24T13:43:35Z

Summary

Addresses Netflix upstream issue #1300 ("CUDA-VMAF Memory Leak (preallocation method) in libvmaf", OPEN since 2024).

Root causes (four framework-side leaks confirmed via ASan):

VmafCudaState heap allocation had no public free. vmaf_cuda_state_init() mallocs the struct; vmaf_cuda_import_state() copies-by-value; vmaf_close() → vmaf_cuda_release() frees the internals + memsets, but never the heap allocation itself. No public API to free it.
CudaFunctions driver table never freed. vmaf_cuda_state_init dlopens libcuda.so via cuda_load_functions(); vmaf_cuda_release never called cuda_free_functions().
vmaf_ring_buffer_close() destroyed a locked pthread_mutex — POSIX UB.
Cold-start leak in init_with_primary_context() — retained primary context not released on cuStreamCreateWithPriority failure.

Fixes:

New public API vmaf_cuda_state_free(VmafCudaState *cu_state) — NULL-safe free() wrapper. Must be called AFTER vmaf_close(). Mirrors the SYCL backend's vmaf_sycl_state_free() pattern.
vmaf_cuda_release saves CudaFunctions* before the existing memset, then calls cuda_free_functions() via the saved local. Order is load-bearing.
vmaf_ring_buffer_close now does unlock → destroy → free(pic) → free(rb).
init_with_primary_context cold-start unwind releases the retained primary context; outer vmaf_cuda_state_init failure-unwind frees c + c->f.

Scope

libvmaf/include/libvmaf/libvmaf_cuda.h — new vmaf_cuda_state_free() declaration.
libvmaf/src/cuda/common.c — new implementation + cuda_free_functions() call in release + cold-start unwinds.
libvmaf/src/cuda/ring_buffer.c — unlock + destroy before free.
libvmaf/test/test_cuda_preallocation_leak.c — new 10-cycle GPU-gated reducer with full cleanup.
libvmaf/test/test_cuda_pic_preallocation.c, test_cuda_buffer_alloc_oom.c — add missing vmaf_cuda_state_free() + vmaf_model_destroy() after vmaf_close().
libvmaf/test/meson.build — register new reducer under enable_cuda.

Test

meson test -C libvmaf/build-cuda → 40/40 pass (was 39; + new reducer).
meson test -C build (CPU-only) → 35/35 pass.
ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak → 183 bytes in 4 allocations, all inside libcuda.so.1 (cuInit's process-lifetime driver cache — does NOT grow per cycle; verified N=1 vs N=10 produces identical byte counts). Zero libvmaf/src/* frames in any leak trace.
clang-tidy -p libvmaf/build-cuda --quiet <5 touched files> → exit 0.
CI-equivalent clang-tidy -p build --quiet libvmaf/include/libvmaf/libvmaf_cuda.h (the only CI-visible file post-ADR-0156 exclusion) → exit 0.
pre-commit run --files <touched> → all hooks pass.

Type

fix — bug fix
feat — new public API (vmaf_cuda_state_free)

Checklist

Commits follow Conventional Commits.
Pre-commit hooks green locally.
Unit tests pass: meson test -C libvmaf/build-cuda → 40/40, meson test -C build → 35/35.
ASan confirms zero framework-side leaked bytes across 10 cycles.
Reducer verified fail-without-fix / pass-with-fix.
Visible behaviour change flagged under ### Added + ### Fixed in CHANGELOG.
ADR-0122 / ADR-0123 / ADR-0156 preserved verbatim.

Netflix golden-data gate (ADR-0024)

I did not modify any assertAlmostEqual(...) score in the Netflix golden Python tests.
CUDA ownership-API change only; CPU numerical paths untouched.

Cross-backend numerical results

N/A — memory-management refactor; no CUDA kernel arithmetic changed.
Netflix golden CPU pair bit-identical (35/35 CPU tests pass pre- and post-PR).

Deep-dive deliverables (ADR-0108)

Research digest — no digest needed: upstream issue links directly from the ADR; direct fix.
Decision matrix — ADR-0157 §Alternatives considered (4 options: full fix / surgical / implicit ownership / document).
AGENTS.md invariant note — libvmaf/src/cuda/AGENTS.md under "Rebase-sensitive invariants".
Reproducer / smoke-test command — pasted below.
CHANGELOG.md "lusoris fork" entry — under ### Added + ### Fixed.
Rebase note — entry 0050 in docs/rebase-notes.md.

Reproducer

Verify the fix is live:

cd libvmaf
meson setup build-cuda -Denable_cuda=true -Denable_sycl=false
ninja -C build-cuda
meson test -C build-cuda test_cuda_preallocation_leak -v
# Expect: test_cuda_preallocation_leak_reducer passes (10 cycles clean).

ASan verification:

cd libvmaf
meson setup build-asan-cuda -Db_sanitize=address -Denable_cuda=true -Denable_sycl=false --buildtype=debug
ninja -C build-asan-cuda
ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak
# Expect: "SUMMARY: AddressSanitizer: ~183 byte(s) leaked in 4 allocation(s)."
# ALL 4 allocations are inside libcuda.so.1 (cuInit process cache).
# ZERO entries with "in vmaf_" or "src/" frames.

Issue-matching real-world repro (requires actual video frames):

for (batch = 0; batch < N; batch++) {
    VmafContext *vmaf; VmafCudaState *cu_state; VmafModel *model;
    vmaf_init(&vmaf, cfg);
    vmaf_cuda_state_init(&cu_state, cu_cfg);
    vmaf_cuda_import_state(vmaf, cu_state);
    vmaf_model_load(&model, &mcfg, "vmaf_v0.6.1");
    vmaf_use_features_from_model(vmaf, model);
    vmaf_cuda_preallocate_pictures(vmaf, pic_cfg);
    for (i = 0; i < 30; i++) { /* process 30 frames */ }
    vmaf_close(vmaf);
    vmaf_cuda_state_free(cu_state);  /* NEW: required step */
    vmaf_model_destroy(model);
}
/* Pre-fix: GPU memory rises every iteration.
 * Post-fix: GPU memory flat. */

Migration guide for existing callers

Every caller that does:

vmaf_cuda_state_init(&cu_state, cfg);
vmaf_cuda_import_state(vmaf, cu_state);
... use ...
vmaf_close(vmaf);
/* cu_state pointer was leaked before */

must now add the free:

vmaf_cuda_state_init(&cu_state, cfg);
vmaf_cuda_import_state(vmaf, cu_state);
... use ...
vmaf_close(vmaf);
vmaf_cuda_state_free(cu_state);  // NEW: required AFTER vmaf_close()

Order is load-bearing: vmaf_close first (tears down the CUDA stream + ctx via the struct copy), then vmaf_cuda_state_free (releases the heap allocation). Reversing is a use-after-free. Callers who were using free(cu_state) directly must switch — mixing in-tree free() with the new API will double-free.

Known follow-ups

ffmpeg filter libavfilter/vf_libvmaf.c should be audited for the same cleanup sequence during the next ffmpeg-patches refresh — noted as follow-up in ADR-0157.
Self-hosted GPU runner (backlog T7-3) would give test_cuda_preallocation_leak real CI coverage instead of SKIP on the driver-probe step.

🤖 Generated with Claude Code

…etflix#1300, ADR-0157) Netflix upstream issue Netflix#1300 reports that CUDA-accelerated VMAF in an init/preallocate/fetch/close loop leaks GPU memory monotonically across cycles. Verified via ASan on `test_cuda_pic_preallocation`: 30 799 bytes leaked across 28 allocations, with four distinct framework-side paths. Root causes + fixes: 1. `VmafCudaState` heap allocation had no public free. `vmaf_cuda_state_init(&cu_state, cfg)` mallocs the struct; `vmaf_cuda_import_state` copies-by-value without taking ownership; `vmaf_close → vmaf_cuda_release` frees the internals + memset's but never the heap allocation itself. Fix: new public symbol `vmaf_cuda_state_free(VmafCudaState *cu_state)` in `libvmaf/include/libvmaf/libvmaf_cuda.h`, implemented as a NULL-safe `free()` wrapper in `libvmaf/src/cuda/common.c`. Mirrors the SYCL backend's `vmaf_sycl_state_free()` ownership pattern. 2. `CudaFunctions` driver function-pointer table was never freed. `vmaf_cuda_state_init` calls `cuda_load_functions()` which dlopens libcuda.so and allocates the table; `vmaf_cuda_release` destroyed the stream + context but never called `cuda_free_functions()`. Fix: save the `CudaFunctions *` pointer before the existing `memset`, then call `cuda_free_functions(&f)` via the saved local. Order matters — memset first so `cu_state->f` is zeroed in the caller's struct, then free via the saved local. 3. `vmaf_ring_buffer_close()` destroyed the `pthread_mutex` while locked — POSIX UB. Fix: unlock → destroy → free(pic) → free(rb). 4. Adjacent cold-start leak in `init_with_primary_context()`: if `cuStreamCreateWithPriority` failed after `cuDevicePrimaryCtxRetain` succeeded, the retained primary context was never released. Fix: release on the `fail_after_pop` path. Also added an outer failure unwind in `vmaf_cuda_state_init` so a botched inner init frees both `c` and `c->f` cleanly. Test-side cleanup (separate from framework fix): - `test_cuda_pic_preallocation.c` — every test that calls `vmaf_cuda_state_init()` now calls `vmaf_cuda_state_free()` after `vmaf_close()`; every test that calls `vmaf_model_load()` now calls `vmaf_model_destroy()` after `vmaf_close()`. - `test_cuda_buffer_alloc_oom.c` — swapped internal `free(cu_state)` for the new public `vmaf_cuda_state_free()`. New GPU-gated reducer `libvmaf/test/test_cuda_preallocation_leak.c`: runs 10 cycles of init / preallocate / fetch 10 pictures / close with full cleanup on each cycle. Registered under `enable_cuda` guard in `libvmaf/test/meson.build`. SKIPs cleanly when no CUDA device visible. **Visible behaviour change** for callers: every CUDA caller must now call `vmaf_cuda_state_free(cu_state)` AFTER `vmaf_close(vmaf)`. Callers relying on informal `free(cu_state)` will double-free — flagged under `### Added` + `### Fixed` in CHANGELOG. Preserves ADR-0122 / ADR-0123 null-guards on public entries + ADR-0156 CHECK_CUDA_GOTO cleanup paths verbatim; all four ADRs compose cleanly. Verification: - `meson test -C libvmaf/build-cuda` → 40/40 pass (was 39; + new reducer). - `meson test -C build` (CPU-only) → 35/35 pass. - `ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak` → 183 bytes leaked in 4 allocations, all inside `libcuda.so.1`'s cuInit cache (persists for process lifetime, does NOT grow per cycle). **Zero `libvmaf/src/*` frames** in the leak traces. - `clang-tidy -p build-cuda --quiet <5 touched files>` → exit 0. - CI-equivalent `clang-tidy -p build --quiet libvmaf/include/libvmaf/libvmaf_cuda.h` (only CI-visible file post-exclusion) → exit 0. Closes backlog item T1-7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nner guide, doc-drift enforcement (#103) Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md` plus the audit-flagged docs gaps that surfaced during scope-checking. T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165) Closes Issue #20. New tracked file `docs/state.md` (Open / Recently closed / Confirmed not-affected / Deferred) is the canonical in-tree bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR update on every bug close / open / rule-out. PR template carries a checkbox; opt-out `no state delta: REASON` for PRs without bug-status impact. ADRs cover decisions, this file covers bug status. T7-2 — MCP server release artifact channel (ADR-0166) Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release attachment with Sigstore keyless signing + PEP 740 attestations + SLSA L3 provenance. Wired as new `mcp-build` / `mcp-sign` / `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher binding required (operational note in the ADR). T7-3 — Self-hosted GPU runner enrollment guide New docs/development/self-hosted-runner.md pins the registration steps so an operator can stand a runner up in ~10 minutes. Per popup 2026-04-25 the user's local dev box (CUDA + Intel) will be the first runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`, `avx512`) reserved for future job targeting. ADR-0167 — Path-mapped doc-drift enforcement Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing checks because the workflow was advisory + accepted ADR additions as "docs were touched"). Two layers: Layer 1 (in-session): new project hook `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an informational `NOTICE` when a user-discoverable surface is touched but no matching `docs/<topic>/` file is touched. Mirrors the `auto-snapshot-warn.sh` pattern — informational stderr, no block. Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check` promoted from advisory (`continue-on-error: true`) to blocking + rewritten with a path-mapped surface→docs check. ADR additions no longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine internal-refactor / bug-fix / test PRs. Audit fixes (2 HIGH + 4 MEDIUM): - docs/api/gpu.md — vmaf_cuda_state_free() public API documented (was: missing entirely, despite the symbol shipping in PR #94). - docs/api/index.md — -EAGAIN error code added to the error semantics list (PR #91 / ADR-0154). - docs/api/index.md — vmaf_read_pictures monotonic-index requirement documented (PR #88 / ADR-0152). - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated (was: "scalar only. SIMD / GPU paths are follow-up workstreams" 36 hours after PRs #98/#99/#100 landed all three SIMD ports). - docs/metrics/features.md — PSNR-HVS backends updated for AVX2 (PR #96) + NEON (PR #97). - docs/metrics/features.md — float_ms_ssim <176×176 minimum documented (PR #90 / ADR-0153). ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20. Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lusoris merged commit fd1b22c into master Apr 24, 2026
46 checks passed

lusoris deleted the fix/cuda-preallocation-leak-netflix-1300 branch April 24, 2026 13:58

github-actions Bot mentioned this pull request Apr 24, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cuda): preallocation memory leak + new vmaf_cuda_state_free API (Netflix#1300, ADR-0157)#94

fix(cuda): preallocation memory leak + new vmaf_cuda_state_free API (Netflix#1300, ADR-0157)#94
lusoris merged 1 commit intomasterfrom
fix/cuda-preallocation-leak-netflix-1300

lusoris commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lusoris commented Apr 24, 2026

Summary

Scope

Test

Type

Checklist

Netflix golden-data gate (ADR-0024)

Cross-backend numerical results

Deep-dive deliverables (ADR-0108)

Reproducer

Migration guide for existing callers

Known follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant