Conversation
…etflix#1300, ADR-0157) Netflix upstream issue Netflix#1300 reports that CUDA-accelerated VMAF in an init/preallocate/fetch/close loop leaks GPU memory monotonically across cycles. Verified via ASan on `test_cuda_pic_preallocation`: 30 799 bytes leaked across 28 allocations, with four distinct framework-side paths. Root causes + fixes: 1. `VmafCudaState` heap allocation had no public free. `vmaf_cuda_state_init(&cu_state, cfg)` mallocs the struct; `vmaf_cuda_import_state` copies-by-value without taking ownership; `vmaf_close → vmaf_cuda_release` frees the internals + memset's but never the heap allocation itself. Fix: new public symbol `vmaf_cuda_state_free(VmafCudaState *cu_state)` in `libvmaf/include/libvmaf/libvmaf_cuda.h`, implemented as a NULL-safe `free()` wrapper in `libvmaf/src/cuda/common.c`. Mirrors the SYCL backend's `vmaf_sycl_state_free()` ownership pattern. 2. `CudaFunctions` driver function-pointer table was never freed. `vmaf_cuda_state_init` calls `cuda_load_functions()` which dlopens libcuda.so and allocates the table; `vmaf_cuda_release` destroyed the stream + context but never called `cuda_free_functions()`. Fix: save the `CudaFunctions *` pointer before the existing `memset`, then call `cuda_free_functions(&f)` via the saved local. Order matters — memset first so `cu_state->f` is zeroed in the caller's struct, then free via the saved local. 3. `vmaf_ring_buffer_close()` destroyed the `pthread_mutex` while locked — POSIX UB. Fix: unlock → destroy → free(pic) → free(rb). 4. Adjacent cold-start leak in `init_with_primary_context()`: if `cuStreamCreateWithPriority` failed after `cuDevicePrimaryCtxRetain` succeeded, the retained primary context was never released. Fix: release on the `fail_after_pop` path. Also added an outer failure unwind in `vmaf_cuda_state_init` so a botched inner init frees both `c` and `c->f` cleanly. Test-side cleanup (separate from framework fix): - `test_cuda_pic_preallocation.c` — every test that calls `vmaf_cuda_state_init()` now calls `vmaf_cuda_state_free()` after `vmaf_close()`; every test that calls `vmaf_model_load()` now calls `vmaf_model_destroy()` after `vmaf_close()`. - `test_cuda_buffer_alloc_oom.c` — swapped internal `free(cu_state)` for the new public `vmaf_cuda_state_free()`. New GPU-gated reducer `libvmaf/test/test_cuda_preallocation_leak.c`: runs 10 cycles of init / preallocate / fetch 10 pictures / close with full cleanup on each cycle. Registered under `enable_cuda` guard in `libvmaf/test/meson.build`. SKIPs cleanly when no CUDA device visible. **Visible behaviour change** for callers: every CUDA caller must now call `vmaf_cuda_state_free(cu_state)` AFTER `vmaf_close(vmaf)`. Callers relying on informal `free(cu_state)` will double-free — flagged under `### Added` + `### Fixed` in CHANGELOG. Preserves ADR-0122 / ADR-0123 null-guards on public entries + ADR-0156 CHECK_CUDA_GOTO cleanup paths verbatim; all four ADRs compose cleanly. Verification: - `meson test -C libvmaf/build-cuda` → 40/40 pass (was 39; + new reducer). - `meson test -C build` (CPU-only) → 35/35 pass. - `ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak` → 183 bytes leaked in 4 allocations, all inside `libcuda.so.1`'s cuInit cache (persists for process lifetime, does NOT grow per cycle). **Zero `libvmaf/src/*` frames** in the leak traces. - `clang-tidy -p build-cuda --quiet <5 touched files>` → exit 0. - CI-equivalent `clang-tidy -p build --quiet libvmaf/include/libvmaf/libvmaf_cuda.h` (only CI-visible file post-exclusion) → exit 0. Closes backlog item T1-7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
added a commit
that referenced
this pull request
Apr 24, 2026
…nner guide, doc-drift enforcement (#103) Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md` plus the audit-flagged docs gaps that surfaced during scope-checking. T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165) Closes Issue #20. New tracked file `docs/state.md` (Open / Recently closed / Confirmed not-affected / Deferred) is the canonical in-tree bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR update on every bug close / open / rule-out. PR template carries a checkbox; opt-out `no state delta: REASON` for PRs without bug-status impact. ADRs cover decisions, this file covers bug status. T7-2 — MCP server release artifact channel (ADR-0166) Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release attachment with Sigstore keyless signing + PEP 740 attestations + SLSA L3 provenance. Wired as new `mcp-build` / `mcp-sign` / `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher binding required (operational note in the ADR). T7-3 — Self-hosted GPU runner enrollment guide New docs/development/self-hosted-runner.md pins the registration steps so an operator can stand a runner up in ~10 minutes. Per popup 2026-04-25 the user's local dev box (CUDA + Intel) will be the first runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`, `avx512`) reserved for future job targeting. ADR-0167 — Path-mapped doc-drift enforcement Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing checks because the workflow was advisory + accepted ADR additions as "docs were touched"). Two layers: Layer 1 (in-session): new project hook `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an informational `NOTICE` when a user-discoverable surface is touched but no matching `docs/<topic>/` file is touched. Mirrors the `auto-snapshot-warn.sh` pattern — informational stderr, no block. Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check` promoted from advisory (`continue-on-error: true`) to blocking + rewritten with a path-mapped surface→docs check. ADR additions no longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine internal-refactor / bug-fix / test PRs. Audit fixes (2 HIGH + 4 MEDIUM): - docs/api/gpu.md — vmaf_cuda_state_free() public API documented (was: missing entirely, despite the symbol shipping in PR #94). - docs/api/index.md — -EAGAIN error code added to the error semantics list (PR #91 / ADR-0154). - docs/api/index.md — vmaf_read_pictures monotonic-index requirement documented (PR #88 / ADR-0152). - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated (was: "scalar only. SIMD / GPU paths are follow-up workstreams" 36 hours after PRs #98/#99/#100 landed all three SIMD ports). - docs/metrics/features.md — PSNR-HVS backends updated for AVX2 (PR #96) + NEON (PR #97). - docs/metrics/features.md — float_ms_ssim <176×176 minimum documented (PR #90 / ADR-0153). ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20. Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses Netflix upstream issue #1300 ("CUDA-VMAF Memory Leak (preallocation method) in libvmaf", OPEN since 2024).
Root causes (four framework-side leaks confirmed via ASan):
VmafCudaStateheap allocation had no public free.vmaf_cuda_state_init()mallocs the struct;vmaf_cuda_import_state()copies-by-value;vmaf_close() → vmaf_cuda_release()frees the internals + memsets, but never the heap allocation itself. No public API to free it.CudaFunctionsdriver table never freed.vmaf_cuda_state_initdlopens libcuda.so viacuda_load_functions();vmaf_cuda_releasenever calledcuda_free_functions().vmaf_ring_buffer_close()destroyed a lockedpthread_mutex— POSIX UB.init_with_primary_context()— retained primary context not released oncuStreamCreateWithPriorityfailure.Fixes:
vmaf_cuda_state_free(VmafCudaState *cu_state)— NULL-safefree()wrapper. Must be called AFTERvmaf_close(). Mirrors the SYCL backend'svmaf_sycl_state_free()pattern.vmaf_cuda_releasesavesCudaFunctions*before the existingmemset, then callscuda_free_functions()via the saved local. Order is load-bearing.vmaf_ring_buffer_closenow doesunlock → destroy → free(pic) → free(rb).init_with_primary_contextcold-start unwind releases the retained primary context; outervmaf_cuda_state_initfailure-unwind freesc+c->f.Scope
libvmaf/include/libvmaf/libvmaf_cuda.h— newvmaf_cuda_state_free()declaration.libvmaf/src/cuda/common.c— new implementation +cuda_free_functions()call in release + cold-start unwinds.libvmaf/src/cuda/ring_buffer.c— unlock + destroy before free.libvmaf/test/test_cuda_preallocation_leak.c— new 10-cycle GPU-gated reducer with full cleanup.libvmaf/test/test_cuda_pic_preallocation.c,test_cuda_buffer_alloc_oom.c— add missingvmaf_cuda_state_free()+vmaf_model_destroy()aftervmaf_close().libvmaf/test/meson.build— register new reducer underenable_cuda.Test
meson test -C libvmaf/build-cuda→ 40/40 pass (was 39; + new reducer).meson test -C build(CPU-only) → 35/35 pass.ASAN_OPTIONS='detect_leaks=1:leak_check_at_exit=1' build-asan-cuda/test/test_cuda_preallocation_leak→ 183 bytes in 4 allocations, all insidelibcuda.so.1(cuInit's process-lifetime driver cache — does NOT grow per cycle; verified N=1 vs N=10 produces identical byte counts). Zerolibvmaf/src/*frames in any leak trace.clang-tidy -p libvmaf/build-cuda --quiet <5 touched files>→ exit 0.clang-tidy -p build --quiet libvmaf/include/libvmaf/libvmaf_cuda.h(the only CI-visible file post-ADR-0156 exclusion) → exit 0.pre-commit run --files <touched>→ all hooks pass.Type
fix— bug fixfeat— new public API (vmaf_cuda_state_free)Checklist
meson test -C libvmaf/build-cuda→ 40/40,meson test -C build→ 35/35.### Added+### Fixedin CHANGELOG.Netflix golden-data gate (ADR-0024)
assertAlmostEqual(...)score in the Netflix golden Python tests.Cross-backend numerical results
Deep-dive deliverables (ADR-0108)
AGENTS.mdinvariant note —libvmaf/src/cuda/AGENTS.mdunder "Rebase-sensitive invariants".CHANGELOG.md"lusoris fork" entry — under### Added+### Fixed.Reproducer
Verify the fix is live:
ASan verification:
Issue-matching real-world repro (requires actual video frames):
Migration guide for existing callers
Every caller that does:
must now add the free:
Order is load-bearing:
vmaf_closefirst (tears down the CUDA stream + ctx via the struct copy), thenvmaf_cuda_state_free(releases the heap allocation). Reversing is a use-after-free. Callers who were usingfree(cu_state)directly must switch — mixing in-treefree()with the new API will double-free.Known follow-ups
libavfilter/vf_libvmaf.cshould be audited for the same cleanup sequence during the nextffmpeg-patchesrefresh — noted as follow-up in ADR-0157.test_cuda_preallocation_leakreal CI coverage instead of SKIP on the driver-probe step.🤖 Generated with Claude Code