perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83
Merged
perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83
Conversation
Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path. Mechanics: - New `VmafThreadPool::free_jobs` list (protected by queue.lock) recycles `VmafThreadPoolJob` slots between enqueue calls instead of malloc/free on every job. - New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the job struct. Payloads <= 64 bytes are copied into it and `job->data = job->inline_data`, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events). - Split cleanup: `_clear_data` distinguishes inline vs heap via `job->data != job->inline_data`; `_recycle` pushes onto free list; `_destroy` is kept for destructor-only use. - Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy` walks and frees the recycle list after the workers exit. Adapted to the fork's `void (*func)(void *data, void **thread_data)` signature and `VmafThreadPoolWorker` per-worker-data path, which Netflix upstream lacks. No API change; callers unmodified. Verification: - meson test -C build: 32/32 pass (threaded framework tests included). - Netflix golden pair (src01_hrc00/01_576x324, full VMAF with vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and `--threads 4` (attribute order may differ — insertion ordering in feature_collector, unchanged by this PR). - Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0 (scalar) and =255 (SIMD). diff exit 0. - clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings, no NOLINT. - Micro-benchmark (500k jobs, 4 threads, int payload): BEFORE (master): ~1.20 M jobs/sec median AFTER (this PR): ~2.20 M jobs/sec median => ~1.8-2.6x enqueue throughput win. Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of T3-6 was already landed via fork commit 81fcd42 (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage). Deliverables (ADR-0108): 1. research digest: no digest needed - narrow upstream port, no novel algorithm 2. decision matrix: ADR-0147 §Alternatives considered 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant with inline_data guard) 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD diff exit 0 5. CHANGELOG: fork entry under Changed 6. rebase-notes: entry 0040 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port the thread-pool portion of Netflix upstream PR #1464 (closed, not merged) into
libvmaf/src/thread_pool.c. Eliminatesmalloc/freechurn from the enqueue hot path:VmafThreadPool::free_jobslinked list (protected by the existingqueue.lock) recyclesVmafThreadPoolJobslots between enqueue calls instead of allocating a fresh one every time.char inline_data[JOB_INLINE_DATA_SIZE = 64]at the tail of the job struct. Payloads ≤ 64 bytes are copied into it andjob->data = job->inline_data, avoiding a secondmallocon the common caller path (main extractor dispatch, MCP frame events).job->data != job->inline_data.Adapted to the fork's
void (*func)(void *data, void **thread_data)signature and per-workerVmafThreadPoolWorkerdata path, which Netflix upstream lacks. No API change; callers unmodified.~1.8–2.6× enqueue throughput on a 500 000-job, 4-thread micro-benchmark (median 1.20 M jobs/sec → 2.20 M jobs/sec). Netflix-golden-pair VMAF score bit-identical between
--threads 4and serial, and betweenVMAF_CPU_MASK=0and=255under--threads 4.Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half was already landed via fork commit
81fcd42e(with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage).Type
perf— performance improvementport— cherry-pick from upstream Netflix/vmaf (thread-pool portion only)Checklist
make format && make lintgreen locally.meson test -C build→ 32/32.--threads 4, diff exit 0..c/.h/.cppfiles added.Netflix golden-data gate (ADR-0024)
assertAlmostEqual(...)score in the Netflix golden Python tests.Cross-backend numerical results
Performance
Deep-dive deliverables (ADR-0108)
AGENTS.mdinvariant note — added tolibvmaf/AGENTS.md(thread-pool recycling + inline data buffer invariant, with the load-bearingjob->data != job->inline_dataguard).CHANGELOG.md"lusoris fork" entry — bullet added under### ChangedinCHANGELOG.md.docs/rebase-notes.md.Reproducer
Known follow-ups
81fcd42e). Not porting the rest.🤖 Generated with Claude Code