perf(thread_pool): recycle job slots + inline data buffer (ADR-0147) by lusoris · Pull Request #83 · lusoris/vmaf

lusoris · 2026-04-23T23:19:14Z

Summary

Port the thread-pool portion of Netflix upstream PR #1464 (closed, not merged) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path:

VmafThreadPool::free_jobs linked list (protected by the existing queue.lock) recycles VmafThreadPoolJob slots between enqueue calls instead of allocating a fresh one every time.
char inline_data[JOB_INLINE_DATA_SIZE = 64] at the tail of the job struct. Payloads ≤ 64 bytes are copied into it and job->data = job->inline_data, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events).
Cleanup path distinguishes inline vs heap payloads via job->data != job->inline_data.

Adapted to the fork's void (*func)(void *data, void **thread_data) signature and per-worker VmafThreadPoolWorker data path, which Netflix upstream lacks. No API change; callers unmodified.

~1.8–2.6× enqueue throughput on a 500 000-job, 4-thread micro-benchmark (median 1.20 M jobs/sec → 2.20 M jobs/sec). Netflix-golden-pair VMAF score bit-identical between --threads 4 and serial, and between VMAF_CPU_MASK=0 and =255 under --threads 4.

Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half was already landed via fork commit 81fcd42e (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage).

Type

perf — performance improvement
port — cherry-pick from upstream Netflix/vmaf (thread-pool portion only)

Checklist

Commits follow Conventional Commits.
make format && make lint green locally.
Unit tests pass: meson test -C build → 32/32.
Touched SIMD/threaded path — ran scalar-vs-SIMD bit-exact check under --threads 4, diff exit 0.
No new .c / .h / .cpp files added.
Not a breaking change.

Netflix golden-data gate (ADR-0024)

I did not modify any assertAlmostEqual(...) score in the Netflix golden Python tests.

Cross-backend numerical results

VMAF_CPU_MASK=0 vs =255 (scalar vs SIMD), --threads 4:
  VMAF + VIF + ADM + MOTION + SSIM   scalar-vs-simd = 0-ULP  (bit-identical)

--threads 1 vs --threads 4 (serial vs threaded), VMAF_CPU_MASK=255:
  All numeric values match per frame; attribute emission order may
  differ (feature_collector insertion-order artefact, unchanged by
  this PR). VMAF score: 83.856284 in both (frame 0).

Performance

Micro-benchmark (500 000 jobs, 4 worker threads, 4-byte payload):

BEFORE (master)
  500000 jobs in 0.3904s = 1,280,787 jobs/sec
  500000 jobs in 0.5330s =   938,156 jobs/sec
  500000 jobs in 0.4156s = 1,203,079 jobs/sec

AFTER (this PR)
  500000 jobs in 0.2289s = 2,184,026 jobs/sec
  500000 jobs in 0.2643s = 1,892,018 jobs/sec
  500000 jobs in 0.1598s = 3,129,623 jobs/sec

=> ~1.8-2.6x enqueue throughput.

Deep-dive deliverables (ADR-0108)

Research digest — no digest needed: narrow upstream port, no novel algorithm.
Decision matrix — captured in ADR-0147 §Alternatives considered.
AGENTS.md invariant note — added to libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant, with the load-bearing job->data != job->inline_data guard).
Reproducer / smoke-test command — pasted below under "Reproducer".
CHANGELOG.md "lusoris fork" entry — bullet added under ### Changed in CHANGELOG.md.
Rebase note — entry 0040 in docs/rebase-notes.md.

Reproducer

ninja -C build && meson test -C build

# Scalar vs SIMD under threads, Netflix golden pair:
for mask in 0 255; do
  VMAF_CPU_MASK=$mask ./build/tools/vmaf \
    --reference python/test/resource/yuv/src01_hrc00_576x324.yuv \
    --distorted python/test/resource/yuv/src01_hrc01_576x324.yuv \
    --width 576 --height 324 --pixel_format 420 --bitdepth 8 \
    -m version=vmaf_v0.6.1 --threads 4 -o /tmp/vmaf_t_$mask.xml
done
diff <(grep -v fyi /tmp/vmaf_t_0.xml) <(grep -v fyi /tmp/vmaf_t_255.xml)
# expect exit 0 (bit-identical scalar vs SIMD, threaded)

# Micro-benchmark (see PR "Performance" section for full numbers):
# See /tmp/bench_tp.c in the PR (500k-job 4-thread job enqueue).

Known follow-ups

Netflix PR Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations Netflix/vmaf#1464 bundles eleven other optimizations (AVX2 PSNR, ADM bit-shift micro-opts, VIF epsilon removal, predict.c stack-alloc, convolution stride hoisting, feature-collector capacity 8→512, comprehensive test suite, etc.). Those either conflict with fork-local work already landed (ADR-0138/0139/0142 bit-exactness, T7-5 predict.c refactor, fork's feature_collector extensions) or are already covered by the fork's own commits (PSNR SIMD at 81fcd42e). Not porting the rest.

🤖 Generated with Claude Code

Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path. Mechanics: - New `VmafThreadPool::free_jobs` list (protected by queue.lock) recycles `VmafThreadPoolJob` slots between enqueue calls instead of malloc/free on every job. - New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the job struct. Payloads <= 64 bytes are copied into it and `job->data = job->inline_data`, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events). - Split cleanup: `_clear_data` distinguishes inline vs heap via `job->data != job->inline_data`; `_recycle` pushes onto free list; `_destroy` is kept for destructor-only use. - Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy` walks and frees the recycle list after the workers exit. Adapted to the fork's `void (*func)(void *data, void **thread_data)` signature and `VmafThreadPoolWorker` per-worker-data path, which Netflix upstream lacks. No API change; callers unmodified. Verification: - meson test -C build: 32/32 pass (threaded framework tests included). - Netflix golden pair (src01_hrc00/01_576x324, full VMAF with vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and `--threads 4` (attribute order may differ — insertion ordering in feature_collector, unchanged by this PR). - Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0 (scalar) and =255 (SIMD). diff exit 0. - clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings, no NOLINT. - Micro-benchmark (500k jobs, 4 threads, int payload): BEFORE (master): ~1.20 M jobs/sec median AFTER (this PR): ~2.20 M jobs/sec median => ~1.8-2.6x enqueue throughput win. Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of T3-6 was already landed via fork commit 81fcd42 (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage). Deliverables (ADR-0108): 1. research digest: no digest needed - narrow upstream port, no novel algorithm 2. decision matrix: ADR-0147 §Alternatives considered 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant with inline_data guard) 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD diff exit 0 5. CHANGELOG: fork entry under Changed 6. rebase-notes: entry 0040 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lusoris merged commit 8fb2fe1 into master Apr 23, 2026
45 checks passed

lusoris deleted the perf/thread-pool-job-pool-t3-6 branch April 23, 2026 23:34

github-actions Bot mentioned this pull request Apr 23, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83

perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83
lusoris merged 1 commit intomasterfrom
perf/thread-pool-job-pool-t3-6

lusoris commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lusoris commented Apr 23, 2026

Summary

Type

Checklist

Netflix golden-data gate (ADR-0024)

Cross-backend numerical results

Performance

Deep-dive deliverables (ADR-0108)

Reproducer

Known follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant