Skip to content

perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83

Merged
lusoris merged 1 commit intomasterfrom
perf/thread-pool-job-pool-t3-6
Apr 23, 2026
Merged

perf(thread_pool): recycle job slots + inline data buffer (ADR-0147)#83
lusoris merged 1 commit intomasterfrom
perf/thread-pool-job-pool-t3-6

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 23, 2026

Summary

Port the thread-pool portion of Netflix upstream PR #1464 (closed, not merged) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path:

  • VmafThreadPool::free_jobs linked list (protected by the existing queue.lock) recycles VmafThreadPoolJob slots between enqueue calls instead of allocating a fresh one every time.
  • char inline_data[JOB_INLINE_DATA_SIZE = 64] at the tail of the job struct. Payloads ≤ 64 bytes are copied into it and job->data = job->inline_data, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events).
  • Cleanup path distinguishes inline vs heap payloads via job->data != job->inline_data.

Adapted to the fork's void (*func)(void *data, void **thread_data) signature and per-worker VmafThreadPoolWorker data path, which Netflix upstream lacks. No API change; callers unmodified.

~1.8–2.6× enqueue throughput on a 500 000-job, 4-thread micro-benchmark (median 1.20 M jobs/sec → 2.20 M jobs/sec). Netflix-golden-pair VMAF score bit-identical between --threads 4 and serial, and between VMAF_CPU_MASK=0 and =255 under --threads 4.

Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half was already landed via fork commit 81fcd42e (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage).

Type

  • perf — performance improvement
  • port — cherry-pick from upstream Netflix/vmaf (thread-pool portion only)

Checklist

  • Commits follow Conventional Commits.
  • make format && make lint green locally.
  • Unit tests pass: meson test -C build → 32/32.
  • Touched SIMD/threaded path — ran scalar-vs-SIMD bit-exact check under --threads 4, diff exit 0.
  • No new .c / .h / .cpp files added.
  • Not a breaking change.

Netflix golden-data gate (ADR-0024)

  • I did not modify any assertAlmostEqual(...) score in the Netflix golden Python tests.

Cross-backend numerical results

VMAF_CPU_MASK=0 vs =255 (scalar vs SIMD), --threads 4:
  VMAF + VIF + ADM + MOTION + SSIM   scalar-vs-simd = 0-ULP  (bit-identical)

--threads 1 vs --threads 4 (serial vs threaded), VMAF_CPU_MASK=255:
  All numeric values match per frame; attribute emission order may
  differ (feature_collector insertion-order artefact, unchanged by
  this PR). VMAF score: 83.856284 in both (frame 0).

Performance

Micro-benchmark (500 000 jobs, 4 worker threads, 4-byte payload):

BEFORE (master)
  500000 jobs in 0.3904s = 1,280,787 jobs/sec
  500000 jobs in 0.5330s =   938,156 jobs/sec
  500000 jobs in 0.4156s = 1,203,079 jobs/sec

AFTER (this PR)
  500000 jobs in 0.2289s = 2,184,026 jobs/sec
  500000 jobs in 0.2643s = 1,892,018 jobs/sec
  500000 jobs in 0.1598s = 3,129,623 jobs/sec

=> ~1.8-2.6x enqueue throughput.

Deep-dive deliverables (ADR-0108)

  • Research digest — no digest needed: narrow upstream port, no novel algorithm.
  • Decision matrix — captured in ADR-0147 §Alternatives considered.
  • AGENTS.md invariant note — added to libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant, with the load-bearing job->data != job->inline_data guard).
  • Reproducer / smoke-test command — pasted below under "Reproducer".
  • CHANGELOG.md "lusoris fork" entry — bullet added under ### Changed in CHANGELOG.md.
  • Rebase note — entry 0040 in docs/rebase-notes.md.

Reproducer

ninja -C build && meson test -C build

# Scalar vs SIMD under threads, Netflix golden pair:
for mask in 0 255; do
  VMAF_CPU_MASK=$mask ./build/tools/vmaf \
    --reference python/test/resource/yuv/src01_hrc00_576x324.yuv \
    --distorted python/test/resource/yuv/src01_hrc01_576x324.yuv \
    --width 576 --height 324 --pixel_format 420 --bitdepth 8 \
    -m version=vmaf_v0.6.1 --threads 4 -o /tmp/vmaf_t_$mask.xml
done
diff <(grep -v fyi /tmp/vmaf_t_0.xml) <(grep -v fyi /tmp/vmaf_t_255.xml)
# expect exit 0 (bit-identical scalar vs SIMD, threaded)

# Micro-benchmark (see PR "Performance" section for full numbers):
# See /tmp/bench_tp.c in the PR (500k-job 4-thread job enqueue).

Known follow-ups

  • Netflix PR Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations Netflix/vmaf#1464 bundles eleven other optimizations (AVX2 PSNR, ADM bit-shift micro-opts, VIF epsilon removal, predict.c stack-alloc, convolution stride hoisting, feature-collector capacity 8→512, comprehensive test suite, etc.). Those either conflict with fork-local work already landed (ADR-0138/0139/0142 bit-exactness, T7-5 predict.c refactor, fork's feature_collector extensions) or are already covered by the fork's own commits (PSNR SIMD at 81fcd42e). Not porting the rest.

🤖 Generated with Claude Code

Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed)
into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the
enqueue hot path.

Mechanics:
- New `VmafThreadPool::free_jobs` list (protected by queue.lock)
  recycles `VmafThreadPoolJob` slots between enqueue calls instead
  of malloc/free on every job.
- New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the
  job struct. Payloads <= 64 bytes are copied into it and
  `job->data = job->inline_data`, avoiding a second malloc on the
  common caller path (main extractor dispatch, MCP frame events).
- Split cleanup: `_clear_data` distinguishes inline vs heap via
  `job->data != job->inline_data`; `_recycle` pushes onto free list;
  `_destroy` is kept for destructor-only use.
- Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy`
  walks and frees the recycle list after the workers exit.

Adapted to the fork's `void (*func)(void *data, void **thread_data)`
signature and `VmafThreadPoolWorker` per-worker-data path, which
Netflix upstream lacks. No API change; callers unmodified.

Verification:
- meson test -C build: 32/32 pass (threaded framework tests included).
- Netflix golden pair (src01_hrc00/01_576x324, full VMAF with
  vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and
  `--threads 4` (attribute order may differ — insertion ordering in
  feature_collector, unchanged by this PR).
- Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0
  (scalar) and =255 (SIMD). diff exit 0.
- clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings,
  no NOLINT.
- Micro-benchmark (500k jobs, 4 threads, int payload):
    BEFORE (master):  ~1.20 M jobs/sec median
    AFTER (this PR):  ~2.20 M jobs/sec median
  => ~1.8-2.6x enqueue throughput win.

Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of
T3-6 was already landed via fork commit 81fcd42 (with additional
AVX-512 + NEON variants beyond upstream's AVX2-only coverage).

Deliverables (ADR-0108):
 1. research digest: no digest needed - narrow upstream port, no novel algorithm
 2. decision matrix: ADR-0147 §Alternatives considered
 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling
    + inline data buffer invariant with inline_data guard)
 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD
    diff exit 0
 5. CHANGELOG: fork entry under Changed
 6. rebase-notes: entry 0040

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris merged commit 8fb2fe1 into master Apr 23, 2026
45 checks passed
@lusoris lusoris deleted the perf/thread-pool-job-pool-t3-6 branch April 23, 2026 23:34
@github-actions github-actions Bot mentioned this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant