Skip to content

fix(libvmaf): score_pooled returns -EAGAIN for pending features (Netflix#755, ADR-0154)#91

Merged
lusoris merged 1 commit intomasterfrom
fix/netflix-755-score-pooled-eagain-t1-1
Apr 24, 2026
Merged

fix(libvmaf): score_pooled returns -EAGAIN for pending features (Netflix#755, ADR-0154)#91
lusoris merged 1 commit intomasterfrom
fix/netflix-755-score-pooled-eagain-t1-1

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 24, 2026

Summary

Addresses Netflix upstream issue #755 ("VMAF functions exit with error when vmaf_read_pictures and vmaf_score_pooled calls are interleaved", OPEN).

Root cause: integer_motion's motion2/motion3 (and the 5-frame-window variant) write frame N's score retroactively when frame N+1 is extracted — and the tail on flush. So at the instant vmaf_read_pictures(ctx, ref, dist, i) returns, frame i's motion2 is not yet in the feature collector. vmaf_feature_collector_get_score returned -EINVAL in that case — indistinguishable from genuine misuse (bad pointer, out-of-range, unknown feature).

Fix: split the error code.

  • -EAGAIN → feature index is valid but not yet written (transient; retry after one more vmaf_read_pictures or after flush).
  • -EINVAL → genuine misuse (null pointer, i >= capacity, unknown feature name).

Downstream integrations that want per-frame streaming VMAF output can now distinguish transient-retry from abort-now. Upstream maintainer closed the door on the streaming use case in 2020 ("you cannot call vmaf_score_pooled() in a loop"); the fork reopens it with one changed error code and one inline-helper signature tweak — no change to the retroactive-write design.

Scope

  • libvmaf/src/feature/feature_collector.c:vmaf_feature_collector_get_score — the "not written" branch returns -EAGAIN.
  • libvmaf/src/feature/feature_collector.h:vmaf_feature_vector_get_score (inline fast-path) — splits the previous combined return -1 into -EINVAL (null / out-of-range) and -EAGAIN (not written). Added #include <errno.h>.
  • Drive-by (ADR-0141 touched-file rule): rename reserved __VMAF_FEATURE_COLLECTOR_H__ header guard to VMAF_FEATURE_COLLECTOR_INCLUDED.

Supported streaming pattern (documented via the new test):

for (unsigned i = 0; i < N; i++) {
    vmaf_read_pictures(ctx, &ref, &dist, i);
    if (i >= 2) {
        /* 3-frame motion window: after reading i, frame i-2 is complete. */
        vmaf_score_pooled(ctx, model, MEAN, &score, i - 2, i - 2);
    }
}
vmaf_read_pictures(ctx, NULL, NULL, 0);  /* flush → retroactive tail */
for (unsigned i = 0; i < N; i++)
    vmaf_score_pooled(ctx, model, MEAN, &score, i, i);

Test

New libvmaf/test/test_score_pooled_eagain.c with 4 subtests:

  1. returns_eagain_on_pending — submit frames 0 + 1 then score_pooled(1,1) must return -EAGAIN (motion2 for frame 1 not written yet).
  2. streaming_pattern — after read(3), score_pooled(0,0) and score_pooled(1,1) both succeed with finite scores.
  3. after_flush_complete — after flush, every in-range index is poolable.
  4. still_rejects_bad_range — inverted range + NULL score pointer still return -EINVAL (not -EAGAIN).

Reducer confirmed: git stash push libvmaf/src/feature/feature_collector.{c,h} && ninja -C build && meson test -C build test_score_pooled_eagain reports Fail: 1.

Type

  • fix — bug fix

Checklist

  • Commits follow Conventional Commits.
  • Pre-commit hooks green locally.
  • Unit tests pass: meson test -C build → 35/35.
  • clang-tidy -p build libvmaf/src/feature/feature_collector.{c,h} → zero warnings.
  • Reducer verified fail-without-fix / pass-with-fix.
  • Visible behaviour change flagged under ### Changed in CHANGELOG.

Netflix golden-data gate (ADR-0024)

  • I did not modify any assertAlmostEqual(...) score in the Netflix golden Python tests.

Cross-backend numerical results

N/A — error-code split; no numeric path changed.

Deep-dive deliverables (ADR-0108)

  • Research digest — no digest needed: error-code split.
  • Decision matrix — ADR-0154 §Alternatives considered.
  • AGENTS.md invariant note — fork-ahead -EAGAIN/-EINVAL split in rebase-notes 0047.
  • Reproducer / smoke-test command — pasted below under Reproducer.
  • CHANGELOG.md "lusoris fork" entry — under ### Changed.
  • Rebase note — entry 0047 in docs/rebase-notes.md.

Reproducer

ninja -C build
meson test -C build test_score_pooled_eagain
# Expect: 4/4 subtests pass

# Reducer confirmation:
git stash push libvmaf/src/feature/feature_collector.c libvmaf/src/feature/feature_collector.h
ninja -C build && meson test -C build test_score_pooled_eagain
# Expect: Fail: 1 (tests fail without the -EAGAIN split)
git stash pop

Issue-matching standalone reproducer (loop of read + per-frame pool):

read_pictures(0) -> 0
score_pooled(0,0) -> rc=0 score=97.428043
read_pictures(1) -> 0
score_pooled(1,1) -> rc=-11 score=97.428043   # -EAGAIN (was -EINVAL)
read_pictures(2) -> 0
score_pooled(2,2) -> rc=-11 score=97.428043   # -EAGAIN
flush -> 0
final_pool(0..2) -> rc=0 score=97.428043

Known follow-ups

  • A future vmaf_feature_available_up_to(ctx, unsigned *i_max) polling API could build on this. Out of scope for this PR; the -EAGAIN signal is the minimal contract.

🤖 Generated with Claude Code

…lix#755, ADR-0154)

Several feature extractors (integer_motion's motion2/motion3, the
five-frame-window variant) write frame N's score retroactively
when frame N+1 is extracted — and on flush for the tail.
Previously `vmaf_score_pooled(ctx, ..., i, i)` called immediately
after `vmaf_read_pictures(ctx, ref, dist, i)` returned -EINVAL,
indistinguishable from programmer error.

Split the error:
 - -EAGAIN: feature index is valid but not yet written (transient
   — retry after one more read or after flush).
 - -EINVAL: genuine misuse (bad pointer, out-of-range, unknown
   feature name).

Changes:
 - feature_collector.c:vmaf_feature_collector_get_score — "not
   written" branch returns -EAGAIN.
 - feature_collector.h:vmaf_feature_vector_get_score — inline
   fast-path splits the previous combined `return -1` into
   -EINVAL (null/out-of-range) and -EAGAIN (not written). Added
   #include <errno.h>.
 - Drive-by (ADR-0141 touched-file rule): rename reserved header
   guard `__VMAF_FEATURE_COLLECTOR_H__` -> `VMAF_FEATURE_COLLECTOR_INCLUDED`.

Visible behaviour change:
 - Callers that want to distinguish transient from fatal can now
   branch on -EAGAIN and retry.
 - Callers that treat any non-zero as fatal are unchanged.
 - Inline get_score previously returned -1 for both cases; now
   splits the same way.

Test: libvmaf/test/test_score_pooled_eagain.c with 4 subtests:
 1. `returns_eagain_on_pending` — submit frames 0 + 1, then
    score_pooled(1,1) must return -EAGAIN (motion2 for frame 1
    not written yet).
 2. `streaming_pattern` — after read(3), score_pooled(0,0) and
    score_pooled(1,1) must both succeed (retroactive writes
    have filled in).
 3. `after_flush_complete` — after flush, every in-range index
    is poolable.
 4. `still_rejects_bad_range` — inverted range and NULL score
    pointer still return -EINVAL (not -EAGAIN).

Verified as reducer: git stash on feature_collector.{c,h} makes
the new test fail; restoring passes. 35/35 total tests pass.

Closes backlog item T1-1.

Deliverables (ADR-0108):
 1. research digest: no digest needed - error-code semantics fix
 2. decision matrix: ADR-0154 §Alternatives considered
 3. AGENTS.md invariant: fork-ahead -EAGAIN/-EINVAL split
    documented in rebase-notes 0047
 4. reproducer: standalone C program matching Netflix#755 pattern
 5. CHANGELOG: fork entry under Changed
 6. rebase-notes: entry 0047

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris merged commit 9b983e0 into master Apr 24, 2026
46 checks passed
@lusoris lusoris deleted the fix/netflix-755-score-pooled-eagain-t1-1 branch April 24, 2026 10:52
@github-actions github-actions Bot mentioned this pull request Apr 24, 2026
lusoris added a commit that referenced this pull request Apr 24, 2026
…nner guide, doc-drift enforcement (#103)

Bundles the four open Tier-7 long-tail items from `.workingdir2/BACKLOG.md`
plus the audit-flagged docs gaps that surfaced during scope-checking.

T7-1 — Tracked docs/state.md + bug-status hygiene rule (ADR-0165)
  Closes Issue #20. New tracked file `docs/state.md` (Open / Recently
  closed / Confirmed not-affected / Deferred) is the canonical in-tree
  bug-status surface. New CLAUDE.md §12 rule 13 mandates a same-PR
  update on every bug close / open / rule-out. PR template carries a
  checkbox; opt-out `no state delta: REASON` for PRs without bug-status
  impact. ADRs cover decisions, this file covers bug status.

T7-2 — MCP server release artifact channel (ADR-0166)
  Both PyPI (Trusted Publishing via OIDC, no token) and GitHub release
  attachment with Sigstore keyless signing + PEP 740 attestations + SLSA
  L3 provenance. Wired as new `mcp-build` / `mcp-sign` /
  `mcp-publish-pypi` jobs in the existing supply-chain.yml. After this
  lands, `pip install vmaf-mcp` works. One-time PyPI Trusted Publisher
  binding required (operational note in the ADR).

T7-3 — Self-hosted GPU runner enrollment guide
  New docs/development/self-hosted-runner.md pins the registration
  steps so an operator can stand a runner up in ~10 minutes. Per popup
  2026-04-25 the user's local dev box (CUDA + Intel) will be the first
  runner. Fine-grained label scheme (`gpu-cuda`, `gpu-intel`,
  `avx512`) reserved for future job targeting.

ADR-0167 — Path-mapped doc-drift enforcement
  Closes the gap surfaced by the 2026-04-25 docs audit (16 PRs landed
  in 2 days; 2 HIGH + 4 MEDIUM doc gaps slipped past the existing
  checks because the workflow was advisory + accepted ADR additions
  as "docs were touched"). Two layers:

  Layer 1 (in-session): new project hook
  `.claude/hooks/docs-drift-warn.sh` (PostToolUse:Edit|Write) emits an
  informational `NOTICE` when a user-discoverable surface is touched
  but no matching `docs/<topic>/` file is touched. Mirrors the
  `auto-snapshot-warn.sh` pattern — informational stderr, no block.

  Layer 2 (pre-merge): rule-enforcement.yml `doc-substance-check`
  promoted from advisory (`continue-on-error: true`) to blocking +
  rewritten with a path-mapped surface→docs check. ADR additions no
  longer satisfy. Per-PR opt-out `no docs needed: REASON` for genuine
  internal-refactor / bug-fix / test PRs.

Audit fixes (2 HIGH + 4 MEDIUM):
  - docs/api/gpu.md — vmaf_cuda_state_free() public API documented
    (was: missing entirely, despite the symbol shipping in PR #94).
  - docs/api/index.md — -EAGAIN error code added to the error
    semantics list (PR #91 / ADR-0154).
  - docs/api/index.md — vmaf_read_pictures monotonic-index requirement
    documented (PR #88 / ADR-0152).
  - docs/metrics/features.md — SSIMULACRA 2 backends matrix updated
    (was: "scalar only. SIMD / GPU paths are follow-up workstreams"
    36 hours after PRs #98/#99/#100 landed all three SIMD ports).
  - docs/metrics/features.md — PSNR-HVS backends updated for AVX2
    (PR #96) + NEON (PR #97).
  - docs/metrics/features.md — float_ms_ssim <176×176 minimum
    documented (PR #90 / ADR-0153).

ADRs: 0165, 0166, 0167. Closes BACKLOG T7-1, T7-2, T7-3 + Issue #20.

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant