feat(ci): nightly bisect-model-quality + sticky tracker (closes #4) by lusoris · Pull Request #41 · lusoris/vmaf

lusoris · 2026-04-18T11:50:11Z

Summary

Wires the existing vmaf-train bisect-model-quality tool into a nightly GitHub Actions workflow (.github/workflows/nightly-bisect.yml) that runs against a deterministic synthetic placeholder cache (committed under ai/testdata/bisect/, regenerable via ai/scripts/build_bisect_cache.py).
Always edits a single sticky comment on tracker issue #40 with the verdict + per-step PLCC/SROCC/RMSE table; fails CI red on first_bad_index is not None.
Closes Wire bisect-model-quality into a nightly CI workflow #4. Real DMOS-aligned cache + canonical model timeline ordering swap in via a follow-up — see ADR-0109 + Research-0001 for the swap path.

Type

feat — new feature
ci — tooling / infra

Checklist

Commits follow Conventional Commits (feat(ci): …).
No SIMD/GPU code touched — /cross-backend-diff not applicable.
No feature extractor SIMD/GPU twins touched.
New .py files carry the Copyright 2026 Lusoris and Claude (Anthropic) header.
Not a breaking change.

Netflix golden-data gate (ADR-0024)

No assertAlmostEqual(...) scores in Netflix golden Python tests modified.

Cross-backend numerical results

Not applicable — no SIMD/GPU code touched.

Performance

Not applicable — net cost is one nightly job (Python-only, ~15 min budget).

Deep-dive deliverables (ADR-0108)

Research digest — docs/research/0001-bisect-model-quality-cache.md.
Decision matrix — captured in ADR-0109 ## Alternatives considered (6 options, 5 alternatives).
AGENTS.md invariant note — added byte-stable-cache invariant to ai/AGENTS.md.
Reproducer / smoke-test command — see below.
CHANGELOG.md "lusoris fork" entry — bullet under ### Added (Tiny AI nightly bisect).
Rebase note — entry 0011 — Nightly bisect-model-quality + fixture cache in docs/rebase-notes.md.

Reproducer

# Local smoke (matches what the workflow does):
python ai/scripts/build_bisect_cache.py --check
vmaf-train bisect-model-quality \
    ai/testdata/bisect/models/model_*.onnx \
    --features ai/testdata/bisect/features.parquet \
    --min-plcc 0.85 --input-name input \
    --json /tmp/bisect-result.json --fail-on-first-bad
# Expected: "no regression in this range"; first_bad_index = None; exit 0.

# Verify the synthetic-regression case (existing pytest):
pytest ai/tests/test_bisect_model_quality.py::test_bisect_localises_first_bad -v

Known follow-ups

Replace the synthetic placeholder cache with a real DMOS-aligned subset (NFLX-public is the likely first target). The workflow file + sticky-comment helper stay unchanged at swap time.
Optionally retire the synthetic cache once a real one lands, or keep both as separate jobs (synthetic = wiring smoke, real = quality regression detection). Lean toward keeping both.

🤖 Generated with Claude Code

Wire the existing vmaf-train bisect-model-quality tool into a nightly GitHub Actions workflow that runs against a deterministic synthetic placeholder cache and posts the verdict to sticky tracker issue #40. Why synthetic-placeholder, not real DMOS-aligned: the issue itself flagged three prerequisites (frozen NFLX/LIVE/KonIQ subset, DMOS labels, canonical model timeline ordering) that are independently blocked by dataset access + label collection + frozen libvmaf build. Shipping the wiring now against a fixed-seed synthetic cache unblocks the AC; the real cache swaps in via a follow-up without touching the workflow file or the sticky-comment helper. Deliverables (per ADR-0108): - ai/scripts/build_bisect_cache.py deterministic cache generator (--check asserts byte-equality) - ai/testdata/bisect/ ~16 KB committed fixture (features.parquet + 8 ONNX models) - .github/workflows/nightly-bisect.yml cron 04:37 UTC, fails red on first_bad_index, always uploads JSON artifact, always edits #40 - scripts/ci/post-bisect-comment.py sticky-comment helper (finds prior bot comment by header, PATCHes in place) - docs/ai/bisect-model-quality.md user-facing reference - docs/adr/0109-...md design choices + alternatives - docs/research/0001-...md cache-shape research digest - CHANGELOG.md, docs/rebase-notes.md (entry 0011), ai/AGENTS.md Synthetic-regression "deliberately bad ONNX trips alert" AC is satisfied by the existing pytest test_bisect_localises_first_bad. The committed timeline is intentionally regression-free so a green nightly means the wiring still works. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

black --check noticed two minor whitespace deltas in the new files; ruff S603/S607 flagged the gh subprocess call in the sticky-comment helper. Fix: - black-reformat ai/scripts/build_bisect_cache.py and scripts/ci/post-bisect-comment.py - bind ["gh", *args] to a local + suppress the S603 warning with a comment explaining the trust boundary (caller-controlled args, gh resolved from $PATH per GitHub-Actions convention) Cache --check still passes byte-identically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Lusoris and others added 2 commits April 18, 2026 13:49

lusoris merged commit 6cd4fb0 into master Apr 18, 2026
22 of 23 checks passed

lusoris deleted the feat/nightly-bisect-model-quality branch April 18, 2026 12:01

github-actions Bot mentioned this pull request Apr 18, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41

feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41
lusoris merged 2 commits intomasterfrom
feat/nightly-bisect-model-quality

lusoris commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lusoris commented Apr 18, 2026

Summary

Type

Checklist

Netflix golden-data gate (ADR-0024)

Cross-backend numerical results

Performance

Deep-dive deliverables (ADR-0108)

Reproducer

Known follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant