feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41
Merged
feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41
Conversation
Wire the existing vmaf-train bisect-model-quality tool into a nightly GitHub Actions workflow that runs against a deterministic synthetic placeholder cache and posts the verdict to sticky tracker issue #40. Why synthetic-placeholder, not real DMOS-aligned: the issue itself flagged three prerequisites (frozen NFLX/LIVE/KonIQ subset, DMOS labels, canonical model timeline ordering) that are independently blocked by dataset access + label collection + frozen libvmaf build. Shipping the wiring now against a fixed-seed synthetic cache unblocks the AC; the real cache swaps in via a follow-up without touching the workflow file or the sticky-comment helper. Deliverables (per ADR-0108): - ai/scripts/build_bisect_cache.py deterministic cache generator (--check asserts byte-equality) - ai/testdata/bisect/ ~16 KB committed fixture (features.parquet + 8 ONNX models) - .github/workflows/nightly-bisect.yml cron 04:37 UTC, fails red on first_bad_index, always uploads JSON artifact, always edits #40 - scripts/ci/post-bisect-comment.py sticky-comment helper (finds prior bot comment by header, PATCHes in place) - docs/ai/bisect-model-quality.md user-facing reference - docs/adr/0109-...md design choices + alternatives - docs/research/0001-...md cache-shape research digest - CHANGELOG.md, docs/rebase-notes.md (entry 0011), ai/AGENTS.md Synthetic-regression "deliberately bad ONNX trips alert" AC is satisfied by the existing pytest test_bisect_localises_first_bad. The committed timeline is intentionally regression-free so a green nightly means the wiring still works. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
black --check noticed two minor whitespace deltas in the new files; ruff S603/S607 flagged the gh subprocess call in the sticky-comment helper. Fix: - black-reformat ai/scripts/build_bisect_cache.py and scripts/ci/post-bisect-comment.py - bind ["gh", *args] to a local + suppress the S603 warning with a comment explaining the trust boundary (caller-controlled args, gh resolved from $PATH per GitHub-Actions convention) Cache --check still passes byte-identically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vmaf-train bisect-model-qualitytool into a nightly GitHub Actions workflow (.github/workflows/nightly-bisect.yml) that runs against a deterministic synthetic placeholder cache (committed underai/testdata/bisect/, regenerable viaai/scripts/build_bisect_cache.py).first_bad_index is not None.Type
feat— new featureci— tooling / infraChecklist
feat(ci): …)./cross-backend-diffnot applicable..pyfiles carry theCopyright 2026 Lusoris and Claude (Anthropic)header.Netflix golden-data gate (ADR-0024)
assertAlmostEqual(...)scores in Netflix golden Python tests modified.Cross-backend numerical results
Not applicable — no SIMD/GPU code touched.
Performance
Not applicable — net cost is one nightly job (Python-only, ~15 min budget).
Deep-dive deliverables (ADR-0108)
docs/research/0001-bisect-model-quality-cache.md.## Alternatives considered(6 options, 5 alternatives).AGENTS.mdinvariant note — added byte-stable-cache invariant toai/AGENTS.md.CHANGELOG.md"lusoris fork" entry — bullet under### Added(Tiny AI nightly bisect).0011 — Nightly bisect-model-quality + fixture cacheindocs/rebase-notes.md.Reproducer
Known follow-ups
🤖 Generated with Claude Code