Goal
The bisect-model-quality tool landed in commit 4a6b76eb (feat(ai): bisect-model-quality — binary-search a checkpoint timeline), but it's
only usable locally today. To make it load-bearing we should run it
nightly against our model registry and alert on regressions.
What's needed
- Golden feature cache — a committed or cached parquet of feature
vectors + DMOS targets drawn from a frozen subset of NFLX-public /
LIVE / KonIQ. Must be stable across CI runs; probably lives in
testdata/ or ai/testdata/.
- Nightly workflow (
.github/workflows/nightly-bisect.yml):
- Checkout, install
ai/ package + onnxruntime
- Run
vmaf-train bisect-model-quality --models model/*.onnx --features <cache> --min-plcc 0.85
- On
first_bad_index is not None, post a comment to a tracking
issue (not file a new one each time — use a sticky label + edit-
in-place)
- Registry ordering — the tool assumes monotonic quality on the
model list. We need a canonical ordering (git log on model/*.onnx?
release tags?) that maps models → timeline indices reproducibly.
Why this is deferred
- Requires designing the golden feature cache first (item 1 above is
non-trivial: needs to be reproducible, not too big for git, aligned
to the FR/NR regressor input shape).
- We don't have a stable model registry cadence yet — tiny-AI model
releases are still in flux.
- Other CI items (see
.workingdir2/analysis/ci-security-triage.md)
are higher prio (P0/P1 vs this being P2).
Acceptance criteria
- Nightly workflow runs and passes on a known-good model set
(first_bad_index is None).
- Synthetic regression test: introducing a deliberately bad ONNX into
the set trips the alert.
- Report is readable (
render_table() output is posted to the tracking
issue).
- No false positives from stochastic eval (fixed RNG seeds in feature
cache).
Out of scope
- Training the cache-generation script itself (separate tiny-AI task).
- Per-commit bisection (would 10x CI cost — nightly is the right cadence).
Goal
The
bisect-model-qualitytool landed in commit4a6b76eb(feat(ai): bisect-model-quality — binary-search a checkpoint timeline), but it'sonly usable locally today. To make it load-bearing we should run it
nightly against our model registry and alert on regressions.
What's needed
vectors + DMOS targets drawn from a frozen subset of NFLX-public /
LIVE / KonIQ. Must be stable across CI runs; probably lives in
testdata/orai/testdata/..github/workflows/nightly-bisect.yml):ai/package + onnxruntimevmaf-train bisect-model-quality --models model/*.onnx --features <cache> --min-plcc 0.85first_bad_index is not None, post a comment to a trackingissue (not file a new one each time — use a sticky label + edit-
in-place)
model list. We need a canonical ordering (git log on
model/*.onnx?release tags?) that maps models → timeline indices reproducibly.
Why this is deferred
non-trivial: needs to be reproducible, not too big for git, aligned
to the FR/NR regressor input shape).
releases are still in flux.
.workingdir2/analysis/ci-security-triage.md)are higher prio (P0/P1 vs this being P2).
Acceptance criteria
(
first_bad_index is None).the set trips the alert.
render_table()output is posted to the trackingissue).
cache).
Out of scope