Skip to content

Wire bisect-model-quality into a nightly CI workflow #4

@lusoris

Description

@lusoris

Goal

The bisect-model-quality tool landed in commit 4a6b76eb (feat(ai): bisect-model-quality — binary-search a checkpoint timeline), but it's
only usable locally today. To make it load-bearing we should run it
nightly against our model registry and alert on regressions.

What's needed

  1. Golden feature cache — a committed or cached parquet of feature
    vectors + DMOS targets drawn from a frozen subset of NFLX-public /
    LIVE / KonIQ. Must be stable across CI runs; probably lives in
    testdata/ or ai/testdata/.
  2. Nightly workflow (.github/workflows/nightly-bisect.yml):
    • Checkout, install ai/ package + onnxruntime
    • Run vmaf-train bisect-model-quality --models model/*.onnx --features <cache> --min-plcc 0.85
    • On first_bad_index is not None, post a comment to a tracking
      issue (not file a new one each time — use a sticky label + edit-
      in-place)
  3. Registry ordering — the tool assumes monotonic quality on the
    model list. We need a canonical ordering (git log on model/*.onnx?
    release tags?) that maps models → timeline indices reproducibly.

Why this is deferred

  • Requires designing the golden feature cache first (item 1 above is
    non-trivial: needs to be reproducible, not too big for git, aligned
    to the FR/NR regressor input shape).
  • We don't have a stable model registry cadence yet — tiny-AI model
    releases are still in flux.
  • Other CI items (see .workingdir2/analysis/ci-security-triage.md)
    are higher prio (P0/P1 vs this being P2).

Acceptance criteria

  • Nightly workflow runs and passes on a known-good model set
    (first_bad_index is None).
  • Synthetic regression test: introducing a deliberately bad ONNX into
    the set trips the alert.
  • Report is readable (render_table() output is posted to the tracking
    issue).
  • No false positives from stochastic eval (fixed RNG seeds in feature
    cache).

Out of scope

  • Training the cache-generation script itself (separate tiny-AI task).
  • Per-commit bisection (would 10x CI cost — nightly is the right cadence).

Metadata

Metadata

Assignees

No one assigned

    Labels

    ciCI/CD workflows, release automationenhancementNew feature or requesttiny-aiai/ tiny-AI models, training, and ONNX infra

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions