Skip to content

feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41

Merged
lusoris merged 2 commits intomasterfrom
feat/nightly-bisect-model-quality
Apr 18, 2026
Merged

feat(ci): nightly bisect-model-quality + sticky tracker (closes #4)#41
lusoris merged 2 commits intomasterfrom
feat/nightly-bisect-model-quality

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 18, 2026

Summary

Type

  • feat — new feature
  • ci — tooling / infra

Checklist

  • Commits follow Conventional Commits (feat(ci): …).
  • No SIMD/GPU code touched — /cross-backend-diff not applicable.
  • No feature extractor SIMD/GPU twins touched.
  • New .py files carry the Copyright 2026 Lusoris and Claude (Anthropic) header.
  • Not a breaking change.

Netflix golden-data gate (ADR-0024)

  • No assertAlmostEqual(...) scores in Netflix golden Python tests modified.

Cross-backend numerical results

Not applicable — no SIMD/GPU code touched.

Performance

Not applicable — net cost is one nightly job (Python-only, ~15 min budget).

Deep-dive deliverables (ADR-0108)

  • Research digestdocs/research/0001-bisect-model-quality-cache.md.
  • Decision matrix — captured in ADR-0109 ## Alternatives considered (6 options, 5 alternatives).
  • AGENTS.md invariant note — added byte-stable-cache invariant to ai/AGENTS.md.
  • Reproducer / smoke-test command — see below.
  • CHANGELOG.md "lusoris fork" entry — bullet under ### Added (Tiny AI nightly bisect).
  • Rebase note — entry 0011 — Nightly bisect-model-quality + fixture cache in docs/rebase-notes.md.

Reproducer

# Local smoke (matches what the workflow does):
python ai/scripts/build_bisect_cache.py --check
vmaf-train bisect-model-quality \
    ai/testdata/bisect/models/model_*.onnx \
    --features ai/testdata/bisect/features.parquet \
    --min-plcc 0.85 --input-name input \
    --json /tmp/bisect-result.json --fail-on-first-bad
# Expected: "no regression in this range"; first_bad_index = None; exit 0.

# Verify the synthetic-regression case (existing pytest):
pytest ai/tests/test_bisect_model_quality.py::test_bisect_localises_first_bad -v

Known follow-ups

  • Replace the synthetic placeholder cache with a real DMOS-aligned subset (NFLX-public is the likely first target). The workflow file + sticky-comment helper stay unchanged at swap time.
  • Optionally retire the synthetic cache once a real one lands, or keep both as separate jobs (synthetic = wiring smoke, real = quality regression detection). Lean toward keeping both.

🤖 Generated with Claude Code

Lusoris and others added 2 commits April 18, 2026 13:49
Wire the existing vmaf-train bisect-model-quality tool into a nightly
GitHub Actions workflow that runs against a deterministic synthetic
placeholder cache and posts the verdict to sticky tracker issue #40.

Why synthetic-placeholder, not real DMOS-aligned: the issue itself
flagged three prerequisites (frozen NFLX/LIVE/KonIQ subset, DMOS labels,
canonical model timeline ordering) that are independently blocked by
dataset access + label collection + frozen libvmaf build. Shipping the
wiring now against a fixed-seed synthetic cache unblocks the AC; the
real cache swaps in via a follow-up without touching the workflow file
or the sticky-comment helper.

Deliverables (per ADR-0108):

- ai/scripts/build_bisect_cache.py     deterministic cache generator
                                       (--check asserts byte-equality)
- ai/testdata/bisect/                  ~16 KB committed fixture
                                       (features.parquet + 8 ONNX models)
- .github/workflows/nightly-bisect.yml cron 04:37 UTC, fails red on
                                       first_bad_index, always uploads
                                       JSON artifact, always edits #40
- scripts/ci/post-bisect-comment.py    sticky-comment helper (finds
                                       prior bot comment by header,
                                       PATCHes in place)
- docs/ai/bisect-model-quality.md      user-facing reference
- docs/adr/0109-...md                  design choices + alternatives
- docs/research/0001-...md             cache-shape research digest
- CHANGELOG.md, docs/rebase-notes.md (entry 0011), ai/AGENTS.md

Synthetic-regression "deliberately bad ONNX trips alert" AC is satisfied
by the existing pytest test_bisect_localises_first_bad. The committed
timeline is intentionally regression-free so a green nightly means the
wiring still works.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
black --check noticed two minor whitespace deltas in the new files;
ruff S603/S607 flagged the gh subprocess call in the sticky-comment
helper. Fix:

- black-reformat ai/scripts/build_bisect_cache.py and
  scripts/ci/post-bisect-comment.py
- bind ["gh", *args] to a local + suppress the S603 warning with a
  comment explaining the trust boundary (caller-controlled args, gh
  resolved from $PATH per GitHub-Actions convention)

Cache --check still passes byte-identically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lusoris lusoris merged commit 6cd4fb0 into master Apr 18, 2026
22 of 23 checks passed
@lusoris lusoris deleted the feat/nightly-bisect-model-quality branch April 18, 2026 12:01
@github-actions github-actions Bot mentioned this pull request Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wire bisect-model-quality into a nightly CI workflow

1 participant