feat(ai): fr_regressor_v2 ensemble — real-corpus retrain harness + flip workflow (ADR-0309) by lusoris · Pull Request #405 · lusoris/vmaf

lusoris · 2026-05-05T20:37:30Z

Summary

Follow-up to ADR-0303 /
PR #399 that ships the operational harness for actually running the
5-seed x 9-fold LOSO retrain against the locally available Netflix
Public Dataset (.workingdir2/netflix/) and emitting a
machine-checkable verdict file.

Wrapper: ai/scripts/run_ensemble_v2_real_corpus_loso.sh —
validates the corpus, loops the seeds through the existing
train_fr_regressor_v2_ensemble_loso.py, tees timestamped per-seed
logs under runs/ensemble_v2_real/logs/.
Validator: ai/scripts/validate_ensemble_seeds.py — calls the
ADR-0303 gate, snapshots the corpus YUV file list as sha256, writes
PROMOTE.json on gate-pass (recommends flipping the five
fr_regressor_v2_ensemble_v1_seed{0..4} rows in
model/tiny/registry.json from smoke: true to smoke: false)
or HOLD.json on gate-fail.
The harness deliberately does not run the LOSO inside the PR
(6–12 h GPU work) and does not flip the registry — the registry
flip is a separate follow-up PR gated on a passing PROMOTE.json.

Six deep-dive deliverables (ADR-0108)

(1) Research digest: docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md — corpus sufficiency, LOSO fold sizing, seed-diversity hyperparameters, Seeking_25fps weak-fold diagnostic.
(2) Decision matrix: ADR-0309 §Alternatives considered (4 options).
(3) AGENTS.md invariant note: ai/AGENTS.md — registry-flip is a
separate PR; never flip during a rebase.
(4) Reproducer / smoke-test command: pytest ai/tests/test_validate_ensemble_seeds.py -v
(5) CHANGELOG fragment: Unreleased — lusoris fork row added.
(6) Rebase note: docs/rebase-notes.md entry 0309.

Test plan

pytest ai/tests/test_validate_ensemble_seeds.py -v (7/7 pass)
python ai/scripts/validate_ensemble_seeds.py --help
bash -n ai/scripts/run_ensemble_v2_real_corpus_loso.sh
black --check / ruff check / isort --check clean on
new Python files
Out-of-band: real LOSO run on .workingdir2/netflix/
(6–12 h, deferred to follow-up flip PR per ADR-0309)

Status: DRAFT

Leaving as draft until the user confirms direction. The follow-up
flip PR is blocked on a maintainer running the wrapper out-of-band
and producing a passing PROMOTE.json.

🤖 Generated with Claude Code

…(ADR-0313) (#410) * ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313) The 23-named-required-check posture (ADR-0037) deadlocks doc/Python-only PRs: the C-build matrix path-filter-skips on their diffs, but branch protection counts a path-filter-skip + a never-ran-at-all as not satisfying the required-check. PR #400 hit this concretely (10/23 succeeded; 13/23 either skipped or never reported; gh pr merge returned "the base branch policy prohibits the merge"). Aggregator is one workflow with no path filter. It polls up to 8 minutes for sibling workflows to register, then verifies each named check on the head SHA reported success/skipped/neutral (or didn't appear at all, which is the documented path-filter rejection semantics). Aggregator becomes the single branch-protection required check; the 23 individual workflows continue to run unchanged. Manual operator step at adoption (after this PR merges): gh api -X PUT "repos/lusoris/vmaf/branches/master/protection/required_status_checks" \ -F 'strict=true' -F 'contexts=["Required Checks Aggregator"]' Unblocks #400, #403, #404, #405, #406, #407 currently stuck on the deadlock. Per user popup direction 2026-05-05. Files: .github/workflows/required-aggregator.yml (new), docs/adr/0313-*.md (new), changelog.d/added/*.md (new), docs/adr/README.md (+1 row), docs/adr/_index_fragments/_order.txt (+1 line + new fragment), docs/rebase-notes.md §0313. * ci: retrigger after PR body cleanup * ci: retrigger after deliverables opt-out polarity fix --------- Co-authored-by: Lusoris <lusoris@pm.me>

…ip workflow (ADR-0309) Follow-up to ADR-0303 / PR #399 that ships the operational harness for actually running the 5-seed x 9-fold LOSO retrain against the locally available Netflix Public Dataset and emitting a machine-checkable verdict file. - ai/scripts/run_ensemble_v2_real_corpus_loso.sh: Bash wrapper that validates .workingdir2/netflix/, loops the seeds through the existing train_fr_regressor_v2_ensemble_loso.py, tees timestamped per-seed logs. - ai/scripts/validate_ensemble_seeds.py: applies the ADR-0303 gate (mean PLCC >= 0.95 AND max-min <= 0.005), snapshots the corpus YUV file list as sha256 over sorted relpath+size, writes PROMOTE.json on gate-pass or HOLD.json on gate-fail. - ai/tests/test_validate_ensemble_seeds.py: 7 tests covering both verdict paths plus exit-code coverage. - docs/ai/ensemble-v2-real-corpus-retrain-runbook.md: prerequisites, two-command run, verdict interpretation, rollback procedure. - docs/adr/0309-*.md (Proposed): decision matrix with 4 alternatives. - docs/research/0081-*.md: corpus-size sufficiency, LOSO sizing, seed-diversity hyperparameters, Seeking_25fps weak-fold diagnostic. - ai/AGENTS.md: appended ADR-0309 invariant (registry-flip is a separate PR; never flip during a rebase). The harness deliberately does NOT run the LOSO inside the PR (6-12 h GPU work) and does NOT flip the registry — the registry flip is a separate follow-up PR gated on a passing PROMOTE.json. Reproducer: pytest ai/tests/test_validate_ensemble_seeds.py -v Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces the operational “real-corpus LOSO retrain harness” for the fr_regressor_v2 deep-ensemble workflow described in ADR-0309: a bash wrapper to run per-seed LOSO training, a Python validator to apply the ADR-0303 gate and emit PROMOTE.json/HOLD.json, plus accompanying tests and documentation (runbook, ADR, research digest).

Changes:

Add ai/scripts/run_ensemble_v2_real_corpus_loso.sh and ai/scripts/validate_ensemble_seeds.py (with pytest coverage) to drive and validate an out-of-band retrain.
Add ADR-0309 + a runbook documenting the operator workflow and rollback guidance.
Add a new research digest and index entries for the workstream.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
docs/research/README.md	Adds Research-0081 to the research index.
docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md	New research digest for the real-corpus retrain methodology.
docs/rebase-notes.md	Adds a rebase-note entry for ADR-0309.
docs/ai/ensemble-v2-real-corpus-retrain-runbook.md	New operator runbook for running wrapper + validator and interpreting verdicts.
docs/adr/README.md	Adds ADR-0309 to the ADR index table (but this file is generated).
docs/adr/0309-fr-regressor-v2-ensemble-real-corpus-retrain.md	New ADR documenting the harness/flip workflow decision.
CHANGELOG.md	Adds an Unreleased entry (but this file is generated).
ai/tests/test_validate_ensemble_seeds.py	New tests for validator verdict emission + exit codes.
ai/scripts/validate_ensemble_seeds.py	New validator script that applies the ADR-0303 gate and writes PROMOTE/HOLD verdict files.
ai/scripts/run_ensemble_v2_real_corpus_loso.sh	New wrapper intended to run per-seed LOSO training and collect logs/artefacts.
ai/AGENTS.md	Adds an invariant note that registry flips must happen in a separate PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+- **`fr_regressor_v2` ensemble — real-corpus retrain harness +
+  flip workflow (ADR-0309).** Follow-up to
+  [ADR-0303](docs/adr/0303-fr-regressor-v2-ensemble-prod-flip.md) /
+  PR #399 that ships the operational harness for actually running
+  the 5-seed × 9-fold LOSO retrain against the locally available
+  Netflix Public Dataset (`.workingdir2/netflix/`) and emitting a
+  machine-checkable verdict file. Adds
+  [`ai/scripts/run_ensemble_v2_real_corpus_loso.sh`](ai/scripts/run_ensemble_v2_real_corpus_loso.sh)
+  (Bash wrapper that validates the corpus, loops the seeds through
+  `train_fr_regressor_v2_ensemble_loso.py`, and tees timestamped
+  per-seed logs under `runs/ensemble_v2_real/logs/`),
+  [`ai/scripts/validate_ensemble_seeds.py`](ai/scripts/validate_ensemble_seeds.py)
+  (Python validator that calls the ADR-0303 gate, snapshots the
+  corpus YUV file list as sha256 over sorted `relpath\tsize`, and
+  writes `PROMOTE.json` on gate-pass with a recommendation to flip
+  the five `fr_regressor_v2_ensemble_v1_seed{0..4}` rows in
+  `model/tiny/registry.json` from `smoke: true` to `smoke: false`,
+  or `HOLD.json` on gate-fail with the failing-seed details and a
+  recommendation to keep `smoke: true` and investigate diversity /
+  hyperparameters), unit tests for both verdict paths, and a
+  runbook
+  [`docs/ai/ensemble-v2-real-corpus-retrain-runbook.md`](docs/ai/ensemble-v2-real-corpus-retrain-runbook.md)
+  covering prerequisites, the two-command run, verdict
+  interpretation, and rollback. The harness deliberately does
+  **not** run the LOSO inside the PR (6–12 h GPU work) and does
+  **not** flip the registry — the registry flip is a separate
+  follow-up PR gated on a passing `PROMOTE.json` (preserves a clean
+  revert surface and honours the new `ai/AGENTS.md` invariant that
+  registry-flip never happens during a rebase). Companion research
+  digest:
+  [Research-0081](docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md).


+  python "$repo_root/ai/scripts/train_fr_regressor_v2_ensemble_loso.py" \
+    --seeds "$seed" \
+    --corpus-root "$corpus_root" \
+    --output "$out_dir/loso_seed${seed}.json" \
+    --out-dir "$out_dir" \
+    2>&1 | tee "$log_file"


+   `train_fr_regressor_v2_ensemble_loso.py --seed N
+   --corpus-root $CORPUS_ROOT
+   --output runs/ensemble_v2_real/loso_seed{N}.json` per seed.


+2. **Verify the registry** — `python ai/scripts/validate_model_registry.py`
+   should pass; the five rows must read `"smoke": true` again.
+3. **Verify the C-side ORT loader** — re-run
+   `python ai/tests/test_registry.py` to confirm the smoke graphs


@@ -312,5 +312,6 @@ ADRs may exist there for local session continuity, but the tracked
 | [ADR-0255](0253-fastdvdnet-pre-real-weights.md) | T6-7b — FastDVDnet temporal pre-filter real upstream weights drop. Replaces the [ADR-0215](0215-fastdvdnet-pre-filter.md) smoke-only placeholder ONNX with the verbatim trained checkpoint from upstream `m-tassano/fastdvdnet` (commit `c8fdf61`, MIT) wrapped by a `LumaAdapter` PyTorch module that preserves the C-side luma `[1, 5, H, W]` → `[1, 1, H, W]` contract: each luma plane is `Concat`-tiled into RGB (`Y → [Y, Y, Y]`) to match upstream's 15-channel input, a constant `sigma = 25/255` noise map (upstream's reference inference level) is broadcast via `ones_like(centre) * sigma`, and the upstream RGB output is collapsed back to luma using BT.601 weights (`Y = 0.299 R + 0.587 G + 0.114 B`). Every `nn.PixelShuffle` instance in upstream's UpBlock is swapped pre-export for an allowlist-safe `Reshape`/`Transpose`/`Reshape` decomposition (zero learned params → numerically identical, verified `< 1e-6` max-abs diff between upstream PyTorch and exported ONNX); `DepthToSpace` deliberately stays off the op allowlist. Shipped graph uses only allowlisted ops. Registry row flips `smoke: false` with `license: MIT`, upstream commit pin, and refreshed `sha256`; sidecar JSON + doc `docs/ai/models/fastdvdnet_pre.md` carry full provenance. New `ai/scripts/export_fastdvdnet_pre.py` (replaces the `_placeholder.py` exporter — kept for reference). 9.5 MiB ONNX (well under the 50 MiB DNN size cap). Luma-native retrain tracked as T6-7c follow-up; INT8 PTQ tracked as T6-7d follow-up. | Accepted | ai, dnn, feature-extractor, wave-1, weights-drop, fork-local |


+  echo "[ensemble-v2-real] seed=$seed -> $log_file"
+  python "$repo_root/ai/scripts/train_fr_regressor_v2_ensemble_loso.py" \
+    --seeds "$seed" \
+    --corpus-root "$corpus_root" \
+    --output "$out_dir/loso_seed${seed}.json" \
+    --out-dir "$out_dir" \
+    2>&1 | tee "$log_file"
+done
+


+- **Status**: Active
+- **Date**: 2026-05-05
+- **ADR**: [ADR-0309](../adr/0309-fr-regressor-v2-ensemble-real-corpus-retrain.md)
+- **Related**: [Research-0075](0075-fr-regressor-v2-ensemble-prod-flip.md)
+  (parent — gate theory + conformal calibration sketch),
+  [Research-0067](0067-fr-regressor-v2-prod-loso.md)
+  (deterministic LOSO baseline),
+  [Research-0058](0058-fr-regressor-v2-feasibility.md)
+  (codec-aware feasibility).


The 2026-05-06 merge train shipped 13 ADRs whose implementing PRs landed but Status was never bumped from Proposed to Accepted. Per docs/adr/README.md and ADR-0028, ADRs flip to Accepted once the deliverable lands. The train moved faster than the per-ADR Status edits could keep up; this PR catches up. Flipped: - ADR-0302 (#401, ENCODER_VOCAB v3 schema expansion) - ADR-0303 (#399, fr_regressor_v2 ensemble prod-flip gate) - ADR-0304 (#402, vmaf-tune fast-path Optuna TPE) - ADR-0305 (#400, knob-sweep Pareto analysis scaffold) - ADR-0307 (#404, vmaf-tune ladder default sampler) - ADR-0308 (#406, knob-sweep recipe-regression policy) - ADR-0309 (#405, ensemble retrain harness) - ADR-0311 (#408, libfuzzer harness expansion) - ADR-0313 (#410, CI Required Checks Aggregator) [table-format Status, sed-edited inline] - ADR-0314 (#412, vmaf-tune --score-backend=vulkan) - ADR-0316 (#414, cli_parse long-only-option assertion fix) - ADR-0317 (#415, CI Docker + FFmpeg-SYCL flake fix) - ADR-0319 (#422, ensemble LOSO trainer real impl) Already-Accepted (no change): ADR-0310 (#407), ADR-0312 (#425), ADR-0315 (skeleton, intentionally Proposed), ADR-0321 (#424).

lusoris mentioned this pull request May 5, 2026

ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313) #410

Merged

10 tasks

lusoris marked this pull request as ready for review May 6, 2026 01:31

Copilot AI review requested due to automatic review settings May 6, 2026 01:31

lusoris force-pushed the feat/fr-regressor-v2-ensemble-real-corpus-retrain branch from 2875cb6 to 31e4aa2 Compare May 6, 2026 01:32

Copilot started reviewing on behalf of lusoris May 6, 2026 01:32 View session

ci: retrigger after deliverables checkbox format fix

abf2e3e

Copilot AI reviewed May 6, 2026

View reviewed changes

lusoris merged commit e45299e into master May 6, 2026
55 checks passed

lusoris deleted the feat/fr-regressor-v2-ensemble-real-corpus-retrain branch May 6, 2026 01:55

This was referenced May 6, 2026

fix(ai): run_ensemble_v2_real_corpus_loso.sh — wrapper-trainer interface mismatch + Phase-A pre-step doc #421

Closed

docs(adr): bulk flip Proposed → Accepted for 13 merge-train ADRs #426

Merged

lusoris mentioned this pull request May 6, 2026

feat/vmaf tune score backend vulkan #436

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ai): fr_regressor_v2 ensemble — real-corpus retrain harness + flip workflow (ADR-0309)#405

feat(ai): fr_regressor_v2 ensemble — real-corpus retrain harness + flip workflow (ADR-0309)#405
lusoris merged 2 commits intomasterfrom
feat/fr-regressor-v2-ensemble-real-corpus-retrain

lusoris commented May 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -312,5 +312,6 @@ ADRs may exist there for local session continuity, but the tracked
		\| [ADR-0255](0253-fastdvdnet-pre-real-weights.md) \| T6-7b — FastDVDnet temporal pre-filter real upstream weights drop. Replaces the [ADR-0215](0215-fastdvdnet-pre-filter.md) smoke-only placeholder ONNX with the verbatim trained checkpoint from upstream `m-tassano/fastdvdnet` (commit `c8fdf61`, MIT) wrapped by a `LumaAdapter` PyTorch module that preserves the C-side luma `[1, 5, H, W]` → `[1, 1, H, W]` contract: each luma plane is `Concat`-tiled into RGB (`Y → [Y, Y, Y]`) to match upstream's 15-channel input, a constant `sigma = 25/255` noise map (upstream's reference inference level) is broadcast via `ones_like(centre) * sigma`, and the upstream RGB output is collapsed back to luma using BT.601 weights (`Y = 0.299 R + 0.587 G + 0.114 B`). Every `nn.PixelShuffle` instance in upstream's UpBlock is swapped pre-export for an allowlist-safe `Reshape`/`Transpose`/`Reshape` decomposition (zero learned params → numerically identical, verified `< 1e-6` max-abs diff between upstream PyTorch and exported ONNX); `DepthToSpace` deliberately stays off the op allowlist. Shipped graph uses only allowlisted ops. Registry row flips `smoke: false` with `license: MIT`, upstream commit pin, and refreshed `sha256`; sidecar JSON + doc `docs/ai/models/fastdvdnet_pre.md` carry full provenance. New `ai/scripts/export_fastdvdnet_pre.py` (replaces the `_placeholder.py` exporter — kept for reference). 9.5 MiB ONNX (well under the 50 MiB DNN size cap). Luma-native retrain tracked as T6-7c follow-up; INT8 PTQ tracked as T6-7d follow-up. \| Accepted \| ai, dnn, feature-extractor, wave-1, weights-drop, fork-local \|

Uh oh!

Conversation

lusoris commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Six deep-dive deliverables (ADR-0108)

Test plan

Status: DRAFT

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lusoris commented May 5, 2026 •

edited

Loading