feat(ai): fr_regressor_v2 probabilistic head (deep-ensemble + conformal scaffold) by lusoris · Pull Request #372 · lusoris/vmaf

lusoris · 2026-05-03T19:42:40Z

Summary

Scaffolds a probabilistic head on top of the codec-aware
fr_regressor_v2 (parent: ADR-0272 / PR #347 in flight)
so producers can drive the in-flight vmaf-tune --quality-confidence 0.95 flag (ADR-0237) off a calibrated prediction interval instead
of v2's bare MOS scalar. PR research(tools): vmaf-tune capability audit — what else can it do? #354 audit Bucket fix: SIMD bit-identical reductions + CI fixes #18 (top-3 ranked).
Trainer
ai/scripts/train_fr_regressor_v2_ensemble.py
trains N=5 copies of the v2 architecture
(FRRegressor(num_codecs=NUM_CODECS)) under distinct seeds, exports
each as a separate two-input ONNX (features [N, 6] +
codec_onehot [N, NUM_CODECS]), and writes the ensemble manifest
sidecar model/tiny/fr_regressor_v2_ensemble_v1.json pinning
per-member sha256s, feature standardisation, codec vocab, nominal
coverage, and an optional split-conformal residual quantile from a
held-out calibration split.
Inference rule is mu +/- q * sigma with q = 1.96 (Gaussian) or
the empirical conformal quantile (Vovk 2005, Romano 2019 -
distribution-free marginal coverage on exchangeable data).
Evaluator
ai/scripts/eval_probabilistic_proxy.py
reports empirical coverage at 50/80/95 % nominal levels, mean
interval width, and the mean-prediction PLCC; an extra conformal
row reports the calibrated interval's coverage when the manifest
carries the conformal scalar.

Smoke-only ship. The shipped artefacts are the trainer's
--smoke output (synthetic 100-row corpus, 1 epoch / member). They
are load-path probes, not quality models. Production training is
gated on the multi-codec Phase A corpus and is tracked as backlog
item T7-FR-REGRESSOR-V2-PROBABILISTIC.

Six deep-dive deliverables (ADR-0108)

(1) Research digest:
docs/research/0054-fr-regressor-v2-probabilistic.md.
(2) Decision matrix:
docs/adr/0279-fr-regressor-v2-probabilistic.md
§ Alternatives considered (deep-ensemble vs. heteroscedastic NLL
vs. MC-dropout vs. quantile regression vs. Bayesian last-layer
vs. bootstrap).
(3) AGENTS.md invariant note: appended to
ai/AGENTS.md.

(4) Reproducer / smoke-test command:

python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke
python ai/scripts/eval_probabilistic_proxy.py --smoke

(5) CHANGELOG fragment: CHANGELOG.md ### Added entry under "Unreleased - lusoris
fork".
(6) Rebase note: ### 0229 in
docs/rebase-notes.md.

Test plan

python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke
trains 5 members in ~3.5s, exports 5 valid two-input ONNX
members + manifest sidecar (ran locally).
python ai/scripts/eval_probabilistic_proxy.py --smoke loads
all 5 ONNX members, aggregates (mu, sigma), reports coverage
at 50/80/95 % (numbers are nonsensical on the 1-epoch synthetic
smoke - the script is the gate, not the score).
python ai/scripts/validate_model_registry.py - 15 entries
valid against registry.schema.json.
pre-commit run --files <changed> - Passed (black / isort /
ruff / json-check / secrets / semgrep).
markdownlint-cli2 on all 3 new docs - 0 errors.
Production training run on a real Phase A multi-codec corpus
- deferred, tracked as backlog item
T7-FR-REGRESSOR-V2-PROBABILISTIC (per ADR-0279 § Consequences).

Notes for the reviewer

Parent fr_regressor_v2 deterministic scaffold is in flight as PR
feat(ai): fr_regressor_v2 codec-aware scaffold (Phase B prereq) #347 (ADR-0261 in that PR's tree). This PR cites it as ADR-0272 in
## References as a placeholder; renumber at merge time if needed.
Each ensemble member is registered as a kind: "fr" row in
model/tiny/registry.json (5 new rows:
fr_regressor_v2_ensemble_v1_seed{0..4}) so the existing
tiny-model verifier sha256-checks each member without a
registry-schema bump. A future kind: "fr_ensemble" schema bump is
noted as a follow-up.
C-side runtime adapter (vmaf_dnn_score_with_interval) and
vmaf-tune --quality-confidence flag are separate follow-up PRs;
this PR is the training-side scaffold only.

🤖 Generated with Claude Code

…al scaffold) Adds a probabilistic head on top of the codec-aware fr_regressor_v2 (parent: ADR-0272 / PR #347 in flight) so producers can drive the in-flight `vmaf-tune --quality-confidence 0.95` flag (ADR-0237) off a calibrated prediction interval instead of v2's bare MOS scalar. PR #354 audit Bucket #18 (top-3 ranked). Trainer (`ai/scripts/train_fr_regressor_v2_ensemble.py`) trains N=5 copies of the v2 architecture (`FRRegressor(num_codecs=NUM_CODECS)`) under distinct seeds, exports each as a separate two-input ONNX (`features [N, 6]` + `codec_onehot [N, NUM_CODECS]`), and writes an ensemble manifest sidecar that pins per-member sha256s, feature standardisation, codec vocab, nominal coverage, and an optional split-conformal residual quantile from a held-out calibration split. Inference rule is `mu ± q · σ` with `q = 1.96` (Gaussian) or the empirical conformal quantile (Vovk 2005, Romano 2019 — distribution-free marginal coverage on exchangeable data). Evaluator (`ai/scripts/eval_probabilistic_proxy.py`) reports empirical coverage at 50/80/95 % nominal levels, mean interval width, and the mean-prediction PLCC; reports the conformal-interval row when the manifest carries a conformal scalar. Smoke-only ship: synthetic 100-row corpus, 1 epoch / member. Production training is gated on the multi-codec Phase A corpus (T7-FR-REGRESSOR-V2-PROBABILISTIC). Six ADR-0108 deliverables: 1. Research digest: docs/research/0054-fr-regressor-v2-probabilistic.md. 2. Decision matrix: ADR-0279 § Alternatives considered. 3. AGENTS.md invariant note: appended to ai/AGENTS.md. 4. Reproducer: `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke` followed by `python ai/scripts/eval_probabilistic_proxy.py --smoke`. 5. CHANGELOG ### Added entry under Unreleased — lusoris fork. 6. Rebase-notes entry: ### 0229 in docs/rebase-notes.md. Test plan: - `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke` produces 5 valid two-input ONNX members + manifest sidecar (ran locally). - `python ai/scripts/eval_probabilistic_proxy.py --smoke` aggregates the 5 ONNX outputs into (mu, sigma) and reports coverage at 50/80/95 %. - `python ai/scripts/validate_model_registry.py` → 15 entries valid. - `pre-commit run --files <changed>` → Passed (black / isort / ruff / json-check / secrets / semgrep). - `markdownlint-cli2` on all new docs → 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR scaffolds a probabilistic (interval-producing) training/eval workflow for the codec-aware fr_regressor_v2 by introducing a 5-member deep ensemble + optional split-conformal calibration, along with the associated tiny-model artifacts and documentation.

Changes:

Add a trainer script to produce an ONNX-per-member ensemble plus a JSON manifest capturing ensemble metadata and calibration parameters.
Add an evaluator script to compute empirical coverage/width metrics for Gaussian vs conformal intervals from the manifest + members.
Register the shipped smoke-only ONNX members in the tiny-model registry and document the design via ADR/research/model-card/rebase-notes/changelog updates.

Reviewed changes

Copilot reviewed 11 out of 21 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`ai/scripts/train_fr_regressor_v2_ensemble.py`	New ensemble trainer (multi-seed training, ONNX export per member, manifest emission, registry update).
`ai/scripts/eval_probabilistic_proxy.py`	New evaluator that loads the manifest + members and reports empirical coverage/width metrics.
`model/tiny/fr_regressor_v2_ensemble_v1.json`	New ensemble manifest sidecar pinning members, standardisation stats, vocab, and confidence parameters.
`model/tiny/registry.json`	Adds 5 new smoke-only `kind: "fr"` entries for the ensemble members.
`docs/ai/models/fr_regressor_v2_probabilistic.md`	New model card explaining interval semantics, manifest layout, and (re)training/eval usage.
`docs/research/0067-fr-regressor-v2-probabilistic.md`	New research digest motivating deep ensemble + conformal and outlining tradeoffs.
`docs/adr/0279-fr-regressor-v2-probabilistic.md`	New ADR capturing the decision and alternatives considered.
`docs/adr/README.md`	Adds ADR-0279 row to the ADR index table.
`ai/AGENTS.md`	Adds rebase-sensitive invariants for the ensemble/manifest contract.
`docs/rebase-notes.md`	Adds rebase note entry documenting touched files and re-test commands.
`CHANGELOG.md`	Adds an Unreleased “Added” entry describing the scaffold.

Comments suppressed due to low confidence (3)

docs/adr/README.md:265

docs/adr/README.md is generated from docs/adr/_index_fragments/ (see docs/adr/_index_fragments/README.md). Editing it directly will be overwritten and tends to cause merge conflicts; add a row fragment + append the slug to _order.txt, then regenerate via scripts/docs/concat-adr-index.sh --write.

| [ADR-0272](0272-fr-regressor-v2-codec-aware-scaffold.md) | `fr_regressor_v2` codec-aware scaffold — first downstream consumer of the vmaf-tune Phase A JSONL corpus ([ADR-0237](0237-quality-aware-encode-automation.md)). Ships [`ai/scripts/train_fr_regressor_v2.py`](../../ai/scripts/train_fr_regressor_v2.py), a smoke ONNX (`fr_regressor_v2.onnx` registered with `smoke: true`), sidecar JSON, and full doc surface ([model card](../ai/models/fr_regressor_v2.md), [research digest](../research/0058-fr-regressor-v2-feasibility.md)). Two-input ONNX: 6 canonical libvmaf features (`adm2`, `vif_scale0..3`, `motion2`, StandardScaler-normalised) + 8-D codec block (6-way encoder one-hot + preset_norm + crf_norm, both in `[0, 1]`). MLP shape `6 -> 16 -> 16 -> 1` with codec block concatenated before the first dense layer (matches the existing `FRRegressor(num_codecs=8)` plumbing landed by [ADR-0235](0235-codec-aware-fr-regressor.md)). Registry row stays `smoke: true` until a follow-up PR (T7-FR-REGRESSOR-V2-PROD) re-runs training on a real Phase A corpus and clears v1's 0.95 LOSO PLCC ship gate with the ≥0.005 multi-codec lift required by ADR-0235. | Proposed | ai, dnn, tiny-ai, fr-regressor, codec-aware, vmaf-tune, fork-local |

CHANGELOG.md:34

The Unreleased section of CHANGELOG.md is rendered from changelog.d/ fragments (see changelog.d/README.md). This direct edit will drift from the generated output and is likely to fail the fragment-drift check; please add a fragment under changelog.d/added/ and regenerate instead.

- **`fr_regressor_v2` codec-aware scaffold — first downstream consumer
  of the vmaf-tune Phase A JSONL corpus (ADR-0272, prereq for
  Phase B).** Ships
  [`ai/scripts/train_fr_regressor_v2.py`](ai/scripts/train_fr_regressor_v2.py)
  — a scaffold-only trainer that consumes the JSONL corpus emitted by
  `vmaf-tune corpus` (ADR-0237 Phase A) and trains the codec-aware
  variant of the v1 FR regressor. Two-input ONNX (`features` shape
  `(N, 6)` canonical-6 + `codec` shape `(N, 8)` block —
  `[encoder_onehot(6), preset_norm, crf_norm]`); reuses the existing
  `FRRegressor(num_codecs=8)` class plumbed by ADR-0235. A `--smoke`
  mode synthesises 100 fake corpus rows and trains 1 epoch so the
  pipeline is end-to-end exercisable in CI without hours of encode
  time. Registers `fr_regressor_v2` in `model/tiny/registry.json`
  with `smoke: true` until a follow-up PR runs production training on
  a real Phase A corpus and clears the ADR-0235 ship gate (≥0.005
  multi-codec PLCC lift over v1's 0.95 LOSO floor). Doc surface:
  [model card](docs/ai/models/fr_regressor_v2.md),
  [research digest](docs/research/0058-fr-regressor-v2-feasibility.md),
  [ADR-0272](docs/adr/0272-fr-regressor-v2-codec-aware-scaffold.md),
  `ai/AGENTS.md` invariant note pinning the codec block layout and
  encoder vocabulary. Smoke validated locally (`python
  ai/scripts/train_fr_regressor_v2.py --smoke` produces a valid
  opset-17 two-input ONNX, op-allowlist clean, torch-vs-ORT roundtrip
  within 1e-4 atol). No upstream-mirror file touched; pure additive

docs/rebase-notes.md:6682

This rebase note says the ADR index row was appended directly to docs/adr/README.md, but that file is generated from docs/adr/_index_fragments/. Please update the note to reflect the fragment-based workflow (add fragment + update _order.txt + regenerate) so future rebases don’t repeat a non-durable edit.

  unchanged. The migration only touches state-management
  boilerplate around the kernel; the SSE accumulator math, the
  per-bpc kernel function lookup, the host-side `log10` score
  formula, and the dispatch grid-dim calculation are byte-identical
  to the prior implementation. Netflix golden gate + CPU/CUDA

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# Research-0054: probabilistic `fr_regressor_v2` — deep-ensemble + conformal
+
+- **Date**: 2026-05-03
+- **Authors**: Lusoris, Claude (Anthropic)
+- **Status**: Final (scaffold-time digest)
+- **Tags**: ai, fr-regressor, probabilistic, ensemble, conformal
+- **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic),
+  ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18


+- **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic),
+  ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18


+# `fr_regressor_v2_ensemble_v1` — probabilistic FR regressor (deep-ensemble + conformal)
+
+`fr_regressor_v2_ensemble_v1` is a **probabilistic** successor to the
+codec-aware `fr_regressor_v2` (parent: [ADR-0272](../../adr/0272-fr-regressor-v2-codec-aware-scaffold.md))


+
+## Context
+
+The codec-aware [`fr_regressor_v2`](0272-fr-regressor-v2-codec-aware-scaffold.md)


+distinct random seeds and exports each copy as
+``model/tiny/fr_regressor_v2_seed<N>.onnx`` plus a manifest sidecar


+])  # (5, N)
+mu, sigma = preds.mean(axis=0), preds.std(axis=0, ddof=1)
+
+q = manifest["confidence"].get("conformal_q_residual") or manifest["confidence"]["gaussian_z"]


+| Method | UCI 95 % cov. | KITTI depth 95 % cov. | Notes |
+| --- | --- | --- | --- |
+| Deep ensemble (N=5) | 0.93–0.95 | 0.91–0.94 | Best of the four pre-conformal; dominates MC-dropout consistently. |
+| MC-dropout (T=10) | 0.85–0.91 | 0.78–0.86 | Underestimates variance; gets worse on OOD inputs. |
+| Heteroscedastic NLL | 0.78–0.92 (high variance) | 0.70–0.88 | Aleatoric only; collapses on epistemic-uncertainty regimes. |
+| Bayesian last-layer | 0.90–0.94 | 0.88–0.92 | Comparable to MC-dropout; substantially more engineering. |
+| **Any method + conformal** | **≥ 0.95 by construction** | **≥ 0.95 by construction** | Marginal coverage guarantee on exchangeable data (Vovk 2005, Lei 2018). |


+    nominal_coverage: 0.95, conformal_q_residual: <float?>,
+    feature_mean / feature_std: list[6] }``.


+    """Add / replace the ensemble registry row.
+
+    The registry schema only knows scoring kinds (fr / nr / filter), so
+    each ensemble *member* is registered as kind=``fr`` with a stable
+    id ``<ensemble_id>_seed<N>`` and the manifest sidecar
+    (``<ensemble_id>.json``) is the higher-level entry point. This
+    keeps `validate_model_registry.py` green without a schema bump.
+    The ensemble manifest itself is referenced via the first member's
+    ``notes`` field so downstream tooling can discover it.
+    """
+    registry = json.loads(registry_path.read_text())
+    models = registry.get("models", [])
+    keep = [m for m in models if not m.get("id", "").startswith(f"{ensemble_id}_seed")]
+    keep = [m for m in keep if m.get("id") != ensemble_id]


+- [ADR-0040](0040-dnn-session-multi-input-api.md),
+  [ADR-0041](0041-lpips-sq-extractor.md) — multi-input ONNX precedent
+  the v2 ensemble member graph follows.
+- Source: `req` (PR #354 audit Bucket #18, top-3 ranked).


…ntries

…e (ADR-0303) Builds on PR #372 (ensemble scaffold — five smoke seed rows in model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip + 0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip gate so the seeds can flip from smoke: true to smoke: false after a real-corpus LOSO run. The production ship gate is two-part per ADR-0303: * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship gate per ensemble member. * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that protects the predictive-distribution semantics that the in-flight vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies on. Without it, the mean PLCC could mask a one-seed-wins-four- seeds-tie configuration that breaks conformal calibration. Per-seed registry rows flip smoke: true -> false only after that seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean entry (if/when registered) flips only after all five seeds clear AND the variance bound holds. The trainer's body is a stub on this branch — the real Phase A canonical-6 corpus is not present and the registry rows are NOT flipped here. CI workflow wiring of the gate is intentionally deferred to the follow-up flip PR (no real loso_seed{N}.json artefacts exist on master to gate on yet). Verification: * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean. * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean. * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0. * python scripts/ci/ensemble_prod_gate.py --help — exits 0. Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A consumer), ADR-0235 (codec-aware decision + ship gate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e (ADR-0303) (#399) Builds on PR #372 (ensemble scaffold — five smoke seed rows in model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip + 0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip gate so the seeds can flip from smoke: true to smoke: false after a real-corpus LOSO run. The production ship gate is two-part per ADR-0303: * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship gate per ensemble member. * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that protects the predictive-distribution semantics that the in-flight vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies on. Without it, the mean PLCC could mask a one-seed-wins-four- seeds-tie configuration that breaks conformal calibration. Per-seed registry rows flip smoke: true -> false only after that seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean entry (if/when registered) flips only after all five seeds clear AND the variance bound holds. The trainer's body is a stub on this branch — the real Phase A canonical-6 corpus is not present and the registry rows are NOT flipped here. CI workflow wiring of the gate is intentionally deferred to the follow-up flip PR (no real loso_seed{N}.json artefacts exist on master to gate on yet). Verification: * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean. * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean. * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0. * python scripts/ci/ensemble_prod_gate.py --help — exits 0. Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A consumer), ADR-0235 (codec-aware decision + ship gate). Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lusoris force-pushed the feat/ai-fr-regressor-v2-probabilistic branch from 05c7c9b to 4a0f1d5 Compare May 3, 2026 19:43

lusoris marked this pull request as ready for review May 5, 2026 12:01

Copilot AI review requested due to automatic review settings May 5, 2026 12:01

Copilot started reviewing on behalf of lusoris May 5, 2026 12:02 View session

lusoris force-pushed the feat/ai-fr-regressor-v2-probabilistic branch from 4a0f1d5 to f7f0aad Compare May 5, 2026 12:03

Copilot AI reviewed May 5, 2026

View reviewed changes

fix(registry): split fr_regressor_v2 + ensemble_seed0 into distinct e…

28b031e

…ntries

lusoris merged commit de6c0a0 into master May 5, 2026
54 checks passed

lusoris deleted the feat/ai-fr-regressor-v2-probabilistic branch May 5, 2026 12:35

lusoris mentioned this pull request May 5, 2026

feat(ai): fr_regressor_v2 ensemble — production flip trainer + CI gate (ADR-0303) #399

Merged

12 tasks

This was referenced May 6, 2026

feat/vmaf tune saliency roi #432

Merged

feat/vmaf tune hdr aware #434

Merged

feat/vmaf tune compare codecs #435

Merged

feat/vmaf tune score backend vulkan #436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ai): fr_regressor_v2 probabilistic head (deep-ensemble + conformal scaffold)#372

feat(ai): fr_regressor_v2 probabilistic head (deep-ensemble + conformal scaffold)#372
lusoris merged 2 commits intomasterfrom
feat/ai-fr-regressor-v2-probabilistic

lusoris commented May 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- Related: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic),
		ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18


		## Context

		The codec-aware [`fr_regressor_v2`](0272-fr-regressor-v2-codec-aware-scaffold.md)

		distinct random seeds and exports each copy as
		``model/tiny/fr_regressor_v2_seed<N>.onnx`` plus a manifest sidecar

		nominal_coverage: 0.95, conformal_q_residual: <float?>,
		feature_mean / feature_std: list[6] }``.

Uh oh!

Conversation

lusoris commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Six deep-dive deliverables (ADR-0108)

Test plan

Notes for the reviewer

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lusoris commented May 3, 2026 •

edited

Loading