Skip to content

feat(ai): fr_regressor_v2 probabilistic head (deep-ensemble + conformal scaffold)#372

Merged
lusoris merged 2 commits intomasterfrom
feat/ai-fr-regressor-v2-probabilistic
May 5, 2026
Merged

feat(ai): fr_regressor_v2 probabilistic head (deep-ensemble + conformal scaffold)#372
lusoris merged 2 commits intomasterfrom
feat/ai-fr-regressor-v2-probabilistic

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented May 3, 2026

Summary

  • Scaffolds a probabilistic head on top of the codec-aware
    fr_regressor_v2 (parent: ADR-0272 / PR #347 in flight)
    so producers can drive the in-flight vmaf-tune --quality-confidence 0.95 flag (ADR-0237) off a calibrated prediction interval instead
    of v2's bare MOS scalar. PR research(tools): vmaf-tune capability audit — what else can it do? #354 audit Bucket fix: SIMD bit-identical reductions + CI fixes #18 (top-3 ranked).
  • Trainer
    ai/scripts/train_fr_regressor_v2_ensemble.py
    trains N=5 copies of the v2 architecture
    (FRRegressor(num_codecs=NUM_CODECS)) under distinct seeds, exports
    each as a separate two-input ONNX (features [N, 6] +
    codec_onehot [N, NUM_CODECS]), and writes the ensemble manifest
    sidecar model/tiny/fr_regressor_v2_ensemble_v1.json pinning
    per-member sha256s, feature standardisation, codec vocab, nominal
    coverage, and an optional split-conformal residual quantile from a
    held-out calibration split.
  • Inference rule is mu +/- q * sigma with q = 1.96 (Gaussian) or
    the empirical conformal quantile (Vovk 2005, Romano 2019 -
    distribution-free marginal coverage on exchangeable data).
  • Evaluator
    ai/scripts/eval_probabilistic_proxy.py
    reports empirical coverage at 50/80/95 % nominal levels, mean
    interval width, and the mean-prediction PLCC; an extra conformal
    row reports the calibrated interval's coverage when the manifest
    carries the conformal scalar.

Smoke-only ship. The shipped artefacts are the trainer's
--smoke output (synthetic 100-row corpus, 1 epoch / member). They
are load-path probes, not quality models. Production training is
gated on the multi-codec Phase A corpus and is tracked as backlog
item T7-FR-REGRESSOR-V2-PROBABILISTIC.

Six deep-dive deliverables (ADR-0108)

Test plan

  • python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke
    trains 5 members in ~3.5s, exports 5 valid two-input ONNX
    members + manifest sidecar (ran locally).
  • python ai/scripts/eval_probabilistic_proxy.py --smoke loads
    all 5 ONNX members, aggregates (mu, sigma), reports coverage
    at 50/80/95 % (numbers are nonsensical on the 1-epoch synthetic
    smoke - the script is the gate, not the score).
  • python ai/scripts/validate_model_registry.py - 15 entries
    valid against registry.schema.json.
  • pre-commit run --files <changed> - Passed (black / isort /
    ruff / json-check / secrets / semgrep).
  • markdownlint-cli2 on all 3 new docs - 0 errors.
  • Production training run on a real Phase A multi-codec corpus
    - deferred, tracked as backlog item
    T7-FR-REGRESSOR-V2-PROBABILISTIC (per ADR-0279 § Consequences).

Notes for the reviewer

  • Parent fr_regressor_v2 deterministic scaffold is in flight as PR
    feat(ai): fr_regressor_v2 codec-aware scaffold (Phase B prereq) #347 (ADR-0261 in that PR's tree). This PR cites it as ADR-0272 in
    ## References as a placeholder; renumber at merge time if needed.
  • Each ensemble member is registered as a kind: "fr" row in
    model/tiny/registry.json (5 new rows:
    fr_regressor_v2_ensemble_v1_seed{0..4}) so the existing
    tiny-model verifier sha256-checks each member without a
    registry-schema bump. A future kind: "fr_ensemble" schema bump is
    noted as a follow-up.
  • C-side runtime adapter (vmaf_dnn_score_with_interval) and
    vmaf-tune --quality-confidence flag are separate follow-up PRs;
    this PR is the training-side scaffold only.

🤖 Generated with Claude Code

@lusoris lusoris force-pushed the feat/ai-fr-regressor-v2-probabilistic branch from 05c7c9b to 4a0f1d5 Compare May 3, 2026 19:43
@lusoris lusoris marked this pull request as ready for review May 5, 2026 12:01
Copilot AI review requested due to automatic review settings May 5, 2026 12:01
…al scaffold)

Adds a probabilistic head on top of the codec-aware fr_regressor_v2
(parent: ADR-0272 / PR #347 in flight) so producers can drive the
in-flight `vmaf-tune --quality-confidence 0.95` flag (ADR-0237) off a
calibrated prediction interval instead of v2's bare MOS scalar. PR #354
audit Bucket #18 (top-3 ranked).

Trainer (`ai/scripts/train_fr_regressor_v2_ensemble.py`) trains N=5
copies of the v2 architecture (`FRRegressor(num_codecs=NUM_CODECS)`)
under distinct seeds, exports each as a separate two-input ONNX
(`features [N, 6]` + `codec_onehot [N, NUM_CODECS]`), and writes an
ensemble manifest sidecar that pins per-member sha256s, feature
standardisation, codec vocab, nominal coverage, and an optional
split-conformal residual quantile from a held-out calibration split.
Inference rule is `mu ± q · σ` with `q = 1.96` (Gaussian) or the
empirical conformal quantile (Vovk 2005, Romano 2019 — distribution-free
marginal coverage on exchangeable data).

Evaluator (`ai/scripts/eval_probabilistic_proxy.py`) reports empirical
coverage at 50/80/95 % nominal levels, mean interval width, and the
mean-prediction PLCC; reports the conformal-interval row when the
manifest carries a conformal scalar.

Smoke-only ship: synthetic 100-row corpus, 1 epoch / member. Production
training is gated on the multi-codec Phase A corpus (T7-FR-REGRESSOR-V2-PROBABILISTIC).

Six ADR-0108 deliverables:
1. Research digest: docs/research/0054-fr-regressor-v2-probabilistic.md.
2. Decision matrix: ADR-0279 § Alternatives considered.
3. AGENTS.md invariant note: appended to ai/AGENTS.md.
4. Reproducer: `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke`
   followed by `python ai/scripts/eval_probabilistic_proxy.py --smoke`.
5. CHANGELOG ### Added entry under Unreleased — lusoris fork.
6. Rebase-notes entry: ### 0229 in docs/rebase-notes.md.

Test plan:
- `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke` produces
  5 valid two-input ONNX members + manifest sidecar (ran locally).
- `python ai/scripts/eval_probabilistic_proxy.py --smoke` aggregates the
  5 ONNX outputs into (mu, sigma) and reports coverage at 50/80/95 %.
- `python ai/scripts/validate_model_registry.py` → 15 entries valid.
- `pre-commit run --files <changed>` → Passed (black / isort / ruff /
  json-check / secrets / semgrep).
- `markdownlint-cli2` on all new docs → 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris force-pushed the feat/ai-fr-regressor-v2-probabilistic branch from 4a0f1d5 to f7f0aad Compare May 5, 2026 12:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR scaffolds a probabilistic (interval-producing) training/eval workflow for the codec-aware fr_regressor_v2 by introducing a 5-member deep ensemble + optional split-conformal calibration, along with the associated tiny-model artifacts and documentation.

Changes:

  • Add a trainer script to produce an ONNX-per-member ensemble plus a JSON manifest capturing ensemble metadata and calibration parameters.
  • Add an evaluator script to compute empirical coverage/width metrics for Gaussian vs conformal intervals from the manifest + members.
  • Register the shipped smoke-only ONNX members in the tiny-model registry and document the design via ADR/research/model-card/rebase-notes/changelog updates.

Reviewed changes

Copilot reviewed 11 out of 21 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
ai/scripts/train_fr_regressor_v2_ensemble.py New ensemble trainer (multi-seed training, ONNX export per member, manifest emission, registry update).
ai/scripts/eval_probabilistic_proxy.py New evaluator that loads the manifest + members and reports empirical coverage/width metrics.
model/tiny/fr_regressor_v2_ensemble_v1.json New ensemble manifest sidecar pinning members, standardisation stats, vocab, and confidence parameters.
model/tiny/registry.json Adds 5 new smoke-only kind: "fr" entries for the ensemble members.
docs/ai/models/fr_regressor_v2_probabilistic.md New model card explaining interval semantics, manifest layout, and (re)training/eval usage.
docs/research/0067-fr-regressor-v2-probabilistic.md New research digest motivating deep ensemble + conformal and outlining tradeoffs.
docs/adr/0279-fr-regressor-v2-probabilistic.md New ADR capturing the decision and alternatives considered.
docs/adr/README.md Adds ADR-0279 row to the ADR index table.
ai/AGENTS.md Adds rebase-sensitive invariants for the ensemble/manifest contract.
docs/rebase-notes.md Adds rebase note entry documenting touched files and re-test commands.
CHANGELOG.md Adds an Unreleased “Added” entry describing the scaffold.
Comments suppressed due to low confidence (3)

docs/adr/README.md:265

  • docs/adr/README.md is generated from docs/adr/_index_fragments/ (see docs/adr/_index_fragments/README.md). Editing it directly will be overwritten and tends to cause merge conflicts; add a row fragment + append the slug to _order.txt, then regenerate via scripts/docs/concat-adr-index.sh --write.
| [ADR-0272](0272-fr-regressor-v2-codec-aware-scaffold.md) | `fr_regressor_v2` codec-aware scaffold — first downstream consumer of the vmaf-tune Phase A JSONL corpus ([ADR-0237](0237-quality-aware-encode-automation.md)). Ships [`ai/scripts/train_fr_regressor_v2.py`](../../ai/scripts/train_fr_regressor_v2.py), a smoke ONNX (`fr_regressor_v2.onnx` registered with `smoke: true`), sidecar JSON, and full doc surface ([model card](../ai/models/fr_regressor_v2.md), [research digest](../research/0058-fr-regressor-v2-feasibility.md)). Two-input ONNX: 6 canonical libvmaf features (`adm2`, `vif_scale0..3`, `motion2`, StandardScaler-normalised) + 8-D codec block (6-way encoder one-hot + preset_norm + crf_norm, both in `[0, 1]`). MLP shape `6 -> 16 -> 16 -> 1` with codec block concatenated before the first dense layer (matches the existing `FRRegressor(num_codecs=8)` plumbing landed by [ADR-0235](0235-codec-aware-fr-regressor.md)). Registry row stays `smoke: true` until a follow-up PR (T7-FR-REGRESSOR-V2-PROD) re-runs training on a real Phase A corpus and clears v1's 0.95 LOSO PLCC ship gate with the ≥0.005 multi-codec lift required by ADR-0235. | Proposed | ai, dnn, tiny-ai, fr-regressor, codec-aware, vmaf-tune, fork-local |

CHANGELOG.md:34

  • The Unreleased section of CHANGELOG.md is rendered from changelog.d/ fragments (see changelog.d/README.md). This direct edit will drift from the generated output and is likely to fail the fragment-drift check; please add a fragment under changelog.d/added/ and regenerate instead.
- **`fr_regressor_v2` codec-aware scaffold — first downstream consumer
  of the vmaf-tune Phase A JSONL corpus (ADR-0272, prereq for
  Phase B).** Ships
  [`ai/scripts/train_fr_regressor_v2.py`](ai/scripts/train_fr_regressor_v2.py)
  — a scaffold-only trainer that consumes the JSONL corpus emitted by
  `vmaf-tune corpus` (ADR-0237 Phase A) and trains the codec-aware
  variant of the v1 FR regressor. Two-input ONNX (`features` shape
  `(N, 6)` canonical-6 + `codec` shape `(N, 8)` block —
  `[encoder_onehot(6), preset_norm, crf_norm]`); reuses the existing
  `FRRegressor(num_codecs=8)` class plumbed by ADR-0235. A `--smoke`
  mode synthesises 100 fake corpus rows and trains 1 epoch so the
  pipeline is end-to-end exercisable in CI without hours of encode
  time. Registers `fr_regressor_v2` in `model/tiny/registry.json`
  with `smoke: true` until a follow-up PR runs production training on
  a real Phase A corpus and clears the ADR-0235 ship gate (≥0.005
  multi-codec PLCC lift over v1's 0.95 LOSO floor). Doc surface:
  [model card](docs/ai/models/fr_regressor_v2.md),
  [research digest](docs/research/0058-fr-regressor-v2-feasibility.md),
  [ADR-0272](docs/adr/0272-fr-regressor-v2-codec-aware-scaffold.md),
  `ai/AGENTS.md` invariant note pinning the codec block layout and
  encoder vocabulary. Smoke validated locally (`python
  ai/scripts/train_fr_regressor_v2.py --smoke` produces a valid
  opset-17 two-input ONNX, op-allowlist clean, torch-vs-ORT roundtrip
  within 1e-4 atol). No upstream-mirror file touched; pure additive

docs/rebase-notes.md:6682

  • This rebase note says the ADR index row was appended directly to docs/adr/README.md, but that file is generated from docs/adr/_index_fragments/. Please update the note to reflect the fragment-based workflow (add fragment + update _order.txt + regenerate) so future rebases don’t repeat a non-durable edit.
  unchanged. The migration only touches state-management
  boilerplate around the kernel; the SSE accumulator math, the
  per-bpc kernel function lookup, the host-side `log10` score
  formula, and the dispatch grid-dim calculation are byte-identical
  to the prior implementation. Netflix golden gate + CPU/CUDA

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +8
# Research-0054: probabilistic `fr_regressor_v2` — deep-ensemble + conformal

- **Date**: 2026-05-03
- **Authors**: Lusoris, Claude (Anthropic)
- **Status**: Final (scaffold-time digest)
- **Tags**: ai, fr-regressor, probabilistic, ensemble, conformal
- **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic),
ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18
Comment on lines +7 to +8
- **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic),
ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18
# `fr_regressor_v2_ensemble_v1` — probabilistic FR regressor (deep-ensemble + conformal)

`fr_regressor_v2_ensemble_v1` is a **probabilistic** successor to the
codec-aware `fr_regressor_v2` (parent: [ADR-0272](../../adr/0272-fr-regressor-v2-codec-aware-scaffold.md))

## Context

The codec-aware [`fr_regressor_v2`](0272-fr-regressor-v2-codec-aware-scaffold.md)
Comment on lines +11 to +12
distinct random seeds and exports each copy as
``model/tiny/fr_regressor_v2_seed<N>.onnx`` plus a manifest sidecar
]) # (5, N)
mu, sigma = preds.mean(axis=0), preds.std(axis=0, ddof=1)

q = manifest["confidence"].get("conformal_q_residual") or manifest["confidence"]["gaussian_z"]
Comment on lines +57 to +63
| Method | UCI 95 % cov. | KITTI depth 95 % cov. | Notes |
| --- | --- | --- | --- |
| Deep ensemble (N=5) | 0.93–0.95 | 0.91–0.94 | Best of the four pre-conformal; dominates MC-dropout consistently. |
| MC-dropout (T=10) | 0.85–0.91 | 0.78–0.86 | Underestimates variance; gets worse on OOD inputs. |
| Heteroscedastic NLL | 0.78–0.92 (high variance) | 0.70–0.88 | Aleatoric only; collapses on epistemic-uncertainty regimes. |
| Bayesian last-layer | 0.90–0.94 | 0.88–0.92 | Comparable to MC-dropout; substantially more engineering. |
| **Any method + conformal** | **≥ 0.95 by construction** | **≥ 0.95 by construction** | Marginal coverage guarantee on exchangeable data (Vovk 2005, Lei 2018). |
Comment on lines +39 to +40
nominal_coverage: 0.95, conformal_q_residual: <float?>,
feature_mean / feature_std: list[6] }``.
Comment on lines +320 to +333
"""Add / replace the ensemble registry row.

The registry schema only knows scoring kinds (fr / nr / filter), so
each ensemble *member* is registered as kind=``fr`` with a stable
id ``<ensemble_id>_seed<N>`` and the manifest sidecar
(``<ensemble_id>.json``) is the higher-level entry point. This
keeps `validate_model_registry.py` green without a schema bump.
The ensemble manifest itself is referenced via the first member's
``notes`` field so downstream tooling can discover it.
"""
registry = json.loads(registry_path.read_text())
models = registry.get("models", [])
keep = [m for m in models if not m.get("id", "").startswith(f"{ensemble_id}_seed")]
keep = [m for m in keep if m.get("id") != ensemble_id]
- [ADR-0040](0040-dnn-session-multi-input-api.md),
[ADR-0041](0041-lpips-sq-extractor.md) — multi-input ONNX precedent
the v2 ensemble member graph follows.
- Source: `req` (PR #354 audit Bucket #18, top-3 ranked).
@lusoris lusoris merged commit de6c0a0 into master May 5, 2026
54 checks passed
@lusoris lusoris deleted the feat/ai-fr-regressor-v2-probabilistic branch May 5, 2026 12:35
lusoris pushed a commit that referenced this pull request May 5, 2026
…e (ADR-0303)

Builds on PR #372 (ensemble scaffold — five smoke seed rows in
model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip +
0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip
gate so the seeds can flip from smoke: true to smoke: false after a
real-corpus LOSO run.

The production ship gate is two-part per ADR-0303:
  * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship
    gate per ensemble member.
  * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that
    protects the predictive-distribution semantics that the in-flight
    vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies
    on. Without it, the mean PLCC could mask a one-seed-wins-four-
    seeds-tie configuration that breaks conformal calibration.

Per-seed registry rows flip smoke: true -> false only after that
seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean
entry (if/when registered) flips only after all five seeds clear AND
the variance bound holds.

The trainer's body is a stub on this branch — the real Phase A
canonical-6 corpus is not present and the registry rows are NOT
flipped here. CI workflow wiring of the gate is intentionally
deferred to the follow-up flip PR (no real loso_seed{N}.json
artefacts exist on master to gate on yet).

Verification:
  * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean.
  * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean.
  * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0.
  * python scripts/ci/ensemble_prod_gate.py --help — exits 0.

Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod
flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A
consumer), ADR-0235 (codec-aware decision + ship gate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris added a commit that referenced this pull request May 5, 2026
…e (ADR-0303) (#399)

Builds on PR #372 (ensemble scaffold — five smoke seed rows in
model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip +
0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip
gate so the seeds can flip from smoke: true to smoke: false after a
real-corpus LOSO run.

The production ship gate is two-part per ADR-0303:
  * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship
    gate per ensemble member.
  * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that
    protects the predictive-distribution semantics that the in-flight
    vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies
    on. Without it, the mean PLCC could mask a one-seed-wins-four-
    seeds-tie configuration that breaks conformal calibration.

Per-seed registry rows flip smoke: true -> false only after that
seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean
entry (if/when registered) flips only after all five seeds clear AND
the variance bound holds.

The trainer's body is a stub on this branch — the real Phase A
canonical-6 corpus is not present and the registry rows are NOT
flipped here. CI workflow wiring of the gate is intentionally
deferred to the follow-up flip PR (no real loso_seed{N}.json
artefacts exist on master to gate on yet).

Verification:
  * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean.
  * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean.
  * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0.
  * python scripts/ci/ensemble_prod_gate.py --help — exits 0.

Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod
flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A
consumer), ADR-0235 (codec-aware decision + ship gate).

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants