Conversation
05c7c9b to
4a0f1d5
Compare
…al scaffold) Adds a probabilistic head on top of the codec-aware fr_regressor_v2 (parent: ADR-0272 / PR #347 in flight) so producers can drive the in-flight `vmaf-tune --quality-confidence 0.95` flag (ADR-0237) off a calibrated prediction interval instead of v2's bare MOS scalar. PR #354 audit Bucket #18 (top-3 ranked). Trainer (`ai/scripts/train_fr_regressor_v2_ensemble.py`) trains N=5 copies of the v2 architecture (`FRRegressor(num_codecs=NUM_CODECS)`) under distinct seeds, exports each as a separate two-input ONNX (`features [N, 6]` + `codec_onehot [N, NUM_CODECS]`), and writes an ensemble manifest sidecar that pins per-member sha256s, feature standardisation, codec vocab, nominal coverage, and an optional split-conformal residual quantile from a held-out calibration split. Inference rule is `mu ± q · σ` with `q = 1.96` (Gaussian) or the empirical conformal quantile (Vovk 2005, Romano 2019 — distribution-free marginal coverage on exchangeable data). Evaluator (`ai/scripts/eval_probabilistic_proxy.py`) reports empirical coverage at 50/80/95 % nominal levels, mean interval width, and the mean-prediction PLCC; reports the conformal-interval row when the manifest carries a conformal scalar. Smoke-only ship: synthetic 100-row corpus, 1 epoch / member. Production training is gated on the multi-codec Phase A corpus (T7-FR-REGRESSOR-V2-PROBABILISTIC). Six ADR-0108 deliverables: 1. Research digest: docs/research/0054-fr-regressor-v2-probabilistic.md. 2. Decision matrix: ADR-0279 § Alternatives considered. 3. AGENTS.md invariant note: appended to ai/AGENTS.md. 4. Reproducer: `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke` followed by `python ai/scripts/eval_probabilistic_proxy.py --smoke`. 5. CHANGELOG ### Added entry under Unreleased — lusoris fork. 6. Rebase-notes entry: ### 0229 in docs/rebase-notes.md. Test plan: - `python ai/scripts/train_fr_regressor_v2_ensemble.py --smoke` produces 5 valid two-input ONNX members + manifest sidecar (ran locally). - `python ai/scripts/eval_probabilistic_proxy.py --smoke` aggregates the 5 ONNX outputs into (mu, sigma) and reports coverage at 50/80/95 %. - `python ai/scripts/validate_model_registry.py` → 15 entries valid. - `pre-commit run --files <changed>` → Passed (black / isort / ruff / json-check / secrets / semgrep). - `markdownlint-cli2` on all new docs → 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4a0f1d5 to
f7f0aad
Compare
There was a problem hiding this comment.
Pull request overview
This PR scaffolds a probabilistic (interval-producing) training/eval workflow for the codec-aware fr_regressor_v2 by introducing a 5-member deep ensemble + optional split-conformal calibration, along with the associated tiny-model artifacts and documentation.
Changes:
- Add a trainer script to produce an ONNX-per-member ensemble plus a JSON manifest capturing ensemble metadata and calibration parameters.
- Add an evaluator script to compute empirical coverage/width metrics for Gaussian vs conformal intervals from the manifest + members.
- Register the shipped smoke-only ONNX members in the tiny-model registry and document the design via ADR/research/model-card/rebase-notes/changelog updates.
Reviewed changes
Copilot reviewed 11 out of 21 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
ai/scripts/train_fr_regressor_v2_ensemble.py |
New ensemble trainer (multi-seed training, ONNX export per member, manifest emission, registry update). |
ai/scripts/eval_probabilistic_proxy.py |
New evaluator that loads the manifest + members and reports empirical coverage/width metrics. |
model/tiny/fr_regressor_v2_ensemble_v1.json |
New ensemble manifest sidecar pinning members, standardisation stats, vocab, and confidence parameters. |
model/tiny/registry.json |
Adds 5 new smoke-only kind: "fr" entries for the ensemble members. |
docs/ai/models/fr_regressor_v2_probabilistic.md |
New model card explaining interval semantics, manifest layout, and (re)training/eval usage. |
docs/research/0067-fr-regressor-v2-probabilistic.md |
New research digest motivating deep ensemble + conformal and outlining tradeoffs. |
docs/adr/0279-fr-regressor-v2-probabilistic.md |
New ADR capturing the decision and alternatives considered. |
docs/adr/README.md |
Adds ADR-0279 row to the ADR index table. |
ai/AGENTS.md |
Adds rebase-sensitive invariants for the ensemble/manifest contract. |
docs/rebase-notes.md |
Adds rebase note entry documenting touched files and re-test commands. |
CHANGELOG.md |
Adds an Unreleased “Added” entry describing the scaffold. |
Comments suppressed due to low confidence (3)
docs/adr/README.md:265
docs/adr/README.mdis generated fromdocs/adr/_index_fragments/(seedocs/adr/_index_fragments/README.md). Editing it directly will be overwritten and tends to cause merge conflicts; add a row fragment + append the slug to_order.txt, then regenerate viascripts/docs/concat-adr-index.sh --write.
| [ADR-0272](0272-fr-regressor-v2-codec-aware-scaffold.md) | `fr_regressor_v2` codec-aware scaffold — first downstream consumer of the vmaf-tune Phase A JSONL corpus ([ADR-0237](0237-quality-aware-encode-automation.md)). Ships [`ai/scripts/train_fr_regressor_v2.py`](../../ai/scripts/train_fr_regressor_v2.py), a smoke ONNX (`fr_regressor_v2.onnx` registered with `smoke: true`), sidecar JSON, and full doc surface ([model card](../ai/models/fr_regressor_v2.md), [research digest](../research/0058-fr-regressor-v2-feasibility.md)). Two-input ONNX: 6 canonical libvmaf features (`adm2`, `vif_scale0..3`, `motion2`, StandardScaler-normalised) + 8-D codec block (6-way encoder one-hot + preset_norm + crf_norm, both in `[0, 1]`). MLP shape `6 -> 16 -> 16 -> 1` with codec block concatenated before the first dense layer (matches the existing `FRRegressor(num_codecs=8)` plumbing landed by [ADR-0235](0235-codec-aware-fr-regressor.md)). Registry row stays `smoke: true` until a follow-up PR (T7-FR-REGRESSOR-V2-PROD) re-runs training on a real Phase A corpus and clears v1's 0.95 LOSO PLCC ship gate with the ≥0.005 multi-codec lift required by ADR-0235. | Proposed | ai, dnn, tiny-ai, fr-regressor, codec-aware, vmaf-tune, fork-local |
CHANGELOG.md:34
- The Unreleased section of
CHANGELOG.mdis rendered fromchangelog.d/fragments (seechangelog.d/README.md). This direct edit will drift from the generated output and is likely to fail the fragment-drift check; please add a fragment underchangelog.d/added/and regenerate instead.
- **`fr_regressor_v2` codec-aware scaffold — first downstream consumer
of the vmaf-tune Phase A JSONL corpus (ADR-0272, prereq for
Phase B).** Ships
[`ai/scripts/train_fr_regressor_v2.py`](ai/scripts/train_fr_regressor_v2.py)
— a scaffold-only trainer that consumes the JSONL corpus emitted by
`vmaf-tune corpus` (ADR-0237 Phase A) and trains the codec-aware
variant of the v1 FR regressor. Two-input ONNX (`features` shape
`(N, 6)` canonical-6 + `codec` shape `(N, 8)` block —
`[encoder_onehot(6), preset_norm, crf_norm]`); reuses the existing
`FRRegressor(num_codecs=8)` class plumbed by ADR-0235. A `--smoke`
mode synthesises 100 fake corpus rows and trains 1 epoch so the
pipeline is end-to-end exercisable in CI without hours of encode
time. Registers `fr_regressor_v2` in `model/tiny/registry.json`
with `smoke: true` until a follow-up PR runs production training on
a real Phase A corpus and clears the ADR-0235 ship gate (≥0.005
multi-codec PLCC lift over v1's 0.95 LOSO floor). Doc surface:
[model card](docs/ai/models/fr_regressor_v2.md),
[research digest](docs/research/0058-fr-regressor-v2-feasibility.md),
[ADR-0272](docs/adr/0272-fr-regressor-v2-codec-aware-scaffold.md),
`ai/AGENTS.md` invariant note pinning the codec block layout and
encoder vocabulary. Smoke validated locally (`python
ai/scripts/train_fr_regressor_v2.py --smoke` produces a valid
opset-17 two-input ONNX, op-allowlist clean, torch-vs-ORT roundtrip
within 1e-4 atol). No upstream-mirror file touched; pure additive
docs/rebase-notes.md:6682
- This rebase note says the ADR index row was appended directly to
docs/adr/README.md, but that file is generated fromdocs/adr/_index_fragments/. Please update the note to reflect the fragment-based workflow (add fragment + update_order.txt+ regenerate) so future rebases don’t repeat a non-durable edit.
unchanged. The migration only touches state-management
boilerplate around the kernel; the SSE accumulator math, the
per-bpc kernel function lookup, the host-side `log10` score
formula, and the dispatch grid-dim calculation are byte-identical
to the prior implementation. Netflix golden gate + CPU/CUDA
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1
to
+8
| # Research-0054: probabilistic `fr_regressor_v2` — deep-ensemble + conformal | ||
|
|
||
| - **Date**: 2026-05-03 | ||
| - **Authors**: Lusoris, Claude (Anthropic) | ||
| - **Status**: Final (scaffold-time digest) | ||
| - **Tags**: ai, fr-regressor, probabilistic, ensemble, conformal | ||
| - **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic), | ||
| ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18 |
Comment on lines
+7
to
+8
| - **Related**: ADR-0279 (this scaffold), ADR-0272 (parent v2 deterministic), | ||
| ADR-0237 (vmaf-tune Phase A consumer), PR #354 audit Bucket #18 |
| # `fr_regressor_v2_ensemble_v1` — probabilistic FR regressor (deep-ensemble + conformal) | ||
|
|
||
| `fr_regressor_v2_ensemble_v1` is a **probabilistic** successor to the | ||
| codec-aware `fr_regressor_v2` (parent: [ADR-0272](../../adr/0272-fr-regressor-v2-codec-aware-scaffold.md)) |
|
|
||
| ## Context | ||
|
|
||
| The codec-aware [`fr_regressor_v2`](0272-fr-regressor-v2-codec-aware-scaffold.md) |
Comment on lines
+11
to
+12
| distinct random seeds and exports each copy as | ||
| ``model/tiny/fr_regressor_v2_seed<N>.onnx`` plus a manifest sidecar |
| ]) # (5, N) | ||
| mu, sigma = preds.mean(axis=0), preds.std(axis=0, ddof=1) | ||
|
|
||
| q = manifest["confidence"].get("conformal_q_residual") or manifest["confidence"]["gaussian_z"] |
Comment on lines
+57
to
+63
| | Method | UCI 95 % cov. | KITTI depth 95 % cov. | Notes | | ||
| | --- | --- | --- | --- | | ||
| | Deep ensemble (N=5) | 0.93–0.95 | 0.91–0.94 | Best of the four pre-conformal; dominates MC-dropout consistently. | | ||
| | MC-dropout (T=10) | 0.85–0.91 | 0.78–0.86 | Underestimates variance; gets worse on OOD inputs. | | ||
| | Heteroscedastic NLL | 0.78–0.92 (high variance) | 0.70–0.88 | Aleatoric only; collapses on epistemic-uncertainty regimes. | | ||
| | Bayesian last-layer | 0.90–0.94 | 0.88–0.92 | Comparable to MC-dropout; substantially more engineering. | | ||
| | **Any method + conformal** | **≥ 0.95 by construction** | **≥ 0.95 by construction** | Marginal coverage guarantee on exchangeable data (Vovk 2005, Lei 2018). | |
Comment on lines
+39
to
+40
| nominal_coverage: 0.95, conformal_q_residual: <float?>, | ||
| feature_mean / feature_std: list[6] }``. |
Comment on lines
+320
to
+333
| """Add / replace the ensemble registry row. | ||
|
|
||
| The registry schema only knows scoring kinds (fr / nr / filter), so | ||
| each ensemble *member* is registered as kind=``fr`` with a stable | ||
| id ``<ensemble_id>_seed<N>`` and the manifest sidecar | ||
| (``<ensemble_id>.json``) is the higher-level entry point. This | ||
| keeps `validate_model_registry.py` green without a schema bump. | ||
| The ensemble manifest itself is referenced via the first member's | ||
| ``notes`` field so downstream tooling can discover it. | ||
| """ | ||
| registry = json.loads(registry_path.read_text()) | ||
| models = registry.get("models", []) | ||
| keep = [m for m in models if not m.get("id", "").startswith(f"{ensemble_id}_seed")] | ||
| keep = [m for m in keep if m.get("id") != ensemble_id] |
| - [ADR-0040](0040-dnn-session-multi-input-api.md), | ||
| [ADR-0041](0041-lpips-sq-extractor.md) — multi-input ONNX precedent | ||
| the v2 ensemble member graph follows. | ||
| - Source: `req` (PR #354 audit Bucket #18, top-3 ranked). |
12 tasks
lusoris
pushed a commit
that referenced
this pull request
May 5, 2026
…e (ADR-0303) Builds on PR #372 (ensemble scaffold — five smoke seed rows in model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip + 0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip gate so the seeds can flip from smoke: true to smoke: false after a real-corpus LOSO run. The production ship gate is two-part per ADR-0303: * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship gate per ensemble member. * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that protects the predictive-distribution semantics that the in-flight vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies on. Without it, the mean PLCC could mask a one-seed-wins-four- seeds-tie configuration that breaks conformal calibration. Per-seed registry rows flip smoke: true -> false only after that seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean entry (if/when registered) flips only after all five seeds clear AND the variance bound holds. The trainer's body is a stub on this branch — the real Phase A canonical-6 corpus is not present and the registry rows are NOT flipped here. CI workflow wiring of the gate is intentionally deferred to the follow-up flip PR (no real loso_seed{N}.json artefacts exist on master to gate on yet). Verification: * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean. * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean. * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0. * python scripts/ci/ensemble_prod_gate.py --help — exits 0. Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A consumer), ADR-0235 (codec-aware decision + ship gate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
added a commit
that referenced
this pull request
May 5, 2026
…e (ADR-0303) (#399) Builds on PR #372 (ensemble scaffold — five smoke seed rows in model/tiny/registry.json) and ADR-0291 (deterministic v2 prod flip + 0.95 LOSO PLCC ship gate). Adds the LOSO trainer + production-flip gate so the seeds can flip from smoke: true to smoke: false after a real-corpus LOSO run. The production ship gate is two-part per ADR-0303: * mean_i(PLCC_i) >= 0.95 — inherits the ADR-0235 / ADR-0291 ship gate per ensemble member. * max_i(PLCC_i) - min_i(PLCC_i) <= 0.005 — variance bound that protects the predictive-distribution semantics that the in-flight vmaf-tune --quality-confidence flag (ADR-0237 consumer) relies on. Without it, the mean PLCC could mask a one-seed-wins-four- seeds-tie configuration that breaks conformal calibration. Per-seed registry rows flip smoke: true -> false only after that seed clears its individual PLCC_i >= 0.95 gate; the ensemble-mean entry (if/when registered) flips only after all five seeds clear AND the variance bound holds. The trainer's body is a stub on this branch — the real Phase A canonical-6 corpus is not present and the registry rows are NOT flipped here. CI workflow wiring of the gate is intentionally deferred to the follow-up flip PR (no real loso_seed{N}.json artefacts exist on master to gate on yet). Verification: * python3 -c "import ast; ast.parse(open('ai/scripts/train_fr_regressor_v2_ensemble_loso.py').read())" — clean. * python3 -c "import ast; ast.parse(open('scripts/ci/ensemble_prod_gate.py').read())" — clean. * python ai/scripts/train_fr_regressor_v2_ensemble_loso.py --help — exits 0. * python scripts/ci/ensemble_prod_gate.py --help — exits 0. Refs: PR #372 (ensemble scaffold), ADR-0291 (deterministic v2 prod flip), ADR-0279 (probabilistic head), ADR-0237 (vmaf-tune Phase A consumer), ADR-0235 (codec-aware decision + ship gate). Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fr_regressor_v2(parent: ADR-0272 / PR #347 in flight)so producers can drive the in-flight
vmaf-tune --quality-confidence 0.95flag (ADR-0237) off a calibrated prediction interval insteadof v2's bare MOS scalar. PR research(tools): vmaf-tune capability audit — what else can it do? #354 audit Bucket fix: SIMD bit-identical reductions + CI fixes #18 (top-3 ranked).
ai/scripts/train_fr_regressor_v2_ensemble.pytrains N=5 copies of the v2 architecture
(
FRRegressor(num_codecs=NUM_CODECS)) under distinct seeds, exportseach as a separate two-input ONNX (
features [N, 6]+codec_onehot [N, NUM_CODECS]), and writes the ensemble manifestsidecar
model/tiny/fr_regressor_v2_ensemble_v1.jsonpinningper-member sha256s, feature standardisation, codec vocab, nominal
coverage, and an optional split-conformal residual quantile from a
held-out calibration split.
mu +/- q * sigmawithq = 1.96(Gaussian) orthe empirical conformal quantile (Vovk 2005, Romano 2019 -
distribution-free marginal coverage on exchangeable data).
ai/scripts/eval_probabilistic_proxy.pyreports empirical coverage at 50/80/95 % nominal levels, mean
interval width, and the mean-prediction PLCC; an extra conformal
row reports the calibrated interval's coverage when the manifest
carries the conformal scalar.
Six deep-dive deliverables (ADR-0108)
(1) Research digest:
docs/research/0054-fr-regressor-v2-probabilistic.md.(2) Decision matrix:
docs/adr/0279-fr-regressor-v2-probabilistic.md§ Alternatives considered (deep-ensemble vs. heteroscedastic NLL
vs. MC-dropout vs. quantile regression vs. Bayesian last-layer
vs. bootstrap).
(3) AGENTS.md invariant note: appended to
ai/AGENTS.md.(4) Reproducer / smoke-test command:
(5) CHANGELOG fragment:
CHANGELOG.md### Addedentry under "Unreleased - lusorisfork".
(6) Rebase note:
### 0229indocs/rebase-notes.md.Test plan
python ai/scripts/train_fr_regressor_v2_ensemble.py --smoketrains 5 members in ~3.5s, exports 5 valid two-input ONNX
members + manifest sidecar (ran locally).
python ai/scripts/eval_probabilistic_proxy.py --smokeloadsall 5 ONNX members, aggregates
(mu, sigma), reports coverageat 50/80/95 % (numbers are nonsensical on the 1-epoch synthetic
smoke - the script is the gate, not the score).
python ai/scripts/validate_model_registry.py- 15 entriesvalid against
registry.schema.json.pre-commit run --files <changed>- Passed (black / isort /ruff / json-check / secrets / semgrep).
markdownlint-cli2on all 3 new docs - 0 errors.- deferred, tracked as backlog item
T7-FR-REGRESSOR-V2-PROBABILISTIC (per ADR-0279 § Consequences).
Notes for the reviewer
fr_regressor_v2deterministic scaffold is in flight as PRfeat(ai): fr_regressor_v2 codec-aware scaffold (Phase B prereq) #347 (ADR-0261 in that PR's tree). This PR cites it as ADR-0272 in
## Referencesas a placeholder; renumber at merge time if needed.kind: "fr"row inmodel/tiny/registry.json(5 new rows:fr_regressor_v2_ensemble_v1_seed{0..4}) so the existingtiny-model verifier sha256-checks each member without a
registry-schema bump. A future
kind: "fr_ensemble"schema bump isnoted as a follow-up.
vmaf_dnn_score_with_interval) andvmaf-tune --quality-confidenceflag are separate follow-up PRs;this PR is the training-side scaffold only.
🤖 Generated with Claude Code