research(ai): encoder knob-sweep — Pareto hulls + recipe regressions (Research-0080) by lusoris · Pull Request #406 · lusoris/vmaf

lusoris · 2026-05-05T20:37:32Z

Summary

Runs the Research-0077 / ADR-0305 analysis script (ships in PR #400) over the 12,636-cell Phase A knob sweep at runs/phase_a/full_grid/comprehensive.jsonl and writes the populated findings into docs/research/0080-encoder-knob-sweep-findings.md. ADR-0308 commits the fork to a structural-vs-content-dependent threshold for revision policy.

Headline findings (one sentence)

CQP regresses 3× less often than CBR/VBR (6.6 % vs 20.2 % / 18.7 %), and h264_nvenc dominates the structural regression cluster — the top-15 bad-recipe cells (h264_nvenc + bf3 / spatial_aq / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq cluster) all reproduce on all 9 corpus sources, re-confirming Research-0063 with hard numbers.

codec	slices	max VMAF	bitrate p50 (kbps)	bitrate p95 (kbps)	enc time p50 (ms)	regressions
`av1_nvenc`	27	99.98	2,266	11,733	546	289
`av1_qsv`	27	99.97	2,566	14,249	398	84
`h264_nvenc`	27	99.87	3,643	18,519	540	636
`h264_qsv`	27	99.97	3,511	17,556	435	281
`hevc_nvenc`	27	99.93	3,537	16,553	543	515
`hevc_qsv`	27	99.97	2,571	10,690	405	110

Decision (ADR-0308)

A recipe regression is structural iff it reproduces on ≥7 of 9 corpus sources within one (codec, rc_mode, recipe, preset, q) cell. Structural regressions are forbidden as tools/vmaf-tune/codec_adapters/* defaults and forbidden as vmaf-tune recommend outputs without explicit override. Content-dependent regressions (1-6 sources) are filtered at recommend-time only via the per-slice hull lookup. The detector remains an offline gate.

Six deep-dive deliverables (CLAUDE §11 / ADR-0108)

(1) Research digest: docs/research/0080-encoder-knob-sweep-findings.md
(2) Decision matrix: ADR-0308 §Alternatives considered (4 options: 7-of-9 structural, forbid-all, accept-all, fixture-only CI gate)
(3) AGENTS.md invariant note: ai/AGENTS.md §Knob-sweep recipe-regression policy (cites ADR-0305 invariant + ADR-0308 cut)
(4) Reproducer / smoke-test command: see Test plan
(5) CHANGELOG fragment: changelog.d/changed/encoder-knob-sweep-findings.md
(6) Rebase note: docs/rebase-notes.md §0308

Constraints honoured

Did not modify ai/scripts/analyze_knob_sweep.py (used public API unchanged via a throw-away wrapper for the field-name rename src→source etc).
Did not modify tools/vmaf-tune/codec_adapters/* (recipe revisions land in follow-up PRs).
Did not commit runs/ artefacts (.gitignore covers them).
Documentation-only; ~452 LOC against the 600-LOC budget.

Test plan

Wait for PR research(ai): encoder knob-space Pareto frontiers — analysis scaffold (ADR-0305 / Research-0077) #400 (ADR-0305 + Research-0077 + analyser script) to merge first.
Rebase this branch onto master post-research(ai): encoder knob-space Pareto frontiers — analysis scaffold (ADR-0305 / Research-0077) #400.
pytest ai/tests/test_knob_sweep_analysis.py -v (analyser logic, lands in PR research(ai): encoder knob-space Pareto frontiers — analysis scaffold (ADR-0305 / Research-0077) #400 — verifies the script my findings depend on).
Offline regenerate sweep + reanalyse: python tools/vmaf-tune/src/vmaftune/hw_encoder_corpus.py … (~3 h on a single host, NVENC + QSV) → adapt fields → python ai/scripts/analyze_knob_sweep.py --jsonl <adapted.jsonl> --out-dir runs/phase_a/full_grid/reports/ → diff summary.md against the headline table in Research-0080.
Verify make format-check (no Python touched in this PR; markdown only).
Confirm no lint regressions on touched files (ai/AGENTS.md, docs/adr/README.md, docs/rebase-notes.md, the three new markdown files).

Known queueing

Queued behind PR #400. Rebase target: master after PR #400 merges. Findings cite ADR-0305 / Research-0077 / ai/scripts/analyze_knob_sweep.py as forward references; those land via PR #400.

🤖 Generated with Claude Code

…(ADR-0313) (#410) * ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313) The 23-named-required-check posture (ADR-0037) deadlocks doc/Python-only PRs: the C-build matrix path-filter-skips on their diffs, but branch protection counts a path-filter-skip + a never-ran-at-all as not satisfying the required-check. PR #400 hit this concretely (10/23 succeeded; 13/23 either skipped or never reported; gh pr merge returned "the base branch policy prohibits the merge"). Aggregator is one workflow with no path filter. It polls up to 8 minutes for sibling workflows to register, then verifies each named check on the head SHA reported success/skipped/neutral (or didn't appear at all, which is the documented path-filter rejection semantics). Aggregator becomes the single branch-protection required check; the 23 individual workflows continue to run unchanged. Manual operator step at adoption (after this PR merges): gh api -X PUT "repos/lusoris/vmaf/branches/master/protection/required_status_checks" \ -F 'strict=true' -F 'contexts=["Required Checks Aggregator"]' Unblocks #400, #403, #404, #405, #406, #407 currently stuck on the deadlock. Per user popup direction 2026-05-05. Files: .github/workflows/required-aggregator.yml (new), docs/adr/0313-*.md (new), changelog.d/added/*.md (new), docs/adr/README.md (+1 row), docs/adr/_index_fragments/_order.txt (+1 line + new fragment), docs/rebase-notes.md §0313. * ci: retrigger after PR body cleanup * ci: retrigger after deliverables opt-out polarity fix --------- Co-authored-by: Lusoris <lusoris@pm.me>

…earch-0080) Runs the Research-0077 / ADR-0305 analysis script (ai/scripts/analyze_knob_sweep.py, ships in PR #400) over the 12,636-cell Phase A sweep at runs/phase_a/full_grid/comprehensive.jsonl and records the populated Pareto-hull populations + recipe-regression count per codec in Research-0080. ADR-0308 commits the fork to a structural-vs-content-dependent threshold for revision policy. Headline findings: - 162 realised slices (every slice has a populated hull). - 1,915 recipe-vs-bare regressions at default tolerances (bitrate_tol_pct=5, vmaf_tol=0.1). - CQP regression rate 6.6 % vs CBR 20.2 % / VBR 18.7 % re-confirms Research-0063 with hard numbers. - Top-15 aggregated bad-recipe cells all reproduce on all 9 corpus sources, clustered around h264_nvenc + bf3 / spatial_aq / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq cluster. Decision (ADR-0308): a recipe regression is structural iff it reproduces on >=7 of 9 corpus sources within one (codec, rc_mode, recipe, preset, q) cell. Structural regressions are forbidden as tools/vmaf-tune/codec_adapters/* defaults and forbidden as vmaf-tune recommend outputs without explicit override; content-dependent regressions (1-6 sources) are filtered at recommend-time only via the per-slice hull lookup. The detector remains an offline gate (3-hour sweep too expensive for CI); promotion to a CI gate is deferred until a smaller stratified sample reproduces the structural patterns. Per-codec adapter revisions land as separate follow-up PRs for clean bisect signals. Six deep-dive deliverables (CLAUDE §11 / ADR-0108): - Research digest: docs/research/0080-encoder-knob-sweep-findings.md. - Decision matrix: ADR-0308 §Alternatives considered (4 options). - AGENTS.md invariant note: ai/AGENTS.md §Knob-sweep recipe-regression policy (cites ADR-0305 + ADR-0308). - Reproducer: pytest ai/tests/test_knob_sweep_analysis.py -v (script logic, ships in PR #400) + offline analyser run command in docs/rebase-notes.md §0308. - CHANGELOG fragment: changelog.d/changed/encoder-knob-sweep-findings.md. - Rebase note: docs/rebase-notes.md §0308. Constraints honoured: - Did not modify ai/scripts/analyze_knob_sweep.py (uses public API unchanged via a throw-away wrapper for the field-name rename). - Did not modify tools/vmaf-tune/codec_adapters/* (recipe revisions land in follow-up PRs). - Did not commit runs/ artefacts (.gitignore covers them). - Documentation-only; ~452 LOC against the 600-LOC budget. This PR queues behind PR #400 (ADR-0305 / Research-0077 / analysis script). Rebase target: master after PR #400 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Documentation-only PR that records the populated results of the Phase A encoder knob-sweep analysis (Pareto hull populations + recipe-vs-bare regressions) and introduces a fork policy (ADR-0308) for classifying “structural” vs “content-dependent” recipe regressions to guide future vmaf-tune adapter-default decisions.

Changes:

Adds Research-0080 with the populated knob-sweep findings and aggregated regression patterns.
Adds ADR-0308 defining a 7-of-9 threshold policy for structural recipe regressions and how regressions should gate future defaults/recommendations.
Updates fork process/docs surfaces (ADR index row, rebase notes, changelog fragment, ai/AGENTS.md) to reflect the new policy and findings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
docs/research/0080-encoder-knob-sweep-findings.md	New research digest capturing populated sweep findings and regression clusters.
docs/adr/0308-encoder-knob-sweep-recipe-regression-policy.md	New ADR defining the structural-vs-content-dependent regression policy.
docs/adr/README.md	Adds an ADR index row for ADR-0308 (but ADR index is generated from fragments).
docs/rebase-notes.md	Adds rebase-sensitive invariant note for the 7-of-9 threshold policy.
changelog.d/changed/encoder-knob-sweep-findings.md	Changelog fragment announcing the new findings + policy.
ai/AGENTS.md	Adds an invariant/policy note for contributors extending the analyzer/consumers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -313,5 +313,6 @@ ADRs may exist there for local session continuity, but the tracked
 | [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local |


 | [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local |
 | [ADR-0307](0307-vmaf-tune-ladder-default-sampler.md) | `vmaf-tune` Phase E ladder default sampler is wired. `tools/vmaf-tune/src/vmaftune/ladder.py::_default_sampler` no longer raises `NotImplementedError`; it composes `corpus.iter_rows` (Phase A encode + score) with `recommend.pick_target_vmaf` (smallest-CRF-clearing-target predicate) over the canonical 5-point CRF sweep `DEFAULT_SAMPLER_CRF_SWEEP = (18, 23, 28, 33, 38)` at the codec adapter's mid-range preset (`"medium"` for libx264 / libx265 / libsvtav1). Builds on [ADR-0295](0295-vmaf-tune-phase-e-bitrate-ladder.md) (Phase E scaffold) and [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) (Phase B-equivalent recommend surface). The `SamplerFn` seam stays open — callers needing a finer grid or a non-CRF predicate pass an explicit `sampler=`. Companion research digest: [`docs/research/0079-vmaf-tune-ladder-default-sampler.md`](../research/0079-vmaf-tune-ladder-default-sampler.md). | Proposed | tooling, automation, vmaf-tune, ladder, fork-local |
 | [ADR-0309](0309-fr-regressor-v2-ensemble-real-corpus-retrain.md) | `fr_regressor_v2` ensemble real-corpus retrain harness + flip workflow. Follow-up to ADR-0303 / PR #399 that ships the operational harness for actually running the 5-seed × 9-fold LOSO retrain against the locally available Netflix Public Dataset (`.workingdir2/netflix/`) and emitting a machine-checkable verdict file. Adds `ai/scripts/run_ensemble_v2_real_corpus_loso.sh` (Bash wrapper that validates the corpus, loops the seeds through the existing `train_fr_regressor_v2_ensemble_loso.py`, and tees timestamped per-seed logs), `ai/scripts/validate_ensemble_seeds.py` (Python validator that calls the ADR-0303 gate, snapshots the corpus YUV file list as sha256 over sorted `relpath\tsize`, and writes `PROMOTE.json` on gate-pass with a recommendation to flip the five `fr_regressor_v2_ensemble_v1_seed{0..4}` rows in `model/tiny/registry.json` from `smoke: true` to `smoke: false`, or `HOLD.json` on gate-fail with the failing-seed details and a recommendation to keep `smoke: true` and investigate diversity / hyperparameters), unit tests for both verdict paths, and a runbook (`docs/ai/ensemble-v2-real-corpus-retrain-runbook.md`) covering prerequisites, the two-command run, verdict interpretation, and rollback if the registry was flipped prematurely. The harness deliberately does **not** run the LOSO inside the PR (6–12 h GPU work) and does **not** flip the registry (separate follow-up PR gated on a passing `PROMOTE.json` — preserves a clean revert surface and honours the ai/AGENTS.md invariant that registry-flip never happens during a rebase). Companion research digest: [`docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md`](../research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md). | Proposed | ai, fr-regressor, ensemble, loso, runbook, fork-local |
+| [ADR-0308](0308-encoder-knob-sweep-recipe-regression-policy.md) | Encoder knob-sweep recipe-regression revision policy: structural regressions (≥7 of 9 sources within a `(codec, rc_mode, recipe, preset, q)` cell) are forbidden as adapter-level defaults and `vmaf-tune recommend` outputs; content-dependent regressions filtered at recommend-time only. Detector stays offline (non-CI). Companion to [ADR-0305](0305-encoder-knob-space-pareto-analysis.md) + [Research-0080](../research/0080-encoder-knob-sweep-findings.md). | Proposed | ai, vmaf-tune, codec-adapters, knob-sweep, fork-local |


+- **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy)
+- **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold)


+# Research-0080: Encoder knob-sweep — populated Pareto hulls and recipe regressions
+
+- **Status**: Findings ready
+- **Date**: 2026-05-05
+- **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy)
+- **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold)


+[ADR-0305](0305-encoder-knob-space-pareto-analysis.md) commits the
+fork to per-slice Pareto stratification on the 12,636-cell knob sweep
+and ships a regression detector
+([`ai/scripts/analyze_knob_sweep.py`](../../ai/scripts/analyze_knob_sweep.py))
+that flags recipes losing VMAF against the bare encoder default at
+matched bitrate within a slice. The policy question ADR-0305 left
+open is **what to do with the regressions once they are detected**:
+the analyser produces 1,915 flagged rows on the populated sweep


+  [Research-0077 / ADR-0305](docs/adr/0305-encoder-knob-space-pareto-analysis.md)
+  analysis script over the 12,636-cell Phase A sweep
+  (`runs/phase_a/full_grid/comprehensive.jsonl`) and records the


The 2026-05-06 merge train shipped 13 ADRs whose implementing PRs landed but Status was never bumped from Proposed to Accepted. Per docs/adr/README.md and ADR-0028, ADRs flip to Accepted once the deliverable lands. The train moved faster than the per-ADR Status edits could keep up; this PR catches up. Flipped: - ADR-0302 (#401, ENCODER_VOCAB v3 schema expansion) - ADR-0303 (#399, fr_regressor_v2 ensemble prod-flip gate) - ADR-0304 (#402, vmaf-tune fast-path Optuna TPE) - ADR-0305 (#400, knob-sweep Pareto analysis scaffold) - ADR-0307 (#404, vmaf-tune ladder default sampler) - ADR-0308 (#406, knob-sweep recipe-regression policy) - ADR-0309 (#405, ensemble retrain harness) - ADR-0311 (#408, libfuzzer harness expansion) - ADR-0313 (#410, CI Required Checks Aggregator) [table-format Status, sed-edited inline] - ADR-0314 (#412, vmaf-tune --score-backend=vulkan) - ADR-0316 (#414, cli_parse long-only-option assertion fix) - ADR-0317 (#415, CI Docker + FFmpeg-SYCL flake fix) - ADR-0319 (#422, ensemble LOSO trainer real impl) Already-Accepted (no change): ADR-0310 (#407), ADR-0312 (#425), ADR-0315 (skeleton, intentionally Proposed), ADR-0321 (#424).

lusoris mentioned this pull request May 5, 2026

ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313) #410

Merged

10 tasks

lusoris marked this pull request as ready for review May 6, 2026 01:55

Copilot AI review requested due to automatic review settings May 6, 2026 01:55

lusoris force-pushed the research/encoder-knob-sweep-findings branch from a1a5923 to b039b9f Compare May 6, 2026 01:56

lusoris force-pushed the research/encoder-knob-sweep-findings branch from b039b9f to 37b1bc6 Compare May 6, 2026 01:56

Copilot started reviewing on behalf of lusoris May 6, 2026 01:57 View session

ci: retrigger after deliverables N-prefix fix

8445f07

Copilot AI reviewed May 6, 2026

View reviewed changes

lusoris merged commit 30cd808 into master May 6, 2026
55 checks passed

lusoris deleted the research/encoder-knob-sweep-findings branch May 6, 2026 02:27

lusoris mentioned this pull request May 6, 2026

docs(adr): bulk flip Proposed → Accepted for 13 merge-train ADRs #426

Merged

9 tasks

lusoris mentioned this pull request May 6, 2026

feat/vmaf tune score backend vulkan #436

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

research(ai): encoder knob-sweep — Pareto hulls + recipe regressions (Research-0080)#406

research(ai): encoder knob-sweep — Pareto hulls + recipe regressions (Research-0080)#406
lusoris merged 2 commits intomasterfrom
research/encoder-knob-sweep-findings

lusoris commented May 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -313,5 +313,6 @@ ADRs may exist there for local session continuity, but the tracked
		\| [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) \| `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. \| Accepted \| tooling, automation, vmaf-tune, ffmpeg, fork-local \|

		- Companion ADRs: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy)
		- Companion digests: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold)

Uh oh!

Conversation

lusoris commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline findings (one sentence)

Decision (ADR-0308)

Six deep-dive deliverables (CLAUDE §11 / ADR-0108)

Constraints honoured

Test plan

Known queueing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lusoris commented May 5, 2026 •

edited

Loading