Skip to content

research(ai): encoder knob-sweep — Pareto hulls + recipe regressions (Research-0080)#406

Merged
lusoris merged 2 commits intomasterfrom
research/encoder-knob-sweep-findings
May 6, 2026
Merged

research(ai): encoder knob-sweep — Pareto hulls + recipe regressions (Research-0080)#406
lusoris merged 2 commits intomasterfrom
research/encoder-knob-sweep-findings

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented May 5, 2026

Summary

Runs the Research-0077 / ADR-0305 analysis script (ships in PR #400) over the 12,636-cell Phase A knob sweep at runs/phase_a/full_grid/comprehensive.jsonl and writes the populated findings into docs/research/0080-encoder-knob-sweep-findings.md. ADR-0308 commits the fork to a structural-vs-content-dependent threshold for revision policy.

Headline findings (one sentence)

CQP regresses 3× less often than CBR/VBR (6.6 % vs 20.2 % / 18.7 %), and h264_nvenc dominates the structural regression cluster — the top-15 bad-recipe cells (h264_nvenc + bf3 / spatial_aq / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq cluster) all reproduce on all 9 corpus sources, re-confirming Research-0063 with hard numbers.

codec slices max VMAF bitrate p50 (kbps) bitrate p95 (kbps) enc time p50 (ms) regressions
av1_nvenc 27 99.98 2,266 11,733 546 289
av1_qsv 27 99.97 2,566 14,249 398 84
h264_nvenc 27 99.87 3,643 18,519 540 636
h264_qsv 27 99.97 3,511 17,556 435 281
hevc_nvenc 27 99.93 3,537 16,553 543 515
hevc_qsv 27 99.97 2,571 10,690 405 110

Decision (ADR-0308)

A recipe regression is structural iff it reproduces on ≥7 of 9 corpus sources within one (codec, rc_mode, recipe, preset, q) cell. Structural regressions are forbidden as tools/vmaf-tune/codec_adapters/* defaults and forbidden as vmaf-tune recommend outputs without explicit override. Content-dependent regressions (1-6 sources) are filtered at recommend-time only via the per-slice hull lookup. The detector remains an offline gate.

Six deep-dive deliverables (CLAUDE §11 / ADR-0108)

  • (1) Research digest: docs/research/0080-encoder-knob-sweep-findings.md
  • (2) Decision matrix: ADR-0308 §Alternatives considered (4 options: 7-of-9 structural, forbid-all, accept-all, fixture-only CI gate)
  • (3) AGENTS.md invariant note: ai/AGENTS.md §Knob-sweep recipe-regression policy (cites ADR-0305 invariant + ADR-0308 cut)
  • (4) Reproducer / smoke-test command: see Test plan
  • (5) CHANGELOG fragment: changelog.d/changed/encoder-knob-sweep-findings.md
  • (6) Rebase note: docs/rebase-notes.md §0308

Constraints honoured

  • Did not modify ai/scripts/analyze_knob_sweep.py (used public API unchanged via a throw-away wrapper for the field-name rename src→source etc).
  • Did not modify tools/vmaf-tune/codec_adapters/* (recipe revisions land in follow-up PRs).
  • Did not commit runs/ artefacts (.gitignore covers them).
  • Documentation-only; ~452 LOC against the 600-LOC budget.

Test plan

Known queueing

Queued behind PR #400. Rebase target: master after PR #400 merges. Findings cite ADR-0305 / Research-0077 / ai/scripts/analyze_knob_sweep.py as forward references; those land via PR #400.

🤖 Generated with Claude Code

lusoris added a commit that referenced this pull request May 5, 2026
…(ADR-0313) (#410)

* ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313)

The 23-named-required-check posture (ADR-0037) deadlocks doc/Python-only
PRs: the C-build matrix path-filter-skips on their diffs, but branch
protection counts a path-filter-skip + a never-ran-at-all as not
satisfying the required-check. PR #400 hit this concretely (10/23
succeeded; 13/23 either skipped or never reported; gh pr merge returned
"the base branch policy prohibits the merge").

Aggregator is one workflow with no path filter. It polls up to 8 minutes
for sibling workflows to register, then verifies each named check on the
head SHA reported success/skipped/neutral (or didn't appear at all,
which is the documented path-filter rejection semantics). Aggregator
becomes the single branch-protection required check; the 23 individual
workflows continue to run unchanged.

Manual operator step at adoption (after this PR merges):

  gh api -X PUT "repos/lusoris/vmaf/branches/master/protection/required_status_checks" \
    -F 'strict=true' -F 'contexts=["Required Checks Aggregator"]'

Unblocks #400, #403, #404, #405, #406, #407 currently stuck on the
deadlock. Per user popup direction 2026-05-05.

Files: .github/workflows/required-aggregator.yml (new),
docs/adr/0313-*.md (new), changelog.d/added/*.md (new),
docs/adr/README.md (+1 row), docs/adr/_index_fragments/_order.txt
(+1 line + new fragment), docs/rebase-notes.md §0313.

* ci: retrigger after PR body cleanup

* ci: retrigger after deliverables opt-out polarity fix

---------

Co-authored-by: Lusoris <lusoris@pm.me>
@lusoris lusoris marked this pull request as ready for review May 6, 2026 01:55
Copilot AI review requested due to automatic review settings May 6, 2026 01:55
@lusoris lusoris force-pushed the research/encoder-knob-sweep-findings branch from a1a5923 to b039b9f Compare May 6, 2026 01:56
…earch-0080)

Runs the Research-0077 / ADR-0305 analysis script
(ai/scripts/analyze_knob_sweep.py, ships in PR #400) over the
12,636-cell Phase A sweep at
runs/phase_a/full_grid/comprehensive.jsonl and records the
populated Pareto-hull populations + recipe-regression count per
codec in Research-0080. ADR-0308 commits the fork to a
structural-vs-content-dependent threshold for revision policy.

Headline findings:
- 162 realised slices (every slice has a populated hull).
- 1,915 recipe-vs-bare regressions at default tolerances
  (bitrate_tol_pct=5, vmaf_tol=0.1).
- CQP regression rate 6.6 % vs CBR 20.2 % / VBR 18.7 %
  re-confirms Research-0063 with hard numbers.
- Top-15 aggregated bad-recipe cells all reproduce on all 9
  corpus sources, clustered around h264_nvenc + bf3 / spatial_aq
  / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq
  cluster.

Decision (ADR-0308): a recipe regression is structural iff it
reproduces on >=7 of 9 corpus sources within one
(codec, rc_mode, recipe, preset, q) cell. Structural regressions
are forbidden as tools/vmaf-tune/codec_adapters/* defaults and
forbidden as vmaf-tune recommend outputs without explicit
override; content-dependent regressions (1-6 sources) are filtered
at recommend-time only via the per-slice hull lookup. The detector
remains an offline gate (3-hour sweep too expensive for CI);
promotion to a CI gate is deferred until a smaller stratified
sample reproduces the structural patterns. Per-codec adapter
revisions land as separate follow-up PRs for clean bisect signals.

Six deep-dive deliverables (CLAUDE §11 / ADR-0108):
- Research digest: docs/research/0080-encoder-knob-sweep-findings.md.
- Decision matrix: ADR-0308 §Alternatives considered (4 options).
- AGENTS.md invariant note: ai/AGENTS.md §Knob-sweep
  recipe-regression policy (cites ADR-0305 + ADR-0308).
- Reproducer: pytest ai/tests/test_knob_sweep_analysis.py -v
  (script logic, ships in PR #400) + offline analyser run command
  in docs/rebase-notes.md §0308.
- CHANGELOG fragment: changelog.d/changed/encoder-knob-sweep-findings.md.
- Rebase note: docs/rebase-notes.md §0308.

Constraints honoured:
- Did not modify ai/scripts/analyze_knob_sweep.py (uses public API
  unchanged via a throw-away wrapper for the field-name rename).
- Did not modify tools/vmaf-tune/codec_adapters/* (recipe revisions
  land in follow-up PRs).
- Did not commit runs/ artefacts (.gitignore covers them).
- Documentation-only; ~452 LOC against the 600-LOC budget.

This PR queues behind PR #400 (ADR-0305 / Research-0077 / analysis
script). Rebase target: master after PR #400 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris force-pushed the research/encoder-knob-sweep-findings branch from b039b9f to 37b1bc6 Compare May 6, 2026 01:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Documentation-only PR that records the populated results of the Phase A encoder knob-sweep analysis (Pareto hull populations + recipe-vs-bare regressions) and introduces a fork policy (ADR-0308) for classifying “structural” vs “content-dependent” recipe regressions to guide future vmaf-tune adapter-default decisions.

Changes:

  • Adds Research-0080 with the populated knob-sweep findings and aggregated regression patterns.
  • Adds ADR-0308 defining a 7-of-9 threshold policy for structural recipe regressions and how regressions should gate future defaults/recommendations.
  • Updates fork process/docs surfaces (ADR index row, rebase notes, changelog fragment, ai/AGENTS.md) to reflect the new policy and findings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/research/0080-encoder-knob-sweep-findings.md New research digest capturing populated sweep findings and regression clusters.
docs/adr/0308-encoder-knob-sweep-recipe-regression-policy.md New ADR defining the structural-vs-content-dependent regression policy.
docs/adr/README.md Adds an ADR index row for ADR-0308 (but ADR index is generated from fragments).
docs/rebase-notes.md Adds rebase-sensitive invariant note for the 7-of-9 threshold policy.
changelog.d/changed/encoder-knob-sweep-findings.md Changelog fragment announcing the new findings + policy.
ai/AGENTS.md Adds an invariant/policy note for contributors extending the analyzer/consumers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/adr/README.md
@@ -313,5 +313,6 @@ ADRs may exist there for local session continuity, but the tracked
| [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local |
Comment thread docs/adr/README.md
| [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local |
| [ADR-0307](0307-vmaf-tune-ladder-default-sampler.md) | `vmaf-tune` Phase E ladder default sampler is wired. `tools/vmaf-tune/src/vmaftune/ladder.py::_default_sampler` no longer raises `NotImplementedError`; it composes `corpus.iter_rows` (Phase A encode + score) with `recommend.pick_target_vmaf` (smallest-CRF-clearing-target predicate) over the canonical 5-point CRF sweep `DEFAULT_SAMPLER_CRF_SWEEP = (18, 23, 28, 33, 38)` at the codec adapter's mid-range preset (`"medium"` for libx264 / libx265 / libsvtav1). Builds on [ADR-0295](0295-vmaf-tune-phase-e-bitrate-ladder.md) (Phase E scaffold) and [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) (Phase B-equivalent recommend surface). The `SamplerFn` seam stays open — callers needing a finer grid or a non-CRF predicate pass an explicit `sampler=`. Companion research digest: [`docs/research/0079-vmaf-tune-ladder-default-sampler.md`](../research/0079-vmaf-tune-ladder-default-sampler.md). | Proposed | tooling, automation, vmaf-tune, ladder, fork-local |
| [ADR-0309](0309-fr-regressor-v2-ensemble-real-corpus-retrain.md) | `fr_regressor_v2` ensemble real-corpus retrain harness + flip workflow. Follow-up to ADR-0303 / PR #399 that ships the operational harness for actually running the 5-seed × 9-fold LOSO retrain against the locally available Netflix Public Dataset (`.workingdir2/netflix/`) and emitting a machine-checkable verdict file. Adds `ai/scripts/run_ensemble_v2_real_corpus_loso.sh` (Bash wrapper that validates the corpus, loops the seeds through the existing `train_fr_regressor_v2_ensemble_loso.py`, and tees timestamped per-seed logs), `ai/scripts/validate_ensemble_seeds.py` (Python validator that calls the ADR-0303 gate, snapshots the corpus YUV file list as sha256 over sorted `relpath\tsize`, and writes `PROMOTE.json` on gate-pass with a recommendation to flip the five `fr_regressor_v2_ensemble_v1_seed{0..4}` rows in `model/tiny/registry.json` from `smoke: true` to `smoke: false`, or `HOLD.json` on gate-fail with the failing-seed details and a recommendation to keep `smoke: true` and investigate diversity / hyperparameters), unit tests for both verdict paths, and a runbook (`docs/ai/ensemble-v2-real-corpus-retrain-runbook.md`) covering prerequisites, the two-command run, verdict interpretation, and rollback if the registry was flipped prematurely. The harness deliberately does **not** run the LOSO inside the PR (6–12 h GPU work) and does **not** flip the registry (separate follow-up PR gated on a passing `PROMOTE.json` — preserves a clean revert surface and honours the ai/AGENTS.md invariant that registry-flip never happens during a rebase). Companion research digest: [`docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md`](../research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md). | Proposed | ai, fr-regressor, ensemble, loso, runbook, fork-local |
| [ADR-0308](0308-encoder-knob-sweep-recipe-regression-policy.md) | Encoder knob-sweep recipe-regression revision policy: structural regressions (≥7 of 9 sources within a `(codec, rc_mode, recipe, preset, q)` cell) are forbidden as adapter-level defaults and `vmaf-tune recommend` outputs; content-dependent regressions filtered at recommend-time only. Detector stays offline (non-CI). Companion to [ADR-0305](0305-encoder-knob-space-pareto-analysis.md) + [Research-0080](../research/0080-encoder-knob-sweep-findings.md). | Proposed | ai, vmaf-tune, codec-adapters, knob-sweep, fork-local |
Comment on lines +5 to +6
- **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy)
- **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold)
Comment on lines +1 to +6
# Research-0080: Encoder knob-sweep — populated Pareto hulls and recipe regressions

- **Status**: Findings ready
- **Date**: 2026-05-05
- **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy)
- **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold)
Comment on lines +10 to +17
[ADR-0305](0305-encoder-knob-space-pareto-analysis.md) commits the
fork to per-slice Pareto stratification on the 12,636-cell knob sweep
and ships a regression detector
([`ai/scripts/analyze_knob_sweep.py`](../../ai/scripts/analyze_knob_sweep.py))
that flags recipes losing VMAF against the bare encoder default at
matched bitrate within a slice. The policy question ADR-0305 left
open is **what to do with the regressions once they are detected**:
the analyser produces 1,915 flagged rows on the populated sweep
Comment on lines +3 to +5
[Research-0077 / ADR-0305](docs/adr/0305-encoder-knob-space-pareto-analysis.md)
analysis script over the 12,636-cell Phase A sweep
(`runs/phase_a/full_grid/comprehensive.jsonl`) and records the
@lusoris lusoris merged commit 30cd808 into master May 6, 2026
55 checks passed
@lusoris lusoris deleted the research/encoder-knob-sweep-findings branch May 6, 2026 02:27
lusoris added a commit that referenced this pull request May 6, 2026
The 2026-05-06 merge train shipped 13 ADRs whose implementing PRs
landed but Status was never bumped from Proposed to Accepted. Per
docs/adr/README.md and ADR-0028, ADRs flip to Accepted once the
deliverable lands. The train moved faster than the per-ADR Status
edits could keep up; this PR catches up.

Flipped:
- ADR-0302 (#401, ENCODER_VOCAB v3 schema expansion)
- ADR-0303 (#399, fr_regressor_v2 ensemble prod-flip gate)
- ADR-0304 (#402, vmaf-tune fast-path Optuna TPE)
- ADR-0305 (#400, knob-sweep Pareto analysis scaffold)
- ADR-0307 (#404, vmaf-tune ladder default sampler)
- ADR-0308 (#406, knob-sweep recipe-regression policy)
- ADR-0309 (#405, ensemble retrain harness)
- ADR-0311 (#408, libfuzzer harness expansion)
- ADR-0313 (#410, CI Required Checks Aggregator) [table-format Status, sed-edited inline]
- ADR-0314 (#412, vmaf-tune --score-backend=vulkan)
- ADR-0316 (#414, cli_parse long-only-option assertion fix)
- ADR-0317 (#415, CI Docker + FFmpeg-SYCL flake fix)
- ADR-0319 (#422, ensemble LOSO trainer real impl)

Already-Accepted (no change): ADR-0310 (#407), ADR-0312 (#425),
ADR-0315 (skeleton, intentionally Proposed), ADR-0321 (#424).
@lusoris lusoris mentioned this pull request May 6, 2026
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants