Conversation
10 tasks
lusoris
added a commit
that referenced
this pull request
May 5, 2026
…(ADR-0313) (#410) * ci(policy): Required Checks Aggregator — unblock doc/Python-only PRs (ADR-0313) The 23-named-required-check posture (ADR-0037) deadlocks doc/Python-only PRs: the C-build matrix path-filter-skips on their diffs, but branch protection counts a path-filter-skip + a never-ran-at-all as not satisfying the required-check. PR #400 hit this concretely (10/23 succeeded; 13/23 either skipped or never reported; gh pr merge returned "the base branch policy prohibits the merge"). Aggregator is one workflow with no path filter. It polls up to 8 minutes for sibling workflows to register, then verifies each named check on the head SHA reported success/skipped/neutral (or didn't appear at all, which is the documented path-filter rejection semantics). Aggregator becomes the single branch-protection required check; the 23 individual workflows continue to run unchanged. Manual operator step at adoption (after this PR merges): gh api -X PUT "repos/lusoris/vmaf/branches/master/protection/required_status_checks" \ -F 'strict=true' -F 'contexts=["Required Checks Aggregator"]' Unblocks #400, #403, #404, #405, #406, #407 currently stuck on the deadlock. Per user popup direction 2026-05-05. Files: .github/workflows/required-aggregator.yml (new), docs/adr/0313-*.md (new), changelog.d/added/*.md (new), docs/adr/README.md (+1 row), docs/adr/_index_fragments/_order.txt (+1 line + new fragment), docs/rebase-notes.md §0313. * ci: retrigger after PR body cleanup * ci: retrigger after deliverables opt-out polarity fix --------- Co-authored-by: Lusoris <lusoris@pm.me>
a1a5923 to
b039b9f
Compare
…earch-0080) Runs the Research-0077 / ADR-0305 analysis script (ai/scripts/analyze_knob_sweep.py, ships in PR #400) over the 12,636-cell Phase A sweep at runs/phase_a/full_grid/comprehensive.jsonl and records the populated Pareto-hull populations + recipe-regression count per codec in Research-0080. ADR-0308 commits the fork to a structural-vs-content-dependent threshold for revision policy. Headline findings: - 162 realised slices (every slice has a populated hull). - 1,915 recipe-vs-bare regressions at default tolerances (bitrate_tol_pct=5, vmaf_tol=0.1). - CQP regression rate 6.6 % vs CBR 20.2 % / VBR 18.7 % re-confirms Research-0063 with hard numbers. - Top-15 aggregated bad-recipe cells all reproduce on all 9 corpus sources, clustered around h264_nvenc + bf3 / spatial_aq / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq cluster. Decision (ADR-0308): a recipe regression is structural iff it reproduces on >=7 of 9 corpus sources within one (codec, rc_mode, recipe, preset, q) cell. Structural regressions are forbidden as tools/vmaf-tune/codec_adapters/* defaults and forbidden as vmaf-tune recommend outputs without explicit override; content-dependent regressions (1-6 sources) are filtered at recommend-time only via the per-slice hull lookup. The detector remains an offline gate (3-hour sweep too expensive for CI); promotion to a CI gate is deferred until a smaller stratified sample reproduces the structural patterns. Per-codec adapter revisions land as separate follow-up PRs for clean bisect signals. Six deep-dive deliverables (CLAUDE §11 / ADR-0108): - Research digest: docs/research/0080-encoder-knob-sweep-findings.md. - Decision matrix: ADR-0308 §Alternatives considered (4 options). - AGENTS.md invariant note: ai/AGENTS.md §Knob-sweep recipe-regression policy (cites ADR-0305 + ADR-0308). - Reproducer: pytest ai/tests/test_knob_sweep_analysis.py -v (script logic, ships in PR #400) + offline analyser run command in docs/rebase-notes.md §0308. - CHANGELOG fragment: changelog.d/changed/encoder-knob-sweep-findings.md. - Rebase note: docs/rebase-notes.md §0308. Constraints honoured: - Did not modify ai/scripts/analyze_knob_sweep.py (uses public API unchanged via a throw-away wrapper for the field-name rename). - Did not modify tools/vmaf-tune/codec_adapters/* (recipe revisions land in follow-up PRs). - Did not commit runs/ artefacts (.gitignore covers them). - Documentation-only; ~452 LOC against the 600-LOC budget. This PR queues behind PR #400 (ADR-0305 / Research-0077 / analysis script). Rebase target: master after PR #400 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b039b9f to
37b1bc6
Compare
There was a problem hiding this comment.
Pull request overview
Documentation-only PR that records the populated results of the Phase A encoder knob-sweep analysis (Pareto hull populations + recipe-vs-bare regressions) and introduces a fork policy (ADR-0308) for classifying “structural” vs “content-dependent” recipe regressions to guide future vmaf-tune adapter-default decisions.
Changes:
- Adds Research-0080 with the populated knob-sweep findings and aggregated regression patterns.
- Adds ADR-0308 defining a 7-of-9 threshold policy for structural recipe regressions and how regressions should gate future defaults/recommendations.
- Updates fork process/docs surfaces (ADR index row, rebase notes, changelog fragment,
ai/AGENTS.md) to reflect the new policy and findings.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/research/0080-encoder-knob-sweep-findings.md | New research digest capturing populated sweep findings and regression clusters. |
| docs/adr/0308-encoder-knob-sweep-recipe-regression-policy.md | New ADR defining the structural-vs-content-dependent regression policy. |
| docs/adr/README.md | Adds an ADR index row for ADR-0308 (but ADR index is generated from fragments). |
| docs/rebase-notes.md | Adds rebase-sensitive invariant note for the 7-of-9 threshold policy. |
| changelog.d/changed/encoder-knob-sweep-findings.md | Changelog fragment announcing the new findings + policy. |
| ai/AGENTS.md | Adds an invariant/policy note for contributors extending the analyzer/consumers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -313,5 +313,6 @@ ADRs may exist there for local session continuity, but the tracked | |||
| | [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local | | |||
| | [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) | `vmaf-tune corpus --coarse-to-fine` and a new `vmaf-tune recommend` subcommand replace the 52-encode full-grid sweep with a 2-pass coarse-then-fine search. Defaults: `coarse_step=10` over `[10..50]` (5 points) + `fine_radius=5 step=1` around best-coarse (up to 10 points) = 15 visited encodes per (source, preset) → 3.46× wall-time speedup vs full grid. 1-pass shortcut when the highest-CRF coarse point already meets `--target-vmaf` skips refinement entirely (~10× speedup). Builds on [ADR-0237](0237-quality-aware-encode-automation.md) (Phase A harness); no JSONL schema bump (visited rows use existing `SCHEMA_VERSION=1`). Widens the libx264 adapter `quality_range` from the old `(15, 40)` informative window to the codec's nominal `(0, 51)` so the search domain matches the user's CLI. | Accepted | tooling, automation, vmaf-tune, ffmpeg, fork-local | | ||
| | [ADR-0307](0307-vmaf-tune-ladder-default-sampler.md) | `vmaf-tune` Phase E ladder default sampler is wired. `tools/vmaf-tune/src/vmaftune/ladder.py::_default_sampler` no longer raises `NotImplementedError`; it composes `corpus.iter_rows` (Phase A encode + score) with `recommend.pick_target_vmaf` (smallest-CRF-clearing-target predicate) over the canonical 5-point CRF sweep `DEFAULT_SAMPLER_CRF_SWEEP = (18, 23, 28, 33, 38)` at the codec adapter's mid-range preset (`"medium"` for libx264 / libx265 / libsvtav1). Builds on [ADR-0295](0295-vmaf-tune-phase-e-bitrate-ladder.md) (Phase E scaffold) and [ADR-0306](0306-vmaf-tune-coarse-to-fine.md) (Phase B-equivalent recommend surface). The `SamplerFn` seam stays open — callers needing a finer grid or a non-CRF predicate pass an explicit `sampler=`. Companion research digest: [`docs/research/0079-vmaf-tune-ladder-default-sampler.md`](../research/0079-vmaf-tune-ladder-default-sampler.md). | Proposed | tooling, automation, vmaf-tune, ladder, fork-local | | ||
| | [ADR-0309](0309-fr-regressor-v2-ensemble-real-corpus-retrain.md) | `fr_regressor_v2` ensemble real-corpus retrain harness + flip workflow. Follow-up to ADR-0303 / PR #399 that ships the operational harness for actually running the 5-seed × 9-fold LOSO retrain against the locally available Netflix Public Dataset (`.workingdir2/netflix/`) and emitting a machine-checkable verdict file. Adds `ai/scripts/run_ensemble_v2_real_corpus_loso.sh` (Bash wrapper that validates the corpus, loops the seeds through the existing `train_fr_regressor_v2_ensemble_loso.py`, and tees timestamped per-seed logs), `ai/scripts/validate_ensemble_seeds.py` (Python validator that calls the ADR-0303 gate, snapshots the corpus YUV file list as sha256 over sorted `relpath\tsize`, and writes `PROMOTE.json` on gate-pass with a recommendation to flip the five `fr_regressor_v2_ensemble_v1_seed{0..4}` rows in `model/tiny/registry.json` from `smoke: true` to `smoke: false`, or `HOLD.json` on gate-fail with the failing-seed details and a recommendation to keep `smoke: true` and investigate diversity / hyperparameters), unit tests for both verdict paths, and a runbook (`docs/ai/ensemble-v2-real-corpus-retrain-runbook.md`) covering prerequisites, the two-command run, verdict interpretation, and rollback if the registry was flipped prematurely. The harness deliberately does **not** run the LOSO inside the PR (6–12 h GPU work) and does **not** flip the registry (separate follow-up PR gated on a passing `PROMOTE.json` — preserves a clean revert surface and honours the ai/AGENTS.md invariant that registry-flip never happens during a rebase). Companion research digest: [`docs/research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md`](../research/0081-fr-regressor-v2-ensemble-real-corpus-methodology.md). | Proposed | ai, fr-regressor, ensemble, loso, runbook, fork-local | | ||
| | [ADR-0308](0308-encoder-knob-sweep-recipe-regression-policy.md) | Encoder knob-sweep recipe-regression revision policy: structural regressions (≥7 of 9 sources within a `(codec, rc_mode, recipe, preset, q)` cell) are forbidden as adapter-level defaults and `vmaf-tune recommend` outputs; content-dependent regressions filtered at recommend-time only. Detector stays offline (non-CI). Companion to [ADR-0305](0305-encoder-knob-space-pareto-analysis.md) + [Research-0080](../research/0080-encoder-knob-sweep-findings.md). | Proposed | ai, vmaf-tune, codec-adapters, knob-sweep, fork-local | |
Comment on lines
+5
to
+6
| - **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy) | ||
| - **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold) |
Comment on lines
+1
to
+6
| # Research-0080: Encoder knob-sweep — populated Pareto hulls and recipe regressions | ||
|
|
||
| - **Status**: Findings ready | ||
| - **Date**: 2026-05-05 | ||
| - **Companion ADRs**: [ADR-0305](../adr/0305-encoder-knob-space-pareto-analysis.md) (methodology), [ADR-0308](../adr/0308-encoder-knob-sweep-recipe-regression-policy.md) (regression-revision policy) | ||
| - **Companion digests**: [Research-0063](0063-encoder-knob-space-cq-vs-vbr-stratification.md) (CQ vs VBR stratification), [Research-0077](0077-encoder-knob-space-pareto-frontiers.md) (analysis scaffold) |
Comment on lines
+10
to
+17
| [ADR-0305](0305-encoder-knob-space-pareto-analysis.md) commits the | ||
| fork to per-slice Pareto stratification on the 12,636-cell knob sweep | ||
| and ships a regression detector | ||
| ([`ai/scripts/analyze_knob_sweep.py`](../../ai/scripts/analyze_knob_sweep.py)) | ||
| that flags recipes losing VMAF against the bare encoder default at | ||
| matched bitrate within a slice. The policy question ADR-0305 left | ||
| open is **what to do with the regressions once they are detected**: | ||
| the analyser produces 1,915 flagged rows on the populated sweep |
Comment on lines
+3
to
+5
| [Research-0077 / ADR-0305](docs/adr/0305-encoder-knob-space-pareto-analysis.md) | ||
| analysis script over the 12,636-cell Phase A sweep | ||
| (`runs/phase_a/full_grid/comprehensive.jsonl`) and records the |
9 tasks
lusoris
added a commit
that referenced
this pull request
May 6, 2026
The 2026-05-06 merge train shipped 13 ADRs whose implementing PRs landed but Status was never bumped from Proposed to Accepted. Per docs/adr/README.md and ADR-0028, ADRs flip to Accepted once the deliverable lands. The train moved faster than the per-ADR Status edits could keep up; this PR catches up. Flipped: - ADR-0302 (#401, ENCODER_VOCAB v3 schema expansion) - ADR-0303 (#399, fr_regressor_v2 ensemble prod-flip gate) - ADR-0304 (#402, vmaf-tune fast-path Optuna TPE) - ADR-0305 (#400, knob-sweep Pareto analysis scaffold) - ADR-0307 (#404, vmaf-tune ladder default sampler) - ADR-0308 (#406, knob-sweep recipe-regression policy) - ADR-0309 (#405, ensemble retrain harness) - ADR-0311 (#408, libfuzzer harness expansion) - ADR-0313 (#410, CI Required Checks Aggregator) [table-format Status, sed-edited inline] - ADR-0314 (#412, vmaf-tune --score-backend=vulkan) - ADR-0316 (#414, cli_parse long-only-option assertion fix) - ADR-0317 (#415, CI Docker + FFmpeg-SYCL flake fix) - ADR-0319 (#422, ensemble LOSO trainer real impl) Already-Accepted (no change): ADR-0310 (#407), ADR-0312 (#425), ADR-0315 (skeleton, intentionally Proposed), ADR-0321 (#424).
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Runs the Research-0077 / ADR-0305 analysis script (ships in PR #400) over the 12,636-cell Phase A knob sweep at
runs/phase_a/full_grid/comprehensive.jsonland writes the populated findings intodocs/research/0080-encoder-knob-sweep-findings.md. ADR-0308 commits the fork to a structural-vs-content-dependent threshold for revision policy.Headline findings (one sentence)
CQP regresses 3× less often than CBR/VBR (6.6 % vs 20.2 % / 18.7 %), and h264_nvenc dominates the structural regression cluster — the top-15 bad-recipe cells (h264_nvenc + bf3 / spatial_aq / full_hq under CBR/VBR plus a smaller hevc_nvenc + spatial_aq cluster) all reproduce on all 9 corpus sources, re-confirming Research-0063 with hard numbers.
av1_nvencav1_qsvh264_nvench264_qsvhevc_nvenchevc_qsvDecision (ADR-0308)
A recipe regression is structural iff it reproduces on ≥7 of 9 corpus sources within one
(codec, rc_mode, recipe, preset, q)cell. Structural regressions are forbidden astools/vmaf-tune/codec_adapters/*defaults and forbidden asvmaf-tune recommendoutputs without explicit override. Content-dependent regressions (1-6 sources) are filtered at recommend-time only via the per-slice hull lookup. The detector remains an offline gate.Six deep-dive deliverables (CLAUDE §11 / ADR-0108)
docs/research/0080-encoder-knob-sweep-findings.mdai/AGENTS.md§Knob-sweep recipe-regression policy (cites ADR-0305 invariant + ADR-0308 cut)changelog.d/changed/encoder-knob-sweep-findings.mddocs/rebase-notes.md§0308Constraints honoured
ai/scripts/analyze_knob_sweep.py(used public API unchanged via a throw-away wrapper for the field-name renamesrc→sourceetc).tools/vmaf-tune/codec_adapters/*(recipe revisions land in follow-up PRs).runs/artefacts (.gitignorecovers them).Test plan
pytest ai/tests/test_knob_sweep_analysis.py -v(analyser logic, lands in PR research(ai): encoder knob-space Pareto frontiers — analysis scaffold (ADR-0305 / Research-0077) #400 — verifies the script my findings depend on).python tools/vmaf-tune/src/vmaftune/hw_encoder_corpus.py …(~3 h on a single host, NVENC + QSV) → adapt fields →python ai/scripts/analyze_knob_sweep.py --jsonl <adapted.jsonl> --out-dir runs/phase_a/full_grid/reports/→ diffsummary.mdagainst the headline table in Research-0080.make format-check(no Python touched in this PR; markdown only).ai/AGENTS.md,docs/adr/README.md,docs/rebase-notes.md, the three new markdown files).Known queueing
Queued behind PR #400. Rebase target: master after PR #400 merges. Findings cite ADR-0305 / Research-0077 /
ai/scripts/analyze_knob_sweep.pyas forward references; those land via PR #400.🤖 Generated with Claude Code