Skip to content

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#229

Closed
lusoris wants to merge 0 commit intomasterfrom
refactor/gpu-kernel-templates
Closed

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#229
lusoris wants to merge 0 commit intomasterfrom
refactor/gpu-kernel-templates

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented Apr 29, 2026

Summary

  • Introduces per-backend GPU kernel scaffolding templates as header-only inline helpers — libvmaf/src/cuda/kernel_template.h (296 LOC) and libvmaf/src/vulkan/kernel_template.h (410 LOC) — that absorb the lifecycle boilerplate every fork-added GPU feature kernel currently re-implements by hand (CUDA: private non-blocking stream + 2 events + device-accumulator + pinned-readback; Vulkan: descriptor-set layout + pipeline + descriptor pool + per-frame command-buffer + fence).
  • Templates only — no kernel migrations. Each future kernel migration ships in its own PR gated by the places=4 cross-backend-diff lane (per ADR-0214) plus the Netflix CPU golden gate. Three deferred-migration follow-up T-rows added to CHANGELOG.
  • Per-backend (not cross-backend) because CUDA's async-stream + event model and Vulkan's command-buffer + fence + descriptor-pool model share no concrete shape. Helper functions (not macros) for cuda-gdb / Nsight / RenderDoc step-debugging.

Test plan

  • CUDA full build: meson setup libvmaf/build-cuda libvmaf -Denable_cuda=true -Denable_nvcc=true -Denable_vulkan=disabled -Denable_sycl=false && ninja -C libvmaf/build-cuda — green.
  • CUDA tests: meson test -C libvmaf/build-cuda45/45 passing.
  • Vulkan full build: meson setup libvmaf/build-vulkan libvmaf -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false && ninja -C libvmaf/build-vulkan — green.
  • Vulkan tests: meson test -C libvmaf/build-vulkan41/41 passing.
  • Header smoke compile (templates instantiated in isolated TU) — both compile cleanly.
  • pre-commit run --files on every touched file — all checks pass (trailing whitespace, EOF, clang-format, copyright headers, conventional commit).
  • No Netflix golden assertions touched.
  • No existing kernel implementations touched.

Deep-dive deliverables (ADR-0108)

  • no research digest needed: refactor of established pattern; cites sister GPU-template scope-analysis report.
  • Decision matrix — ADR-0221 § Alternatives considered (per-backend vs cross-backend; templates-only vs templates+migrations; macros vs helpers).
  • AGENTS.md invariant note — kernel-template contract row in libvmaf/src/cuda/AGENTS.md and a new libvmaf/src/vulkan/AGENTS.md.
  • Reproducer / smoke-test command — see Test plan.
  • CHANGELOG.md entry — Unreleased § Added.
  • Rebase notedocs/rebase-notes.md entry 0095.

🤖 Generated with Claude Code

lusoris pushed a commit that referenced this pull request Apr 30, 2026
…ghtning

Initial pin `pip<26.1` was insufficient: pip 26.0.1 has the same
"No matching distribution for lightning<3.0,>=2.5" regression as 26.1.
Verified on this PR's own run (job 73817357146): venv installed
pip-26.0.1 (under 26.1), then `pip install -e ai` failed identically.

The runner image ships pip 24.0 pre-installed (PR #229 baseline,
2026-04-29, succeeded). Drop the `--upgrade pip` line so the venv
inherits 24.0 and skips the broken 26.x release entirely. Re-introduce
the upgrade once a pip release lands that resolves lightning correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris added a commit that referenced this pull request Apr 30, 2026
* fix(ci): pin pip <26.1 in Tiny AI workflow — lightning resolution regression

pip 26.1 (released 2026-04-30) regresses transitive resolution for
`lightning>=2.5,<3.0` and fails the `pip install -e ai` step in the
Tiny AI workflow with:

  ERROR: Could not find a version that satisfies the requirement
         lightning<3.0,>=2.5 (from vmaf-train) (from versions: none)

This blocks every PR's Tiny AI gate. Concrete cases observed:

- PR #213 same-SHA reruns: 12:41 UTC pass (pip 24.0) → 15:54 UTC fail
  (pip 26.1, after `pip install --upgrade pip` pulled the new release).
- PR #229 yesterday: passed with pip 24.0, lightning-2.6.1 resolved
  cleanly.

The PR diff itself doesn't touch `ai/`, so this is purely an upstream
pip regression breaking a working workflow.

Pin to `pip<26.1` in the Tiny AI venv until pip ships a fix. Other
workflows that don't depend on `lightning` resolution are left alone
to keep this change narrow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): drop pip upgrade in Tiny AI venv — pip 26.x can't resolve lightning

Initial pin `pip<26.1` was insufficient: pip 26.0.1 has the same
"No matching distribution for lightning<3.0,>=2.5" regression as 26.1.
Verified on this PR's own run (job 73817357146): venv installed
pip-26.0.1 (under 26.1), then `pip install -e ai` failed identically.

The runner image ships pip 24.0 pre-installed (PR #229 baseline,
2026-04-29, succeeded). Drop the `--upgrade pip` line so the venv
inherits 24.0 and skips the broken 26.x release entirely. Re-introduce
the upgrade once a pip release lands that resolves lightning correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris added a commit that referenced this pull request Apr 30, 2026
* fix(ai): switch lightning → pytorch-lightning (PyPI 404)

Lightning AI un-published the `lightning` distribution from PyPI on
2026-04-30. Verified:

  $ curl -sSI https://pypi.org/pypi/lightning/json
  HTTP/2 404
  $ curl -sSI https://pypi.org/pypi/pytorch-lightning/json
  HTTP/2 200    # version 2.6.1 — same wheel that pip resolved yesterday

Yesterday's PR #229 successfully downloaded `lightning-2.6.1-py3-none-any.whl`;
today every PR's Tiny AI gate fails with "No matching distribution found
for lightning<3.0,>=2.5 (from versions: none)" — the resolver is correctly
reporting the 404. PR #231's earlier `pip<26.1` pin was treating a
misdiagnosis (pip resolver bug); the pin is harmless but unnecessary now
and will be dropped in a follow-up.

`pytorch-lightning >= 2.0` ships both `import lightning` and
`import pytorch_lightning` entry points, so `import lightning as L` in
ai/src/vmaf_train/{datamodule,train,models/*}.py and the train scripts
continues to work without source changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ai): migrate `import lightning` → `import pytorch_lightning`

The previous commit on this branch swapped the PyPI distribution name
(`lightning>=2.5,<3.0` → `pytorch-lightning>=2.5,<3.0`) under the
incorrect assumption that pytorch-lightning ships both the
`pytorch_lightning` and `lightning` import namespaces. CI proved
otherwise: pytest collection errors on this branch with
`ModuleNotFoundError: No module named 'lightning'` from
ai/tests/test_export_roundtrip.py and ai/tests/test_registry.py
(import-side test discovery; these import from vmaf_train).

The pytorch-lightning distribution only exposes `pytorch_lightning`
as a top-level import. The `lightning` shim was an exclusive of the
now-404 `lightning` distribution.

Migrated all six call sites:
  ai/src/vmaf_train/datamodule.py
  ai/src/vmaf_train/train.py            (also `lightning.pytorch.callbacks`
                                         → `pytorch_lightning.callbacks`)
  ai/src/vmaf_train/models/fr_regressor.py
  ai/src/vmaf_train/models/nr_metric.py
  ai/src/vmaf_train/models/learned_filter.py
  ai/scripts/train_konvid.py            (same callbacks rename)

API surface is unchanged — pytorch_lightning >= 2.0 has the same
LightningModule / LightningDataModule / Trainer / callbacks API as
the lightning distribution did. Updated the pyproject comment to
reflect the namespace reality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 1, 2026 18:13
@lusoris lusoris closed this May 1, 2026
@lusoris lusoris force-pushed the refactor/gpu-kernel-templates branch from ad0c29a to 3130ca4 Compare May 1, 2026 18:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lusoris pushed a commit that referenced this pull request May 3, 2026
Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris pushed a commit that referenced this pull request May 5, 2026
Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris pushed a commit that referenced this pull request May 5, 2026
Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris added a commit that referenced this pull request May 5, 2026
…er) (#371)

* feat(tools): vmaf-tune Phase E — per-title bitrate ladder (game-changer)

Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(docs): renumber phase-e ADR 0277→0295 + research 0066→0068 (collisions)

---------

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris pushed a commit that referenced this pull request May 7, 2026
Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris added a commit that referenced this pull request May 7, 2026
…er) (#433)

Scaffolds the Phase E ladder generator (ADR-0277) — the highest-leverage
gap surfaced by PR #354's capability audit (Bucket #6). Mirrors the
Netflix per-title encoding paper: sample (resolution × target-VMAF),
take the Pareto upper-convex hull on (bitrate, vmaf), pick n rungs along
the hull, emit an HLS / DASH / JSON manifest.

Currently scaffold-only: the production sampler that drives Phase B's
target-VMAF bisect (PR #347) lands once that PR merges. Default sampler
raises NotImplementedError; tests inject a synthetic stub modelled on
the Netflix paper's R-D curves.

- New module tools/vmaf-tune/src/vmaftune/ladder.py — build_ladder,
  convex_hull (Pareto filter + diminishing-returns envelope),
  select_knees (log-bitrate or VMAF spacing), emit_manifest (HLS / DASH
  / JSON), and a build_and_emit convenience.
- New `vmaf-tune ladder` CLI subcommand with the canonical 5-rung
  1080p/720p/480p/360p/240p default rendition set.
- 15 new ladder tests (28 total in tools/vmaf-tune/tests/) covering hull
  correctness on a synthetic Netflix-paper-shaped cloud, knee selection
  invariants, and HLS / DASH / JSON manifest emit shape.
- ADR-0277 (Proposed; flips to Accepted once Phase B integration PR
  lands and a real-corpus PLCC validation digest reports the delta).
- Research-0054 surveys the algorithm space (Netflix per-title paper,
  Apple HLS authoring spec, JND-spaced, BO sampling).
- docs/usage/vmaf-tune.md gains a "Per-title ladder (Phase E)" section
  with the canonical invocation.
- CHANGELOG, rebase-notes (#229), AGENTS.md invariant note.

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants