feat(vulkan): T7-36 — cambi Vulkan integration (Strategy II)#196
Merged
feat(vulkan): T7-36 — cambi Vulkan integration (Strategy II)#196
Conversation
8d6c6a0 to
14dc3d2
Compare
lusoris
pushed a commit
that referenced
this pull request
Apr 29, 2026
Resolves PR #196 Doc-Substance Gate (ADR-0167) failure. The cambi feature extractor gained a Vulkan backend in this PR (T7-36 / ADR-0210), making `feature_extractor.c` a touched "feature extractor" surface per ADR-0100/0167 — which requires a matching `docs/metrics/` edit. Adds a "## GPU support" section to docs/metrics/cambi.md with the integer-phase / host-residual split summary, the meson flag recipe, and pointers to ADR-0210 + Research-0032. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b8147ca to
48e6e3d
Compare
lusoris
pushed a commit
that referenced
this pull request
Apr 29, 2026
PR #196 (T7-36 cambi) keeps ADR-0210; this PR (T3-15(b) chroma psnr) bumps to 0216. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
Apr 29, 2026
Closes the GPU long-tail matrix terminus (per ADR-0192 + ADR-0205).
Replaces the spike scaffold's `init_stub`/`extract_stub`/`close_stub`
triple in `libvmaf/src/feature/vulkan/cambi_vulkan.c` with the full
Vulkan-aware lifecycle. After this PR every registered feature
extractor in the fork has at least one GPU twin (lpips remains via
ORT EPs per ADR-0022).
Strategy II hybrid (per ADR-0205 §Decision):
- GPU runs the integer phases — preprocess (forward-compatible
scaffold; v1 wires the CPU bilinear-resize for bit-exactness on
resolution mismatches), per-pixel derivative, the 7×7 spatial
mask SAT, 2× decimate, and the separable 3-tap mode filter.
- Host runs the precision-sensitive sliding-histogram
`calculate_c_values` + top-K spatial pooling + scale-weighted
final score on byte-identical readback buffers.
- Bit-exact w.r.t. CPU by construction (every GPU phase is
integer arithmetic; host residual runs the unmodified CPU code
on byte-identical buffers); cross-backend gate runs at
`places=4` from day one with no per-metric tolerance carve-out.
New shaders + 1 unified TU for the 3 SAT phases:
- `cambi_preprocess.comp` (new) — per-pixel decimate + bit-shift
+ optional anti-dither, exact-resolution fast path.
- `cambi_mask_dp.comp` (new) — single TU with `PASS=0/1/2` spec
const for row-SAT / col-SAT / threshold-compare.
- Existing `cambi_derivative.comp`, `cambi_filter_mode.comp`,
`cambi_decimate.comp` shaders wired into the dispatch chain
unchanged (renamed `min3` → `cambi_min3` / `mode3` →
`cambi_mode3` to avoid the GLSL precision-overload conflict).
`cambi_internal.h` (new) exposes cambi.c's file-static helpers
(`vmaf_cambi_calculate_c_values`, `vmaf_cambi_get_spatial_mask`,
`vmaf_cambi_decimate`, `vmaf_cambi_filter_mode`,
`vmaf_cambi_spatial_pooling`, `vmaf_cambi_weight_scores_per_scale`,
`vmaf_cambi_get_pixels_in_window`, `vmaf_cambi_preprocessing`,
`vmaf_cambi_default_callbacks`) to the GPU twin via a thin
trampoline block at the bottom of `cambi.c` — no upstream-mirror
function-static code is renamed or moved, keeping Netflix sync
clean. Picked over the buffer-pair refactor ADR-0205 sketched
because the latter would ripple through CPU AVX2 / AVX-512 / NEON
callsites for ~200 LOC of churn (vs the trampoline's <70).
Wires:
- Registers 5 cambi shaders in `vulkan_shader_sources[]` and
`cambi_vulkan.c` in `vulkan_sources` in
`libvmaf/src/vulkan/meson.build`.
- Registers `vmaf_fex_cambi_vulkan` in
`feature_extractor_list[]` under `#if HAVE_VULKAN`.
- Adds a `cambi` row to `scripts/ci/cross_backend_vif_diff.py`'s
`FEATURE_METRICS` so the cross-backend gate at `places=4` runs
against the CPU baseline.
Documentation (six deep-dive deliverables per ADR-0108):
- ADR-0210 (`docs/adr/0210-cambi-vulkan-integration.md`)
- Research-0031 (`docs/research/0031-cambi-vulkan-integration.md`)
- `docs/rebase-notes.md` entry 0090
- `docs/backends/vulkan/overview.md` extractor row
- `libvmaf/src/feature/AGENTS.md` rebase-sensitive invariant note
(lock-step CPU residual + cambi_internal.h signature contract)
- `CHANGELOG.md` Unreleased / lusoris fork entry
Smoke verified: 38/38 meson tests pass on the Vulkan-enabled build
including `test_cambi`, `test_vulkan_smoke`, `test_feature_extractor`.
Pre-commit (clang-format + ruff + ADR-0105 copyright header gate)
clean on every touched file. Closes backlog item T7-36.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h T7-9 T7-9 (#194, just merged) shipped Research-0031 (Intel AI-PC NPU applicability digest). This PR's cambi-vulkan-integration digest was independently numbered 0031 by the agent that drafted it. Renumber to 0032 to keep the one-number-per-digest invariant. References updated: filename, in-body title, ADR-0210 cross-link, ADR-0210 README index row, CHANGELOG.md, docs/rebase-notes.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves PR #196 Doc-Substance Gate (ADR-0167) failure. The cambi feature extractor gained a Vulkan backend in this PR (T7-36 / ADR-0210), making `feature_extractor.c` a touched "feature extractor" surface per ADR-0100/0167 — which requires a matching `docs/metrics/` edit. Adds a "## GPU support" section to docs/metrics/cambi.md with the integer-phase / host-residual split summary, the meson flag recipe, and pointers to ADR-0210 + Research-0032. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
48e6e3d to
25021aa
Compare
lusoris
added a commit
that referenced
this pull request
Apr 29, 2026
* feat(hip): T7-10 — HIP (AMD) backend scaffold (audit-first)
Mirrors the Vulkan T5-1 scaffold (ADR-0175) for the HIP / AMD ROCm
backend. Lands the static surfaces (header, build wiring, kernel
stubs, smoke test, CI matrix row, docs) so the runtime + first-kernel
PRs that follow have a stable base to land on.
Public C-API surface (libvmaf/include/libvmaf/libvmaf_hip.h)
+ VmafHipState (opaque)
+ VmafHipConfiguration { device_index, flags }
+ vmaf_hip_state_init / _import_state / _state_free
+ vmaf_hip_list_devices / vmaf_hip_available
Header purity: HIP runtime types cross the ABI as uintptr_t.
Backend tree (libvmaf/src/hip/)
+ common.{c,h} — context_new/destroy/device_count + public C-API stubs.
+ picture_hip.{c,h} — alloc/free stubs (-ENOSYS).
+ dispatch_strategy.{c,h} — feature-name → kernel routing stub.
+ meson.build — paths relative to libvmaf/src/hip/, with `..`
walks to the feature stubs at libvmaf/src/feature/hip/.
Feature kernel stubs (libvmaf/src/feature/hip/)
+ adm_hip.c, vif_hip.c, motion_hip.c — _init / _run return -ENOSYS
pending real implementations.
+ feature_hip.h — forward declarations so the stubs aren't flagged
by clang-tidy's misc-use-internal-linkage checker.
Build wiring
+ new `enable_hip` boolean option in libvmaf/meson_options.txt
(default false — matches enable_cuda / enable_sycl convention;
Vulkan's `feature` form is intentionally not mirrored, see
ADR-0209 § "Decision").
+ conditional `subdir('hip')` in libvmaf/src/meson.build, with
`hip_sources` threaded into libvmaf_feature_static_lib and
`hip_deps` into the top-level library() dependencies list.
+ cdata.set10('HAVE_HIP', true) when enabled.
+ dependency('hip-lang', required: false) optional probe — no hard
SDK requirement for the scaffold.
Smoke test
+ libvmaf/test/test_hip_smoke.c — 9 sub-tests pinning the contract
(4 internal-context lifecycle + 5 public C-API entry-point
-ENOSYS / -EINVAL / NULL-safe assertions). Wired in
libvmaf/test/meson.build under `if get_option('enable_hip')`.
+ verified locally: meson setup -Denable_hip=true + ninja +
test_hip_smoke → 9/9 pass. Default no-HIP build still 37/37.
CI matrix
+ new "Build — Ubuntu HIP (T7-10 scaffold)" row in
.github/workflows/libvmaf-build-matrix.yml. Compiles with
-Denable_hip=true. No ROCm SDK install step needed.
Docs
+ new docs/backends/hip/overview.md — "scaffold only" warning,
build instructions, "what lands next" sequence.
+ ADR-0209 captures the audit-first decision + alternatives
(separate libvmaf_hip.so vs in-tree, AMD-only vs hipify
auto-translation, ROCm vs HIP runtime, boolean vs feature
option type).
+ Research-0032 covers AMD market share + ROCm 6.x Linux
maturity check.
+ ADR README index updated; docs/backends/index.md flipped from
"planned" to "scaffold"; docs/development/build-flags.md row
added; docs/rebase-notes.md entry 0074 added; libvmaf/AGENTS.md
rebase-sensitive invariant entry added.
+ CHANGELOG entry under [Unreleased] § Added.
Zero hard runtime dependencies — `dependency('hip-lang')` probe
stays optional. Adding the real linkage is the responsibility of
the runtime PR (T7-10b).
ADRs: 0209.
Research: 0032.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(docs): renumber ADR-0209 → ADR-0212 and Research-0032 → 0033
PRs #195/#199/#200/#201 all picked ADR-0209; #196/#200 both
picked Research-0032. #195 keeps ADR-0209 (opened first), #196
keeps Research-0032 (cambi). This PR (T7-10 HIP) bumps both.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
Apr 29, 2026
lusoris
added a commit
that referenced
this pull request
Apr 29, 2026
* feat(ai): T6-7 — FastDVDnet temporal pre-filter (5-frame window)
Adds the libvmaf-side contract for the Wave 1 §3.3 FastDVDnet temporal
denoise pre-filter: a registered feature extractor `fastdvdnet_pre`
backed by an ONNX model with a 5-frame sliding window.
The extractor maintains an internal 5-slot ring buffer of normalised
luma planes, gathers `[t-2, t-1, t, t+1, t+2]` into a `[1, 5, H, W]`
input tensor, runs `vmaf_dnn_session_run`, and emits a per-frame
scalar `fastdvdnet_pre_l1_residual` (mean-abs difference between the
input centre and the denoised output) so the existing feature plumbing
has something to record. Replicate-edge clamp covers clip start/end.
This PR ships a smoke-only placeholder ONNX (~6 KB,
randomly-initialised 3-layer CNN with the correct shape contract);
real upstream-derived FastDVDnet weights and the FFmpeg
`vmaf_pre_temporal` filter that consumes the denoised frame buffer
are tracked as T6-7b. The registry row carries `smoke: true` and the
sidecar JSON carries the same flag plus the input/output names so
downstream consumers can validate the contract without parsing the
graph.
Touched files:
- libvmaf/src/feature/fastdvdnet_pre.c (new) + meson.build wiring
- libvmaf/src/feature/feature_extractor.c — register vmaf_fex_fastdvdnet_pre
- libvmaf/test/test_fastdvdnet_pre.c (new) — registration + options
smoke; mirrors test_lpips.c
- model/tiny/fastdvdnet_pre.{onnx,json} (new) + registry.json row
- ai/scripts/export_fastdvdnet_pre_placeholder.py (new) — placeholder
weights regen helper
- docs/ai/models/fastdvdnet_pre.md (new) — user-facing model doc
- docs/ai/roadmap.md — Wave 1 §3.3 status row
- docs/adr/0210-fastdvdnet-pre-filter.md (new) + README index
- libvmaf/src/feature/AGENTS.md — 5-frame-window invariant note
- CHANGELOG.md — Unreleased § Added entry
Local gate: meson test -C build-cpu — 38/38 OK including the new
test_fastdvdnet_pre and the existing dnn registry test.
Closes backlog item T6-7.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(docs): renumber ADR-0210 → ADR-0215 (collision with #196 cambi)
PR #196 (T7-36 cambi) keeps ADR-0210; this PR (T6-7 FastDVDnet) bumps to 0215.
Sister #204 chroma psnr → 0216.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(metrics): add fastdvdnet_pre section (T6-7 / ADR-0215)
Adds the user-discoverable extractor surface entry for
``fastdvdnet_pre`` in docs/metrics/features.md per the
project-wide doc-substance rule (CLAUDE.md §12 r10 /
ADR-0100). Mirrors the LPIPS pattern: invocation,
output metric, range, input formats, options, backends,
and limitations — including the ADR-0215 placeholder
checkpoint caveat.
Resolves the PR #203 doc-substance gate failure: the
feature_extractor.c registration of ``fastdvdnet_pre``
now has its matching docs/metrics/ surface entry,
complementing the existing docs/ai/models/fastdvdnet_pre.md
deep-dive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
Apr 29, 2026
PR #196 (T7-36 cambi) keeps ADR-0210; this PR (T3-15(b) chroma psnr) bumps to 0216. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
added a commit
that referenced
this pull request
Apr 29, 2026
) * feat(vulkan): T3-15(b) — psnr chroma (psnr_cb / psnr_cr) on Vulkan Extends `psnr_vulkan` from luma-only to full Y/Cb/Cr coverage by running three back-to-back dispatches of the existing plane-agnostic `psnr.comp` shader against per-plane buffers in a single command buffer. Chroma sizing follows `pix_fmt` (4:2:0 → w/2 × h/2, 4:2:2 → w/2 × h, 4:4:4 → w × h); YUV400 clamps to luma-only. `provided_features` becomes `{psnr_y, psnr_cb, psnr_cr}` so the dispatcher routes chroma queries to Vulkan instead of the silent CPU fall-through. `psnr_max[p]` follows CPU integer_psnr.c default ((6 * bpc) + 12). Cross-backend gate (`scripts/ci/cross_backend_vif_diff.py --feature psnr`) extended to assert all three plane scores at places=4; lavapipe measurement on testdata/ref_576x324_48f.yuv vs testdata/dis_576x324_48f.yuv reports max_abs_diff = 0.0 across 48 frames for psnr_y / psnr_cb / psnr_cr (deterministic int64 SSE on both sides). See ADR-0210 for the design + alternatives. Doc update under docs/backends/vulkan/overview.md; AGENTS.md invariant note on the chroma contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docs): renumber ADR-0210 → ADR-0216 (collision with #196 cambi) PR #196 (T7-36 cambi) keeps ADR-0210; this PR (T3-15(b) chroma psnr) bumps to 0216. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
May 1, 2026
…-1a Netflix Public dataset row) Update docs/state.md `_Updated:` stamp to 2026-04-29 and rewrite the "Tiny-AI C1 baseline `fr_regressor_v1.onnx`" deferral row's reopen-trigger to TRIGGERED — the Netflix Public training corpus that gated C1 is now locally available at `.workingdir2/netflix/` (9 ref + 70 dis YUVs, ~37 GB, gitignored; provided by lawrence 2026-04-27), unblocking BACKLOG T6-1a. Verified the rest of state.md against the 2026-04-29-session merged PR set (#193–#205, #209). Every merged PR was feature / chore / docs / perf with no bug-status delta to record per CLAUDE §12 rule 13: - #193 chore(dnn) T7-12 env override removal — chore. - #194 docs(research) T7-9 NPU digest — research. - #195 feat(mcp) T5-2 embedded scaffold — feature. - #196 feat(vulkan) T7-36 cambi integration — feature. - #197 feat(motion) Netflix b949ceb port — upstream port. - #198 chore(backlog) T7-32 micro-investigations — verify-only. - #199 feat(ai) T6-9 model registry — feature. - #200 feat(hip) T7-10 HIP scaffold — feature. - #201 feat(simd) T7-38 SVE2 ports — feature. - #202 feat(ci) T6-8 parity matrix — feature. - #203 feat(ai) T6-7 FastDVDnet — feature. - #205 docs(audit) T7-4 quarterly audit — explicitly notes "no state.md changes (no upstream commit ruled in/out a fork bug)". - #209 perf(sycl) T7-17 fp64-less device — perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
May 1, 2026
…-1a Netflix Public dataset row) Update docs/state.md `_Updated:` stamp to 2026-04-29 and rewrite the "Tiny-AI C1 baseline `fr_regressor_v1.onnx`" deferral row's reopen-trigger to TRIGGERED — the Netflix Public training corpus that gated C1 is now locally available at `.workingdir2/netflix/` (9 ref + 70 dis YUVs, ~37 GB, gitignored; provided by lawrence 2026-04-27), unblocking BACKLOG T6-1a. Verified the rest of state.md against the 2026-04-29-session merged PR set (#193–#205, #209). Every merged PR was feature / chore / docs / perf with no bug-status delta to record per CLAUDE §12 rule 13: - #193 chore(dnn) T7-12 env override removal — chore. - #194 docs(research) T7-9 NPU digest — research. - #195 feat(mcp) T5-2 embedded scaffold — feature. - #196 feat(vulkan) T7-36 cambi integration — feature. - #197 feat(motion) Netflix b949ceb port — upstream port. - #198 chore(backlog) T7-32 micro-investigations — verify-only. - #199 feat(ai) T6-9 model registry — feature. - #200 feat(hip) T7-10 HIP scaffold — feature. - #201 feat(simd) T7-38 SVE2 ports — feature. - #202 feat(ci) T6-8 parity matrix — feature. - #203 feat(ai) T6-7 FastDVDnet — feature. - #205 docs(audit) T7-4 quarterly audit — explicitly notes "no state.md changes (no upstream commit ruled in/out a fork bug)". - #209 perf(sycl) T7-17 fp64-less device — perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
pushed a commit
that referenced
this pull request
May 1, 2026
…-1a Netflix Public dataset row) Update docs/state.md `_Updated:` stamp to 2026-04-29 and rewrite the "Tiny-AI C1 baseline `fr_regressor_v1.onnx`" deferral row's reopen-trigger to TRIGGERED — the Netflix Public training corpus that gated C1 is now locally available at `.workingdir2/netflix/` (9 ref + 70 dis YUVs, ~37 GB, gitignored; provided by lawrence 2026-04-27), unblocking BACKLOG T6-1a. Verified the rest of state.md against the 2026-04-29-session merged PR set (#193–#205, #209). Every merged PR was feature / chore / docs / perf with no bug-status delta to record per CLAUDE §12 rule 13: - #193 chore(dnn) T7-12 env override removal — chore. - #194 docs(research) T7-9 NPU digest — research. - #195 feat(mcp) T5-2 embedded scaffold — feature. - #196 feat(vulkan) T7-36 cambi integration — feature. - #197 feat(motion) Netflix b949ceb port — upstream port. - #198 chore(backlog) T7-32 micro-investigations — verify-only. - #199 feat(ai) T6-9 model registry — feature. - #200 feat(hip) T7-10 HIP scaffold — feature. - #201 feat(simd) T7-38 SVE2 ports — feature. - #202 feat(ci) T6-8 parity matrix — feature. - #203 feat(ai) T6-7 FastDVDnet — feature. - #205 docs(audit) T7-4 quarterly audit — explicitly notes "no state.md changes (no upstream commit ruled in/out a fork bug)". - #209 perf(sycl) T7-17 fp64-less device — perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lusoris
added a commit
that referenced
this pull request
May 1, 2026
…-1a Netflix Public dataset row) (#245) Update docs/state.md `_Updated:` stamp to 2026-04-29 and rewrite the "Tiny-AI C1 baseline `fr_regressor_v1.onnx`" deferral row's reopen-trigger to TRIGGERED — the Netflix Public training corpus that gated C1 is now locally available at `.workingdir2/netflix/` (9 ref + 70 dis YUVs, ~37 GB, gitignored; provided by lawrence 2026-04-27), unblocking BACKLOG T6-1a. Verified the rest of state.md against the 2026-04-29-session merged PR set (#193–#205, #209). Every merged PR was feature / chore / docs / perf with no bug-status delta to record per CLAUDE §12 rule 13: - #193 chore(dnn) T7-12 env override removal — chore. - #194 docs(research) T7-9 NPU digest — research. - #195 feat(mcp) T5-2 embedded scaffold — feature. - #196 feat(vulkan) T7-36 cambi integration — feature. - #197 feat(motion) Netflix b949ceb port — upstream port. - #198 chore(backlog) T7-32 micro-investigations — verify-only. - #199 feat(ai) T6-9 model registry — feature. - #200 feat(hip) T7-10 HIP scaffold — feature. - #201 feat(simd) T7-38 SVE2 ports — feature. - #202 feat(ci) T6-8 parity matrix — feature. - #203 feat(ai) T6-7 FastDVDnet — feature. - #205 docs(audit) T7-4 quarterly audit — explicitly notes "no state.md changes (no upstream commit ruled in/out a fork bug)". - #209 perf(sycl) T7-17 fp64-less device — perf. Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
init_stub/extract_stub/close_stubtriple inlibvmaf/src/feature/vulkan/cambi_vulkan.cwith the fullVulkan-aware lifecycle, registers
vmaf_fex_cambi_vulkaninfeature_extractor_list[], wires the 5 cambi shaders intovulkan_shader_sources[]. Closes the GPU long-tail matrixterminus declared in ADR-0192:
every registered feature extractor in the fork now has at least
one GPU twin (lpips remains via ORT EPs per ADR-0022).
GPU runs the integer phases (preprocess scaffold + per-pixel
derivative + 7×7 spatial-mask SAT + 2× decimate + separable 3-tap
mode filter); the precision-sensitive sliding-histogram
calculate_c_values+ top-K spatial pooling stay on the host.Bit-exact w.r.t. CPU by construction — every GPU phase is
integer arithmetic (
uint16derivative,int32SAT,>compare,stride-2 gather, 3-element
mode3lookup); the host residualruns the unmodified CPU code on byte-identical buffers.
cambi_internal.h(new) exposes cambi.c's file-static helpersto the GPU twin via a thin trampoline block at the bottom of
cambi.c. Picked over the buffer-pair refactor ADR-0205sketched because the latter would ripple through CPU AVX2 /
AVX-512 / NEON callsites for ~200 LOC of churn.
Test plan
meson setup build-vulkan libvmaf -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false && ninja -C build-vulkan— green (403/403 targets).meson test -C build-vulkan— 38/38 OK includingtest_cambi,test_vulkan_smoke,test_feature_extractor.pre-commit run --files <touched>— clang-format + ruff + ADR-0105 copyright + assertion-density all green.python3 scripts/ci/cross_backend_vif_diff.py --backend vulkan --feature cambi --ref testdata/ref_576x324_48f.yuv --dist testdata/dis_576x324_48f.yuv --width 576 --height 324 --pixel-format 420 --bitdepth 8 --frames 48— expectsplaces=4 PASSwithmax_abs_diff = 0.0(validates ULP=0 prediction).vmaf_fex_cambiextractor unchanged; only added a trampoline block at end ofcambi.c).Deep-dive deliverables (ADR-0108 r11 checklist)
docs/research/0031-cambi-vulkan-integration.md(integration-time trade-offs: trampoline vs buffer-pair refactor, mask-DP single-TU vs three-TU split, per-stage vs per-frame command buffers, GPU preprocess wired-vs-scaffolded).libvmaf/src/feature/AGENTS.mdrebase-sensitive invariants section: cambi.c trampoline-block invariant +cambi_internal.hsignature lock-step contract + GPU/CPU residual lock-step.CHANGELOG.mdUnreleased / lusoris fork → Added section.docs/rebase-notes.mdentry 0090.Notes for reviewers
byte-identical to what the CPU code path would have written
in-place (only integer ops, no float rounding). The host residual
then runs the exact same
calculate_c_valuesandspatial_poolingon those buffers — so the emitted score isbit-identical to
vmaf_fex_cambi. ULP=0 / max_abs_diff=0 againstthe CPU on the smoke fixture;
places=4is a comfortable5-decade margin.
cambi.cupstream-mirror discipline: the trampoline block atthe bottom of
cambi.cis the only fork-added code inside thatfile. Everything above it is upstream-mirror byte-identical. Any
future Netflix sync that renames a file-static helper (e.g.
decimate→cambi_decimate) updates the trampoline body butnot the header — see the AGENTS.md invariant + rebase-notes 0090.
calculate_c_values(Strategy III, ~9×CPU bandwidth, needs profile data — tracked as future v2 ADR);
CUDA + SYCL twins (per ADR-0192 cadence); GPU heatmap dump;
high_res_speedupGPU shortcut.