refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations) by lusoris · Pull Request #254 · lusoris/vmaf

lusoris · 2026-05-01T18:25:53Z

Summary

Introduces per-backend GPU kernel scaffolding templates as header-only inline helpers — libvmaf/src/cuda/kernel_template.h (296 LOC) and libvmaf/src/vulkan/kernel_template.h (410 LOC) — that absorb the lifecycle boilerplate every fork-added GPU feature kernel currently re-implements by hand (CUDA: private non-blocking stream + 2 events + device-accumulator + pinned-readback; Vulkan: descriptor-set layout + pipeline + descriptor pool + per-frame command-buffer + fence).
Templates only — no kernel migrations. Each future kernel migration ships in its own PR gated by the places=4 cross-backend-diff lane (per ADR-0214) plus the Netflix CPU golden gate. Three deferred-migration follow-up T-rows added to CHANGELOG.
Per-backend (not cross-backend) because CUDA's async-stream + event model and Vulkan's command-buffer + fence + descriptor-pool model share no concrete shape. Helper functions (not macros) for cuda-gdb / Nsight / RenderDoc step-debugging.

Test plan

CUDA full build: meson setup libvmaf/build-cuda libvmaf -Denable_cuda=true -Denable_nvcc=true -Denable_vulkan=disabled -Denable_sycl=false && ninja -C libvmaf/build-cuda — green.
CUDA tests: meson test -C libvmaf/build-cuda — 45/45 passing.
Vulkan full build: meson setup libvmaf/build-vulkan libvmaf -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false && ninja -C libvmaf/build-vulkan — green.
Vulkan tests: meson test -C libvmaf/build-vulkan — 41/41 passing.
Header smoke compile (templates instantiated in isolated TU) — both compile cleanly.
pre-commit run --files on every touched file — all checks pass (trailing whitespace, EOF, clang-format, copyright headers, conventional commit).
No Netflix golden assertions touched.
No existing kernel implementations touched.

Deep-dive deliverables (ADR-0108)

no research digest needed: refactor of established pattern; cites sister GPU-template scope-analysis report.
Decision matrix — ADR-0221 § Alternatives considered (per-backend vs cross-backend; templates-only vs templates+migrations; macros vs helpers).
AGENTS.md invariant note — kernel-template contract row in libvmaf/src/cuda/AGENTS.md and a new libvmaf/src/vulkan/AGENTS.md.
Reproducer / smoke-test command — see Test plan.
CHANGELOG fragment — Unreleased § Added.
Rebase note — docs/rebase-notes.md entry 0095.

🤖 Generated with Claude Code

Supersedes #229 (closed by accident in batch-rebase incident, 2026-05-01).

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds reusable, header-only scaffolding helpers for CUDA and Vulkan GPU feature kernels, plus accompanying ADR/docs/agent guidance, without migrating any existing kernels.

Changes:

Introduces new per-backend kernel template headers for CUDA and Vulkan to centralize common init/submit/teardown boilerplate.
Documents the templates and migration approach (ADR-0221, kernel-scaffolding docs, rebase notes).
Updates/introduces AGENTS guidance for CUDA/Vulkan backend invariants.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
libvmaf/src/vulkan/kernel_template.h	New Vulkan helper inlines for pipeline creation and per-frame submit/wait/free scaffolding.
libvmaf/src/vulkan/AGENTS.md	New Vulkan backend agent orientation + invariants referencing the new template.
libvmaf/src/cuda/kernel_template.h	New CUDA helper inlines for stream/event lifecycle + readback allocation/wait/teardown.
libvmaf/src/cuda/AGENTS.md	Adds template invariant row + directory listing update.
docs/rebase-notes.md	Adds rebase note entry describing template additions and invariants.
docs/backends/kernel-scaffolding.md	New user/developer doc describing template surfaces and migration sketches.
docs/adr/README.md	Registers ADR-0221 in the ADR index table.
docs/adr/0221-gpu-kernel-template.md	New ADR documenting the decision and alternatives.
CHANGELOG.md	Adds an “Unreleased” entry describing the new templates and planned follow-ups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T18:34:09Z

+static inline int vmaf_vulkan_kernel_submit_end_and_wait(VmafVulkanContext *ctx,
+                                                         VmafVulkanKernelSubmit *sub)
+{
+    if (vkEndCommandBuffer(sub->cmd) != VK_SUCCESS) {
+        return -EIO;
+    }
+    VkSubmitInfo si = {
+        .sType = VK_STRUCTURE_TYPE_SUBMIT_INFO,
+        .commandBufferCount = 1,
+        .pCommandBuffers = &sub->cmd,
+    };
+    if (vkQueueSubmit(ctx->queue, 1, &si, sub->fence) != VK_SUCCESS) {
+        return -EIO;
+    }
+    if (vkWaitForFences(ctx->device, 1, &sub->fence, VK_TRUE, UINT64_MAX) != VK_SUCCESS) {
+        return -EIO;
+    }
+    return 0;
+}


vmaf_vulkan_kernel_submit_end_and_wait dereferences sub and uses ctx without validating either (unlike *_submit_begin). This can crash on accidental misuse and makes the helper less robust; add ctx == NULL || sub == NULL early returns (e.g., -EINVAL) similar to the other helpers.

Copilot · 2026-05-01T18:34:10Z

+static inline void vmaf_vulkan_kernel_pipeline_destroy(VmafVulkanContext *ctx,
+                                                       VmafVulkanKernelPipeline *pl)
+{
+    if (ctx == NULL || ctx->device == VK_NULL_HANDLE) {
+        return;
+    }
+    vkDeviceWaitIdle(ctx->device);
+    if (pl->desc_pool != VK_NULL_HANDLE) {
+        vkDestroyDescriptorPool(ctx->device, pl->desc_pool, NULL);
+        pl->desc_pool = VK_NULL_HANDLE;
+    }


vmaf_vulkan_kernel_pipeline_destroy guards ctx but not pl; it dereferences pl unconditionally. Add a pl == NULL guard (or require non-NULL via API contract and assert/return) to avoid potential crashes in error/unwind paths.

Copilot · 2026-05-01T18:34:10Z

+    const VkPushConstantRange pcr = {
+        .stageFlags = VK_SHADER_STAGE_COMPUTE_BIT,
+        .offset = 0,
+        .size = desc->push_constant_size,
+    };


Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot · 2026-05-01T18:34:10Z

+    VkShaderModuleCreateInfo smci = {
+        .sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO,
+        .codeSize = desc->spv_size,
+        .pCode = desc->spv_bytes,
+    };


Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot · 2026-05-01T18:34:10Z

+    VkComputePipelineCreateInfo cpci = desc->pipeline_create_info;
+    cpci.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
+    cpci.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
+    cpci.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
+    cpci.stage.module = out->shader;
+    cpci.layout = out->pipeline_layout;


Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot · 2026-05-01T18:34:11Z

+    if (vkCreateShaderModule(ctx->device, &smci, NULL, &out->shader) != VK_SUCCESS) {
+        vkDestroyPipelineLayout(ctx->device, out->pipeline_layout, NULL);
+        vkDestroyDescriptorSetLayout(ctx->device, out->dsl, NULL);
+        out->pipeline_layout = VK_NULL_HANDLE;
+        out->dsl = VK_NULL_HANDLE;
+        return -ENOMEM;
+    }


These helpers currently return -ENOMEM for all Vulkan failures, but the Vulkan backend guidance in libvmaf/src/vulkan/AGENTS.md states kernels should return -EIO / -ENOMEM based on failure class. Consider mapping VK_ERROR_OUT_OF_HOST_MEMORY/VK_ERROR_OUT_OF_DEVICE_MEMORY to -ENOMEM and treating other VkResult failures as -EIO so callers don’t mis-handle non-OOM errors as allocation pressure.

Copilot · 2026-05-01T18:34:11Z

+    CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);
+    ctx_pushed = 1;
+    CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);
+    CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);
+    CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);
+    CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop);
+    return 0;
+
+fail:
+    if (ctx_pushed) {
+        (void)cu_f->cuCtxPopCurrent(NULL);
+    }
+fail_after_pop:
+    /* Best-effort: any of the three handles that did create cleanly
+     * are leaked deliberately — the caller will hit the same failure
+     * on the next init() and the process is in an unrecoverable CUDA
+     * state anyway. */


The docstring says this function “rolls back any partial state — lc ends up zeroed”, but the implementation explicitly leaks any successfully-created stream/events on failure and does not reset lc fields. This is a real resource leak and contradicts the stated contract; a recoverable transient failure (e.g., resource exhaustion) could leak repeatedly. Prefer best-effort cleanup (destroy any created events/stream) and set lc->str/submit/finished back to NULL/0 before return.

Suggested change

CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);

ctx_pushed = 1;

CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);

CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);

CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);

CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop);

return 0;

fail:

if (ctx_pushed) {

(void)cu_f->cuCtxPopCurrent(NULL);

}

fail_after_pop:

/* Best-effort: any of the three handles that did create cleanly

* are leaked deliberately — the caller will hit the same failure

* on the next init() and the process is in an unrecoverable CUDA

* state anyway. */

lc->str = NULL;

lc->submit = NULL;

lc->finished = NULL;

CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);

ctx_pushed = 1;

CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);

CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);

CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);

CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail);

ctx_pushed = 0;

return 0;

fail:

if (ctx_pushed) {

if (lc->finished) {

(void)cu_f->cuEventDestroy(lc->finished);

}

if (lc->submit) {

(void)cu_f->cuEventDestroy(lc->submit);

}

if (lc->str) {

(void)cu_f->cuStreamDestroy(lc->str);

}

(void)cu_f->cuCtxPopCurrent(NULL);

}

lc->str = NULL;

lc->submit = NULL;

lc->finished = NULL;

Copilot · 2026-05-01T18:34:12Z

+ * Returns 0 on success or -ENOMEM. On failure the caller must call
+ * vmaf_cuda_kernel_readback_free; the function leaves `rb` partially
+ * populated rather than rolling back so the unwind path stays
+ * uniform with the multi-readback case.
+ */
+static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,
+                                                  VmafCudaState *cu_state, size_t bytes)
+{
+    rb->bytes = bytes;
+    int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);
+    if (err != 0) {
+        return err;
+    }
+    err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);
+    if (err != 0) {
+        return err;
+    }


vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.

Suggested change

* Returns 0 on success or -ENOMEM. On failure the caller must call

* vmaf_cuda_kernel_readback_free; the function leaves `rb` partially

* populated rather than rolling back so the unwind path stays

* uniform with the multi-readback case.

*/

static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,

VmafCudaState *cu_state, size_t bytes)

{

rb->bytes = bytes;

int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);

if (err != 0) {

return err;

}

err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);

if (err != 0) {

return err;

}

* Returns 0 on success or the negative errno from the underlying

* allocation helper. On failure the function rolls back any partial

* allocation so `rb` remains fully unallocated.

*/

static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,

VmafCudaState *cu_state, size_t bytes)

{

int err;

rb->device = NULL;

rb->host_pinned = NULL;

rb->bytes = 0;

err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);

if (err != 0) {

return err;

}

err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);

if (err != 0) {

vmaf_cuda_buffer_free(cu_state, &rb->device);

rb->device = NULL;

return err;

}

rb->bytes = bytes;

Copilot · 2026-05-01T18:34:12Z

+    /* host_pinned is owned by the CUDA host alloc table; the
+     * matching free path lives behind common.c's
+     * vmaf_cuda_buffer_host_free. The template doesn't claim to
+     * release it — callers that adopted the template still call
+     * the existing host-free helper directly. Documented in the
+     * migration guide. */
+    rb->host_pinned = NULL;


vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.

Suggested change

/* host_pinned is owned by the CUDA host alloc table; the

* matching free path lives behind common.c's

* vmaf_cuda_buffer_host_free. The template doesn't claim to

* release it — callers that adopted the template still call

* the existing host-free helper directly. Documented in the

* migration guide. */

rb->host_pinned = NULL;

if (rb->host_pinned != NULL) {

const int e = vmaf_cuda_buffer_host_free(cu_state, rb->host_pinned);

if (e != 0 && rc == 0) {

rc = e;

}

rb->host_pinned = NULL;

}

Copilot · 2026-05-01T18:34:13Z

+  unused in PR #NNN — each future kernel migration is its own
+  gated PR (`places=4` cross-backend-diff per ADR-0214). **On
+  rebase**: keep both the header and any kernel call-sites that
+  later adopt it; upstream has no equivalent. Reference
+  implementation that mirrors the template's shape lives in


PR #NNN is a placeholder and will become stale/confusing once merged. Replace it with the actual PR number before merge, or reword to avoid embedding an explicit PR number (e.g., “lands unused in the initial templates-only PR”).

Suggested change

unused in PR #NNN — each future kernel migration is its own

gated PR (`places=4` cross-backend-diff per ADR-0214). **On

rebase**: keep both the header and any kernel call-sites that

later adopt it; upstream has no equivalent. Reference

implementation that mirrors the template's shape lives in

unused in the initial templates-only change — each future kernel

migration is its own gated PR (`places=4`

cross-backend-diff per ADR-0214). **On rebase**: keep both the

header and any kernel call-sites that later adopt it; upstream

has no equivalent. Reference implementation that mirrors the

template's shape lives in

…Vulkan, no migrations) Header-only inline-helper templates absorb the lifecycle boilerplate every fork-added GPU feature kernel currently re-implements by hand. Templates land unused; each future kernel migration is a separate PR gated by the places=4 cross-backend-diff lane (ADR-0214) plus the Netflix CPU golden gate. - libvmaf/src/cuda/kernel_template.h (296 LOC) — VmafCudaKernelLifecycle + VmafCudaKernelReadback + 6 inlines for the private non-blocking stream + 2-event + device-accumulator + pinned-readback shape. - libvmaf/src/vulkan/kernel_template.h (410 LOC) — VmafVulkanKernelPipeline + VmafVulkanKernelSubmit + 5 inlines for the descriptor-set layout + pipeline + descriptor pool + per-frame command-buffer + fence shape. Per-backend (not cross-backend) because CUDA's async-stream + event model and Vulkan's command-buffer + fence + descriptor-pool model share no concrete shape. Helper functions (not macros) for cuda-gdb / Nsight / RenderDoc step-debugging. CUDA build: 45/45 tests pass. Vulkan build: 41/41 tests pass. See ADR-0221, docs/backends/kernel-scaffolding.md. Deferred follow-ups (CHANGELOG): T7-XX-followup-{a,b,c}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 1, 2026 18:25

Copilot AI reviewed May 1, 2026

View reviewed changes

lusoris force-pushed the refactor/gpu-kernel-templates branch from ad0c29a to 2b4f839 Compare May 1, 2026 18:52

lusoris force-pushed the refactor/gpu-kernel-templates branch from 2b4f839 to 657d24e Compare May 2, 2026 02:29

lusoris merged commit 4aa75b1 into master May 2, 2026
54 of 57 checks passed

lusoris deleted the refactor/gpu-kernel-templates branch May 2, 2026 02:48

github-actions Bot mentioned this pull request May 2, 2026

chore: release master #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#254

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#254
lusoris merged 1 commit intomasterfrom
refactor/gpu-kernel-templates

lusoris commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lusoris commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Deep-dive deliverables (ADR-0108)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lusoris commented May 1, 2026 •

edited

Loading