Skip to content

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#254

Merged
lusoris merged 1 commit intomasterfrom
refactor/gpu-kernel-templates
May 2, 2026
Merged

refactor(gpu): land per-backend kernel scaffolding templates (CUDA + Vulkan, no migrations)#254
lusoris merged 1 commit intomasterfrom
refactor/gpu-kernel-templates

Conversation

@lusoris
Copy link
Copy Markdown
Owner

@lusoris lusoris commented May 1, 2026

Summary

  • Introduces per-backend GPU kernel scaffolding templates as header-only inline helpers — libvmaf/src/cuda/kernel_template.h (296 LOC) and libvmaf/src/vulkan/kernel_template.h (410 LOC) — that absorb the lifecycle boilerplate every fork-added GPU feature kernel currently re-implements by hand (CUDA: private non-blocking stream + 2 events + device-accumulator + pinned-readback; Vulkan: descriptor-set layout + pipeline + descriptor pool + per-frame command-buffer + fence).
  • Templates only — no kernel migrations. Each future kernel migration ships in its own PR gated by the places=4 cross-backend-diff lane (per ADR-0214) plus the Netflix CPU golden gate. Three deferred-migration follow-up T-rows added to CHANGELOG.
  • Per-backend (not cross-backend) because CUDA's async-stream + event model and Vulkan's command-buffer + fence + descriptor-pool model share no concrete shape. Helper functions (not macros) for cuda-gdb / Nsight / RenderDoc step-debugging.

Test plan

  • CUDA full build: meson setup libvmaf/build-cuda libvmaf -Denable_cuda=true -Denable_nvcc=true -Denable_vulkan=disabled -Denable_sycl=false && ninja -C libvmaf/build-cuda — green.
  • CUDA tests: meson test -C libvmaf/build-cuda45/45 passing.
  • Vulkan full build: meson setup libvmaf/build-vulkan libvmaf -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false && ninja -C libvmaf/build-vulkan — green.
  • Vulkan tests: meson test -C libvmaf/build-vulkan41/41 passing.
  • Header smoke compile (templates instantiated in isolated TU) — both compile cleanly.
  • pre-commit run --files on every touched file — all checks pass (trailing whitespace, EOF, clang-format, copyright headers, conventional commit).
  • No Netflix golden assertions touched.
  • No existing kernel implementations touched.

Deep-dive deliverables (ADR-0108)

  • no research digest needed: refactor of established pattern; cites sister GPU-template scope-analysis report.
  • Decision matrix — ADR-0221 § Alternatives considered (per-backend vs cross-backend; templates-only vs templates+migrations; macros vs helpers).
  • AGENTS.md invariant note — kernel-template contract row in libvmaf/src/cuda/AGENTS.md and a new libvmaf/src/vulkan/AGENTS.md.
  • Reproducer / smoke-test command — see Test plan.
  • CHANGELOG fragment — Unreleased § Added.
  • Rebase notedocs/rebase-notes.md entry 0095.

🤖 Generated with Claude Code


Supersedes #229 (closed by accident in batch-rebase incident, 2026-05-01).

Copilot AI review requested due to automatic review settings May 1, 2026 18:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds reusable, header-only scaffolding helpers for CUDA and Vulkan GPU feature kernels, plus accompanying ADR/docs/agent guidance, without migrating any existing kernels.

Changes:

  • Introduces new per-backend kernel template headers for CUDA and Vulkan to centralize common init/submit/teardown boilerplate.
  • Documents the templates and migration approach (ADR-0221, kernel-scaffolding docs, rebase notes).
  • Updates/introduces AGENTS guidance for CUDA/Vulkan backend invariants.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
libvmaf/src/vulkan/kernel_template.h New Vulkan helper inlines for pipeline creation and per-frame submit/wait/free scaffolding.
libvmaf/src/vulkan/AGENTS.md New Vulkan backend agent orientation + invariants referencing the new template.
libvmaf/src/cuda/kernel_template.h New CUDA helper inlines for stream/event lifecycle + readback allocation/wait/teardown.
libvmaf/src/cuda/AGENTS.md Adds template invariant row + directory listing update.
docs/rebase-notes.md Adds rebase note entry describing template additions and invariants.
docs/backends/kernel-scaffolding.md New user/developer doc describing template surfaces and migration sketches.
docs/adr/README.md Registers ADR-0221 in the ADR index table.
docs/adr/0221-gpu-kernel-template.md New ADR documenting the decision and alternatives.
CHANGELOG.md Adds an “Unreleased” entry describing the new templates and planned follow-ups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +335 to +353
static inline int vmaf_vulkan_kernel_submit_end_and_wait(VmafVulkanContext *ctx,
VmafVulkanKernelSubmit *sub)
{
if (vkEndCommandBuffer(sub->cmd) != VK_SUCCESS) {
return -EIO;
}
VkSubmitInfo si = {
.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO,
.commandBufferCount = 1,
.pCommandBuffers = &sub->cmd,
};
if (vkQueueSubmit(ctx->queue, 1, &si, sub->fence) != VK_SUCCESS) {
return -EIO;
}
if (vkWaitForFences(ctx->device, 1, &sub->fence, VK_TRUE, UINT64_MAX) != VK_SUCCESS) {
return -EIO;
}
return 0;
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmaf_vulkan_kernel_submit_end_and_wait dereferences sub and uses ctx without validating either (unlike *_submit_begin). This can crash on accidental misuse and makes the helper less robust; add ctx == NULL || sub == NULL early returns (e.g., -EINVAL) similar to the other helpers.

Copilot uses AI. Check for mistakes.
Comment on lines +377 to +387
static inline void vmaf_vulkan_kernel_pipeline_destroy(VmafVulkanContext *ctx,
VmafVulkanKernelPipeline *pl)
{
if (ctx == NULL || ctx->device == VK_NULL_HANDLE) {
return;
}
vkDeviceWaitIdle(ctx->device);
if (pl->desc_pool != VK_NULL_HANDLE) {
vkDestroyDescriptorPool(ctx->device, pl->desc_pool, NULL);
pl->desc_pool = VK_NULL_HANDLE;
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmaf_vulkan_kernel_pipeline_destroy guards ctx but not pl; it dereferences pl unconditionally. Add a pl == NULL guard (or require non-NULL via API contract and assert/return) to avoid potential crashes in error/unwind paths.

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +200
const VkPushConstantRange pcr = {
.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT,
.offset = 0,
.size = desc->push_constant_size,
};
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +219
VkShaderModuleCreateInfo smci = {
.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO,
.codeSize = desc->spv_size,
.pCode = desc->spv_bytes,
};
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +235
VkComputePipelineCreateInfo cpci = desc->pipeline_create_info;
cpci.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
cpci.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
cpci.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
cpci.stage.module = out->shader;
cpci.layout = out->pipeline_layout;
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.

Copilot uses AI. Check for mistakes.
Comment on lines +220 to +226
if (vkCreateShaderModule(ctx->device, &smci, NULL, &out->shader) != VK_SUCCESS) {
vkDestroyPipelineLayout(ctx->device, out->pipeline_layout, NULL);
vkDestroyDescriptorSetLayout(ctx->device, out->dsl, NULL);
out->pipeline_layout = VK_NULL_HANDLE;
out->dsl = VK_NULL_HANDLE;
return -ENOMEM;
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers currently return -ENOMEM for all Vulkan failures, but the Vulkan backend guidance in libvmaf/src/vulkan/AGENTS.md states kernels should return -EIO / -ENOMEM based on failure class. Consider mapping VK_ERROR_OUT_OF_HOST_MEMORY/VK_ERROR_OUT_OF_DEVICE_MEMORY to -ENOMEM and treating other VkResult failures as -EIO so callers don’t mis-handle non-OOM errors as allocation pressure.

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +148
CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);
ctx_pushed = 1;
CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop);
return 0;

fail:
if (ctx_pushed) {
(void)cu_f->cuCtxPopCurrent(NULL);
}
fail_after_pop:
/* Best-effort: any of the three handles that did create cleanly
* are leaked deliberately — the caller will hit the same failure
* on the next init() and the process is in an unrecoverable CUDA
* state anyway. */
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this function “rolls back any partial state — lc ends up zeroed”, but the implementation explicitly leaks any successfully-created stream/events on failure and does not reset lc fields. This is a real resource leak and contradicts the stated contract; a recoverable transient failure (e.g., resource exhaustion) could leak repeatedly. Prefer best-effort cleanup (destroy any created events/stream) and set lc->str/submit/finished back to NULL/0 before return.

Suggested change
CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);
ctx_pushed = 1;
CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop);
return 0;
fail:
if (ctx_pushed) {
(void)cu_f->cuCtxPopCurrent(NULL);
}
fail_after_pop:
/* Best-effort: any of the three handles that did create cleanly
* are leaked deliberatelythe caller will hit the same failure
* on the next init() and the process is in an unrecoverable CUDA
* state anyway. */
lc->str = NULL;
lc->submit = NULL;
lc->finished = NULL;
CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail);
ctx_pushed = 1;
CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail);
CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail);
ctx_pushed = 0;
return 0;
fail:
if (ctx_pushed) {
if (lc->finished) {
(void)cu_f->cuEventDestroy(lc->finished);
}
if (lc->submit) {
(void)cu_f->cuEventDestroy(lc->submit);
}
if (lc->str) {
(void)cu_f->cuStreamDestroy(lc->str);
}
(void)cu_f->cuCtxPopCurrent(NULL);
}
lc->str = NULL;
lc->submit = NULL;
lc->finished = NULL;

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +171
* Returns 0 on success or -ENOMEM. On failure the caller must call
* vmaf_cuda_kernel_readback_free; the function leaves `rb` partially
* populated rather than rolling back so the unwind path stays
* uniform with the multi-readback case.
*/
static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,
VmafCudaState *cu_state, size_t bytes)
{
rb->bytes = bytes;
int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);
if (err != 0) {
return err;
}
err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);
if (err != 0) {
return err;
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.

Suggested change
* Returns 0 on success or -ENOMEM. On failure the caller must call
* vmaf_cuda_kernel_readback_free; the function leaves `rb` partially
* populated rather than rolling back so the unwind path stays
* uniform with the multi-readback case.
*/
static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,
VmafCudaState *cu_state, size_t bytes)
{
rb->bytes = bytes;
int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);
if (err != 0) {
return err;
}
err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);
if (err != 0) {
return err;
}
* Returns 0 on success or the negative errno from the underlying
* allocation helper. On failure the function rolls back any partial
* allocation so `rb` remains fully unallocated.
*/
static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb,
VmafCudaState *cu_state, size_t bytes)
{
int err;
rb->device = NULL;
rb->host_pinned = NULL;
rb->bytes = 0;
err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes);
if (err != 0) {
return err;
}
err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes);
if (err != 0) {
vmaf_cuda_buffer_free(cu_state, &rb->device);
rb->device = NULL;
return err;
}
rb->bytes = bytes;

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +287
/* host_pinned is owned by the CUDA host alloc table; the
* matching free path lives behind common.c's
* vmaf_cuda_buffer_host_free. The template doesn't claim to
* release it — callers that adopted the template still call
* the existing host-free helper directly. Documented in the
* migration guide. */
rb->host_pinned = NULL;
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.

Suggested change
/* host_pinned is owned by the CUDA host alloc table; the
* matching free path lives behind common.c's
* vmaf_cuda_buffer_host_free. The template doesn't claim to
* release itcallers that adopted the template still call
* the existing host-free helper directly. Documented in the
* migration guide. */
rb->host_pinned = NULL;
if (rb->host_pinned != NULL) {
const int e = vmaf_cuda_buffer_host_free(cu_state, rb->host_pinned);
if (e != 0 && rc == 0) {
rc = e;
}
rb->host_pinned = NULL;
}

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +119
unused in PR #NNN — each future kernel migration is its own
gated PR (`places=4` cross-backend-diff per ADR-0214). **On
rebase**: keep both the header and any kernel call-sites that
later adopt it; upstream has no equivalent. Reference
implementation that mirrors the template's shape lives in
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #NNN is a placeholder and will become stale/confusing once merged. Replace it with the actual PR number before merge, or reword to avoid embedding an explicit PR number (e.g., “lands unused in the initial templates-only PR”).

Suggested change
unused in PR #NNN — each future kernel migration is its own
gated PR (`places=4` cross-backend-diff per ADR-0214). **On
rebase**: keep both the header and any kernel call-sites that
later adopt it; upstream has no equivalent. Reference
implementation that mirrors the template's shape lives in
unused in the initial templates-only change — each future kernel
migration is its own gated PR (`places=4`
cross-backend-diff per ADR-0214). **On rebase**: keep both the
header and any kernel call-sites that later adopt it; upstream
has no equivalent. Reference implementation that mirrors the
template's shape lives in

Copilot uses AI. Check for mistakes.
@lusoris lusoris force-pushed the refactor/gpu-kernel-templates branch from ad0c29a to 2b4f839 Compare May 1, 2026 18:52
…Vulkan, no migrations)

Header-only inline-helper templates absorb the lifecycle boilerplate
every fork-added GPU feature kernel currently re-implements by hand.
Templates land unused; each future kernel migration is a separate
PR gated by the places=4 cross-backend-diff lane (ADR-0214) plus
the Netflix CPU golden gate.

- libvmaf/src/cuda/kernel_template.h (296 LOC) — VmafCudaKernelLifecycle
  + VmafCudaKernelReadback + 6 inlines for the private non-blocking
  stream + 2-event + device-accumulator + pinned-readback shape.
- libvmaf/src/vulkan/kernel_template.h (410 LOC) — VmafVulkanKernelPipeline
  + VmafVulkanKernelSubmit + 5 inlines for the descriptor-set layout
  + pipeline + descriptor pool + per-frame command-buffer + fence shape.

Per-backend (not cross-backend) because CUDA's async-stream + event
model and Vulkan's command-buffer + fence + descriptor-pool model
share no concrete shape. Helper functions (not macros) for cuda-gdb
/ Nsight / RenderDoc step-debugging.

CUDA build: 45/45 tests pass. Vulkan build: 41/41 tests pass.

See ADR-0221, docs/backends/kernel-scaffolding.md.
Deferred follow-ups (CHANGELOG): T7-XX-followup-{a,b,c}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lusoris lusoris force-pushed the refactor/gpu-kernel-templates branch from 2b4f839 to 657d24e Compare May 2, 2026 02:29
@lusoris lusoris merged commit 4aa75b1 into master May 2, 2026
54 of 57 checks passed
@lusoris lusoris deleted the refactor/gpu-kernel-templates branch May 2, 2026 02:48
@github-actions github-actions Bot mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants