Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds reusable, header-only scaffolding helpers for CUDA and Vulkan GPU feature kernels, plus accompanying ADR/docs/agent guidance, without migrating any existing kernels.
Changes:
- Introduces new per-backend kernel template headers for CUDA and Vulkan to centralize common init/submit/teardown boilerplate.
- Documents the templates and migration approach (ADR-0221, kernel-scaffolding docs, rebase notes).
- Updates/introduces AGENTS guidance for CUDA/Vulkan backend invariants.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| libvmaf/src/vulkan/kernel_template.h | New Vulkan helper inlines for pipeline creation and per-frame submit/wait/free scaffolding. |
| libvmaf/src/vulkan/AGENTS.md | New Vulkan backend agent orientation + invariants referencing the new template. |
| libvmaf/src/cuda/kernel_template.h | New CUDA helper inlines for stream/event lifecycle + readback allocation/wait/teardown. |
| libvmaf/src/cuda/AGENTS.md | Adds template invariant row + directory listing update. |
| docs/rebase-notes.md | Adds rebase note entry describing template additions and invariants. |
| docs/backends/kernel-scaffolding.md | New user/developer doc describing template surfaces and migration sketches. |
| docs/adr/README.md | Registers ADR-0221 in the ADR index table. |
| docs/adr/0221-gpu-kernel-template.md | New ADR documenting the decision and alternatives. |
| CHANGELOG.md | Adds an “Unreleased” entry describing the new templates and planned follow-ups. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static inline int vmaf_vulkan_kernel_submit_end_and_wait(VmafVulkanContext *ctx, | ||
| VmafVulkanKernelSubmit *sub) | ||
| { | ||
| if (vkEndCommandBuffer(sub->cmd) != VK_SUCCESS) { | ||
| return -EIO; | ||
| } | ||
| VkSubmitInfo si = { | ||
| .sType = VK_STRUCTURE_TYPE_SUBMIT_INFO, | ||
| .commandBufferCount = 1, | ||
| .pCommandBuffers = &sub->cmd, | ||
| }; | ||
| if (vkQueueSubmit(ctx->queue, 1, &si, sub->fence) != VK_SUCCESS) { | ||
| return -EIO; | ||
| } | ||
| if (vkWaitForFences(ctx->device, 1, &sub->fence, VK_TRUE, UINT64_MAX) != VK_SUCCESS) { | ||
| return -EIO; | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
vmaf_vulkan_kernel_submit_end_and_wait dereferences sub and uses ctx without validating either (unlike *_submit_begin). This can crash on accidental misuse and makes the helper less robust; add ctx == NULL || sub == NULL early returns (e.g., -EINVAL) similar to the other helpers.
| static inline void vmaf_vulkan_kernel_pipeline_destroy(VmafVulkanContext *ctx, | ||
| VmafVulkanKernelPipeline *pl) | ||
| { | ||
| if (ctx == NULL || ctx->device == VK_NULL_HANDLE) { | ||
| return; | ||
| } | ||
| vkDeviceWaitIdle(ctx->device); | ||
| if (pl->desc_pool != VK_NULL_HANDLE) { | ||
| vkDestroyDescriptorPool(ctx->device, pl->desc_pool, NULL); | ||
| pl->desc_pool = VK_NULL_HANDLE; | ||
| } |
There was a problem hiding this comment.
vmaf_vulkan_kernel_pipeline_destroy guards ctx but not pl; it dereferences pl unconditionally. Add a pl == NULL guard (or require non-NULL via API contract and assert/return) to avoid potential crashes in error/unwind paths.
| const VkPushConstantRange pcr = { | ||
| .stageFlags = VK_SHADER_STAGE_COMPUTE_BIT, | ||
| .offset = 0, | ||
| .size = desc->push_constant_size, | ||
| }; |
There was a problem hiding this comment.
Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.
| VkShaderModuleCreateInfo smci = { | ||
| .sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO, | ||
| .codeSize = desc->spv_size, | ||
| .pCode = desc->spv_bytes, | ||
| }; |
There was a problem hiding this comment.
Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.
| VkComputePipelineCreateInfo cpci = desc->pipeline_create_info; | ||
| cpci.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO; | ||
| cpci.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO; | ||
| cpci.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT; | ||
| cpci.stage.module = out->shader; | ||
| cpci.layout = out->pipeline_layout; |
There was a problem hiding this comment.
Several Vulkan spec constraints aren’t validated here and can lead to hard-to-diagnose vkCreate* failures: (1) desc->spv_bytes should be non-NULL and desc->spv_size should be non-zero and a multiple of 4; (2) push_constant_size should be a multiple of 4 when non-zero; (3) cpci.stage.pName must be non-NULL (consider defaulting to "main" when unset, or explicitly validate and return -EINVAL). Adding these checks makes failures deterministic and returns a clear errno back to callers.
| if (vkCreateShaderModule(ctx->device, &smci, NULL, &out->shader) != VK_SUCCESS) { | ||
| vkDestroyPipelineLayout(ctx->device, out->pipeline_layout, NULL); | ||
| vkDestroyDescriptorSetLayout(ctx->device, out->dsl, NULL); | ||
| out->pipeline_layout = VK_NULL_HANDLE; | ||
| out->dsl = VK_NULL_HANDLE; | ||
| return -ENOMEM; | ||
| } |
There was a problem hiding this comment.
These helpers currently return -ENOMEM for all Vulkan failures, but the Vulkan backend guidance in libvmaf/src/vulkan/AGENTS.md states kernels should return -EIO / -ENOMEM based on failure class. Consider mapping VK_ERROR_OUT_OF_HOST_MEMORY/VK_ERROR_OUT_OF_DEVICE_MEMORY to -ENOMEM and treating other VkResult failures as -EIO so callers don’t mis-handle non-OOM errors as allocation pressure.
| CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail); | ||
| ctx_pushed = 1; | ||
| CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail); | ||
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail); | ||
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail); | ||
| CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop); | ||
| return 0; | ||
|
|
||
| fail: | ||
| if (ctx_pushed) { | ||
| (void)cu_f->cuCtxPopCurrent(NULL); | ||
| } | ||
| fail_after_pop: | ||
| /* Best-effort: any of the three handles that did create cleanly | ||
| * are leaked deliberately — the caller will hit the same failure | ||
| * on the next init() and the process is in an unrecoverable CUDA | ||
| * state anyway. */ |
There was a problem hiding this comment.
The docstring says this function “rolls back any partial state — lc ends up zeroed”, but the implementation explicitly leaks any successfully-created stream/events on failure and does not reset lc fields. This is a real resource leak and contradicts the stated contract; a recoverable transient failure (e.g., resource exhaustion) could leak repeatedly. Prefer best-effort cleanup (destroy any created events/stream) and set lc->str/submit/finished back to NULL/0 before return.
| CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail); | |
| ctx_pushed = 1; | |
| CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail_after_pop); | |
| return 0; | |
| fail: | |
| if (ctx_pushed) { | |
| (void)cu_f->cuCtxPopCurrent(NULL); | |
| } | |
| fail_after_pop: | |
| /* Best-effort: any of the three handles that did create cleanly | |
| * are leaked deliberately — the caller will hit the same failure | |
| * on the next init() and the process is in an unrecoverable CUDA | |
| * state anyway. */ | |
| lc->str = NULL; | |
| lc->submit = NULL; | |
| lc->finished = NULL; | |
| CHECK_CUDA_GOTO(cu_f, cuCtxPushCurrent(cu_state->ctx), fail); | |
| ctx_pushed = 1; | |
| CHECK_CUDA_GOTO(cu_f, cuStreamCreateWithPriority(&lc->str, CU_STREAM_NON_BLOCKING, 0), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->submit, CU_EVENT_DEFAULT), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuEventCreate(&lc->finished, CU_EVENT_DEFAULT), fail); | |
| CHECK_CUDA_GOTO(cu_f, cuCtxPopCurrent(NULL), fail); | |
| ctx_pushed = 0; | |
| return 0; | |
| fail: | |
| if (ctx_pushed) { | |
| if (lc->finished) { | |
| (void)cu_f->cuEventDestroy(lc->finished); | |
| } | |
| if (lc->submit) { | |
| (void)cu_f->cuEventDestroy(lc->submit); | |
| } | |
| if (lc->str) { | |
| (void)cu_f->cuStreamDestroy(lc->str); | |
| } | |
| (void)cu_f->cuCtxPopCurrent(NULL); | |
| } | |
| lc->str = NULL; | |
| lc->submit = NULL; | |
| lc->finished = NULL; |
| * Returns 0 on success or -ENOMEM. On failure the caller must call | ||
| * vmaf_cuda_kernel_readback_free; the function leaves `rb` partially | ||
| * populated rather than rolling back so the unwind path stays | ||
| * uniform with the multi-readback case. | ||
| */ | ||
| static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb, | ||
| VmafCudaState *cu_state, size_t bytes) | ||
| { | ||
| rb->bytes = bytes; | ||
| int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes); | ||
| if (err != 0) { | ||
| return err; | ||
| } | ||
| err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes); | ||
| if (err != 0) { | ||
| return err; | ||
| } |
There was a problem hiding this comment.
vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.
| * Returns 0 on success or -ENOMEM. On failure the caller must call | |
| * vmaf_cuda_kernel_readback_free; the function leaves `rb` partially | |
| * populated rather than rolling back so the unwind path stays | |
| * uniform with the multi-readback case. | |
| */ | |
| static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb, | |
| VmafCudaState *cu_state, size_t bytes) | |
| { | |
| rb->bytes = bytes; | |
| int err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes); | |
| if (err != 0) { | |
| return err; | |
| } | |
| err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes); | |
| if (err != 0) { | |
| return err; | |
| } | |
| * Returns 0 on success or the negative errno from the underlying | |
| * allocation helper. On failure the function rolls back any partial | |
| * allocation so `rb` remains fully unallocated. | |
| */ | |
| static inline int vmaf_cuda_kernel_readback_alloc(VmafCudaKernelReadback *rb, | |
| VmafCudaState *cu_state, size_t bytes) | |
| { | |
| int err; | |
| rb->device = NULL; | |
| rb->host_pinned = NULL; | |
| rb->bytes = 0; | |
| err = vmaf_cuda_buffer_alloc(cu_state, &rb->device, bytes); | |
| if (err != 0) { | |
| return err; | |
| } | |
| err = vmaf_cuda_buffer_host_alloc(cu_state, &rb->host_pinned, bytes); | |
| if (err != 0) { | |
| vmaf_cuda_buffer_free(cu_state, &rb->device); | |
| rb->device = NULL; | |
| return err; | |
| } | |
| rb->bytes = bytes; |
| /* host_pinned is owned by the CUDA host alloc table; the | ||
| * matching free path lives behind common.c's | ||
| * vmaf_cuda_buffer_host_free. The template doesn't claim to | ||
| * release it — callers that adopted the template still call | ||
| * the existing host-free helper directly. Documented in the | ||
| * migration guide. */ | ||
| rb->host_pinned = NULL; |
There was a problem hiding this comment.
vmaf_cuda_kernel_readback_alloc allocates both device and pinned-host memory, but vmaf_cuda_kernel_readback_free does not free the pinned-host allocation (it only NULLs the pointer). This makes the alloc/free pair non-symmetric and easy to misuse, undermining the goal of centralizing boilerplate. Consider either (mandatory) having *_readback_free call the matching pinned-host free helper, or (optional) renaming/splitting APIs so the name clearly communicates it only frees the device side.
| /* host_pinned is owned by the CUDA host alloc table; the | |
| * matching free path lives behind common.c's | |
| * vmaf_cuda_buffer_host_free. The template doesn't claim to | |
| * release it — callers that adopted the template still call | |
| * the existing host-free helper directly. Documented in the | |
| * migration guide. */ | |
| rb->host_pinned = NULL; | |
| if (rb->host_pinned != NULL) { | |
| const int e = vmaf_cuda_buffer_host_free(cu_state, rb->host_pinned); | |
| if (e != 0 && rc == 0) { | |
| rc = e; | |
| } | |
| rb->host_pinned = NULL; | |
| } |
| unused in PR #NNN — each future kernel migration is its own | ||
| gated PR (`places=4` cross-backend-diff per ADR-0214). **On | ||
| rebase**: keep both the header and any kernel call-sites that | ||
| later adopt it; upstream has no equivalent. Reference | ||
| implementation that mirrors the template's shape lives in |
There was a problem hiding this comment.
PR #NNN is a placeholder and will become stale/confusing once merged. Replace it with the actual PR number before merge, or reword to avoid embedding an explicit PR number (e.g., “lands unused in the initial templates-only PR”).
| unused in PR #NNN — each future kernel migration is its own | |
| gated PR (`places=4` cross-backend-diff per ADR-0214). **On | |
| rebase**: keep both the header and any kernel call-sites that | |
| later adopt it; upstream has no equivalent. Reference | |
| implementation that mirrors the template's shape lives in | |
| unused in the initial templates-only change — each future kernel | |
| migration is its own gated PR (`places=4` | |
| cross-backend-diff per ADR-0214). **On rebase**: keep both the | |
| header and any kernel call-sites that later adopt it; upstream | |
| has no equivalent. Reference implementation that mirrors the | |
| template's shape lives in |
ad0c29a to
2b4f839
Compare
…Vulkan, no migrations)
Header-only inline-helper templates absorb the lifecycle boilerplate
every fork-added GPU feature kernel currently re-implements by hand.
Templates land unused; each future kernel migration is a separate
PR gated by the places=4 cross-backend-diff lane (ADR-0214) plus
the Netflix CPU golden gate.
- libvmaf/src/cuda/kernel_template.h (296 LOC) — VmafCudaKernelLifecycle
+ VmafCudaKernelReadback + 6 inlines for the private non-blocking
stream + 2-event + device-accumulator + pinned-readback shape.
- libvmaf/src/vulkan/kernel_template.h (410 LOC) — VmafVulkanKernelPipeline
+ VmafVulkanKernelSubmit + 5 inlines for the descriptor-set layout
+ pipeline + descriptor pool + per-frame command-buffer + fence shape.
Per-backend (not cross-backend) because CUDA's async-stream + event
model and Vulkan's command-buffer + fence + descriptor-pool model
share no concrete shape. Helper functions (not macros) for cuda-gdb
/ Nsight / RenderDoc step-debugging.
CUDA build: 45/45 tests pass. Vulkan build: 41/41 tests pass.
See ADR-0221, docs/backends/kernel-scaffolding.md.
Deferred follow-ups (CHANGELOG): T7-XX-followup-{a,b,c}.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2b4f839 to
657d24e
Compare
Summary
libvmaf/src/cuda/kernel_template.h(296 LOC) andlibvmaf/src/vulkan/kernel_template.h(410 LOC) — that absorb the lifecycle boilerplate every fork-added GPU feature kernel currently re-implements by hand (CUDA: private non-blocking stream + 2 events + device-accumulator + pinned-readback; Vulkan: descriptor-set layout + pipeline + descriptor pool + per-frame command-buffer + fence).places=4cross-backend-diff lane (per ADR-0214) plus the Netflix CPU golden gate. Three deferred-migration follow-up T-rows added to CHANGELOG.Test plan
meson setup libvmaf/build-cuda libvmaf -Denable_cuda=true -Denable_nvcc=true -Denable_vulkan=disabled -Denable_sycl=false && ninja -C libvmaf/build-cuda— green.meson test -C libvmaf/build-cuda— 45/45 passing.meson setup libvmaf/build-vulkan libvmaf -Denable_vulkan=enabled -Denable_cuda=false -Denable_sycl=false && ninja -C libvmaf/build-vulkan— green.meson test -C libvmaf/build-vulkan— 41/41 passing.pre-commit run --fileson every touched file — all checks pass (trailing whitespace, EOF, clang-format, copyright headers, conventional commit).Deep-dive deliverables (ADR-0108)
libvmaf/src/cuda/AGENTS.mdand a newlibvmaf/src/vulkan/AGENTS.md.docs/rebase-notes.mdentry 0095.🤖 Generated with Claude Code
Supersedes #229 (closed by accident in batch-rebase incident, 2026-05-01).