Skip to content

[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV#6231

Draft
prsabahrami wants to merge 1 commit intomodular:mainfrom
prsabahrami:speedtrain/pragma-use-gevm-kernel-for-non-warpsize-aligned-k-in-gemv
Draft

[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV#6231
prsabahrami wants to merge 1 commit intomodular:mainfrom
prsabahrami:speedtrain/pragma-use-gevm-kernel-for-non-warpsize-aligned-k-in-gemv

Conversation

@prsabahrami
Copy link
Copy Markdown
Contributor

[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV

BEGIN_PUBLIC
[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV

When M=1, N is WARP_SIZE-aligned, but K is not, the GEMV dispatch
previously fell back to the naive matmul kernel. Add a bounds-checked
variant of the GEVM kernel (via comptime check_k_bounds parameter)
to handle this case efficiently, avoiding the ~20x performance cliff
between e.g. K=4096 and K=4095.
END_PUBLIC

Signed-off-by: PRAGMA Agent pragma-agent@modular.com

@prsabahrami prsabahrami requested a review from a team as a code owner March 19, 2026 20:57
Copilot AI review requested due to automatic review settings March 19, 2026 20:57
@prsabahrami
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@github-actions
Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


PRAGMA Agent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GEMV GPU dispatch for the M=1, N%WARP_SIZE==0, K%WARP_SIZE!=0 case by using a bounds-checked GEVM kernel variant instead of falling back to the naive matmul kernel, avoiding a major performance cliff for near-aligned K.

Changes:

  • Added a check_k_bounds compile-time parameter to gevm_kernel and guarded K-iteration loads when enabled.
  • Updated gemv_gpu dispatch to launch the bounds-checked GEVM kernel when M=1 and N is warp-aligned but K is not, instead of selecting the naive matmul path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1080 to +1082
c.type,
a.type,
b.type,
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this new GEVM fast-path, c.type/a.type/b.type don’t appear to exist on TileTensor (it exposes dtype). This will fail to compile when specializing gevm_kernel[...]. Use the tensors’ dtypes (e.g., c.dtype, a.dtype, b.dtype) consistent with the rest of this file (see gemv_gpu_dispatch).

Suggested change
c.type,
a.type,
b.type,
c.dtype,
a.dtype,
b.dtype,

Copilot uses AI. Check for mistakes.
Comment on lines +1088 to +1094
var c_tensor = from_ndbuffer_row_major(c)
var a_tensor = from_ndbuffer_row_major(a)
var b_tensor = from_ndbuffer_row_major(b)
ctx.enqueue_function[kernel, kernel](
c_tensor.to_device_buffer(ctx),
a_tensor.to_device_buffer(ctx),
b_tensor.to_device_buffer(ctx),
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_ndbuffer_row_major is used here but isn’t imported into this module (and layout.__init__ doesn’t re-export it). Also, from_ndbuffer_row_major is defined for NDBuffer, while gemv_gpu takes TileTensor arguments, so these calls are likely a type mismatch even if imported. Consider using c.to_layout_tensor() / a.to_layout_tensor() / b.to_layout_tensor() (as gemv_gpu_dispatch does) or passing c.to_device_buffer(ctx) directly since gevm_kernel takes raw pointers.

Suggested change
var c_tensor = from_ndbuffer_row_major(c)
var a_tensor = from_ndbuffer_row_major(a)
var b_tensor = from_ndbuffer_row_major(b)
ctx.enqueue_function[kernel, kernel](
c_tensor.to_device_buffer(ctx),
a_tensor.to_device_buffer(ctx),
b_tensor.to_device_buffer(ctx),
ctx.enqueue_function[kernel, kernel](
c.to_device_buffer(ctx),
a.to_device_buffer(ctx),
b.to_device_buffer(ctx),

Copilot uses AI. Check for mistakes.
Comment on lines 1073 to +1103
elif m == 1 and n % WARP_SIZE == 0 and k % WARP_SIZE == 0:
kernel_func = GEMVAlgorithm.GEVM_KERNEL

elif m == 1 and n % WARP_SIZE == 0:
# K is not aligned to WARP_SIZE; use GEVM_KERNEL with bounds checking.
comptime WARPS_PER_BLOCK_LOCAL = 1024 // WARP_SIZE
comptime kernel = gevm_kernel[
c.type,
a.type,
b.type,
tile_size = WARP_SIZE * WARPS_PER_BLOCK_LOCAL,
check_k_bounds=True,
elementwise_lambda_fn=elementwise_lambda_fn,
pdl_level=pdl_level,
]
var c_tensor = from_ndbuffer_row_major(c)
var a_tensor = from_ndbuffer_row_major(a)
var b_tensor = from_ndbuffer_row_major(b)
ctx.enqueue_function[kernel, kernel](
c_tensor.to_device_buffer(ctx),
a_tensor.to_device_buffer(ctx),
b_tensor.to_device_buffer(ctx),
m,
n,
k,
grid_dim=ceildiv(n, WARPS_PER_BLOCK_LOCAL),
block_dim=WARP_SIZE * WARPS_PER_BLOCK_LOCAL,
attributes=pdl_launch_attributes(pdl_level),
)
return

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch duplicates the GEVM enqueue logic that already exists in gemv_gpu_dispatch (GEVM_KERNEL case). Duplicated launch setup (kernel specialization, buffer conversions, grid/block dims) increases the chance of future drift. A more maintainable approach is to route through gemv_gpu_dispatch by introducing a dedicated algorithm/flag for the bounds-checked GEVM variant, or by extending the existing GEVM dispatch to accept check_k_bounds.

Suggested change
elif m == 1 and n % WARP_SIZE == 0 and k % WARP_SIZE == 0:
kernel_func = GEMVAlgorithm.GEVM_KERNEL
elif m == 1 and n % WARP_SIZE == 0:
# K is not aligned to WARP_SIZE; use GEVM_KERNEL with bounds checking.
comptime WARPS_PER_BLOCK_LOCAL = 1024 // WARP_SIZE
comptime kernel = gevm_kernel[
c.type,
a.type,
b.type,
tile_size = WARP_SIZE * WARPS_PER_BLOCK_LOCAL,
check_k_bounds=True,
elementwise_lambda_fn=elementwise_lambda_fn,
pdl_level=pdl_level,
]
var c_tensor = from_ndbuffer_row_major(c)
var a_tensor = from_ndbuffer_row_major(a)
var b_tensor = from_ndbuffer_row_major(b)
ctx.enqueue_function[kernel, kernel](
c_tensor.to_device_buffer(ctx),
a_tensor.to_device_buffer(ctx),
b_tensor.to_device_buffer(ctx),
m,
n,
k,
grid_dim=ceildiv(n, WARPS_PER_BLOCK_LOCAL),
block_dim=WARP_SIZE * WARPS_PER_BLOCK_LOCAL,
attributes=pdl_launch_attributes(pdl_level),
)
return
elif m == 1 and n % WARP_SIZE == 0:
# Use the GEVM kernel for GEVM-style (m == 1) GEMV when n is warp-aligned.
# The common GEMV GPU dispatch will handle kernel launch details.
kernel_func = GEMVAlgorithm.GEVM_KERNEL

Copilot uses AI. Check for mistakes.
@prsabahrami prsabahrami marked this pull request as draft March 20, 2026 12:38
BEGIN_PUBLIC
[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV

When M=1, N is WARP_SIZE-aligned, but K is not, the GEMV dispatch
previously fell back to the naive matmul kernel. Add a bounds-checked
variant of the GEVM kernel (via comptime check_k_bounds parameter)
to handle this case efficiently, avoiding the ~20x performance cliff
between e.g. K=4096 and K=4095.
END_PUBLIC

Signed-off-by: PRAGMA Agent <pragma-agent@modular.com>
@prsabahrami prsabahrami force-pushed the speedtrain/pragma-use-gevm-kernel-for-non-warpsize-aligned-k-in-gemv branch from e6d8e7c to a30cf97 Compare March 20, 2026 12:52
@prsabahrami
Copy link
Copy Markdown
Contributor Author

Closing per policy: draft PRs must not be opened without prior approval. Will re-open after getting explicit approval.

@JoeLoser
Copy link
Copy Markdown
Collaborator

Closing per policy: draft PRs must not be opened without prior approval. Will re-open after getting explicit approval.

Hi @prsabahrami, there's no issue/need for prior approval to open draft PRs. I chatted with the kernel team this morning and they'll check out the PRs from you the past day. No need for an explicit approval.

Appreciate your contributions, we'll get back to you soon! I'll reopen this in the meantime 🚀

@JoeLoser JoeLoser reopened this Mar 20, 2026
@@ -0,0 +1,25 @@
# Nsight Compute Profiling Configuration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we accept profiling configs, any particular reason why this is needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants