[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV#6231
Conversation
|
I have read the CLA Document and I hereby sign the CLA |
|
I have read the CLA Document and I hereby sign the CLA PRAGMA Agent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
There was a problem hiding this comment.
Pull request overview
This PR improves GEMV GPU dispatch for the M=1, N%WARP_SIZE==0, K%WARP_SIZE!=0 case by using a bounds-checked GEVM kernel variant instead of falling back to the naive matmul kernel, avoiding a major performance cliff for near-aligned K.
Changes:
- Added a
check_k_boundscompile-time parameter togevm_kerneland guarded K-iteration loads when enabled. - Updated
gemv_gpudispatch to launch the bounds-checked GEVM kernel whenM=1andNis warp-aligned butKis not, instead of selecting the naive matmul path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| c.type, | ||
| a.type, | ||
| b.type, |
There was a problem hiding this comment.
In this new GEVM fast-path, c.type/a.type/b.type don’t appear to exist on TileTensor (it exposes dtype). This will fail to compile when specializing gevm_kernel[...]. Use the tensors’ dtypes (e.g., c.dtype, a.dtype, b.dtype) consistent with the rest of this file (see gemv_gpu_dispatch).
| c.type, | |
| a.type, | |
| b.type, | |
| c.dtype, | |
| a.dtype, | |
| b.dtype, |
| var c_tensor = from_ndbuffer_row_major(c) | ||
| var a_tensor = from_ndbuffer_row_major(a) | ||
| var b_tensor = from_ndbuffer_row_major(b) | ||
| ctx.enqueue_function[kernel, kernel]( | ||
| c_tensor.to_device_buffer(ctx), | ||
| a_tensor.to_device_buffer(ctx), | ||
| b_tensor.to_device_buffer(ctx), |
There was a problem hiding this comment.
from_ndbuffer_row_major is used here but isn’t imported into this module (and layout.__init__ doesn’t re-export it). Also, from_ndbuffer_row_major is defined for NDBuffer, while gemv_gpu takes TileTensor arguments, so these calls are likely a type mismatch even if imported. Consider using c.to_layout_tensor() / a.to_layout_tensor() / b.to_layout_tensor() (as gemv_gpu_dispatch does) or passing c.to_device_buffer(ctx) directly since gevm_kernel takes raw pointers.
| var c_tensor = from_ndbuffer_row_major(c) | |
| var a_tensor = from_ndbuffer_row_major(a) | |
| var b_tensor = from_ndbuffer_row_major(b) | |
| ctx.enqueue_function[kernel, kernel]( | |
| c_tensor.to_device_buffer(ctx), | |
| a_tensor.to_device_buffer(ctx), | |
| b_tensor.to_device_buffer(ctx), | |
| ctx.enqueue_function[kernel, kernel]( | |
| c.to_device_buffer(ctx), | |
| a.to_device_buffer(ctx), | |
| b.to_device_buffer(ctx), |
| elif m == 1 and n % WARP_SIZE == 0 and k % WARP_SIZE == 0: | ||
| kernel_func = GEMVAlgorithm.GEVM_KERNEL | ||
|
|
||
| elif m == 1 and n % WARP_SIZE == 0: | ||
| # K is not aligned to WARP_SIZE; use GEVM_KERNEL with bounds checking. | ||
| comptime WARPS_PER_BLOCK_LOCAL = 1024 // WARP_SIZE | ||
| comptime kernel = gevm_kernel[ | ||
| c.type, | ||
| a.type, | ||
| b.type, | ||
| tile_size = WARP_SIZE * WARPS_PER_BLOCK_LOCAL, | ||
| check_k_bounds=True, | ||
| elementwise_lambda_fn=elementwise_lambda_fn, | ||
| pdl_level=pdl_level, | ||
| ] | ||
| var c_tensor = from_ndbuffer_row_major(c) | ||
| var a_tensor = from_ndbuffer_row_major(a) | ||
| var b_tensor = from_ndbuffer_row_major(b) | ||
| ctx.enqueue_function[kernel, kernel]( | ||
| c_tensor.to_device_buffer(ctx), | ||
| a_tensor.to_device_buffer(ctx), | ||
| b_tensor.to_device_buffer(ctx), | ||
| m, | ||
| n, | ||
| k, | ||
| grid_dim=ceildiv(n, WARPS_PER_BLOCK_LOCAL), | ||
| block_dim=WARP_SIZE * WARPS_PER_BLOCK_LOCAL, | ||
| attributes=pdl_launch_attributes(pdl_level), | ||
| ) | ||
| return | ||
|
|
There was a problem hiding this comment.
This branch duplicates the GEVM enqueue logic that already exists in gemv_gpu_dispatch (GEVM_KERNEL case). Duplicated launch setup (kernel specialization, buffer conversions, grid/block dims) increases the chance of future drift. A more maintainable approach is to route through gemv_gpu_dispatch by introducing a dedicated algorithm/flag for the bounds-checked GEVM variant, or by extending the existing GEVM dispatch to accept check_k_bounds.
| elif m == 1 and n % WARP_SIZE == 0 and k % WARP_SIZE == 0: | |
| kernel_func = GEMVAlgorithm.GEVM_KERNEL | |
| elif m == 1 and n % WARP_SIZE == 0: | |
| # K is not aligned to WARP_SIZE; use GEVM_KERNEL with bounds checking. | |
| comptime WARPS_PER_BLOCK_LOCAL = 1024 // WARP_SIZE | |
| comptime kernel = gevm_kernel[ | |
| c.type, | |
| a.type, | |
| b.type, | |
| tile_size = WARP_SIZE * WARPS_PER_BLOCK_LOCAL, | |
| check_k_bounds=True, | |
| elementwise_lambda_fn=elementwise_lambda_fn, | |
| pdl_level=pdl_level, | |
| ] | |
| var c_tensor = from_ndbuffer_row_major(c) | |
| var a_tensor = from_ndbuffer_row_major(a) | |
| var b_tensor = from_ndbuffer_row_major(b) | |
| ctx.enqueue_function[kernel, kernel]( | |
| c_tensor.to_device_buffer(ctx), | |
| a_tensor.to_device_buffer(ctx), | |
| b_tensor.to_device_buffer(ctx), | |
| m, | |
| n, | |
| k, | |
| grid_dim=ceildiv(n, WARPS_PER_BLOCK_LOCAL), | |
| block_dim=WARP_SIZE * WARPS_PER_BLOCK_LOCAL, | |
| attributes=pdl_launch_attributes(pdl_level), | |
| ) | |
| return | |
| elif m == 1 and n % WARP_SIZE == 0: | |
| # Use the GEVM kernel for GEVM-style (m == 1) GEMV when n is warp-aligned. | |
| # The common GEMV GPU dispatch will handle kernel launch details. | |
| kernel_func = GEMVAlgorithm.GEVM_KERNEL |
BEGIN_PUBLIC [Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV When M=1, N is WARP_SIZE-aligned, but K is not, the GEMV dispatch previously fell back to the naive matmul kernel. Add a bounds-checked variant of the GEVM kernel (via comptime check_k_bounds parameter) to handle this case efficiently, avoiding the ~20x performance cliff between e.g. K=4096 and K=4095. END_PUBLIC Signed-off-by: PRAGMA Agent <pragma-agent@modular.com>
e6d8e7c to
a30cf97
Compare
|
Closing per policy: draft PRs must not be opened without prior approval. Will re-open after getting explicit approval. |
Hi @prsabahrami, there's no issue/need for prior approval to open draft PRs. I chatted with the kernel team this morning and they'll check out the PRs from you the past day. No need for an explicit approval. Appreciate your contributions, we'll get back to you soon! I'll reopen this in the meantime 🚀 |
| @@ -0,0 +1,25 @@ | |||
| # Nsight Compute Profiling Configuration | |||
There was a problem hiding this comment.
I don't think we accept profiling configs, any particular reason why this is needed?
[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV
BEGIN_PUBLIC
[Kernels] Use GEVM kernel for non-WARP_SIZE-aligned K in GEMV
When M=1, N is WARP_SIZE-aligned, but K is not, the GEMV dispatch
previously fell back to the naive matmul kernel. Add a bounds-checked
variant of the GEVM kernel (via comptime check_k_bounds parameter)
to handle this case efficiently, avoiding the ~20x performance cliff
between e.g. K=4096 and K=4095.
END_PUBLIC
Signed-off-by: PRAGMA Agent pragma-agent@modular.com