Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411
Conversation
|
Hi @hariharans29 |
There was a problem hiding this comment.
Pull request overview
This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.
Changes:
- Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
- Wires the new kernels into
MLAS_PLATFORMinitialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV. - Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/mlas/lib/snchwc.cpp | Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size. |
| onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp | New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling. |
| onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp | New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path. |
| onnxruntime/core/mlas/lib/platform.cpp | Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present. |
| onnxruntime/core/mlas/lib/mlasi.h | Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to MLAS_PLATFORM. |
| onnxruntime/core/mlas/lib/convolve.cpp | Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic. |
| onnxruntime/core/mlas/inc/mlas.h | Exposes MlasConvAlgorithmDepthwise in the enum for RISCV64 builds. |
| cmake/onnxruntime_mlas.cmake | Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Both comments point to the same concern: "What if In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless). The calculation is as follows:
|
Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot. |
|
Copilot generated a lot more comments in this round - can you please take a look ? |
I've finished the modifications, mainly adding some asserts. Thanks a lot! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
onnxruntime/core/mlas/lib/convolve.cpp:1572
- On riscv64 builds that enable MLAS_USE_RVV, the depthwise fast-path (MlasConvAlgorithmDepthwise) can still be selected purely based on compile-time macros. In this configuration the depthwise implementation is RVV-only (and the scalar/generic depthwise source is removed in CMake), so selecting this algorithm when RVV is not available at runtime (or when ORT_MLAS_RISCV_FORCE_SCALAR disables RVV dispatch) will execute vector instructions and can fault (illegal instruction). Consider gating depthwise algorithm selection on runtime RVV availability (e.g., a platform dispatch pointer/flag) or keeping a non-RVV fallback implementation compiled in and dispatching accordingly.
#if defined(MLAS_TARGET_WASM_SCALAR) || defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)
// Scalar (WASM_SCALAR) / vectorized (ARM64/RISCV64) direct conv for depthwise convolution.
// Currently only support 3x3 kernel with padding <=1 and dilations = 1
// and on ARM64/RISCV64, it is further restricted to strides = 1.
// TODO: support more general depthwise convolution.
#if defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)
bool depthwise_conv_stride_support_check = Parameters->StrideShape[0] == 1 && Parameters->StrideShape[1] == 1;
#else
bool depthwise_conv_stride_support_check = true;
#endif
if (Dimensions == 2
&& Parameters->FilterCount == 1 && Parameters->InputChannels == 1
&& Parameters->KernelShape[0] == 3 && Parameters->KernelShape[1] == 3
&& Parameters->Padding[0] <= 1 && Parameters->Padding[1] <= 1
&& Parameters->Padding[2] <= 1 && Parameters->Padding[3] <= 1
&& depthwise_conv_stride_support_check
&& Parameters->DilationShape[0] == 1 && Parameters->DilationShape[1] == 1) {
|
Could you please re-run the CI? |
Description
New kernel files:
Motivation and Context
Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.
Please Note:
On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.
The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.
Benchmark ((SpacemiT K3, VLEN=256, 8-core))
All tests pass with zero numerical error.