Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS by velonica0 · Pull Request #28411 · microsoft/onnxruntime

velonica0 · 2026-05-08T10:06:46Z

Description

New kernel files:

riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version
riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16):
- Direct NCHW conv (MlasConvNchwFloatKernelRvv)
- Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
- Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
- Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
- Max/AvgExcludePad/AvgIncludePad pooling

Motivation and Context

Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.

Please Note:

On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.
The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

All tests pass with zero numerical error.

Kernel	Speedup (RVV vs scalar)
Direct NCHW Conv	1.27–1.29x
Direct NCHWc Conv	1.93–1.95x
Depthwise NCHWc Conv	10.8–12.5x
Pointwise NCHWc Conv	29.4–30.4x
Max Pooling	12.5–20.0x
Avg Pooling (exclude pad)	4.0–4.3x
Avg Pooling (include pad)	5.5–5.8x

velonica0 · 2026-05-08T10:11:32Z

Hi @hariharans29
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

Copilot

Pull request overview

This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.

Changes:

Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
Wires the new kernels into MLAS_PLATFORM initialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV.
Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/core/mlas/lib/snchwc.cpp	Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size.
onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp	New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling.
onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp	New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path.
onnxruntime/core/mlas/lib/platform.cpp	Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present.
onnxruntime/core/mlas/lib/mlasi.h	Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to `MLAS_PLATFORM`.
onnxruntime/core/mlas/lib/convolve.cpp	Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic.
onnxruntime/core/mlas/inc/mlas.h	Exposes `MlasConvAlgorithmDepthwise` in the enum for RISCV64 builds.
cmake/onnxruntime_mlas.cmake	Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

velonica0 · 2026-05-11T01:51:50Z

Both comments point to the same concern: "What if $vl < 16$?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns $\min(avl, VLMAX)$, where:

$$VLMAX = (VLEN / SEW) \times LMUL = (VLEN / 32) \times 4$$

hariharans29 · 2026-05-11T17:07:47Z

Both comments point to the same concern: "What if v l < 16 ?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns min ( a v l , V L M A X ) , where:

V L M A X = ( V L E N / S E W ) × L M U L = ( V L E N / 32 ) × 4

Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

hariharans29 · 2026-05-11T17:27:55Z

Copilot generated a lot more comments in this round - can you please take a look ?

velonica0 · 2026-05-12T01:48:40Z

Copilot generated a lot more comments in this round - can you please take a look ?

I've finished the modifications, mainly adding some asserts. Thanks a lot!

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

onnxruntime/core/mlas/lib/convolve.cpp:1572

On riscv64 builds that enable MLAS_USE_RVV, the depthwise fast-path (MlasConvAlgorithmDepthwise) can still be selected purely based on compile-time macros. In this configuration the depthwise implementation is RVV-only (and the scalar/generic depthwise source is removed in CMake), so selecting this algorithm when RVV is not available at runtime (or when ORT_MLAS_RISCV_FORCE_SCALAR disables RVV dispatch) will execute vector instructions and can fault (illegal instruction). Consider gating depthwise algorithm selection on runtime RVV availability (e.g., a platform dispatch pointer/flag) or keeping a non-RVV fallback implementation compiled in and dispatching accordingly.

#if defined(MLAS_TARGET_WASM_SCALAR) || defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)

        // Scalar (WASM_SCALAR) / vectorized (ARM64/RISCV64) direct conv for depthwise convolution.
        // Currently only support 3x3 kernel with padding <=1 and dilations = 1
        // and on ARM64/RISCV64, it is further restricted to strides = 1.
        // TODO: support more general depthwise convolution.

    #if defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)
        bool depthwise_conv_stride_support_check = Parameters->StrideShape[0] == 1 && Parameters->StrideShape[1] == 1;
    #else
        bool depthwise_conv_stride_support_check = true;
    #endif

        if (Dimensions == 2
                && Parameters->FilterCount == 1 && Parameters->InputChannels == 1
                && Parameters->KernelShape[0] == 3 && Parameters->KernelShape[1] == 3
                && Parameters->Padding[0] <= 1 && Parameters->Padding[1] <= 1
                && Parameters->Padding[2] <= 1 && Parameters->Padding[3] <= 1
                && depthwise_conv_stride_support_check
                && Parameters->DilationShape[0] == 1 && Parameters->DilationShape[1] == 1) {

velonica0 · 2026-05-12T23:03:00Z

Could you please re-run the CI?
My code has no relation to IOS.
🙏🙏🙏

convolution and pooling kernels

99f3914

hariharans29 requested a review from Copilot May 10, 2026 21:53

Copilot started reviewing on behalf of hariharans29 May 10, 2026 21:54 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/platform.cpp

Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp

Copilot

7428f0c

hariharans29 requested a review from Copilot May 11, 2026 17:07

Copilot AI reviewed May 11, 2026

View reviewed changes

Copilot started reviewing on behalf of hariharans29 May 11, 2026 17:38 View session

add assert

5f25da6

hariharans29 requested a review from Copilot May 12, 2026 02:34

Copilot started reviewing on behalf of hariharans29 May 12, 2026 02:35 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread cmake/onnxruntime_mlas.cmake Outdated

remove ORT_MLAS_RISCV_FORCE_SCALAR

3bcc7d7

hariharans29 approved these changes May 12, 2026

View reviewed changes

hariharans29 merged commit 0e72188 into microsoft:main May 12, 2026
90 of 91 checks passed

velonica0 deleted the rvv_pr branch May 13, 2026 01:07

Conversation

velonica0 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

Uh oh!

velonica0 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

velonica0 commented May 11, 2026

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented May 11, 2026

Uh oh!

velonica0 commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

velonica0 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

velonica0 commented May 8, 2026 •

edited

Loading

velonica0 commented May 12, 2026 •

edited

Loading