Skip to content

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411

Merged
hariharans29 merged 4 commits into
microsoft:mainfrom
velonica0:rvv_pr
May 12, 2026
Merged

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411
hariharans29 merged 4 commits into
microsoft:mainfrom
velonica0:rvv_pr

Conversation

@velonica0
Copy link
Copy Markdown
Contributor

@velonica0 velonica0 commented May 8, 2026

Description

New kernel files:

  • riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version
  • riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16):
    • Direct NCHW conv (MlasConvNchwFloatKernelRvv)
    • Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
    • Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
    • Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
    • Max/AvgExcludePad/AvgIncludePad pooling

Motivation and Context

Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.

Please Note:

  • On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.

  • The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

All tests pass with zero numerical error.

Kernel Speedup (RVV vs scalar)
Direct NCHW Conv 1.27–1.29x
Direct NCHWc Conv 1.93–1.95x
Depthwise NCHWc Conv 10.8–12.5x
Pointwise NCHWc Conv 29.4–30.4x
Max Pooling 12.5–20.0x
Avg Pooling (exclude pad) 4.0–4.3x
Avg Pooling (include pad) 5.5–5.8x

@velonica0
Copy link
Copy Markdown
Contributor Author

Hi @hariharans29
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.

Changes:

  • Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
  • Wires the new kernels into MLAS_PLATFORM initialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV.
  • Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/snchwc.cpp Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size.
onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling.
onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path.
onnxruntime/core/mlas/lib/platform.cpp Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present.
onnxruntime/core/mlas/lib/mlasi.h Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to MLAS_PLATFORM.
onnxruntime/core/mlas/lib/convolve.cpp Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic.
onnxruntime/core/mlas/inc/mlas.h Exposes MlasConvAlgorithmDepthwise in the enum for RISCV64 builds.
cmake/onnxruntime_mlas.cmake Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/platform.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp
@velonica0
Copy link
Copy Markdown
Contributor Author

Both comments point to the same concern: "What if $vl < 16$?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns $\min(avl, VLMAX)$, where:

$$VLMAX = (VLEN / SEW) \times LMUL = (VLEN / 32) \times 4$$

@hariharans29
Copy link
Copy Markdown
Member

Both comments point to the same concern: "What if v l < 16 ?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns min ( a v l , V L M A X ) , where:

V L M A X = ( V L E N / S E W ) × L M U L = ( V L E N / 32 ) × 4

Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot.

@hariharans29 hariharans29 requested a review from Copilot May 11, 2026 17:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp
@hariharans29
Copy link
Copy Markdown
Member

Copilot generated a lot more comments in this round - can you please take a look ?

@velonica0
Copy link
Copy Markdown
Contributor Author

Copilot generated a lot more comments in this round - can you please take a look ?

I've finished the modifications, mainly adding some asserts. Thanks a lot!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

onnxruntime/core/mlas/lib/convolve.cpp:1572

  • On riscv64 builds that enable MLAS_USE_RVV, the depthwise fast-path (MlasConvAlgorithmDepthwise) can still be selected purely based on compile-time macros. In this configuration the depthwise implementation is RVV-only (and the scalar/generic depthwise source is removed in CMake), so selecting this algorithm when RVV is not available at runtime (or when ORT_MLAS_RISCV_FORCE_SCALAR disables RVV dispatch) will execute vector instructions and can fault (illegal instruction). Consider gating depthwise algorithm selection on runtime RVV availability (e.g., a platform dispatch pointer/flag) or keeping a non-RVV fallback implementation compiled in and dispatching accordingly.
#if defined(MLAS_TARGET_WASM_SCALAR) || defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)

        // Scalar (WASM_SCALAR) / vectorized (ARM64/RISCV64) direct conv for depthwise convolution.
        // Currently only support 3x3 kernel with padding <=1 and dilations = 1
        // and on ARM64/RISCV64, it is further restricted to strides = 1.
        // TODO: support more general depthwise convolution.

    #if defined(MLAS_TARGET_ARM64) || defined(MLAS_TARGET_RISCV64)
        bool depthwise_conv_stride_support_check = Parameters->StrideShape[0] == 1 && Parameters->StrideShape[1] == 1;
    #else
        bool depthwise_conv_stride_support_check = true;
    #endif

        if (Dimensions == 2
                && Parameters->FilterCount == 1 && Parameters->InputChannels == 1
                && Parameters->KernelShape[0] == 3 && Parameters->KernelShape[1] == 3
                && Parameters->Padding[0] <= 1 && Parameters->Padding[1] <= 1
                && Parameters->Padding[2] <= 1 && Parameters->Padding[3] <= 1
                && depthwise_conv_stride_support_check
                && Parameters->DilationShape[0] == 1 && Parameters->DilationShape[1] == 1) {

Comment thread cmake/onnxruntime_mlas.cmake Outdated
@velonica0
Copy link
Copy Markdown
Contributor Author

velonica0 commented May 12, 2026

Could you please re-run the CI?
My code has no relation to IOS.
🙏🙏🙏

@hariharans29 hariharans29 merged commit 0e72188 into microsoft:main May 12, 2026
90 of 91 checks passed
@velonica0 velonica0 deleted the rvv_pr branch May 13, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants