Add RISC-V Vector (RVV) support for CPU Execution Provider by velonica0 · Pull Request #28261 · microsoft/onnxruntime

velonica0 · 2026-04-29T05:49:54Z

Motivation and Context

MLAS already provides architecture-specific optimized kernels for multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x. However, riscv64 has not had comparable RVV-optimized coverage for the operators in this PR and has mainly fallen back to scalar code.

This PR introduces RISC-V Vector (RVV) extension support to the ONNX Runtime CPU Execution Provider.

This PR focuses on two operators: SGEMM and Softmax.
We have already completed optimizations for several other operators. Following the acceptance of this PR, I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs.

Benchmark Results

SGEMM

Case	pack_b	RVV pack ms	RVV compute ms	Scalar pack ms	Scalar compute ms	Compute speedup	End-to-end speedup
128x3072x768	1	63.21	114.52	66.71	414.44	3.62x	2.71x
64x1024x1024	1	22.07	27.66	23.14	96.64	3.49x	2.41x
32x4096x1024	1	119.04	56.82	118.86	188.34	3.31x	1.75x

Softmax

Case	Scalar ms	RVV ms	Speedup
4096x128	1955.25	611.65	3.20x
1024x1024	717.26	236.73	3.03x

velonica0 · 2026-04-29T05:54:52Z

@microsoft-github-policy-service agree

velonica0 · 2026-04-29T06:32:02Z

Hi, @hariharans29
Could you please take a look at this PR? Thank you for your help.

hariharans29 · 2026-04-29T17:22:48Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-04-29T17:22:56Z

No pipelines are associated with this pull request.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds RISC-V Vector (RVV) support to the CPU Execution Provider’s MLAS path, focusing on optimized SGEMM and Softmax kernels with build-time enablement and runtime dispatch.

Changes:

Added build and CMake options to enable RVV (--enable_rvv, onnxruntime_USE_RVV) and compile RVV intrinsic sources on riscv64.
Implemented RVV-optimized SGEMM (kernel + packing) and Softmax critical-path kernels, plus platform runtime dispatch with an opt-out env var.
Added riscv64-specific standalone benchmark/compare tools and wired them into the test build as separate executables.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tools/ci_build/build_args.py	Adds `--enable_rvv` build flag for RVV-enabled MLAS builds.
tools/ci_build/build.py	Plumbs `--enable_rvv` into CMake via `onnxruntime_USE_RVV`.
onnxruntime/test/mlas/bench/riscv64/softmax_rvv_compare.cpp	Adds a standalone RVV vs scalar Softmax validation/timing tool.
onnxruntime/test/mlas/bench/riscv64/sgemm_riscv_bench.cpp	Adds a standalone SGEMM benchmark to compare RVV vs scalar.
onnxruntime/test/mlas/bench/riscv64/README.md	Documents how to build and run the riscv64 benchmarks/tools.
onnxruntime/core/mlas/lib/sgemm.cpp	Hooks RVV pack-B and RVV SGEMM kernel dispatch on riscv64.
onnxruntime/core/mlas/lib/riscv64/softmax_kernel_rvv.cpp	Implements RVV Softmax primitives (reduce max, sum-exp, normalize, log-softmax output).
onnxruntime/core/mlas/lib/riscv64/sgemm_pack_b_rvv.cpp	Implements RVV-accelerated SGEMM packed-B copy routine.
onnxruntime/core/mlas/lib/riscv64/sgemm_kernel_rvv.cpp	Implements RVV SGEMM kernel for packed-B tiles.
onnxruntime/core/mlas/lib/platform.cpp	Adds riscv64 runtime dispatch for RVV kernels and `ORT_MLAS_RISCV_FORCE_SCALAR` opt-out.
onnxruntime/core/mlas/lib/mlasi.h	Extends platform/kernel declarations to include riscv64 and RVV kernel symbols.
onnxruntime/core/mlas/lib/compute.cpp	Routes Softmax path through platform function pointers on riscv64.
onnxruntime/core/mlas/inc/mlas.h	Adds `MLAS_TARGET_RISCV64` target detection macro.
cmake/onnxruntime_unittests.cmake	Excludes riscv64 bench sources from the generic benchmark target; adds riscv64 standalone executables.
cmake/onnxruntime_mlas.cmake	Adds riscv64 platform selection and conditional RVV intrinsic compile checks/flags.
cmake/CMakeLists.txt	Introduces `onnxruntime_USE_RVV` CMake option.

Comments suppressed due to low confidence (1)

cmake/onnxruntime_unittests.cmake:1

The newly added endif() at line 1423 changes the CMake block structure around the MLAS benchmark/test targets. This looks like it may prematurely close an enclosing if() (based on surrounding indentation and flow) and could alter which platforms/configurations generate subsequent test executables. Please re-check the intended scoping and adjust the endif() placement so the benchmark and new riscv64 targets remain under the same guard(s) as before.

# Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 · 2026-04-30T03:42:12Z

Can you please resolve the copilot comments - with a comment stating if you took it in or not ?

velonica0 · 2026-04-30T03:47:19Z

Yes, I have made the changes according to Copilot's requirements and resolved the format issues in CI.

Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>

Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>

Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>

Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>

velonica0 added 4 commits April 27, 2026 19:31

Add RVV build flag plumbing

04b6c0a

Add RVV MLAS kernels and riscv benchmarks

eb22e50

Add RVV MLAS kernels and riscv benchmarks

270510f

Add RVV SGEMM pack-B path and benchmarks

5f7afc5

hariharans29 requested a review from Copilot April 29, 2026 17:22

github-advanced-security AI found potential problems Apr 29, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/riscv64/softmax_rvv_compare.cpp Fixed

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/riscv64/softmax_kernel_rvv.cpp

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated

Comment thread cmake/onnxruntime_mlas.cmake Outdated

Comment thread onnxruntime/core/mlas/lib/mlasi.h

Copilot started reviewing on behalf of hariharans29 April 29, 2026 17:36 View session

velonica0 added 2 commits April 30, 2026 10:12

Copilot feedback

88da882

CI format

49265c9

hariharans29 enabled auto-merge (squash) April 30, 2026 04:05

hariharans29 approved these changes Apr 30, 2026

View reviewed changes

hariharans29 merged commit 62f742f into microsoft:main Apr 30, 2026
85 of 87 checks passed

qiurui144 mentioned this pull request Apr 30, 2026

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261) #28287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RISC-V Vector (RVV) support for CPU Execution Provider#28261

Add RISC-V Vector (RVV) support for CPU Execution Provider#28261
hariharans29 merged 6 commits intomicrosoft:mainfrom
velonica0:rvv_pr

velonica0 commented Apr 29, 2026

Uh oh!

velonica0 commented Apr 29, 2026

Uh oh!

velonica0 commented Apr 29, 2026

Uh oh!

hariharans29 commented Apr 29, 2026

Uh oh!

azure-pipelines Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Apr 30, 2026

Uh oh!

velonica0 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

velonica0 commented Apr 29, 2026

Motivation and Context

Benchmark Results

SGEMM

Softmax

Uh oh!

velonica0 commented Apr 29, 2026

Uh oh!

velonica0 commented Apr 29, 2026

Uh oh!

hariharans29 commented Apr 29, 2026

Uh oh!

azure-pipelines Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Apr 30, 2026

Uh oh!

velonica0 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants