Skip to content

Add RISC-V Vector (RVV) support for CPU Execution Provider#28261

Merged
hariharans29 merged 6 commits intomicrosoft:mainfrom
velonica0:rvv_pr
Apr 30, 2026
Merged

Add RISC-V Vector (RVV) support for CPU Execution Provider#28261
hariharans29 merged 6 commits intomicrosoft:mainfrom
velonica0:rvv_pr

Conversation

@velonica0
Copy link
Copy Markdown
Contributor

Motivation and Context

Close #17466 and #24596

MLAS already provides architecture-specific optimized kernels for multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x. However, riscv64 has not had comparable RVV-optimized coverage for the operators in this PR and has mainly fallen back to scalar code.

This PR introduces RISC-V Vector (RVV) extension support to the ONNX Runtime CPU Execution Provider.

This PR focuses on two operators: SGEMM and Softmax.
We have already completed optimizations for several other operators. Following the acceptance of this PR, I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs.

Benchmark Results

SGEMM

Case pack_b RVV pack ms RVV compute ms Scalar pack ms Scalar compute ms Compute speedup End-to-end speedup
128x3072x768 1 63.21 114.52 66.71 414.44 3.62x 2.71x
64x1024x1024 1 22.07 27.66 23.14 96.64 3.49x 2.41x
32x4096x1024 1 119.04 56.82 118.86 188.34 3.31x 1.75x

Softmax

Case Scalar ms RVV ms Speedup
4096x128 1955.25 611.65 3.20x
1024x1024 717.26 236.73 3.03x

@velonica0
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@velonica0
Copy link
Copy Markdown
Contributor Author

Hi, @hariharans29
Could you please take a look at this PR? Thank you for your help.

@hariharans29 hariharans29 requested a review from Copilot April 29, 2026 17:22
@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Comment thread onnxruntime/test/mlas/bench/riscv64/softmax_rvv_compare.cpp Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds RISC-V Vector (RVV) support to the CPU Execution Provider’s MLAS path, focusing on optimized SGEMM and Softmax kernels with build-time enablement and runtime dispatch.

Changes:

  • Added build and CMake options to enable RVV (--enable_rvv, onnxruntime_USE_RVV) and compile RVV intrinsic sources on riscv64.
  • Implemented RVV-optimized SGEMM (kernel + packing) and Softmax critical-path kernels, plus platform runtime dispatch with an opt-out env var.
  • Added riscv64-specific standalone benchmark/compare tools and wired them into the test build as separate executables.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/ci_build/build_args.py Adds --enable_rvv build flag for RVV-enabled MLAS builds.
tools/ci_build/build.py Plumbs --enable_rvv into CMake via onnxruntime_USE_RVV.
onnxruntime/test/mlas/bench/riscv64/softmax_rvv_compare.cpp Adds a standalone RVV vs scalar Softmax validation/timing tool.
onnxruntime/test/mlas/bench/riscv64/sgemm_riscv_bench.cpp Adds a standalone SGEMM benchmark to compare RVV vs scalar.
onnxruntime/test/mlas/bench/riscv64/README.md Documents how to build and run the riscv64 benchmarks/tools.
onnxruntime/core/mlas/lib/sgemm.cpp Hooks RVV pack-B and RVV SGEMM kernel dispatch on riscv64.
onnxruntime/core/mlas/lib/riscv64/softmax_kernel_rvv.cpp Implements RVV Softmax primitives (reduce max, sum-exp, normalize, log-softmax output).
onnxruntime/core/mlas/lib/riscv64/sgemm_pack_b_rvv.cpp Implements RVV-accelerated SGEMM packed-B copy routine.
onnxruntime/core/mlas/lib/riscv64/sgemm_kernel_rvv.cpp Implements RVV SGEMM kernel for packed-B tiles.
onnxruntime/core/mlas/lib/platform.cpp Adds riscv64 runtime dispatch for RVV kernels and ORT_MLAS_RISCV_FORCE_SCALAR opt-out.
onnxruntime/core/mlas/lib/mlasi.h Extends platform/kernel declarations to include riscv64 and RVV kernel symbols.
onnxruntime/core/mlas/lib/compute.cpp Routes Softmax path through platform function pointers on riscv64.
onnxruntime/core/mlas/inc/mlas.h Adds MLAS_TARGET_RISCV64 target detection macro.
cmake/onnxruntime_unittests.cmake Excludes riscv64 bench sources from the generic benchmark target; adds riscv64 standalone executables.
cmake/onnxruntime_mlas.cmake Adds riscv64 platform selection and conditional RVV intrinsic compile checks/flags.
cmake/CMakeLists.txt Introduces onnxruntime_USE_RVV CMake option.
Comments suppressed due to low confidence (1)

cmake/onnxruntime_unittests.cmake:1

  • The newly added endif() at line 1423 changes the CMake block structure around the MLAS benchmark/test targets. This looks like it may prematurely close an enclosing if() (based on surrounding indentation and flow) and could alter which platforms/configurations generate subsequent test executables. Please re-check the intended scoping and adjust the endif() placement so the benchmark and new riscv64 targets remain under the same guard(s) as before.
# Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/riscv64/softmax_kernel_rvv.cpp
Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated
Comment thread cmake/onnxruntime_mlas.cmake Outdated
Comment thread onnxruntime/core/mlas/lib/mlasi.h
@hariharans29
Copy link
Copy Markdown
Member

Can you please resolve the copilot comments - with a comment stating if you took it in or not ?

@velonica0
Copy link
Copy Markdown
Contributor Author

Yes, I have made the changes according to Copilot's requirements and resolved the format issues in CI.

@hariharans29 hariharans29 enabled auto-merge (squash) April 30, 2026 04:05
@hariharans29 hariharans29 merged commit 62f742f into microsoft:main Apr 30, 2026
85 of 87 checks passed
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a
standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8
GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for
any VLEN without rebuild.

Wired into the existing RISCV64 RVV build block introduced by microsoft#28261:
- cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the
  if(HAS_RISCV64_RVV) source list (additive, no new block).
- qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects
  MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations,
  matching the inline-extern style used by ARM64EC / WASM_SIMD /
  S390X branches above it.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add an RVV-vectorised activation/compute family at
  onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired
into the dispatch framework introduced by microsoft#28261:

- mlasi.h: extend the existing
  `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six
  new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32),
  and add four MLAS_PLATFORM dispatch fields
  (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine,
  ComputeExpF32Kernel) under a RISCV64-only block.
- platform.cpp: in the RISCV64 init block, default-assign the four new
  fields to the upstream scalar kernels and override them with the RVV
  variants inside the existing `if (has_rvv)` gate.
- erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp:
  extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...`
  guard to include `MLAS_TARGET_RISCV64`.
- cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the
  `if(HAS_RISCV64_RVV)` source list (additive, no new block).

Kernel strategy: LMUL=m4 throughout (32 floats per vector at
VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite
range reduction + 6th-order minimax polynomial; erf/gelu use the
Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error).
Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic.

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a
standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8
GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for
any VLEN without rebuild.

Wired into the existing RISCV64 RVV build block introduced by microsoft#28261:
- cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the
  if(HAS_RISCV64_RVV) source list (additive, no new block).
- qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects
  MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations,
  matching the inline-extern style used by ARM64EC / WASM_SIMD /
  S390X branches above it.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add an RVV-vectorised activation/compute family at
  onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired
into the dispatch framework introduced by microsoft#28261:

- mlasi.h: extend the existing
  `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six
  new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32),
  and add four MLAS_PLATFORM dispatch fields
  (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine,
  ComputeExpF32Kernel) under a RISCV64-only block.
- platform.cpp: in the RISCV64 init block, default-assign the four new
  fields to the upstream scalar kernels and override them with the RVV
  variants inside the existing `if (has_rvv)` gate.
- erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp:
  extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...`
  guard to include `MLAS_TARGET_RISCV64`.
- cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the
  `if(HAS_RISCV64_RVV)` source list (additive, no new block).

Kernel strategy: LMUL=m4 throughout (32 floats per vector at
VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite
range reduction + 6th-order minimax polynomial; erf/gelu use the
Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error).
Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic.

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a
standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8
GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for
any VLEN without rebuild.

Wired into the existing RISCV64 RVV build block introduced by microsoft#28261:
- cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the
  if(HAS_RISCV64_RVV) source list (additive, no new block).
- qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects
  MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations,
  matching the inline-extern style used by ARM64EC / WASM_SIMD /
  S390X branches above it.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add an RVV-vectorised activation/compute family at
  onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired
into the dispatch framework introduced by microsoft#28261:

- mlasi.h: extend the existing
  `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six
  new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32),
  and add four MLAS_PLATFORM dispatch fields
  (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine,
  ComputeExpF32Kernel) under a RISCV64-only block.
- platform.cpp: in the RISCV64 init block, default-assign the four new
  fields to the upstream scalar kernels and override them with the RVV
  variants inside the existing `if (has_rvv)` gate.
- erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp:
  extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...`
  guard to include `MLAS_TARGET_RISCV64`.
- cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the
  `if(HAS_RISCV64_RVV)` source list (additive, no new block).

Kernel strategy: LMUL=m4 throughout (32 floats per vector at
VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite
range reduction + 6th-order minimax polynomial; erf/gelu use the
Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error).
Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic.

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a
standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8
GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for
any VLEN without rebuild.

Wired into the existing RISCV64 RVV build block introduced by microsoft#28261:
- cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the
  if(HAS_RISCV64_RVV) source list (additive, no new block).
- qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects
  MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations,
  matching the inline-extern style used by ARM64EC / WASM_SIMD /
  S390X branches above it.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
qiurui144 added a commit to qiurui144/onnxruntime that referenced this pull request May 1, 2026
Add an RVV-vectorised activation/compute family at
  onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp

covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired
into the dispatch framework introduced by microsoft#28261:

- mlasi.h: extend the existing
  `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six
  new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32),
  and add four MLAS_PLATFORM dispatch fields
  (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine,
  ComputeExpF32Kernel) under a RISCV64-only block.
- platform.cpp: in the RISCV64 init block, default-assign the four new
  fields to the upstream scalar kernels and override them with the RVV
  variants inside the existing `if (has_rvv)` gate.
- erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp:
  extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...`
  guard to include `MLAS_TARGET_RISCV64`.
- cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the
  `if(HAS_RISCV64_RVV)` source list (additive, no new block).

Kernel strategy: LMUL=m4 throughout (32 floats per vector at
VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite
range reduction + 6th-order minimax polynomial; erf/gelu use the
Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error).
Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic.

Signed-off-by: qiurui144 <happyqiurui@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants