Add RISC-V Vector (RVV) support for CPU Execution Provider#28261
Add RISC-V Vector (RVV) support for CPU Execution Provider#28261hariharans29 merged 6 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
|
Hi, @hariharans29 |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
No pipelines are associated with this pull request. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds RISC-V Vector (RVV) support to the CPU Execution Provider’s MLAS path, focusing on optimized SGEMM and Softmax kernels with build-time enablement and runtime dispatch.
Changes:
- Added build and CMake options to enable RVV (
--enable_rvv,onnxruntime_USE_RVV) and compile RVV intrinsic sources on riscv64. - Implemented RVV-optimized SGEMM (kernel + packing) and Softmax critical-path kernels, plus platform runtime dispatch with an opt-out env var.
- Added riscv64-specific standalone benchmark/compare tools and wired them into the test build as separate executables.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/ci_build/build_args.py | Adds --enable_rvv build flag for RVV-enabled MLAS builds. |
| tools/ci_build/build.py | Plumbs --enable_rvv into CMake via onnxruntime_USE_RVV. |
| onnxruntime/test/mlas/bench/riscv64/softmax_rvv_compare.cpp | Adds a standalone RVV vs scalar Softmax validation/timing tool. |
| onnxruntime/test/mlas/bench/riscv64/sgemm_riscv_bench.cpp | Adds a standalone SGEMM benchmark to compare RVV vs scalar. |
| onnxruntime/test/mlas/bench/riscv64/README.md | Documents how to build and run the riscv64 benchmarks/tools. |
| onnxruntime/core/mlas/lib/sgemm.cpp | Hooks RVV pack-B and RVV SGEMM kernel dispatch on riscv64. |
| onnxruntime/core/mlas/lib/riscv64/softmax_kernel_rvv.cpp | Implements RVV Softmax primitives (reduce max, sum-exp, normalize, log-softmax output). |
| onnxruntime/core/mlas/lib/riscv64/sgemm_pack_b_rvv.cpp | Implements RVV-accelerated SGEMM packed-B copy routine. |
| onnxruntime/core/mlas/lib/riscv64/sgemm_kernel_rvv.cpp | Implements RVV SGEMM kernel for packed-B tiles. |
| onnxruntime/core/mlas/lib/platform.cpp | Adds riscv64 runtime dispatch for RVV kernels and ORT_MLAS_RISCV_FORCE_SCALAR opt-out. |
| onnxruntime/core/mlas/lib/mlasi.h | Extends platform/kernel declarations to include riscv64 and RVV kernel symbols. |
| onnxruntime/core/mlas/lib/compute.cpp | Routes Softmax path through platform function pointers on riscv64. |
| onnxruntime/core/mlas/inc/mlas.h | Adds MLAS_TARGET_RISCV64 target detection macro. |
| cmake/onnxruntime_unittests.cmake | Excludes riscv64 bench sources from the generic benchmark target; adds riscv64 standalone executables. |
| cmake/onnxruntime_mlas.cmake | Adds riscv64 platform selection and conditional RVV intrinsic compile checks/flags. |
| cmake/CMakeLists.txt | Introduces onnxruntime_USE_RVV CMake option. |
Comments suppressed due to low confidence (1)
cmake/onnxruntime_unittests.cmake:1
- The newly added
endif()at line 1423 changes the CMake block structure around the MLAS benchmark/test targets. This looks like it may prematurely close an enclosingif()(based on surrounding indentation and flow) and could alter which platforms/configurations generate subsequent test executables. Please re-check the intended scoping and adjust theendif()placement so the benchmark and new riscv64 targets remain under the same guard(s) as before.
# Copyright (c) Microsoft Corporation. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Can you please resolve the copilot comments - with a comment stating if you took it in or not ? |
|
Yes, I have made the changes according to Copilot's requirements and resolved the format issues in CI. |
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>
Add onnxruntime/core/mlas/lib/riscv64/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. Wired into the existing RISCV64 RVV build block introduced by microsoft#28261: - cmake/onnxruntime_mlas.cmake: append qgemm_kernel_rvv.cpp to the if(HAS_RISCV64_RVV) source list (additive, no new block). - qgemm.h: add an MLAS_TARGET_RISCV64 dispatch branch that selects MlasGemmU8S8DispatchRvv for all four (A,B) signedness combinations, matching the inline-extern style used by ARM64EC / WASM_SIMD / S390X branches above it. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV-vectorised activation/compute family at onnxruntime/core/mlas/lib/riscv64/activation_kernel_rvv.cpp covering Erf, Tanh, Logistic, ComputeExpF32, Silu and GeluErf. Wired into the dispatch framework introduced by microsoft#28261: - mlasi.h: extend the existing `MLAS_TARGET_RISCV64 && MLAS_USE_RVV` kernel-decl block with the six new symbols (Erf, Logistic, GeluErf, Silu, Tanh, ComputeExpF32), and add four MLAS_PLATFORM dispatch fields (GeluErfKernelRoutine, SiluKernelRoutine, TanhKernelRoutine, ComputeExpF32Kernel) under a RISCV64-only block. - platform.cpp: in the RISCV64 init block, default-assign the four new fields to the upstream scalar kernels and override them with the RVV variants inside the existing `if (has_rvv)` gate. - erf.cpp / logistic.cpp / tanh.cpp / compute.cpp / gelu.cpp / silu.cpp: extend the dispatch-site `#if defined(MLAS_TARGET_AMD64) || ...` guard to include `MLAS_TARGET_RISCV64`. - cmake/onnxruntime_mlas.cmake: append activation_kernel_rvv.cpp to the `if(HAS_RISCV64_RVV)` source list (additive, no new block). Kernel strategy: LMUL=m4 throughout (32 floats per vector at VLEN=256, scales with VLEN via dynamic vsetvli). exp uses Cody-Waite range reduction + 6th-order minimax polynomial; erf/gelu use the Abramowitz & Stegun 5-term approximation (max ~2.5e-5 abs error). Silu fuses `x * sigmoid(x)` in a single pass to halve memory traffic. Signed-off-by: qiurui144 <happyqiurui@163.com>
Motivation and Context
Close #17466 and #24596
MLAS already provides architecture-specific optimized kernels for multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x. However, riscv64 has not had comparable RVV-optimized coverage for the operators in this PR and has mainly fallen back to scalar code.
This PR introduces RISC-V Vector (RVV) extension support to the ONNX Runtime CPU Execution Provider.
This PR focuses on two operators: SGEMM and Softmax.
We have already completed optimizations for several other operators. Following the acceptance of this PR, I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs.
Benchmark Results
SGEMM
Softmax