[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261) by qiurui144 · Pull Request #28287 · microsoft/onnxruntime

qiurui144 · 2026-04-30T06:58:46Z

Summary

Adds two RVV-vectorised kernels to the MLAS RISC-V 64 path, follow-up to #28261:

INT8 GEMM (qgemm_kernel_rvv.cpp, 452 lines): standard-RVV vwmulu.vv + vwaddu.wv widening pattern, dynamic vsetvli, works for any VLEN ≥ 128 without rebuild.
GEMV (riscv64/sgemv_kernel_rvv.cpp, 86 lines): RVV M=1 GEMV kernel, LMUL=m4 (32 floats/vector at VLEN=256, scales with VLEN), 4× K-unroll for FMA latency hiding.

This is the first follow-up PR in the series @velonica0 mentioned in #28261 ("I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs").

Hardware compatibility

Both kernels: any RV64 with V extension 1.0 (RVA22V profile or later).
Tested on SpacemiT K3 X100 (RVA22V, VLEN=256, 2.4 GHz, GCC 15.2 -O3 -march=rv64gcv).

Build

onnxruntime_USE_RVV=ON (introduced by #28261, opt-in via --enable_rvv flag) automatically builds the new kernels with -march=rv64gcv. No new cmake option.

Benchmarks

K3 X100 (8 threads, governor=performance, cooldown ≤65 °C, 10 reps), comparing main (#28261) vs develop (#28261 + this PR):

bge-small-zh-v1.5 INT8 (where the new INT8 kernel applies)

variant	P50	speedup
main (no INT8 RVV kernel)	166.2 ms	1.00×
+ this PR (INT8 GEMM dispatch)	81.4 ms	2.04×

PPOCRv4 (CNN 1×1 conv triggers GEMV path)

variant	det P50	rec P50	det speedup	rec speedup
main	231.8 ms	132.5 ms	1.00×	1.00×
+ this PR (GEMV kernel)	214.6 ms	115.0 ms	1.08×	1.15×

FP32 transformer baseline (no regression)

#28261 already added the FP32 SGEMM RVV kernel, so FP32 transformer P50 is unchanged within noise:

model	main	develop	delta
bge-small FP32	89.8 ms	90.5 ms	+0.8%
bge-base FP32	600.9 ms	605.8 ms	+0.8%
bge-reranker FP32	598.7 ms	600.2 ms	+0.3%

Hardware without V does not define __riscv_vector and never reaches the new dispatch; the existing scalar fallback path is unchanged.

Files changed

file	LOC	purpose
`cmake/onnxruntime_mlas.cmake`	+5	Additive: register both kernels in #28261 RISCV64 RVV block
`onnxruntime/core/mlas/lib/mlasi.h`	+1	Forward-declare `MlasGemvFloatKernel` for RISCV64
`onnxruntime/core/mlas/lib/qgemm.h`	+5	Dispatch INT8 GEMM to RVV kernel for `MLAS_TARGET_RISCV64`
`onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpp`	+452 (new)	INT8 GEMM kernel implementation
`onnxruntime/core/mlas/lib/sgemm.cpp`	+1	Add RISCV64 to `#elif ARM64
`onnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cpp`	+86 (new)	GEMV kernel implementation

Total: +550 / 0 (no deletions).

Built on top of Add RISC-V Vector (RVV) support for CPU Execution Provider #28261 (Add RISC-V Vector (RVV) support for CPU EP).
Subsequent PRs in this series (per @velonica0's plan): activation kernels (Erf/Tanh/Logistic/Exp/Silu/GeluErf), LayerNorm fast path, SGEMM LMUL=m2 fast path. Will be sent as separate PRs.

Add onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. - cmake/onnxruntime_mlas.cmake: new RISCV64 build block that compiles the RVV kernel with -march=rv64gcv. FP32 SGEMM still uses the upstream scalar fallback (scalar/*.cpp), kept intact. - qgemm.h: add MLAS_TARGET_RISCV64 dispatch branch selecting the RVV kernel for all four (A,B) signedness combinations. No MLAS_PLATFORM struct change required: the dispatch is wired via the standard extern-global pattern used by MLAS_TARGET_WASM_SIMD / MLAS_TARGET_ARM64EC, not via a platform field. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>

Add an RVV M=1 GEMV kernel at onnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cpp Follows the ARM64/WASM pattern: the existing sgemm.cpp fast path already calls MlasGemvFloatKernel() when TransB == CblasNoTrans; this patch extends the '#elif ARM64 || WASM' guard to include RISCV64 and declares the symbol in mlasi.h alongside the ARM64/WASM branch. Kernel strategy: - LMUL=m4, 32 floats/vector at VLEN=256, scales with VLEN via vsetvli - 4x unroll over K to hide FMA latency Wired via direct extern symbol (no MLAS_PLATFORM field required) to match the ARM64 integration, kept minimal to stay focused. Measured K3 (SpacemiT X100, 8T): PPOCRv4-det -8% end-to-end (CNN 1x1 conv trails benefit). Transformer models unchanged within noise. Signed-off-by: qiurui144 <happyqiurui@163.com>

qiurui144 · 2026-04-30T07:00:28Z

@microsoft-github-policy-service agree

qiurui144 added 2 commits April 30, 2026 14:32

qiurui144 closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287
qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-kernels

qiurui144 commented Apr 30, 2026

Uh oh!

qiurui144 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qiurui144 commented Apr 30, 2026

Summary

Hardware compatibility

Build

Benchmarks

bge-small-zh-v1.5 INT8 (where the new INT8 kernel applies)

PPOCRv4 (CNN 1×1 conv triggers GEMV path)

FP32 transformer baseline (no regression)

Files changed

Related

Uh oh!

qiurui144 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant