Skip to content

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287

Closed
qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-kernels
Closed

[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287
qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
qiurui144:feat/mlas-rvv-int8-gemv-kernels

Conversation

@qiurui144
Copy link
Copy Markdown

Summary

Adds two RVV-vectorised kernels to the MLAS RISC-V 64 path, follow-up to #28261:

  1. INT8 GEMM (qgemm_kernel_rvv.cpp, 452 lines): standard-RVV vwmulu.vv + vwaddu.wv widening pattern, dynamic vsetvli, works for any VLEN ≥ 128 without rebuild.
  2. GEMV (riscv64/sgemv_kernel_rvv.cpp, 86 lines): RVV M=1 GEMV kernel, LMUL=m4 (32 floats/vector at VLEN=256, scales with VLEN), 4× K-unroll for FMA latency hiding.

This is the first follow-up PR in the series @velonica0 mentioned in #28261 ("I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs").

Hardware compatibility

  • Both kernels: any RV64 with V extension 1.0 (RVA22V profile or later).
  • Tested on SpacemiT K3 X100 (RVA22V, VLEN=256, 2.4 GHz, GCC 15.2 -O3 -march=rv64gcv).

Build

onnxruntime_USE_RVV=ON (introduced by #28261, opt-in via --enable_rvv flag) automatically builds the new kernels with -march=rv64gcv. No new cmake option.

Benchmarks

K3 X100 (8 threads, governor=performance, cooldown ≤65 °C, 10 reps), comparing main (#28261) vs develop (#28261 + this PR):

bge-small-zh-v1.5 INT8 (where the new INT8 kernel applies)

variant P50 speedup
main (no INT8 RVV kernel) 166.2 ms 1.00×
+ this PR (INT8 GEMM dispatch) 81.4 ms 2.04×

PPOCRv4 (CNN 1×1 conv triggers GEMV path)

variant det P50 rec P50 det speedup rec speedup
main 231.8 ms 132.5 ms 1.00× 1.00×
+ this PR (GEMV kernel) 214.6 ms 115.0 ms 1.08× 1.15×

FP32 transformer baseline (no regression)

#28261 already added the FP32 SGEMM RVV kernel, so FP32 transformer P50 is unchanged within noise:

model main develop delta
bge-small FP32 89.8 ms 90.5 ms +0.8%
bge-base FP32 600.9 ms 605.8 ms +0.8%
bge-reranker FP32 598.7 ms 600.2 ms +0.3%

Hardware without V does not define __riscv_vector and never reaches the new dispatch; the existing scalar fallback path is unchanged.

Files changed

file LOC purpose
cmake/onnxruntime_mlas.cmake +5 Additive: register both kernels in #28261 RISCV64 RVV block
onnxruntime/core/mlas/lib/mlasi.h +1 Forward-declare MlasGemvFloatKernel for RISCV64
onnxruntime/core/mlas/lib/qgemm.h +5 Dispatch INT8 GEMM to RVV kernel for MLAS_TARGET_RISCV64
onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpp +452 (new) INT8 GEMM kernel implementation
onnxruntime/core/mlas/lib/sgemm.cpp +1 Add RISCV64 to `#elif ARM64
onnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cpp +86 (new) GEMV kernel implementation

Total: +550 / 0 (no deletions).

Related

Add onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpp, a standard-RVV
(baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel
using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN
without rebuild.

- cmake/onnxruntime_mlas.cmake: new RISCV64 build block that compiles
  the RVV kernel with -march=rv64gcv. FP32 SGEMM still uses the
  upstream scalar fallback (scalar/*.cpp), kept intact.
- qgemm.h: add MLAS_TARGET_RISCV64 dispatch branch selecting the RVV
  kernel for all four (A,B) signedness combinations.

No MLAS_PLATFORM struct change required: the dispatch is wired via the
standard extern-global pattern used by MLAS_TARGET_WASM_SIMD /
MLAS_TARGET_ARM64EC, not via a platform field.

Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel
throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50
stays at 89ms (unchanged from upstream main; no regression).

Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV M=1 GEMV kernel at
  onnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cpp

Follows the ARM64/WASM pattern: the existing sgemm.cpp fast path
already calls MlasGemvFloatKernel() when TransB == CblasNoTrans; this
patch extends the '#elif ARM64 || WASM' guard to include RISCV64
and declares the symbol in mlasi.h alongside the ARM64/WASM branch.

Kernel strategy:
- LMUL=m4, 32 floats/vector at VLEN=256, scales with VLEN via vsetvli
- 4x unroll over K to hide FMA latency

Wired via direct extern symbol (no MLAS_PLATFORM field required) to
match the ARM64 integration, kept minimal to stay focused.

Measured K3 (SpacemiT X100, 8T): PPOCRv4-det -8% end-to-end
(CNN 1x1 conv trails benefit). Transformer models unchanged within
noise.

Signed-off-by: qiurui144 <happyqiurui@163.com>
@qiurui144
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@qiurui144 qiurui144 closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant