[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287
Closed
qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
Closed
[RISC-V] Add RVV INT8 GEMM and GEMV kernels (follow-up #28261)#28287qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
qiurui144 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
Add onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpp, a standard-RVV (baseline V extension, VLEN>=128, dynamic vsetvli) INT8 GEMM kernel using the vwmulu.vv + vwaddu.wv widening pattern. Works for any VLEN without rebuild. - cmake/onnxruntime_mlas.cmake: new RISCV64 build block that compiles the RVV kernel with -march=rv64gcv. FP32 SGEMM still uses the upstream scalar fallback (scalar/*.cpp), kept intact. - qgemm.h: add MLAS_TARGET_RISCV64 dispatch branch selecting the RVV kernel for all four (A,B) signedness combinations. No MLAS_PLATFORM struct change required: the dispatch is wired via the standard extern-global pattern used by MLAS_TARGET_WASM_SIMD / MLAS_TARGET_ARM64EC, not via a platform field. Measured K3 (SpacemiT X100, VLEN=256, 8T): bge-small INT8 kernel throughput ~2.5x vs scalar default. FP32 bge-small no-dispatch P50 stays at 89ms (unchanged from upstream main; no regression). Signed-off-by: qiurui144 <happyqiurui@163.com>
Add an RVV M=1 GEMV kernel at onnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cpp Follows the ARM64/WASM pattern: the existing sgemm.cpp fast path already calls MlasGemvFloatKernel() when TransB == CblasNoTrans; this patch extends the '#elif ARM64 || WASM' guard to include RISCV64 and declares the symbol in mlasi.h alongside the ARM64/WASM branch. Kernel strategy: - LMUL=m4, 32 floats/vector at VLEN=256, scales with VLEN via vsetvli - 4x unroll over K to hide FMA latency Wired via direct extern symbol (no MLAS_PLATFORM field required) to match the ARM64 integration, kept minimal to stay focused. Measured K3 (SpacemiT X100, 8T): PPOCRv4-det -8% end-to-end (CNN 1x1 conv trails benefit). Transformer models unchanged within noise. Signed-off-by: qiurui144 <happyqiurui@163.com>
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two RVV-vectorised kernels to the MLAS RISC-V 64 path, follow-up to #28261:
qgemm_kernel_rvv.cpp, 452 lines): standard-RVVvwmulu.vv + vwaddu.wvwidening pattern, dynamicvsetvli, works for any VLEN ≥ 128 without rebuild.riscv64/sgemv_kernel_rvv.cpp, 86 lines): RVV M=1 GEMV kernel, LMUL=m4 (32 floats/vector at VLEN=256, scales with VLEN), 4× K-unroll for FMA latency hiding.This is the first follow-up PR in the series @velonica0 mentioned in #28261 ("I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs").
Hardware compatibility
-O3 -march=rv64gcv).Build
onnxruntime_USE_RVV=ON(introduced by #28261, opt-in via--enable_rvvflag) automatically builds the new kernels with-march=rv64gcv. No new cmake option.Benchmarks
K3 X100 (8 threads, governor=performance, cooldown ≤65 °C, 10 reps), comparing
main(#28261) vsdevelop(#28261+ this PR):bge-small-zh-v1.5 INT8 (where the new INT8 kernel applies)
PPOCRv4 (CNN 1×1 conv triggers GEMV path)
FP32 transformer baseline (no regression)
#28261 already added the FP32 SGEMM RVV kernel, so FP32 transformer P50 is unchanged within noise:
Hardware without V does not define
__riscv_vectorand never reaches the new dispatch; the existing scalar fallback path is unchanged.Files changed
cmake/onnxruntime_mlas.cmakeonnxruntime/core/mlas/lib/mlasi.hMlasGemvFloatKernelfor RISCV64onnxruntime/core/mlas/lib/qgemm.hMLAS_TARGET_RISCV64onnxruntime/core/mlas/lib/qgemm_kernel_rvv.cpponnxruntime/core/mlas/lib/sgemm.cpponnxruntime/core/mlas/lib/riscv64/sgemv_kernel_rvv.cppTotal: +550 / 0 (no deletions).
Related