Skip to content
View mpgemm's full-sized avatar

Block or report mpgemm

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
MPGEMM/README.md

MpGEMM

Demystifying ARM SME to Optimize General Matrix Multiplications.

This paper was published in the Proceedings of the 40th IEEE International Parallel & Distributed Processing Symposium (IPDPS '26), and you can find it here. If you find our work beneficial to your research, we would greatly appreciate it if you could cite our paper and star the repository. Please feel free to contact us at chenchengdeng@nudt.edu.cn if you have any questions.

Abstract

MpGEMM is an open-source library that exploits the key architectural features of ARM SME to optimize GEMM across multiple precisions. Through a systematic architectural characterization of SME, we derive practical optimization guidelines that directly inform the library design. MpGEMM integrates cache-aware partitioning, efficient data packing with on-the-fly transposition, and specialized micro-kernels that utilize multi vector loads and all available tile registers.

How to use

cd src
make
cd ../benchmark
make
./correct.x
./singlePerformance.x
./multiPerformance.x
./irregular.x

You need to adjust the values of the arrays Me, Ke, and Ne to evaluate different workloads.

Platform and Workloads

alt text

alt text

To reflect practical use cases, we benchmark GEMM workloads derived from DeepSeek and LLaMA. The workloads include both large and skinny matrices, enabling us to evaluate performance across diverse aspect ratios.

Performance

We compare MPGEMM against four SME-enabled GEMM implementations: Apple’s Accelerate, LIBXSMM, KleidiAI, and OpenBLAS. Experimental results demonstrate that MPGEMM achieves an average speedup of 1.23× over the vendor-optimized BLAS library in Accelerate and significantly outperforms other open source solutions.

alt text alt text

Popular repositories Loading

  1. MPGEMM MPGEMM Public

    Assembly 3 1

  2. SYCL-Evaluation SYCL-Evaluation Public

    LLVM

  3. FastGithub-2.4.1 FastGithub-2.4.1 Public

  4. Antlr4-compute Antlr4-compute Public

    C++

  5. TensorCompiler-MLIR TensorCompiler-MLIR Public

    C++

  6. blis_apple blis_apple Public

    Forked from xrq-phys/blis_apple

    BLIS fork with kernels for Apple M1. (Perhaps) The first open-source BLAS with Apple Matrix Coprocessor support.

    C