2.0.14
CPU backend — per-ISA variant compute tests
Every supported instruction set now runs as its own test with a labeled output
(e.g. "Single-precision compute (SSE2)", "(AVX2+FMA)", "(AVX-512)"), so users
can directly compare instruction set performance on the same hardware.
ARM matrix engines use the instruction name rather than the feature name:
BFMMLA (FEAT_BF16) and SMMLA (FEAT_I8MM), paralleling x86 "AMX".
AOT-compiled GPU binaries
CUDA and ROCm kernels are now ahead-of-time compiled and embedded in the
binary. No runtime kernel compilation — works with driver-only installs
(no NVCC/HIP SDK needed to run). New CMake options:
CLPEAK_ENABLE_CUDA_GEMM/CLPEAK_ENABLE_ROCM_GEMM— optionally link
cuBLASLt / rocBLAS + hipBLASLt for vendor GEMM peak tests
Packaging
- Flathub — Flatpak manifest + AppStream metainfo (Vulkan + OpenCL + CPU)
- Homebrew — formula for macOS and Linuxbrew
- Snap — rebuilt with clang, ELF patching enabled
Fixes
- CPU: volatile-seed fp32/fp64 coefficients — fixes
-ffast-mathchain collapse
on non-FMA targets (SSE2/scalar reported impossible peak above AVX2) - CPU: gate AMX intrinsics on 64-bit target (i686 compilation fix)
- ROCm: fix bf16 kernel build on ROCm 6.4 (
hip_bfloat16→__hip_bfloat16) - ROCm: fix rocWMMA AOT build (int8 BlockK, gfx950 unsupported)
- ROCm: fix dlopen shim compile errors
- Snap: fix binary path
CI
- Linux ARM64 + Windows x64 CUDA builds
- CUDA stub symlink for GPU-less runners
- oneAPI
setvars.shsourcing
Breaking
- Removed
bmma_b1CUDA test
Full Changelog: 2.0.13...2.0.14