Skip to content

2.0.14

Choose a tag to compare

@krrishnarraj krrishnarraj released this 23 Jun 10:56
· 12 commits to master since this release
72c11fd

CPU backend — per-ISA variant compute tests

Every supported instruction set now runs as its own test with a labeled output
(e.g. "Single-precision compute (SSE2)", "(AVX2+FMA)", "(AVX-512)"), so users
can directly compare instruction set performance on the same hardware.

ARM matrix engines use the instruction name rather than the feature name:
BFMMLA (FEAT_BF16) and SMMLA (FEAT_I8MM), paralleling x86 "AMX".

AOT-compiled GPU binaries

CUDA and ROCm kernels are now ahead-of-time compiled and embedded in the
binary. No runtime kernel compilation — works with driver-only installs
(no NVCC/HIP SDK needed to run). New CMake options:

  • CLPEAK_ENABLE_CUDA_GEMM / CLPEAK_ENABLE_ROCM_GEMM — optionally link
    cuBLASLt / rocBLAS + hipBLASLt for vendor GEMM peak tests

Packaging

  • Flathub — Flatpak manifest + AppStream metainfo (Vulkan + OpenCL + CPU)
  • Homebrew — formula for macOS and Linuxbrew
  • Snap — rebuilt with clang, ELF patching enabled

Fixes

  • CPU: volatile-seed fp32/fp64 coefficients — fixes -ffast-math chain collapse
    on non-FMA targets (SSE2/scalar reported impossible peak above AVX2)
  • CPU: gate AMX intrinsics on 64-bit target (i686 compilation fix)
  • ROCm: fix bf16 kernel build on ROCm 6.4 (hip_bfloat16__hip_bfloat16)
  • ROCm: fix rocWMMA AOT build (int8 BlockK, gfx950 unsupported)
  • ROCm: fix dlopen shim compile errors
  • Snap: fix binary path

CI

  • Linux ARM64 + Windows x64 CUDA builds
  • CUDA stub symlink for GPU-less runners
  • oneAPI setvars.sh sourcing

Breaking

  • Removed bmma_b1 CUDA test

Full Changelog: 2.0.13...2.0.14