Release 2.0.14 · krrishnarraj/clpeak

CPU backend — per-ISA variant compute tests

Every supported instruction set now runs as its own test with a labeled output
(e.g. "Single-precision compute (SSE2)", "(AVX2+FMA)", "(AVX-512)"), so users
can directly compare instruction set performance on the same hardware.

ARM matrix engines use the instruction name rather than the feature name:
BFMMLA (FEAT_BF16) and SMMLA (FEAT_I8MM), paralleling x86 "AMX".

AOT-compiled GPU binaries

CUDA and ROCm kernels are now ahead-of-time compiled and embedded in the
binary. No runtime kernel compilation — works with driver-only installs
(no NVCC/HIP SDK needed to run). New CMake options:

CLPEAK_ENABLE_CUDA_GEMM / CLPEAK_ENABLE_ROCM_GEMM — optionally link
cuBLASLt / rocBLAS + hipBLASLt for vendor GEMM peak tests

Packaging

Flathub — Flatpak manifest + AppStream metainfo (Vulkan + OpenCL + CPU)
Homebrew — formula for macOS and Linuxbrew
Snap — rebuilt with clang, ELF patching enabled

Fixes

CPU: volatile-seed fp32/fp64 coefficients — fixes -ffast-math chain collapse
on non-FMA targets (SSE2/scalar reported impossible peak above AVX2)
CPU: gate AMX intrinsics on 64-bit target (i686 compilation fix)
ROCm: fix bf16 kernel build on ROCm 6.4 (hip_bfloat16 → __hip_bfloat16)
ROCm: fix rocWMMA AOT build (int8 BlockK, gfx950 unsupported)
ROCm: fix dlopen shim compile errors
Snap: fix binary path

CI

Linux ARM64 + Windows x64 CUDA builds
CUDA stub symlink for GPU-less runners
oneAPI setvars.sh sourcing

Breaking

Removed bmma_b1 CUDA test

Full Changelog: 2.0.13...2.0.14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2.0.14

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

CPU backend — per-ISA variant compute tests

AOT-compiled GPU binaries

Packaging

Fixes

CI

Breaking

Uh oh!