Release 2.0.11 · krrishnarraj/clpeak

New: Native CPU Backend

clpeak now benchmarks the host CPU as a first-class backend, with no external SDK or driver dependency. The backend uses std::thread and compiles ISA-optimised kernels for each platform at build time, selecting the widest available variant at runtime without requiring a per-feature build.

What it measures

Category	Tests
Compute – floating-point	SP (fp32), DP (fp64), HP (fp16), BF16, Mixed-precision (FP16FML / FMLAL)
Compute – integer	INT32, INT8 dot-product (SMMLA / UDOT / VPDPBUSD)
Matrix engines	AMX-INT8 / AMX-BF16 (x86), SMMLA int8 / BFMMLA bf16 (AArch64)
Bandwidth	L1, L2, L3 (per-core and aggregate), DRAM (read, write, copy)
Latency	Pointer-chase memory latency

ISA coverage

Architecture	Variants
x86-64	Scalar, SSE4.2, AVX2+FMA, AVX-512F, AVX-512VNNI, AVX-512BF16, AVX-512FP16, AMX
AArch64	NEON, DOTPROD, I8MM, BF16, FMLAL (FP16FML)

Runtime dispatch probes CPUID / getauxval / sysctlbyname and silently falls back to the widest supported tier — no build flag gymnastics needed.

Build

The backend has no external dependencies and is enabled by default. To disable:

cmake -S . -B build -DCLPEAK_CPU_NATIVE_ARCH=OFF   # portable multi-ISA build (default)
cmake -S . -B build -DCLPEAK_ENABLE_CPU=OFF         # skip the backend entirely

CLPEAK_CPU_NATIVE_ARCH=ON compiles a single TU with -march=native / -mcpu=native instead of the per-ISA dispatch tree — useful for tightly-profiled single-machine builds.

CLI Changes

New flags

Flag	Effect
`--cpu`	Run only the native CPU backend (combines with other `--<backend>` flags)
`--no-cpu`	Skip the native CPU backend
`--amx` / `--no-amx`	Enable / disable the matrix-engine test (AMX on x86, I8MM/SMMLA/BFMMLA on AArch64)
`--cache-bandwidth` / `--no-cache-bandwidth`	Enable / disable the L1/L2/L3 cache bandwidth test (CPU only)
`--memory-latency` / `--no-memory-latency`	Enable / disable the pointer-chase memory latency test (CPU only)
`--max-time-cpu ms`	Per-test time budget for the CPU backend (default: 2000 ms). Separate from `--max-time` because CPU kernels need longer runs to amortise thread-pool overhead and produce stable numbers.

Changed flags

--max-time — now explicitly documented as applying to all backends except CPU. Use --max-time-cpu to tune the CPU budget independently.

Removed flags

--atomic-throughput / --no-atomic-throughput — removed together with the test itself.

Vulkan: Coopmat Tile Generalisation via Specialisation Constants

Tile dimensions M / N / K are now SPIR-V specialisation constants. On startup the backend queries the driver for every advertised (dtype, M, N, K) combination and dispatches each one individually. Previously, fp8 K=32 on NVIDIA was silently skipped by the hard-coded K=16 path; it now runs correctly.

The static coopmat_int8_k32.comp shader (K=32 hard-code) is removed; all shapes are handled by the generic shaders via spec constants.

Removed: Atomic Throughput

The Atomic throughput test (int_global / int_local / float_global) has been removed from all backends — OpenCL, Vulkan, CUDA, ROCm/HIP, Metal, and oneAPI/SYCL. Results were not comparable across stacks and the test did not reflect any workload-relevant bottleneck.

oneAPI / SYCL Fixes

Queue self-recovery — the backend now re-creates the SYCL queue after a device fault instead of aborting the process.
Authoritative XMX detection — replaces the previous heuristic; correctly identifies XMX capability on Arc and other Intel GPUs.
oneMKL context isolation — each dtype now runs on a private SYCL context, stopping a fault in one GEMM variant from cascading into subsequent dtypes.
fp64 GEMM tile reduced — smaller tile avoids triggering the GPU watchdog on Arc hardware.

Build Fixes

Linux: prefer clang — the CPU backend CMakeLists now prefers clang/clang++ over GCC on Linux when both are present, for better SIMD codegen.

Full Changelog: 2.0.10...2.0.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2.0.11

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

New: Native CPU Backend

What it measures

ISA coverage

Build

CLI Changes

New flags

Changed flags

Removed flags

Vulkan: Coopmat Tile Generalisation via Specialisation Constants

Removed: Atomic Throughput

oneAPI / SYCL Fixes

Build Fixes

Uh oh!