2.0.11
New: Native CPU Backend
clpeak now benchmarks the host CPU as a first-class backend, with no external SDK or driver dependency. The backend uses std::thread and compiles ISA-optimised kernels for each platform at build time, selecting the widest available variant at runtime without requiring a per-feature build.
What it measures
| Category | Tests |
|---|---|
| Compute – floating-point | SP (fp32), DP (fp64), HP (fp16), BF16, Mixed-precision (FP16FML / FMLAL) |
| Compute – integer | INT32, INT8 dot-product (SMMLA / UDOT / VPDPBUSD) |
| Matrix engines | AMX-INT8 / AMX-BF16 (x86), SMMLA int8 / BFMMLA bf16 (AArch64) |
| Bandwidth | L1, L2, L3 (per-core and aggregate), DRAM (read, write, copy) |
| Latency | Pointer-chase memory latency |
ISA coverage
| Architecture | Variants |
|---|---|
| x86-64 | Scalar, SSE4.2, AVX2+FMA, AVX-512F, AVX-512VNNI, AVX-512BF16, AVX-512FP16, AMX |
| AArch64 | NEON, DOTPROD, I8MM, BF16, FMLAL (FP16FML) |
Runtime dispatch probes CPUID / getauxval / sysctlbyname and silently falls back to the widest supported tier — no build flag gymnastics needed.
Build
The backend has no external dependencies and is enabled by default. To disable:
cmake -S . -B build -DCLPEAK_CPU_NATIVE_ARCH=OFF # portable multi-ISA build (default)
cmake -S . -B build -DCLPEAK_ENABLE_CPU=OFF # skip the backend entirelyCLPEAK_CPU_NATIVE_ARCH=ON compiles a single TU with -march=native / -mcpu=native instead of the per-ISA dispatch tree — useful for tightly-profiled single-machine builds.
CLI Changes
New flags
| Flag | Effect |
|---|---|
--cpu |
Run only the native CPU backend (combines with other --<backend> flags) |
--no-cpu |
Skip the native CPU backend |
--amx / --no-amx |
Enable / disable the matrix-engine test (AMX on x86, I8MM/SMMLA/BFMMLA on AArch64) |
--cache-bandwidth / --no-cache-bandwidth |
Enable / disable the L1/L2/L3 cache bandwidth test (CPU only) |
--memory-latency / --no-memory-latency |
Enable / disable the pointer-chase memory latency test (CPU only) |
--max-time-cpu ms |
Per-test time budget for the CPU backend (default: 2000 ms). Separate from --max-time because CPU kernels need longer runs to amortise thread-pool overhead and produce stable numbers. |
Changed flags
--max-time— now explicitly documented as applying to all backends except CPU. Use--max-time-cputo tune the CPU budget independently.
Removed flags
--atomic-throughput/--no-atomic-throughput— removed together with the test itself.
Vulkan: Coopmat Tile Generalisation via Specialisation Constants
Tile dimensions M / N / K are now SPIR-V specialisation constants. On startup the backend queries the driver for every advertised (dtype, M, N, K) combination and dispatches each one individually. Previously, fp8 K=32 on NVIDIA was silently skipped by the hard-coded K=16 path; it now runs correctly.
The static coopmat_int8_k32.comp shader (K=32 hard-code) is removed; all shapes are handled by the generic shaders via spec constants.
Removed: Atomic Throughput
The Atomic throughput test (int_global / int_local / float_global) has been removed from all backends — OpenCL, Vulkan, CUDA, ROCm/HIP, Metal, and oneAPI/SYCL. Results were not comparable across stacks and the test did not reflect any workload-relevant bottleneck.
oneAPI / SYCL Fixes
- Queue self-recovery — the backend now re-creates the SYCL queue after a device fault instead of aborting the process.
- Authoritative XMX detection — replaces the previous heuristic; correctly identifies XMX capability on Arc and other Intel GPUs.
- oneMKL context isolation — each dtype now runs on a private SYCL context, stopping a fault in one GEMM variant from cascading into subsequent dtypes.
- fp64 GEMM tile reduced — smaller tile avoids triggering the GPU watchdog on Arc hardware.
Build Fixes
- Linux: prefer clang — the CPU backend CMakeLists now prefers
clang/clang++over GCC on Linux when both are present, for better SIMD codegen.
Full Changelog: 2.0.10...2.0.11