Skip to content

2.0.11

Choose a tag to compare

@krrishnarraj krrishnarraj released this 09 Jun 15:59
· 20 commits to master since this release
78ecbe8

New: Native CPU Backend

clpeak now benchmarks the host CPU as a first-class backend, with no external SDK or driver dependency. The backend uses std::thread and compiles ISA-optimised kernels for each platform at build time, selecting the widest available variant at runtime without requiring a per-feature build.

What it measures

Category Tests
Compute – floating-point SP (fp32), DP (fp64), HP (fp16), BF16, Mixed-precision (FP16FML / FMLAL)
Compute – integer INT32, INT8 dot-product (SMMLA / UDOT / VPDPBUSD)
Matrix engines AMX-INT8 / AMX-BF16 (x86), SMMLA int8 / BFMMLA bf16 (AArch64)
Bandwidth L1, L2, L3 (per-core and aggregate), DRAM (read, write, copy)
Latency Pointer-chase memory latency

ISA coverage

Architecture Variants
x86-64 Scalar, SSE4.2, AVX2+FMA, AVX-512F, AVX-512VNNI, AVX-512BF16, AVX-512FP16, AMX
AArch64 NEON, DOTPROD, I8MM, BF16, FMLAL (FP16FML)

Runtime dispatch probes CPUID / getauxval / sysctlbyname and silently falls back to the widest supported tier — no build flag gymnastics needed.

Build

The backend has no external dependencies and is enabled by default. To disable:

cmake -S . -B build -DCLPEAK_CPU_NATIVE_ARCH=OFF   # portable multi-ISA build (default)
cmake -S . -B build -DCLPEAK_ENABLE_CPU=OFF         # skip the backend entirely

CLPEAK_CPU_NATIVE_ARCH=ON compiles a single TU with -march=native / -mcpu=native instead of the per-ISA dispatch tree — useful for tightly-profiled single-machine builds.


CLI Changes

New flags

Flag Effect
--cpu Run only the native CPU backend (combines with other --<backend> flags)
--no-cpu Skip the native CPU backend
--amx / --no-amx Enable / disable the matrix-engine test (AMX on x86, I8MM/SMMLA/BFMMLA on AArch64)
--cache-bandwidth / --no-cache-bandwidth Enable / disable the L1/L2/L3 cache bandwidth test (CPU only)
--memory-latency / --no-memory-latency Enable / disable the pointer-chase memory latency test (CPU only)
--max-time-cpu ms Per-test time budget for the CPU backend (default: 2000 ms). Separate from --max-time because CPU kernels need longer runs to amortise thread-pool overhead and produce stable numbers.

Changed flags

  • --max-time — now explicitly documented as applying to all backends except CPU. Use --max-time-cpu to tune the CPU budget independently.

Removed flags

  • --atomic-throughput / --no-atomic-throughput — removed together with the test itself.

Vulkan: Coopmat Tile Generalisation via Specialisation Constants

Tile dimensions M / N / K are now SPIR-V specialisation constants. On startup the backend queries the driver for every advertised (dtype, M, N, K) combination and dispatches each one individually. Previously, fp8 K=32 on NVIDIA was silently skipped by the hard-coded K=16 path; it now runs correctly.

The static coopmat_int8_k32.comp shader (K=32 hard-code) is removed; all shapes are handled by the generic shaders via spec constants.


Removed: Atomic Throughput

The Atomic throughput test (int_global / int_local / float_global) has been removed from all backends — OpenCL, Vulkan, CUDA, ROCm/HIP, Metal, and oneAPI/SYCL. Results were not comparable across stacks and the test did not reflect any workload-relevant bottleneck.


oneAPI / SYCL Fixes

  • Queue self-recovery — the backend now re-creates the SYCL queue after a device fault instead of aborting the process.
  • Authoritative XMX detection — replaces the previous heuristic; correctly identifies XMX capability on Arc and other Intel GPUs.
  • oneMKL context isolation — each dtype now runs on a private SYCL context, stopping a fault in one GEMM variant from cascading into subsequent dtypes.
  • fp64 GEMM tile reduced — smaller tile avoids triggering the GPU watchdog on Arc hardware.

Build Fixes

  • Linux: prefer clang — the CPU backend CMakeLists now prefers clang/clang++ over GCC on Linux when both are present, for better SIMD codegen.

Full Changelog: 2.0.10...2.0.11