roofbench

Benchmark utility for CPU FLOPS, core latency, and memory bandwidth.

Building

Dependencies:

Linux
GCC 11.0 or Clang 11.0 or newer (Clang is preferred)
- libomp-dev LLVM OpenMP Runtime Library (if using Clang)
- libc++-dev LLVM C++ standard library (if using -stdlib=libc++)
- lld LLVM linker (if using -fuse-ld=lld)
Meson build system
numactl or libnuma-dev

export AR=gcc-ar CC=gcc CXX=g++ RANLIB=gcc-ranlib
meson setup builddir -D simd_batch_size_f32=232 -D simd_batch_size_f64=116
ninja -C builddir

export AR=llvm-ar CC=clang CXX=clang++ RANLIB=llvm-ranlib
export CXXFLAGS=-stdlib=libc++ LDFLAGS='-fuse-ld=lld -stdlib=libc++'  # Optional
meson setup builddir -D simd_batch_size_f32=240 -D simd_batch_size_f64=120
ninja -C builddir

The build system uses -march=native by default, so the binary will be optimized for your specific machine.

Intel AVX-512

Turning on 512-bit SIMD can increase peak FLOPS on Intel CPUs. However, in a multitasking environment, the performance of other processes will be reduced.

export AR=gcc-ar CC=gcc CXX=g++ RANLIB=gcc-ranlib
meson setup builddir -D cpp_args=-mprefer-vector-width=512 -D simd_batch_size_f32=464 -D simd_batch_size_f64=232 --wipe
ninja -C builddir

export AR=llvm-ar CC=clang CXX=clang++ RANLIB=llvm-ranlib
export CXXFLAGS=-stdlib=libc++ LDFLAGS='-fuse-ld=lld -stdlib=libc++'  # Optional
meson setup builddir -D cpp_args="$CXXFLAGS -mprefer-vector-width=512" -D simd_batch_size_f32=480 -D simd_batch_size_f64=240 --wipe
ninja -C builddir

Optimal SIMD batch size

The optimal value is: (total SIMD register count − occupied count) × (SIMD lane width) ÷ sizeof (float).

Compiler	AArch64 NEON (128-bit)	AVX2 (256-bit)	AVX-512 (512-bit)
GCC	120, 60	464, 232	232, 116
Clang	120, 60	240, 120	480, 240

Running

OMP_PLACES=threads OMP_PROC_BIND=true ./builddir/roofbench | tee results.json
./plot_latency.py results.json > latency.svg

The output is in JSON format.

Included benchmarks

Affinity: shows thread affinity
Float Add: floating-point add operations
Float Mul: floating-point multiply operations
Float FMA: fused floating-point multiply then add operations
Memory Read: reading the corresponding NUMA local memory
Inter-thread Latency: round-trip time between each pair of host thread and guest thread, through shared memory communication on host thread’s NUMA node

Units of measurement

Time duration: seconds
FLOPS: operations per second
Throughput: bytes per second
Latency: seconds

License

The program is free and open-source software, licensed under the MIT license.

Refer to the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
src		src
subprojects		subprojects
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
build.sh		build.sh
meson.build		meson.build
meson_options.txt		meson_options.txt
plot_latency.py		plot_latency.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

subprojects

subprojects

.gitignore

.gitignore

LICENSE

LICENSE

Readme.md

Readme.md

build.sh

build.sh

meson.build

meson.build

meson_options.txt

meson_options.txt

plot_latency.py

plot_latency.py

Repository files navigation

roofbench

Building

Intel AVX-512

Optimal SIMD batch size

Running

Included benchmarks

Units of measurement

License

About

Releases

Packages

Languages

License

m13253/roofbench

Folders and files

Latest commit

History

Repository files navigation

roofbench

Building

Intel AVX-512

Optimal SIMD batch size

Running

Included benchmarks

Units of measurement

License

About

Resources

License

Stars

Watchers

Forks

Languages