Performance test for a simple loop
C++ Python Meson
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

Loop speedup experiment

In a Reddit discussion the following piece of code was posted:

for (unsigned c = 0; c < arraySize; ++c) {
    if (data[c] >= 128)
        sum += data[c];

This seems to be the optimal solution but it turns out not to be. This repository is an attempt to look at different ways of making it faster on real hardware.

Compiling and running

meson build
cd build

To switch between O2 and O3 do:

mesonconf -Dbuildtype=debugoptimized [or release]

Measuring and plotting

To run the full test measurement run do:

./ output_file.json

Creating plots requires Matplotlib and is run like this:

./ output_file.json output_dir

The plotter creates one picture per measurement setup.


If you have an algorithm you want to contribute, please file a pull request.

Some simple rules:

  • no inline assembly, threads, LTO or PGO
  • your submission does not need to be faster than the fastest so far just as long as it is interesting
  • no obviously stupid solutions
  • try to avoid undefined behaviour


Intel i7, GCC 6.3, -O2:

paradd     38030 μs
lut        54987 μs
bucket     54990 μs
multi      81016 μs
simple     81256 μs
bitfiddle  62836 μs
partition 379627 μs
zeroing   367643 μs

Raspberry Pi 2B+, Raspbian Jessie, GCC 4.9.2, -O2:

Bit fiddling          988622 μs
Bucket                992603 μs
Simple loop          1104662 μs
Parallel add lookup  1158125 μs
Lookup table         1223642 μs
Zeroing              1267950 μs
Multi                1222455 μs
Partitioning         1939782 μs

Please note that the measurements vary wildly between -O2 and -O3, flags such as -mfpu=neon etc, the compiler used etc. Do your own instead of blindly following these.