[clang] Poor performance because gather and scatter operations are emitted by the compiler targeting AVX512 or at level 3 of optimization #87640

wolfpld · 2024-04-04T14:29:40Z

Using -march=native and/or -O3 compilation flags can result in a significantly (more that x2) slower executable.

The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.

Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.

To build the program and get the results, you need to do the following:

% meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds 
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s)

The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use -O3 and -march=native. I am running on i7-1185G7, which supports AVX512.

Building the program with meson setup build --buildtype=release --optimization=2, which lowers the optimization level to -O2 results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level.

Similarly, changing the -march=native parameter to -march=skylake (which requires modifying the meson.build file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512.

Interestingly, building with both -march=skylake and -O2 results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of -march=native + -O3 build.

The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where -march=native + -O3 results in 170 Mpx/s and -march=skylake + O3 results in 190 Mpx/s.

In the case of -march=native + -O3, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency.

Here are example measurements for one of the gather instructions:

Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here).

With -march=native and -O2, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime.

The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots.

The -march=skylake + -O3 configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed.

The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!).

This gather instruction is again heavily microcoded and has high latency.

With -march=skylake and -O2, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected.

Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc).

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-04-04T14:31:22Z

@llvm/issue-subscribers-backend-x86

Author: Bartosz Taudul (wolfpld)

Using `-march=native` and/or `-O3` compilation flags can result in a significantly (more that x2) slower executable.

The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.

Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.

To build the program and get the results, you need to do the following:

% meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds 
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s)