Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang] Poor performance because gather and scatter operations are emitted by the compiler targeting AVX512 or at level 3 of optimization #87640

Open
wolfpld opened this issue Apr 4, 2024 · 4 comments
Assignees

Comments

@wolfpld
Copy link

wolfpld commented Apr 4, 2024

Using -march=native and/or -O3 compilation flags can result in a significantly (more that x2) slower executable.

The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.

Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.

To build the program and get the results, you need to do the following:

% meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds 
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s)

The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use -O3 and -march=native. I am running on i7-1185G7, which supports AVX512.

Building the program with meson setup build --buildtype=release --optimization=2, which lowers the optimization level to -O2 results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level.

Similarly, changing the -march=native parameter to -march=skylake (which requires modifying the meson.build file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512.

Interestingly, building with both -march=skylake and -O2 results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of -march=native + -O3 build.

The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where -march=native + -O3 results in 170 Mpx/s and -march=skylake + O3 results in 190 Mpx/s.


In the case of -march=native + -O3, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency.

obraz

Here are example measurements for one of the gather instructions:

obraz

Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here).

obraz


With -march=native and -O2, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime.

obraz

The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots.

obraz


The -march=skylake + -O3 configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed.

obraz

The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!).

obraz

This gather instruction is again heavily microcoded and has high latency.

obraz


With -march=skylake and -O2, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected.


Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc).

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Apr 4, 2024
@EugeneZelenko EugeneZelenko added backend:X86 and removed clang Clang issues not falling into any other category labels Apr 4, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 4, 2024

@llvm/issue-subscribers-backend-x86

Author: Bartosz Taudul (wolfpld)

Using `-march=native` and/or `-O3` compilation flags can result in a significantly (more that x2) slower executable.

The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.

Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.

To build the program and get the results, you need to do the following:

% meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds 
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s)

The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use -O3 and -march=native. I am running on i7-1185G7, which supports AVX512.

Building the program with meson setup build --buildtype=release --optimization=2, which lowers the optimization level to -O2 results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level.

Similarly, changing the -march=native parameter to -march=skylake (which requires modifying the meson.build file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512.

Interestingly, building with both -march=skylake and -O2 results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of -march=native + -O3 build.

The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where -march=native + -O3 results in 170 Mpx/s and -march=skylake + O3 results in 190 Mpx/s.


In the case of -march=native + -O3, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency.

obraz

Here are example measurements for one of the gather instructions:

obraz

Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here).

obraz


With -march=native and -O2, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime.

obraz

The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots.

obraz


The -march=skylake + -O3 configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed.

obraz

The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!).

obraz

This gather instruction is again heavily microcoded and has high latency.

obraz


With -march=skylake and -O2, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected.


Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc).

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 4, 2024

cc @RKSimon

@RKSimon RKSimon self-assigned this Apr 4, 2024
RKSimon added a commit that referenced this issue Apr 4, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Apr 4, 2024

Looking at this now, but I'm going to have to think about where to start tbh - gather/scatters are still a mess in both the costs tables and the scheduler models (the znver4 model has no entries at all so they are modelled as simple loads/stores.....).

@wolfpld
Copy link
Author

wolfpld commented Aug 21, 2024

A bit simpler repro case:

#include <stdint.h>
#include <string.h>

void FixOrder( char* data, size_t blocks )
{
    do
    {
        uint32_t tmp;
        memcpy( &tmp, data+4, 4 );
        tmp = ~tmp;
        uint32_t t0 = tmp & 0x55555555;
        uint32_t t1 = tmp & 0xAAAAAAAA;
        tmp = ( ( t0 << 1 ) | ( t1 >> 1 ) ) ^ t1;
        memcpy( data+4, &tmp, 4 );
        data += 8;
    }
    while( --blocks );
}

The following assembly is generated with -O3 -march=znver4:

.LBB0_7:
  vmovdqu64 zmm3, zmmword ptr [r9 + 4]
  kxnorw k1, k0, k0
  add r8, -16
  vpermt2d zmm3, zmm0, zmmword ptr [r9 + 68]
  vpandnd zmm4, zmm3, zmm1
  vpternlogq zmm3, zmm3, zmm3, 15
  vpaddd zmm3, zmm3, zmm3
  vpsrld zmm5, zmm4, 1
  vpandd zmm3, zmm3, zmm1
  vpternlogd zmm5, zmm4, zmm3, 54
  vpscatterdd zmmword ptr [r9 + zmm2] {k1}, zmm5
  sub r9, -128
  cmp rax, r8
  jne .LBB0_7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants