-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[clang] Poor performance because gather and scatter operations are emitted by the compiler targeting AVX512 or at level 3 of optimization #87640
Comments
@llvm/issue-subscribers-backend-x86 Author: Bartosz Taudul (wolfpld)
Using `-march=native` and/or `-O3` compilation flags can result in a significantly (more that x2) slower executable.
The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h. Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present. To build the program and get the results, you need to do the following: % meson setup build --buildtype=release
% cd build
% ninja
# any reasonably large image file will do instead of 16k.png, this only needs to be done once to get 16k.dds
% ./etcpak -c bc7 -h dds ~/16k.png 16k.dds
% ./etcpak -v -b 16k.dds
Median decode time for 9 runs: 3483.195 ms (77.066 Mpx/s) The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use Building the program with Similarly, changing the Interestingly, building with both The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where In the case of Here are example measurements for one of the gather instructions: Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here). With The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots. The The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!). This gather instruction is again heavily microcoded and has high latency. With Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc). |
cc @RKSimon |
…ther/scatter Noticed while starting triage for #87640
Looking at this now, but I'm going to have to think about where to start tbh - gather/scatters are still a mess in both the costs tables and the scheduler models (the znver4 model has no entries at all so they are modelled as simple loads/stores.....). |
A bit simpler repro case: #include <stdint.h>
#include <string.h>
void FixOrder( char* data, size_t blocks )
{
do
{
uint32_t tmp;
memcpy( &tmp, data+4, 4 );
tmp = ~tmp;
uint32_t t0 = tmp & 0x55555555;
uint32_t t1 = tmp & 0xAAAAAAAA;
tmp = ( ( t0 << 1 ) | ( t1 >> 1 ) ) ^ t1;
memcpy( data+4, &tmp, 4 );
data += 8;
}
while( --blocks );
} The following assembly is generated with .LBB0_7:
vmovdqu64 zmm3, zmmword ptr [r9 + 4]
kxnorw k1, k0, k0
add r8, -16
vpermt2d zmm3, zmm0, zmmword ptr [r9 + 68]
vpandnd zmm4, zmm3, zmm1
vpternlogq zmm3, zmm3, zmm3, 15
vpaddd zmm3, zmm3, zmm3
vpsrld zmm5, zmm4, 1
vpandd zmm3, zmm3, zmm1
vpternlogd zmm5, zmm4, zmm3, 54
vpscatterdd zmmword ptr [r9 + zmm2] {k1}, zmm5
sub r9, -128
cmp rax, r8
jne .LBB0_7 |
Using
-march=native
and/or-O3
compilation flags can result in a significantly (more that x2) slower executable.The code I am seeing problems with is https://github.com/wolfpld/etcpak, specifically https://github.com/wolfpld/etcpak/blob/master/bcdec.h.
Reproducing the results is a bit involved and requires a specific image file I cannot share, but any other image will suffice, provided it is large enough to produce long enough computation times. It may be necessary to use an image with an alpha channel, as different encoding modes (and code paths) are used when alpha is present.
To build the program and get the results, you need to do the following:
The 77 Mpx/s result is a measure of performance. The meson setup command configured the compiler to use
-O3
and-march=native
. I am running on i7-1185G7, which supports AVX512.Building the program with
meson setup build --buildtype=release --optimization=2
, which lowers the optimization level to-O2
results in 2274.264 ms (118.032 Mpx/s), which is not what you would normally expect from lowering the optimization level.Similarly, changing the
-march=native
parameter to-march=skylake
(which requires modifying themeson.build
file) results in 2776.365 ms (96.686 Mpx/s). The Skylake ISA doesn't support AVX512.Interestingly, building with both
-march=skylake
and-O2
results in 1693.646 ms (158.496 Mpx/s). This is twice the speed of-march=native
+-O3
build.The behavior is also reproducible on Ryzen 7950X (Zen4, another AVX512-enabled uarch), where
-march=native
+-O3
results in 170 Mpx/s and-march=skylake
+O3
results in 190 Mpx/s.In the case of
-march=native
+-O3
, the first problematic place in the code is lines 1171-1173. The compiler emits a lot of gather + scatter instructions that are microcoded and have big latency.Here are example measurements for one of the gather instructions:
Another problematic place is at line 1225, where a series of gather operations is emitted (four gathers are emitted, but I only show one here).
With
-march=native
and-O2
, the first place mentioned above still emits gathers and scatters, and takes a larger percentage of the runtime.The second code fragment is now emitted as a series of scalar operations, which causes it to practically disappear from the list of hot spots.
The
-march=skylake
+-O3
configuration produces a number of AVX2 operations (since AVX512 is not supported by Skylake) that have virtually no impact on execution speed.The second location dominates again due to a series of gather operations (oops, AVX2 has gathers and scatters too!).
This gather instruction is again heavily microcoded and has high latency.
With
-march=skylake
and-O2
, gather operations are no longer emitted, moving hotspots in the code to scalar computation elsewhere in the code as expected.Checked with clang version 17.0.6. I have also observed similar behavior with clang version 19.0.0git (https://github.com/llvm/llvm-project.git 35886dc).
The text was updated successfully, but these errors were encountered: