Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86][codegen] Poor AVX512VBMI codegen when using intrinsics #77459

Closed
gchatelet opened this issue Jan 9, 2024 · 7 comments
Closed

[x86][codegen] Poor AVX512VBMI codegen when using intrinsics #77459

gchatelet opened this issue Jan 9, 2024 · 7 comments
Assignees

Comments

@gchatelet
Copy link
Contributor

The following snippet byte-reverses two zmm and returns matching elements into a mask

uint64_t big_endian_cmp_mask(__m512i max, __m512i value) {
  const auto byte_reverse = [](__m512i value) -> __m512i {
    return _mm512_permutexvar_epi8(
        _mm512_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31, //
                        32, 33, 34, 35, 36, 37, 38, 39, //
                        40, 41, 42, 43, 44, 45, 46, 47, //
                        48, 49, 50, 51, 52, 53, 54, 55, //
                        56, 57, 58, 59, 60, 61, 62, 63),
        value);
  };
  return _mm512_cmpeq_epi8_mask(byte_reverse(max), byte_reverse(value));
}

GCC 13.2 compiles it down to

  vmovdqa64 .LC0(%rip), %zmm2
  vpermb %zmm1, %zmm2, %zmm1
  vpermb %zmm0, %zmm2, %zmm0
  vpcmpb $0, %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  ret

Whether clang 17.0.1 compiles it down to

  vpcmpeqb %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  bswapq %rax
  movq %rax, %rcx
  shrq $4, %rcx
  movabsq $1085102592571150095, %rdx # imm = 0xF0F0F0F0F0F0F0F
  andq %rdx, %rcx
  andq %rdx, %rax
  shlq $4, %rax
  orq %rcx, %rax
  movabsq $3689348814741910323, %rcx # imm = 0x3333333333333333
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq $2, %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,4), %rax
  movabsq $6148914691236517205, %rcx # imm = 0x5555555555555555
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,2), %rax
  vzeroupper
  retq

The issue comes from InstCombine transforming

  %0 = bitcast <8 x i64> %max to <64 x i8>
  %1 = shufflevector <64 x i8> %0, <64 x i8> poison, <64 x i32> <i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
  %2 = bitcast <64 x i8> %1 to <8 x i64>
  %3 = bitcast <8 x i64> %value to <64 x i8>
  %4 = shufflevector <64 x i8> %3, <64 x i8> poison, <64 x i32> <i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
  %5 = bitcast <64 x i8> %4 to <8 x i64>
  %6 = icmp eq <64 x i8> %1, %4
  %7 = bitcast <64 x i1> %6 to i64
  ret i64 %7

into

  %0 = bitcast <8 x i64> %max to <64 x i8>
  %1 = bitcast <8 x i64> %value to <64 x i8>
  %2 = icmp eq <64 x i8> %0, %1
  %3 = bitcast <64 x i1> %2 to i64
  %4 = call i64 @llvm.bitreverse.i64(i64 %3)
  ret i64 %4

The @llvm.bitreverse.i64 operation is then done using GPRs instead of vectors.

Godbolt link : https://godbolt.org/z/5PjzaTr51

llvm-mca latency for the vector version : https://godbolt.org/z/K36K4r16K
llvm-mca latency for the GRP version : https://godbolt.org/z/7qcceYvM8

Is there any way to prevent the transformation and stick to the vector intrinsics ?

@llvmbot
Copy link

llvmbot commented Jan 9, 2024

@llvm/issue-subscribers-clang-codegen

Author: Guillaume Chatelet (gchatelet)

The following snippet byte-reverses two `zmm` and returns matching elements into a mask
uint64_t big_endian_cmp_mask(__m512i max, __m512i value) {
  const auto byte_reverse = [](__m512i value) -&gt; __m512i {
    return _mm512_permutexvar_epi8(
        _mm512_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31, //
                        32, 33, 34, 35, 36, 37, 38, 39, //
                        40, 41, 42, 43, 44, 45, 46, 47, //
                        48, 49, 50, 51, 52, 53, 54, 55, //
                        56, 57, 58, 59, 60, 61, 62, 63),
        value);
  };
  return _mm512_cmpeq_epi8_mask(byte_reverse(max), byte_reverse(value));
}

GCC 13.2 compiles it down to

  vmovdqa64 .LC0(%rip), %zmm2
  vpermb %zmm1, %zmm2, %zmm1
  vpermb %zmm0, %zmm2, %zmm0
  vpcmpb $0, %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  ret

Whether clang 17.0.1 compiles it down to

  vpcmpeqb %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  bswapq %rax
  movq %rax, %rcx
  shrq $4, %rcx
  movabsq $1085102592571150095, %rdx # imm = 0xF0F0F0F0F0F0F0F
  andq %rdx, %rcx
  andq %rdx, %rax
  shlq $4, %rax
  orq %rcx, %rax
  movabsq $3689348814741910323, %rcx # imm = 0x3333333333333333
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq $2, %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,4), %rax
  movabsq $6148914691236517205, %rcx # imm = 0x5555555555555555
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,2), %rax
  vzeroupper
  retq

The issue comes from InstCombine transforming

  %0 = bitcast &lt;8 x i64&gt; %max to &lt;64 x i8&gt;
  %1 = shufflevector &lt;64 x i8&gt; %0, &lt;64 x i8&gt; poison, &lt;64 x i32&gt; &lt;i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0&gt;
  %2 = bitcast &lt;64 x i8&gt; %1 to &lt;8 x i64&gt;
  %3 = bitcast &lt;8 x i64&gt; %value to &lt;64 x i8&gt;
  %4 = shufflevector &lt;64 x i8&gt; %3, &lt;64 x i8&gt; poison, &lt;64 x i32&gt; &lt;i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0&gt;
  %5 = bitcast &lt;64 x i8&gt; %4 to &lt;8 x i64&gt;
  %6 = icmp eq &lt;64 x i8&gt; %1, %4
  %7 = bitcast &lt;64 x i1&gt; %6 to i64
  ret i64 %7

into

  %0 = bitcast &lt;8 x i64&gt; %max to &lt;64 x i8&gt;
  %1 = bitcast &lt;8 x i64&gt; %value to &lt;64 x i8&gt;
  %2 = icmp eq &lt;64 x i8&gt; %0, %1
  %3 = bitcast &lt;64 x i1&gt; %2 to i64
  %4 = call i64 @<!-- -->llvm.bitreverse.i64(i64 %3)
  ret i64 %4

The @<!-- -->llvm.bitreverse.i64 operation is then done using GPRs instead of vectors.

Godbolt link : https://godbolt.org/z/5PjzaTr51

llvm-mca latency for the vector version : https://godbolt.org/z/K36K4r16K
llvm-mca latency for the GRP version : https://godbolt.org/z/7qcceYvM8

Is there any way to prevent the transformation and stick to the vector intrinsics ?

@gchatelet
Copy link
Contributor Author

Benchmark on a sapphirerapid machine, the GPR version is 70% slower than the vector version.

gchatelet@intel-sapphire-rapids:~$ cat benchmark.cpp 
#include <benchmark/benchmark.h>
#include <cstdint>
#include <immintrin.h>

uint64_t big_endian_cmp_mask(__m512i max, __m512i value) {
  const auto byte_reverse = [](__m512i value) -> __m512i {
    return _mm512_permutexvar_epi8(
        _mm512_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31, //
                        32, 33, 34, 35, 36, 37, 38, 39, //
                        40, 41, 42, 43, 44, 45, 46, 47, //
                        48, 49, 50, 51, 52, 53, 54, 55, //
                        56, 57, 58, 59, 60, 61, 62, 63),
        value);
  };
  return _mm512_cmpeq_epi8_mask(byte_reverse(max), byte_reverse(value));
}

static void BM_BigEndianCmpMask(benchmark::State &state) {
  __m512i a = {};
  __m512i b = {};
  for (auto _ : state) {
    benchmark::DoNotOptimize(a);
    benchmark::DoNotOptimize(b);
    benchmark::DoNotOptimize(big_endian_cmp_mask(a, b));
  }
}
// Register the function as a benchmark
BENCHMARK(BM_BigEndianCmpMask);

BENCHMARK_MAIN();

gchatelet@intel-sapphire-rapids:~$ clang++-17 -O3 -march=tigerlake benchmark.cpp -isystem git/benchmark/include -L git/benchmark/build/src -lbenchmark -lpthread
gchatelet@intel-sapphire-rapids:~$ llvm-nm --print-size --radix=d ./a.out | grep _Z19big_endian_cmp_maskDv8_xS_
0000000000031584 0000000000000101 T _Z19big_endian_cmp_maskDv8_xS_
gchatelet@intel-sapphire-rapids:~$ llvm-objdump --disassemble-symbols=_Z19big_endian_cmp_maskDv8_xS_ ./a.out 

./a.out:        file format elf64-x86-64


Disassembly of section .text:

0000000000007b60 <_Z19big_endian_cmp_maskDv8_xS_>:
    7b60: 62 f1 7d 48 74 c1             vpcmpeqb        %zmm1, %zmm0, %k0
    7b66: c4 e1 fb 93 c0                kmovq   %k0, %rax
    7b6b: 48 0f c8                      bswapq  %rax
    7b6e: 48 89 c1                      movq    %rax, %rcx
    7b71: 48 c1 e9 04                   shrq    $4, %rcx
    7b75: 48 ba 0f 0f 0f 0f 0f 0f 0f 0f movabsq $1085102592571150095, %rdx
    7b7f: 48 21 d1                      andq    %rdx, %rcx
    7b82: 48 21 d0                      andq    %rdx, %rax
    7b85: 48 c1 e0 04                   shlq    $4, %rax
    7b89: 48 09 c8                      orq     %rcx, %rax
    7b8c: 48 b9 33 33 33 33 33 33 33 33 movabsq $3689348814741910323, %rcx
    7b96: 48 89 c2                      movq    %rax, %rdx
    7b99: 48 21 ca                      andq    %rcx, %rdx
    7b9c: 48 c1 e8 02                   shrq    $2, %rax
    7ba0: 48 21 c8                      andq    %rcx, %rax
    7ba3: 48 8d 04 90                   leaq    (%rax,%rdx,4), %rax
    7ba7: 48 b9 55 55 55 55 55 55 55 55 movabsq $6148914691236517205, %rcx
    7bb1: 48 89 c2                      movq    %rax, %rdx
    7bb4: 48 21 ca                      andq    %rcx, %rdx
    7bb7: 48 d1 e8                      shrq    %rax
    7bba: 48 21 c8                      andq    %rcx, %rax
    7bbd: 48 8d 04 50                   leaq    (%rax,%rdx,2), %rax
    7bc1: c5 f8 77                      vzeroupper
    7bc4: c3                            retq
    7bc5: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
gchatelet@intel-sapphire-rapids:~$ ./a.out --benchmark_repetitions=10 --benchmark_min_warmup_time=1
2024-01-09T14:02:59+00:00
Running ./a.out
Run on (22 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x11)
  L1 Instruction 32 KiB (x11)
  L2 Unified 2048 KiB (x11)
  L3 Unified 107520 KiB (x1)
Load Average: 0.02, 0.04, 0.02
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask              1.69 ns         1.69 ns    415357292
BM_BigEndianCmpMask_mean         1.69 ns         1.69 ns           10
BM_BigEndianCmpMask_median       1.69 ns         1.69 ns           10
BM_BigEndianCmpMask_stddev      0.000 ns        0.000 ns           10
BM_BigEndianCmpMask_cv           0.02 %          0.02 %            10
gchatelet@intel-sapphire-rapids:~$ g++-10 -O3 -march=tigerlake benchmark.cpp -isystem git/benchmark/include -L git/benchmark/build/src -lbenchmark -lpthread
gchatelet@intel-sapphire-rapids:~$ llvm-nm --print-size --radix=d ./a.out | grep _Z19big_endian_cmp_maskDv8_xS_
0000000000027056 0000000000000220 t _GLOBAL__sub_I__Z19big_endian_cmp_maskDv8_xS_
0000000000026864 0000000000000048 t _GLOBAL__sub_I__Z19big_endian_cmp_maskDv8_xS_.cold
0000000000031968 0000000000000034 T _Z19big_endian_cmp_maskDv8_xS_
gchatelet@intel-sapphire-rapids:~$ llvm-objdump --disassemble-symbols=_Z19big_endian_cmp_maskDv8_xS_ ./a.out 

./a.out:        file format elf64-x86-64


Disassembly of section .text:

0000000000007ce0 <_Z19big_endian_cmp_maskDv8_xS_>:
    7ce0: 62 f1 fd 48 6f 15 d6 a3 03 00 vmovdqa64       238550(%rip), %zmm2  # 420c0 <_IO_stdin_used+0xc0>
    7cea: 62 f2 6d 48 8d c9             vpermb  %zmm1, %zmm2, %zmm1
    7cf0: 62 f2 6d 48 8d c0             vpermb  %zmm0, %zmm2, %zmm0
    7cf6: 62 f1 7d 48 74 c1             vpcmpeqb        %zmm1, %zmm0, %k0
    7cfc: c4 e1 fb 93 c0                kmovq   %k0, %rax
    7d01: c3                            retq
    7d02: 66 2e 0f 1f 84 00 00 00 00 00 nopw    %cs:(%rax,%rax)
    7d0c: 0f 1f 40 00                   nopl    (%rax)
gchatelet@intel-sapphire-rapids:~$ ./a.out --benchmark_repetitions=10 --benchmark_min_warmup_time=1
2024-01-09T14:03:35+00:00
Running ./a.out
Run on (22 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x11)
  L1 Instruction 32 KiB (x11)
  L2 Unified 2048 KiB (x11)
  L3 Unified 107520 KiB (x1)
Load Average: 0.11, 0.06, 0.02
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask              1.01 ns         1.01 ns    695774029
BM_BigEndianCmpMask_mean         1.01 ns         1.01 ns           10
BM_BigEndianCmpMask_median       1.01 ns         1.01 ns           10
BM_BigEndianCmpMask_stddev      0.000 ns        0.000 ns           10
BM_BigEndianCmpMask_cv           0.02 %          0.02 %            10
gchatelet@intel-sapphire-rapids:~$ 

@llvmbot
Copy link

llvmbot commented Jan 9, 2024

@llvm/issue-subscribers-backend-x86

Author: Guillaume Chatelet (gchatelet)

The following snippet byte-reverses two `zmm` and returns matching elements into a mask
uint64_t big_endian_cmp_mask(__m512i max, __m512i value) {
  const auto byte_reverse = [](__m512i value) -&gt; __m512i {
    return _mm512_permutexvar_epi8(
        _mm512_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31, //
                        32, 33, 34, 35, 36, 37, 38, 39, //
                        40, 41, 42, 43, 44, 45, 46, 47, //
                        48, 49, 50, 51, 52, 53, 54, 55, //
                        56, 57, 58, 59, 60, 61, 62, 63),
        value);
  };
  return _mm512_cmpeq_epi8_mask(byte_reverse(max), byte_reverse(value));
}

GCC 13.2 compiles it down to

  vmovdqa64 .LC0(%rip), %zmm2
  vpermb %zmm1, %zmm2, %zmm1
  vpermb %zmm0, %zmm2, %zmm0
  vpcmpb $0, %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  ret

Whether clang 17.0.1 compiles it down to

  vpcmpeqb %zmm1, %zmm0, %k0
  kmovq %k0, %rax
  bswapq %rax
  movq %rax, %rcx
  shrq $4, %rcx
  movabsq $1085102592571150095, %rdx # imm = 0xF0F0F0F0F0F0F0F
  andq %rdx, %rcx
  andq %rdx, %rax
  shlq $4, %rax
  orq %rcx, %rax
  movabsq $3689348814741910323, %rcx # imm = 0x3333333333333333
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq $2, %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,4), %rax
  movabsq $6148914691236517205, %rcx # imm = 0x5555555555555555
  movq %rax, %rdx
  andq %rcx, %rdx
  shrq %rax
  andq %rcx, %rax
  leaq (%rax,%rdx,2), %rax
  vzeroupper
  retq

The issue comes from InstCombine transforming

  %0 = bitcast &lt;8 x i64&gt; %max to &lt;64 x i8&gt;
  %1 = shufflevector &lt;64 x i8&gt; %0, &lt;64 x i8&gt; poison, &lt;64 x i32&gt; &lt;i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0&gt;
  %2 = bitcast &lt;64 x i8&gt; %1 to &lt;8 x i64&gt;
  %3 = bitcast &lt;8 x i64&gt; %value to &lt;64 x i8&gt;
  %4 = shufflevector &lt;64 x i8&gt; %3, &lt;64 x i8&gt; poison, &lt;64 x i32&gt; &lt;i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0&gt;
  %5 = bitcast &lt;64 x i8&gt; %4 to &lt;8 x i64&gt;
  %6 = icmp eq &lt;64 x i8&gt; %1, %4
  %7 = bitcast &lt;64 x i1&gt; %6 to i64
  ret i64 %7

into

  %0 = bitcast &lt;8 x i64&gt; %max to &lt;64 x i8&gt;
  %1 = bitcast &lt;8 x i64&gt; %value to &lt;64 x i8&gt;
  %2 = icmp eq &lt;64 x i8&gt; %0, %1
  %3 = bitcast &lt;64 x i1&gt; %2 to i64
  %4 = call i64 @<!-- -->llvm.bitreverse.i64(i64 %3)
  ret i64 %4

The @<!-- -->llvm.bitreverse.i64 operation is then done using GPRs instead of vectors.

Godbolt link : https://godbolt.org/z/5PjzaTr51

llvm-mca latency for the vector version : https://godbolt.org/z/K36K4r16K
llvm-mca latency for the GRP version : https://godbolt.org/z/7qcceYvM8

Is there any way to prevent the transformation and stick to the vector intrinsics ?

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 9, 2024

We can get most of the way with a DAG peephole to fold the bitreverse back into a v64i1 shuffle:

define i64 @reverse_v64i1(<8 x i64> %max, <8 x i64>  %value) {
entry:
  %0 = bitcast <8 x i64> %max to <64 x i8>
  %1 = bitcast <8 x i64> %value to <64 x i8>
  %2 = icmp eq <64 x i8> %0, %1
  %3 = shufflevector <64 x i1> %2, <64 x i1> poison, <64 x i32> <i32 63, i32 62, i32 61, i32 60, i32 59, i32 58, i32 57, i32 56, i32 55, i32 54, i32 53, i32 52, i32 51, i32 50, i32 49, i32 48, i32 47, i32 46, i32 45, i32 44, i32 43, i32 42, i32 41, i32 40, i32 39, i32 38, i32 37, i32 36, i32 35, i32 34, i32 33, i32 32, i32 31, i32 30, i32 29, i32 28, i32 27, i32 26, i32 25, i32 24, i32 23, i32 22, i32 21, i32 20, i32 19, i32 18, i32 17, i32 16, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
  %4 = bitcast <64 x i1> %3 to i64
  ret i64 %4
}

https://simd.godbolt.org/z/1Wej3cxj9

@RKSimon RKSimon self-assigned this Jan 9, 2024
@gchatelet
Copy link
Contributor Author

We may want to make sure that the 256-bit version is fixed as well
https://godbolt.org/z/v5EMbr8eE

uint64_t big_endian_cmp_mask(__m256i max, __m256i value) {
  const auto byte_reverse = [](__m256i value) -> __m256i {
    return _mm256_permutexvar_epi8(
        _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31),
        value);
  };
  return _mm256_movemask_epi8(
      _mm256_cmpeq_epi8(byte_reverse(max), byte_reverse(value)));
}

Although it can be rewritten in a form where the bitreverse optimization doesn't kick in
https://godbolt.org/z/cx6KqPGPd

uint64_t big_endian_cmp_mask(__m256i max, __m256i value) {
  const auto byte_reverse = [](__m256i value) -> __m256i {
    return _mm256_permutexvar_epi8(
        _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                        8, 9, 10, 11, 12, 13, 14, 15,   //
                        16, 17, 18, 19, 20, 21, 22, 23, //
                        24, 25, 26, 27, 28, 29, 30, 31),
        value);
  };
  return _mm256_movemask_epi8(byte_reverse(_mm256_cmpeq_epi8(max, value)));
}

AFAIK this version cannot be written with avx512 but I might be wrong.

RKSimon added a commit that referenced this issue Jan 9, 2024
@RKSimon RKSimon closed this as completed in 3210ce2 Jan 9, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 9, 2024

Closing for now - please reopen (or create a new ticket) if you think its worth trying to pre-shuffle the comparison arguments

@gchatelet
Copy link
Contributor Author

Thx a lot for the quick fix @RKSimon 🙏

RKSimon added a commit that referenced this issue Jan 10, 2024
RKSimon added a commit that referenced this issue Jan 10, 2024
…,permute(y)) for 32/64-bit element vectors

Noticed in #77459 - for wider element types, its usually better to pre-shuffle the comparison arguments if we can, like we already for broadcasts
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
X86 doesn't have a BITREVERSE instruction, so if we're working with a casted boolean vector, we're better off shuffling the vector instead if we have PSHUFB (SSSE3 or later)

Fixes llvm#77459
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
…,permute(y)) for 32/64-bit element vectors

Noticed in llvm#77459 - for wider element types, its usually better to pre-shuffle the comparison arguments if we can, like we already for broadcasts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants