[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

RKSimon · 2024-02-12T11:34:07Z

Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

define i32 @popcount8(ptr %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)

We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:

define i32 @popcount8(ptr %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)

Godbolt: https://simd.godbolt.org/z/ooK497x7s

We might be best off attempting this in vector-combine

llvmbot · 2024-02-12T11:34:24Z

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i &lt; 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

define i32 @<!-- -->popcount8(ptr %data)  {
entry:
  %0 = load &lt;8 x i64&gt;, ptr %data, align 8
  %1 = tail call &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt; %0)
  %2 = trunc &lt;8 x i64&gt; %1 to &lt;8 x i32&gt;
  %3 = tail call i32 @<!-- -->llvm.vector.reduce.add.v8i32(&lt;8 x i32&gt; %2)
  ret i32 %3
}
declare &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt;)
declare i32 @<!-- -->llvm.vector.reduce.add.v8i32(&lt;8 x i32&gt;)

We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:

define i32 @<!-- -->popcount8(ptr %data)  {
entry:
  %0 = load &lt;8 x i64&gt;, ptr %data, align 8
  %1 = tail call &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt; %0)
  %2 = tail call i64 @<!-- -->llvm.vector.reduce.add.v8i64 (&lt;8 x i64 &gt; %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt;) #<!-- -->1
declare i64 @<!-- -->llvm.vector.reduce.add.v8i64(&lt;8 x i64&gt;)

Godbolt: https://simd.godbolt.org/z/ooK497x7s

We might be best off attempting this in vector-combine

RKSimon · 2024-02-12T11:53:16Z

Alive2: https://alive2.llvm.org/ce/z/phx0Lp

AFAICT we can do this for add/mul/and/or/xor reductions

…fective Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free. If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely. Fixes llvm#81469

…fective (#81852) Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free. If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely. Fixes #81469

RKSimon added backend:X86 missed-optimization labels Feb 12, 2024

RKSimon mentioned this issue Feb 15, 2024

[VectorCombine] Fold reduce(trunc(x)) -> trunc(reduce(x)) iff cost effective #81852

Merged

RKSimon closed this as completed in #81852 Feb 19, 2024

EugeneZelenko added vectorizers and removed backend:X86 labels Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

RKSimon commented Feb 12, 2024

llvmbot commented Feb 12, 2024

RKSimon commented Feb 12, 2024

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

Comments

RKSimon commented Feb 12, 2024

llvmbot commented Feb 12, 2024

RKSimon commented Feb 12, 2024