Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

Closed
RKSimon opened this issue Feb 12, 2024 · 2 comments · Fixed by #81852
Closed

[X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) #81469

RKSimon opened this issue Feb 12, 2024 · 2 comments · Fixed by #81852

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 12, 2024

Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}
define i32 @popcount8(ptr %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>)
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>)

We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:

define i32 @popcount8(ptr %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)

Godbolt: https://simd.godbolt.org/z/ooK497x7s

We might be best off attempting this in vector-combine

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 12, 2024

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Reported here: https://discourse.llvm.org/t/avx2-popcount-regression/76926
int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i &lt; 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}
define i32 @<!-- -->popcount8(ptr %data)  {
entry:
  %0 = load &lt;8 x i64&gt;, ptr %data, align 8
  %1 = tail call &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt; %0)
  %2 = trunc &lt;8 x i64&gt; %1 to &lt;8 x i32&gt;
  %3 = tail call i32 @<!-- -->llvm.vector.reduce.add.v8i32(&lt;8 x i32&gt; %2)
  ret i32 %3
}
declare &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt;)
declare i32 @<!-- -->llvm.vector.reduce.add.v8i32(&lt;8 x i32&gt;)

We can avoid the vector truncation replacing with a free scalar truncation if we perform the reduction on the v8i64:

define i32 @<!-- -->popcount8(ptr %data)  {
entry:
  %0 = load &lt;8 x i64&gt;, ptr %data, align 8
  %1 = tail call &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt; %0)
  %2 = tail call i64 @<!-- -->llvm.vector.reduce.add.v8i64 (&lt;8 x i64 &gt; %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare &lt;8 x i64&gt; @<!-- -->llvm.ctpop.v8i64(&lt;8 x i64&gt;) #<!-- -->1
declare i64 @<!-- -->llvm.vector.reduce.add.v8i64(&lt;8 x i64&gt;)

Godbolt: https://simd.godbolt.org/z/ooK497x7s

We might be best off attempting this in vector-combine

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 12, 2024

Alive2: https://alive2.llvm.org/ce/z/phx0Lp

AFAICT we can do this for add/mul/and/or/xor reductions

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 15, 2024
…fective

Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.

If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.

Fixes llvm#81469
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 16, 2024
…fective

Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.

If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.

Fixes llvm#81469
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 19, 2024
…fective

Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.

If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.

Fixes llvm#81469
RKSimon added a commit that referenced this issue Feb 19, 2024
…fective (#81852)

Vector truncations can be pretty expensive, especially on X86, whilst scalar truncations are often free.

If the cost of performing the add/mul/and/or/xor reduction is cheap enough on the pre-truncated type, then avoid the vector truncation entirely.

Fixes #81469
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants