[AMDGPU] - Add constant folding for s_bitreplicate #72366

OutOfCache · 2023-11-15T09:36:57Z

If the input is a constant, we can constant fold the s_bitreplicate operation.

llvmbot · 2023-11-15T09:37:21Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: Jessica Del (OutOfCache)

Changes

If the input is a constant, we can constant fold the s_bitreplicate operation.

Full diff: https://github.com/llvm/llvm-project/pull/72366.diff

2 Files Affected:

(modified) llvm/lib/Analysis/ConstantFolding.cpp (+18)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll (+2-2)

diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 966a65ac26b8017..e401efa6c67c2da 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -1533,6 +1533,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::amdgcn_perm:
   case Intrinsic::amdgcn_wave_reduce_umin:
   case Intrinsic::amdgcn_wave_reduce_umax:
+  case Intrinsic::amdgcn_s_bitreplicate:
   case Intrinsic::arm_mve_vctp8:
   case Intrinsic::arm_mve_vctp16:
   case Intrinsic::arm_mve_vctp32:
@@ -2422,6 +2423,23 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
 
       return ConstantFP::get(Ty->getContext(), Val);
     }
+
+    case Intrinsic::amdgcn_s_bitreplicate: {
+      uint64_t Val = Op->getZExtValue();
+      uint64_t ReplicatedVal = 0;
+      uint64_t ReplicatedOnes = 0b11;
+      // Input operand is always b32
+      for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
+        uint64_t Bit = Val & 1;
+
+        if (!Bit)
+          continue;
+
+        ReplicatedVal |= ReplicatedOnes;
+      }
+      return ConstantInt::get(Ty, ReplicatedVal);
+    }
+
     default:
       return nullptr;
     }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
index 027c9ef5e7cc349..0937280e6352c80 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
@@ -8,8 +8,8 @@ define i64 @test_s_bitreplicate_constant() {
 ; GFX11-LABEL: test_s_bitreplicate_constant:
 ; GFX11:       ; %bb.0: ; %entry
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    s_bitreplicate_b64_b32 s[0:1], 0x85fe3a92
-; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0xfccc30c
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0xc033fffc
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 entry:
   %br = call i64 @llvm.amdgcn.s.bitreplicate(i32 u0x85FE3A92)

jayfoad

Thanks!

jayfoad · 2023-11-15T09:43:23Z

llvm/lib/Analysis/ConstantFolding.cpp

+      uint64_t ReplicatedOnes = 0b11;
+      // Input operand is always b32
+      for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
+        uint64_t Bit = Val & 1;


Nit: could fold this into the only use of Bit.

OutOfCache

I implemented the s_bitreplicate constant folding, as mentioned in #69209 (comment).

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

OutOfCache · 2023-11-15T09:44:42Z

llvm/lib/Analysis/ConstantFolding.cpp

+      uint64_t ReplicatedVal = 0;
+      uint64_t ReplicatedOnes = 0b11;
+      // Input operand is always b32
+      for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
+        uint64_t Bit = Val & 1;
+
+        if (!Bit)
+          continue;
+
+        ReplicatedVal |= ReplicatedOnes;
+      }


My initial version looked more straightforward like this:

uint64_t ReplicatedVal = 0; for (unsigned i = 0; i < 32; ++i) { uint64_t Bit = (Val >> i) & 1; if (!Bit) continue; ReplicatedVal |= (Bit << (i * 2)); ReplicatedVal |= (Bit << ((i * 2) + 1); }

But I realized we don't need to shift the values by i every iteration, since i increases by 1 every time. So instead, I shift the original value by 1 every iteration and shift a mask with two ones (0b11) by 2.
The current version might be harder to read or understand. Let me know what you think.

This is fun - there are lots of different ways to write it! For example:

for (I = 0; I < 32; ++I) ReplicatedVal |= ((Val & (1 << I)) * 3) << I;

Or:

Val = (Val & 0x000000000000FFFF) | (Val & 0x00000000FFFF0000) << 16; Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8; Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4; Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2; Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1; ReplicatedVal = Val | Val << 1; // or "Val * 3"

But I think it is your decision how you want to write it - I don't mind too much.

jayfoad · 2023-11-15T10:00:57Z

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.

The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

OutOfCache · 2023-11-15T10:05:15Z

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.

The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

The intrinsics do not set SCC, but the instructions s_wqm and s_quadmask do, right? And the intrinsics themselves only lower to these instructions. It's true that the bit operations themselves can be constant folded, but by removing the instructions from the equation we lose the setting of SCC. At least, that's what I thought? If that is not an issue, implementing constant folding for the intrinsics is of course doable.

jayfoad · 2023-11-15T10:12:42Z

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.
The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

The intrinsics do not set SCC, but the instructions s_wqm and s_quadmask do, right? And the intrinsics themselves only lower to these instructions. It's true that the bit operations themselves can be constant folded, but by removing the instructions from the equation we lose the setting of SCC. At least, that's what I thought? If that is not an issue, implementing constant folding for the intrinsics is of course doable.

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.

It would be correct to implement the intrinsics in various different ways:

you could constant fold it (if the argument is a constant)
you could expand it to a long sequence of VALU instructions (which would not touch SCC)
you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

OutOfCache · 2023-11-15T10:21:17Z

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.

It would be correct to implement the intrinsics in various different ways:

you could constant fold it (if the argument is a constant)

you could expand it to a long sequence of VALU instructions (which would not touch SCC)

you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

Oh, I see. So the behavior of the intrinsic is independent from the behavior of the instruction, despite their link? Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

jayfoad · 2023-11-15T10:55:31Z

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.
It would be correct to implement the intrinsics in various different ways:

you could constant fold it (if the argument is a constant)

you could expand it to a long sequence of VALU instructions (which would not touch SCC)

you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

Oh, I see. So the behavior of the intrinsic is independent from the behavior of the instruction, despite their link?

Right. The behaviour is supposed to be "the same" at a high level, i.e. it performs the same calculation. But low level details (like clobbering physical registers) simply don't make any sense in LLVM IR.

Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

Yes, exactly.

OutOfCache · 2023-11-15T11:07:12Z

Right. The behaviour is supposed to be "the same" at a high level, i.e. it performs the same calculation. But low level details (like clobbering physical registers) simply don't make any sense in LLVM IR.

Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

Yes, exactly.

Thank you so much for the explanation! That makes a lot of sense.

arsenm · 2023-11-15T12:18:01Z

I implemented the s_bitreplicate constant folding, as mentioned in #69209 (comment).

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

But the intrinsics don't expose the SCC output so it doesn't matter. Nearly every SCC def is useless

nhaehnle · 2023-11-15T15:45:14Z

llvm/lib/Analysis/ConstantFolding.cpp

+      Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
+      Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
+      Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
+      Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;


That last step looks wrong to me. I'd expect 0x111... and 0x222... constants.

Also, doesn't this need some sort of llu prefix in general for the constants?

jayfoad · 2023-11-15T16:10:07Z

llvm/lib/Analysis/ConstantFolding.cpp

+      Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
+      Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
+      Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
+      Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;


This might work, but for consistency it should be:

Suggested change

Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;

Val = (Val & 0x1111111111111111) | (Val & 0x2222222222222222) << 1;

Sorry this was my bug.

If the input is a constant, we can constant fold the s_bitreplicate operation.

[AMDGPU] - Add constant folding for s_bitreplicate

44b4463

If the input is a constant, we can constant fold the s_bitreplicate operation.

OutOfCache added the backend:AMDGPU label Nov 15, 2023

OutOfCache requested review from jayfoad, arsenm and nhaehnle November 15, 2023 09:36

llvmbot added the llvm:analysis label Nov 15, 2023

jayfoad approved these changes Nov 15, 2023

View reviewed changes

OutOfCache commented Nov 15, 2023

View reviewed changes

fixup! [AMDGPU] - Add constant folding for s_bitreplicate

22a1189

nhaehnle approved these changes Nov 15, 2023

View reviewed changes

fixup! [AMDGPU] - Add constant folding for s_bitreplicate

8c69df4

nhaehnle reviewed Nov 15, 2023

View reviewed changes

jayfoad reviewed Nov 15, 2023

View reviewed changes

fixup! [AMDGPU] - Add constant folding for s_bitreplicate

a6e5a61

jayfoad approved these changes Nov 15, 2023

View reviewed changes

OutOfCache merged commit af05f9f into llvm:main Nov 16, 2023
3 checks passed

sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023

[AMDGPU] - Add constant folding for s_bitreplicate (llvm#72366)

b2eab98

If the input is a constant, we can constant fold the s_bitreplicate operation.

zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023

[AMDGPU] - Add constant folding for s_bitreplicate (llvm#72366)

11cdda0

If the input is a constant, we can constant fold the s_bitreplicate operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] - Add constant folding for s_bitreplicate #72366

[AMDGPU] - Add constant folding for s_bitreplicate #72366

OutOfCache commented Nov 15, 2023

llvmbot commented Nov 15, 2023 •

edited

jayfoad left a comment

jayfoad Nov 15, 2023

OutOfCache left a comment

OutOfCache Nov 15, 2023 •

edited

jayfoad Nov 15, 2023

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

arsenm commented Nov 15, 2023

nhaehnle Nov 15, 2023

jayfoad Nov 15, 2023

	Val = (Val & 0x5555555555555555) \| (Val & 0xAAAAAAAAAAAAAAAA) << 1;
	Val = (Val & 0x1111111111111111) \| (Val & 0x2222222222222222) << 1;

[AMDGPU] - Add constant folding for s_bitreplicate #72366

[AMDGPU] - Add constant folding for s_bitreplicate #72366

Conversation

OutOfCache commented Nov 15, 2023

llvmbot commented Nov 15, 2023 • edited

jayfoad left a comment

Choose a reason for hiding this comment

jayfoad Nov 15, 2023

Choose a reason for hiding this comment

OutOfCache left a comment

Choose a reason for hiding this comment

OutOfCache Nov 15, 2023 • edited

Choose a reason for hiding this comment

jayfoad Nov 15, 2023

Choose a reason for hiding this comment

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

jayfoad commented Nov 15, 2023

OutOfCache commented Nov 15, 2023

arsenm commented Nov 15, 2023

nhaehnle Nov 15, 2023

Choose a reason for hiding this comment

jayfoad Nov 15, 2023

Choose a reason for hiding this comment

llvmbot commented Nov 15, 2023 •

edited

OutOfCache Nov 15, 2023 •

edited