Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] - Add constant folding for s_bitreplicate #72366

Merged
merged 4 commits into from
Nov 16, 2023

Conversation

OutOfCache
Copy link
Contributor

If the input is a constant, we can constant fold the s_bitreplicate operation.

If the input is a constant, we can constant fold the s_bitreplicate
operation.
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 15, 2023

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: Jessica Del (OutOfCache)

Changes

If the input is a constant, we can constant fold the s_bitreplicate operation.


Full diff: https://github.com/llvm/llvm-project/pull/72366.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/ConstantFolding.cpp (+18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll (+2-2)
diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 966a65ac26b8017..e401efa6c67c2da 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -1533,6 +1533,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::amdgcn_perm:
   case Intrinsic::amdgcn_wave_reduce_umin:
   case Intrinsic::amdgcn_wave_reduce_umax:
+  case Intrinsic::amdgcn_s_bitreplicate:
   case Intrinsic::arm_mve_vctp8:
   case Intrinsic::arm_mve_vctp16:
   case Intrinsic::arm_mve_vctp32:
@@ -2422,6 +2423,23 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
 
       return ConstantFP::get(Ty->getContext(), Val);
     }
+
+    case Intrinsic::amdgcn_s_bitreplicate: {
+      uint64_t Val = Op->getZExtValue();
+      uint64_t ReplicatedVal = 0;
+      uint64_t ReplicatedOnes = 0b11;
+      // Input operand is always b32
+      for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
+        uint64_t Bit = Val & 1;
+
+        if (!Bit)
+          continue;
+
+        ReplicatedVal |= ReplicatedOnes;
+      }
+      return ConstantInt::get(Ty, ReplicatedVal);
+    }
+
     default:
       return nullptr;
     }
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
index 027c9ef5e7cc349..0937280e6352c80 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
@@ -8,8 +8,8 @@ define i64 @test_s_bitreplicate_constant() {
 ; GFX11-LABEL: test_s_bitreplicate_constant:
 ; GFX11:       ; %bb.0: ; %entry
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    s_bitreplicate_b64_b32 s[0:1], 0x85fe3a92
-; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_mov_b32_e32 v0, 0xfccc30c
+; GFX11-NEXT:    v_mov_b32_e32 v1, 0xc033fffc
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 entry:
   %br = call i64 @llvm.amdgcn.s.bitreplicate(i32 u0x85FE3A92)

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

uint64_t ReplicatedOnes = 0b11;
// Input operand is always b32
for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
uint64_t Bit = Val & 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could fold this into the only use of Bit.

Copy link
Contributor Author

@OutOfCache OutOfCache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the s_bitreplicate constant folding, as mentioned in #69209 (comment).

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

Comment on lines 2429 to 2439
uint64_t ReplicatedVal = 0;
uint64_t ReplicatedOnes = 0b11;
// Input operand is always b32
for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
uint64_t Bit = Val & 1;

if (!Bit)
continue;

ReplicatedVal |= ReplicatedOnes;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial version looked more straightforward like this:

uint64_t ReplicatedVal = 0;

for (unsigned i = 0; i < 32; ++i) {
  uint64_t Bit = (Val >> i) & 1;

  if (!Bit)
    continue;

  ReplicatedVal |= (Bit << (i * 2));
  ReplicatedVal |= (Bit << ((i * 2) + 1);
}

But I realized we don't need to shift the values by i every iteration, since i increases by 1 every time. So instead, I shift the original value by 1 every iteration and shift a mask with two ones (0b11) by 2.
The current version might be harder to read or understand. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fun - there are lots of different ways to write it! For example:

  for (I = 0; I < 32; ++I)
    ReplicatedVal |= ((Val & (1 << I)) * 3) << I;

Or:

  Val = (Val & 0x000000000000FFFF) | (Val & 0x00000000FFFF0000) << 16;
  Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
  Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
  Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
  Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;
  ReplicatedVal = Val | Val << 1; // or "Val * 3"

But I think it is your decision how you want to write it - I don't mind too much.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 15, 2023

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.

The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

@OutOfCache
Copy link
Contributor Author

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.

The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

The intrinsics do not set SCC, but the instructions s_wqm and s_quadmask do, right? And the intrinsics themselves only lower to these instructions. It's true that the bit operations themselves can be constant folded, but by removing the instructions from the equation we lose the setting of SCC. At least, that's what I thought? If that is not an issue, implementing constant folding for the intrinsics is of course doable.

@jayfoad
Copy link
Contributor

jayfoad commented Nov 15, 2023

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour.
The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded.

The intrinsics do not set SCC, but the instructions s_wqm and s_quadmask do, right? And the intrinsics themselves only lower to these instructions. It's true that the bit operations themselves can be constant folded, but by removing the instructions from the equation we lose the setting of SCC. At least, that's what I thought? If that is not an issue, implementing constant folding for the intrinsics is of course doable.

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.

It would be correct to implement the intrinsics in various different ways:

  • you could constant fold it (if the argument is a constant)
  • you could expand it to a long sequence of VALU instructions (which would not touch SCC)
  • you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

@OutOfCache
Copy link
Contributor Author

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.

It would be correct to implement the intrinsics in various different ways:

  • you could constant fold it (if the argument is a constant)
  • you could expand it to a long sequence of VALU instructions (which would not touch SCC)
  • you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

Oh, I see. So the behavior of the intrinsic is independent from the behavior of the instruction, despite their link? Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 15, 2023

Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with.
It would be correct to implement the intrinsics in various different ways:

  • you could constant fold it (if the argument is a constant)
  • you could expand it to a long sequence of VALU instructions (which would not touch SCC)
  • you could emit the s_wqm instruction (which would set SCC)

Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC.

Oh, I see. So the behavior of the intrinsic is independent from the behavior of the instruction, despite their link?

Right. The behaviour is supposed to be "the same" at a high level, i.e. it performs the same calculation. But low level details (like clobbering physical registers) simply don't make any sense in LLVM IR.

Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

Yes, exactly.

@OutOfCache
Copy link
Contributor Author

Right. The behaviour is supposed to be "the same" at a high level, i.e. it performs the same calculation. But low level details (like clobbering physical registers) simply don't make any sense in LLVM IR.

Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process?

Yes, exactly.

Thank you so much for the explanation! That makes a lot of sense.

@arsenm
Copy link
Contributor

arsenm commented Nov 15, 2023

I implemented the s_bitreplicate constant folding, as mentioned in #69209 (comment).

However, I don't think I can do the same for the s_wqm and s_quadmask intrinsics, because they implicitly set SCC?

But the intrinsics don't expose the SCC output so it doesn't matter. Nearly every SCC def is useless

Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That last step looks wrong to me. I'd expect 0x111... and 0x222... constants.

Also, doesn't this need some sort of llu prefix in general for the constants?

Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might work, but for consistency it should be:

Suggested change
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;
Val = (Val & 0x1111111111111111) | (Val & 0x2222222222222222) << 1;

Sorry this was my bug.

@OutOfCache OutOfCache merged commit af05f9f into llvm:main Nov 16, 2023
3 checks passed
sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023
If the input is a constant, we can constant fold the s_bitreplicate
operation.
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
If the input is a constant, we can constant fold the s_bitreplicate
operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants