-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] - Add constant folding for s_bitreplicate #72366
[AMDGPU] - Add constant folding for s_bitreplicate #72366
Conversation
If the input is a constant, we can constant fold the s_bitreplicate operation.
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-amdgpu Author: Jessica Del (OutOfCache) ChangesIf the input is a constant, we can constant fold the s_bitreplicate operation. Full diff: https://github.com/llvm/llvm-project/pull/72366.diff 2 Files Affected:
diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 966a65ac26b8017..e401efa6c67c2da 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -1533,6 +1533,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
case Intrinsic::amdgcn_perm:
case Intrinsic::amdgcn_wave_reduce_umin:
case Intrinsic::amdgcn_wave_reduce_umax:
+ case Intrinsic::amdgcn_s_bitreplicate:
case Intrinsic::arm_mve_vctp8:
case Intrinsic::arm_mve_vctp16:
case Intrinsic::arm_mve_vctp32:
@@ -2422,6 +2423,23 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
return ConstantFP::get(Ty->getContext(), Val);
}
+
+ case Intrinsic::amdgcn_s_bitreplicate: {
+ uint64_t Val = Op->getZExtValue();
+ uint64_t ReplicatedVal = 0;
+ uint64_t ReplicatedOnes = 0b11;
+ // Input operand is always b32
+ for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) {
+ uint64_t Bit = Val & 1;
+
+ if (!Bit)
+ continue;
+
+ ReplicatedVal |= ReplicatedOnes;
+ }
+ return ConstantInt::get(Ty, ReplicatedVal);
+ }
+
default:
return nullptr;
}
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
index 027c9ef5e7cc349..0937280e6352c80 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.bitreplicate.ll
@@ -8,8 +8,8 @@ define i64 @test_s_bitreplicate_constant() {
; GFX11-LABEL: test_s_bitreplicate_constant:
; GFX11: ; %bb.0: ; %entry
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT: s_bitreplicate_b64_b32 s[0:1], 0x85fe3a92
-; GFX11-NEXT: v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT: v_mov_b32_e32 v0, 0xfccc30c
+; GFX11-NEXT: v_mov_b32_e32 v1, 0xc033fffc
; GFX11-NEXT: s_setpc_b64 s[30:31]
entry:
%br = call i64 @llvm.amdgcn.s.bitreplicate(i32 u0x85FE3A92)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
uint64_t ReplicatedOnes = 0b11; | ||
// Input operand is always b32 | ||
for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) { | ||
uint64_t Bit = Val & 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could fold this into the only use of Bit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the s_bitreplicate
constant folding, as mentioned in #69209 (comment).
However, I don't think I can do the same for the s_wqm
and s_quadmask
intrinsics, because they implicitly set SCC?
uint64_t ReplicatedVal = 0; | ||
uint64_t ReplicatedOnes = 0b11; | ||
// Input operand is always b32 | ||
for (unsigned i = 0; i < 32; ++i, ReplicatedOnes <<= 2, Val >>= 1) { | ||
uint64_t Bit = Val & 1; | ||
|
||
if (!Bit) | ||
continue; | ||
|
||
ReplicatedVal |= ReplicatedOnes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial version looked more straightforward like this:
uint64_t ReplicatedVal = 0;
for (unsigned i = 0; i < 32; ++i) {
uint64_t Bit = (Val >> i) & 1;
if (!Bit)
continue;
ReplicatedVal |= (Bit << (i * 2));
ReplicatedVal |= (Bit << ((i * 2) + 1);
}
But I realized we don't need to shift the values by i
every iteration, since i
increases by 1 every time. So instead, I shift the original value by 1 every iteration and shift a mask with two ones (0b11
) by 2.
The current version might be harder to read or understand. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fun - there are lots of different ways to write it! For example:
for (I = 0; I < 32; ++I)
ReplicatedVal |= ((Val & (1 << I)) * 3) << I;
Or:
Val = (Val & 0x000000000000FFFF) | (Val & 0x00000000FFFF0000) << 16;
Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8;
Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4;
Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2;
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1;
ReplicatedVal = Val | Val << 1; // or "Val * 3"
But I think it is your decision how you want to write it - I don't mind too much.
No, that's not true. The intrinsics do not set SCC. It does not make sense for an intrinsic to be defined as changing the value of some physical register, because it would be impossible to make use of that behaviour. The definition of the intrinsics is that they just do bit twiddling on their input value to produce a result. They can be constant-folded. |
The intrinsics do not set SCC, but the instructions |
Yes the instructions set SCC but that is not part of the behaviour that is required to implement the intrinsic. It is just an inconvenience that codegen passes have to cope with. It would be correct to implement the intrinsics in various different ways:
Maybe the confusion here is that we never wrote down an exact specification for what the intrinsics do? There is an implied specification of "do whatever the corresponding instruction does", but that does not (should not) include side effects like setting SCC. |
Oh, I see. So the behavior of the intrinsic is independent from the behavior of the instruction, despite their link? Is this possible because at the IR level, we are not aware of SCC yet and the compiler handles SCC during the lowering process? |
Right. The behaviour is supposed to be "the same" at a high level, i.e. it performs the same calculation. But low level details (like clobbering physical registers) simply don't make any sense in LLVM IR.
Yes, exactly. |
Thank you so much for the explanation! That makes a lot of sense. |
But the intrinsics don't expose the SCC output so it doesn't matter. Nearly every SCC def is useless |
Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8; | ||
Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4; | ||
Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2; | ||
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That last step looks wrong to me. I'd expect 0x111... and 0x222... constants.
Also, doesn't this need some sort of llu
prefix in general for the constants?
Val = (Val & 0x000000FF000000FF) | (Val & 0x0000FF000000FF00) << 8; | ||
Val = (Val & 0x000F000F000F000F) | (Val & 0x00F000F000F000F0) << 4; | ||
Val = (Val & 0x0303030303030303) | (Val & 0x0C0C0C0C0C0C0C0C) << 2; | ||
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might work, but for consistency it should be:
Val = (Val & 0x5555555555555555) | (Val & 0xAAAAAAAAAAAAAAAA) << 1; | |
Val = (Val & 0x1111111111111111) | (Val & 0x2222222222222222) << 1; |
Sorry this was my bug.
If the input is a constant, we can constant fold the s_bitreplicate operation.
If the input is a constant, we can constant fold the s_bitreplicate operation.
If the input is a constant, we can constant fold the s_bitreplicate operation.