-
Notifications
You must be signed in to change notification settings - Fork 11.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Valery Pykhtin (vpykhtin) ChangesThis fixes crash when operand sources for V_OR instruction reside in different basic blocks. Full diff: https://github.com/llvm/llvm-project/pull/82406.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 53fc2c0686245f..739ee17464edec 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -739,6 +739,11 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
MachineInstr *SDWAInst = OrSDWADef->getParent();
MachineInstr *OtherInst = OrOtherDef->getParent();
+ // Instruction and operand sources should reside in the same BB.
+ if (SDWAInst->getParent() != MI.getParent() ||
+ OtherInst->getParent() != MI.getParent())
+ break;
+
// Check that OtherInstr is actually bitwise compatible with SDWAInst = their
// destination patterns don't overlap. Compatible instruction can be either
// regular instruction with compatible bitness or SDWA instruction with
@@ -815,7 +820,6 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
return std::make_unique<SDWADstPreserveOperand>(
OrDst, OrSDWADef, OrOtherDef, DstSel);
-
}
}
diff --git a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
index f93456ccacb806..359945ff799434 100644
--- a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
+++ b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
@@ -160,3 +160,70 @@ body: |
S_ENDPGM 0
...
+---
+name: add_f16_u32_preserve_different_bb
+tracksRegLiveness: true
+registers:
+ - { id: 0, class: vreg_64 }
+ - { id: 1, class: vreg_64 }
+ - { id: 2, class: sreg_64 }
+ - { id: 3, class: vgpr_32 }
+ - { id: 4, class: vgpr_32 }
+ - { id: 5, class: vgpr_32 }
+ - { id: 6, class: vgpr_32 }
+ - { id: 7, class: vgpr_32 }
+ - { id: 8, class: vgpr_32 }
+ - { id: 9, class: vgpr_32 }
+ - { id: 10, class: vgpr_32 }
+ - { id: 11, class: vgpr_32 }
+ - { id: 12, class: vgpr_32 }
+ - { id: 13, class: vgpr_32 }
+body: |
+ ; SDWA-LABEL: name: add_f16_u32_preserve_different_bb
+ ; SDWA: bb.0:
+ ; SDWA-NEXT: successors: %bb.1(0x80000000)
+ ; SDWA-NEXT: liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+ ; SDWA-NEXT: {{ $}}
+ ; SDWA-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr30_sgpr31
+ ; SDWA-NEXT: [[COPY1:%[0-9]+]]:vreg_64 = COPY $vgpr2_vgpr3
+ ; SDWA-NEXT: [[COPY2:%[0-9]+]]:vreg_64 = COPY $vgpr0_vgpr1
+ ; SDWA-NEXT: [[FLAT_LOAD_DWORD:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY2]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+ ; SDWA-NEXT: [[FLAT_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+ ; SDWA-NEXT: [[V_AND_B32_e32_:%[0-9]+]]:vgpr_32 = V_AND_B32_e32 65535, [[FLAT_LOAD_DWORD]], implicit $exec
+ ; SDWA-NEXT: [[V_LSHRREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e64 16, [[FLAT_LOAD_DWORD1]], implicit $exec
+ ; SDWA-NEXT: [[V_BFE_U32_e64_:%[0-9]+]]:vgpr_32 = V_BFE_U32_e64 [[FLAT_LOAD_DWORD]], 8, 8, implicit $exec
+ ; SDWA-NEXT: [[V_LSHRREV_B32_e32_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e32 24, [[FLAT_LOAD_DWORD1]], implicit $exec
+ ; SDWA-NEXT: [[V_ADD_F16_sdwa:%[0-9]+]]:vgpr_32 = V_ADD_F16_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 1, 0, 4, 5, implicit $mode, implicit $exec
+ ; SDWA-NEXT: [[V_MUL_F32_sdwa:%[0-9]+]]:vgpr_32 = V_MUL_F32_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 5, 0, 1, 3, implicit $mode, implicit $exec
+ ; SDWA-NEXT: {{ $}}
+ ; SDWA-NEXT: bb.1:
+ ; SDWA-NEXT: [[V_OR_B32_e64_:%[0-9]+]]:vgpr_32 = V_OR_B32_e64 [[V_ADD_F16_sdwa]], [[V_MUL_F32_sdwa]], implicit $exec
+ ; SDWA-NEXT: FLAT_STORE_DWORD [[COPY2]], [[V_OR_B32_e64_]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+ ; SDWA-NEXT: $sgpr30_sgpr31 = COPY [[COPY]]
+ ; SDWA-NEXT: S_SETPC_B64_return $sgpr30_sgpr31
+ bb.0:
+ liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+
+ %2 = COPY $sgpr30_sgpr31
+ %1 = COPY $vgpr2_vgpr3
+ %0 = COPY $vgpr0_vgpr1
+ %3 = FLAT_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+ %4 = FLAT_LOAD_DWORD %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+
+ %5 = V_AND_B32_e32 65535, %3, implicit $exec
+ %6 = V_LSHRREV_B32_e64 16, %4, implicit $exec
+ %7 = V_BFE_U32_e64 %3, 8, 8, implicit $exec
+ %8 = V_LSHRREV_B32_e32 24, %4, implicit $exec
+
+ %9 = V_ADD_F16_e64 0, %5, 0, %6, 0, 0, implicit $mode, implicit $exec
+ %10 = V_LSHLREV_B16_e64 8, %9, implicit $exec
+ %11 = V_MUL_F32_e64 0, %7, 0, %8, 0, 0, implicit $mode, implicit $exec
+ %12 = V_LSHLREV_B32_e64 16, %11, implicit $exec
+
+ bb.1:
+ %13 = V_OR_B32_e64 %10, %12, implicit $exec
+
+ FLAT_STORE_DWORD %0, %13, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+ $sgpr30_sgpr31 = COPY %2
+ S_SETPC_B64_return $sgpr30_sgpr31
+...
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in |
I'm not sure it's safe to move instructions across possible execmask change. |
I think it should be fine. This happens before control flow lowering, so exec mask changes are not explicit in the MIR anyway. |
I'm trying to understand why it would be valid. %v0:vgpr_32 = sdwa1 .. dst_sel: WORD0 unused_pad if (...) { So this would be valid because 'if' only narrows the execmask and we cannot have uninitialized values in %v0, %v1 at some lanes inside 'then' bb. I probably cannot imagine situation when execmask at 'then' BB isn't a subset of execmasks from BBs for sdwa1 and sdwa2 instructions. |
Right. We are in SSA form which guarantees that (for every thread) a definition of %v0 and %v1 dominates each use. After control flow lowering, this implies that the exec mask at the use will be a subset of the exec mask at the definitions. |
--- | ||
name: add_f16_u32_preserve_different_bb | ||
tracksRegLiveness: true | ||
registers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the registers
section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
79c3dd3
to
d23f860
Compare
…rent basic blocks.
…rent basic blocks. (llvm#82406) This fixes crash when operand sources for V_OR instruction reside in different basic blocks. Change-Id: I1a5db649e174f76b1c749b02467768bd0d3ca440
…rent basic blocks. (llvm#82406) This fixes crash when operand sources for V_OR instruction reside in different basic blocks. Change-Id: I1a5db649e174f76b1c749b02467768bd0d3ca440
This fixes crash when operand sources for V_OR instruction reside in different basic blocks.