[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

vpykhtin · 2024-02-20T19:22:33Z

This fixes crash when operand sources for V_OR instruction reside in different basic blocks.

llvmbot · 2024-02-20T19:23:03Z

@llvm/pr-subscribers-backend-amdgpu

Author: Valery Pykhtin (vpykhtin)

Changes

This fixes crash when operand sources for V_OR instruction reside in different basic blocks.

Full diff: https://github.com/llvm/llvm-project/pull/82406.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp (+5-1)
(modified) llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir (+67)

diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 53fc2c0686245f..739ee17464edec 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -739,6 +739,11 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
     MachineInstr *SDWAInst = OrSDWADef->getParent();
     MachineInstr *OtherInst = OrOtherDef->getParent();
 
+    // Instruction and operand sources should reside in the same BB.
+    if (SDWAInst->getParent() != MI.getParent() ||
+        OtherInst->getParent() != MI.getParent())
+      break;
+
     // Check that OtherInstr is actually bitwise compatible with SDWAInst = their
     // destination patterns don't overlap. Compatible instruction can be either
     // regular instruction with compatible bitness or SDWA instruction with
@@ -815,7 +820,6 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
 
     return std::make_unique<SDWADstPreserveOperand>(
       OrDst, OrSDWADef, OrOtherDef, DstSel);
-
   }
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
index f93456ccacb806..359945ff799434 100644
--- a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
+++ b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
@@ -160,3 +160,70 @@ body:             |
     S_ENDPGM 0
 
 ...
+---
+name:            add_f16_u32_preserve_different_bb
+tracksRegLiveness: true
+registers:
+  - { id: 0, class: vreg_64 }
+  - { id: 1, class: vreg_64 }
+  - { id: 2, class: sreg_64 }
+  - { id: 3, class: vgpr_32 }
+  - { id: 4, class: vgpr_32 }
+  - { id: 5, class: vgpr_32 }
+  - { id: 6, class: vgpr_32 }
+  - { id: 7, class: vgpr_32 }
+  - { id: 8, class: vgpr_32 }
+  - { id: 9, class: vgpr_32 }
+  - { id: 10, class: vgpr_32 }
+  - { id: 11, class: vgpr_32 }
+  - { id: 12, class: vgpr_32 }
+  - { id: 13, class: vgpr_32 }
+body:             |
+  ; SDWA-LABEL: name: add_f16_u32_preserve_different_bb
+  ; SDWA: bb.0:
+  ; SDWA-NEXT:   successors: %bb.1(0x80000000)
+  ; SDWA-NEXT:   liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+  ; SDWA-NEXT: {{  $}}
+  ; SDWA-NEXT:   [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr30_sgpr31
+  ; SDWA-NEXT:   [[COPY1:%[0-9]+]]:vreg_64 = COPY $vgpr2_vgpr3
+  ; SDWA-NEXT:   [[COPY2:%[0-9]+]]:vreg_64 = COPY $vgpr0_vgpr1
+  ; SDWA-NEXT:   [[FLAT_LOAD_DWORD:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY2]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+  ; SDWA-NEXT:   [[FLAT_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+  ; SDWA-NEXT:   [[V_AND_B32_e32_:%[0-9]+]]:vgpr_32 = V_AND_B32_e32 65535, [[FLAT_LOAD_DWORD]], implicit $exec
+  ; SDWA-NEXT:   [[V_LSHRREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e64 16, [[FLAT_LOAD_DWORD1]], implicit $exec
+  ; SDWA-NEXT:   [[V_BFE_U32_e64_:%[0-9]+]]:vgpr_32 = V_BFE_U32_e64 [[FLAT_LOAD_DWORD]], 8, 8, implicit $exec
+  ; SDWA-NEXT:   [[V_LSHRREV_B32_e32_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e32 24, [[FLAT_LOAD_DWORD1]], implicit $exec
+  ; SDWA-NEXT:   [[V_ADD_F16_sdwa:%[0-9]+]]:vgpr_32 = V_ADD_F16_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 1, 0, 4, 5, implicit $mode, implicit $exec
+  ; SDWA-NEXT:   [[V_MUL_F32_sdwa:%[0-9]+]]:vgpr_32 = V_MUL_F32_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 5, 0, 1, 3, implicit $mode, implicit $exec
+  ; SDWA-NEXT: {{  $}}
+  ; SDWA-NEXT: bb.1:
+  ; SDWA-NEXT:   [[V_OR_B32_e64_:%[0-9]+]]:vgpr_32 = V_OR_B32_e64 [[V_ADD_F16_sdwa]], [[V_MUL_F32_sdwa]], implicit $exec
+  ; SDWA-NEXT:   FLAT_STORE_DWORD [[COPY2]], [[V_OR_B32_e64_]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+  ; SDWA-NEXT:   $sgpr30_sgpr31 = COPY [[COPY]]
+  ; SDWA-NEXT:   S_SETPC_B64_return $sgpr30_sgpr31
+  bb.0:
+    liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+
+    %2 = COPY $sgpr30_sgpr31
+    %1 = COPY $vgpr2_vgpr3
+    %0 = COPY $vgpr0_vgpr1
+    %3 = FLAT_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+    %4 = FLAT_LOAD_DWORD %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+
+    %5 = V_AND_B32_e32 65535, %3, implicit $exec
+    %6 = V_LSHRREV_B32_e64 16, %4, implicit $exec
+    %7 = V_BFE_U32_e64 %3, 8, 8, implicit $exec
+    %8 = V_LSHRREV_B32_e32 24, %4, implicit $exec
+
+    %9 = V_ADD_F16_e64 0, %5, 0, %6, 0, 0, implicit $mode, implicit $exec
+    %10 = V_LSHLREV_B16_e64 8, %9, implicit $exec
+    %11 = V_MUL_F32_e64 0, %7, 0, %8, 0, 0, implicit $mode, implicit $exec
+    %12 = V_LSHLREV_B32_e64 16, %11, implicit $exec
+
+  bb.1:
+    %13 = V_OR_B32_e64 %10, %12, implicit $exec
+
+    FLAT_STORE_DWORD %0, %13, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+    $sgpr30_sgpr31 = COPY %2
+    S_SETPC_B64_return $sgpr30_sgpr31
+...

github-actions · 2024-02-20T19:25:01Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jayfoad · 2024-02-21T10:24:35Z

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

vpykhtin · 2024-02-21T12:52:31Z

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

I'm not sure it's safe to move instructions across possible execmask change.

jayfoad · 2024-02-21T13:09:34Z

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

I'm not sure it's safe to move instructions across possible execmask change.

I think it should be fine. This happens before control flow lowering, so exec mask changes are not explicit in the MIR anyway.

vpykhtin · 2024-02-21T14:53:27Z

I think it should be fine. This happens before control flow lowering, so exec mask changes are not explicit in the MIR anyway.

I'm trying to understand why it would be valid.

%v0:vgpr_32 = sdwa1 .. dst_sel: WORD0 unused_pad
%v1:vgpr_32 = sdwa2 .. dst_sel: WORD1 unused_pad

if (...) {
$v2:vgpr_32 = V_OR_B32 %v0, %v1
use %v2
}

So this would be valid because 'if' only narrows the execmask and we cannot have uninitialized values in %v0, %v1 at some lanes inside 'then' bb. I probably cannot imagine situation when execmask at 'then' BB isn't a subset of execmasks from BBs for sdwa1 and sdwa2 instructions.

jayfoad · 2024-02-21T15:08:42Z

I probably cannot imagine situation when execmask at 'then' BB isn't a subset of execmasks from BBs for sdwa1 and sdwa2 instructions.

Right. We are in SSA form which guarantees that (for every thread) a definition of %v0 and %v1 dominates each use. After control flow lowering, this implies that the exec mask at the use will be a subset of the exec mask at the definitions.

jayfoad · 2024-02-22T16:58:18Z

llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir

+---
+name:            add_f16_u32_preserve_different_bb
+tracksRegLiveness: true
+registers:


Can you remove the registers section?

Thanks, done.

jayfoad

LGTM, thanks!

…rent basic blocks.

…rent basic blocks. (llvm#82406) This fixes crash when operand sources for V_OR instruction reside in different basic blocks. Change-Id: I1a5db649e174f76b1c749b02467768bd0d3ca440

llvmbot added the backend:AMDGPU label Feb 20, 2024

vpykhtin changed the title ~~[AMDGPU] Prevent SDWA 'preserve' transformation for instructions in different basic blocks.~~ [AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. Feb 22, 2024

jayfoad reviewed Feb 22, 2024

View reviewed changes

jayfoad approved these changes Feb 22, 2024

View reviewed changes

vpykhtin force-pushed the fix_sdwa_preserve branch 5 times, most recently from 79c3dd3 to d23f860 Compare February 28, 2024 12:12

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in diffe…

d23f860

…rent basic blocks.

vpykhtin merged commit a845ea3 into llvm:main Feb 28, 2024
3 of 4 checks passed

vpykhtin deleted the fix_sdwa_preserve branch February 28, 2024 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

vpykhtin commented Feb 20, 2024

llvmbot commented Feb 20, 2024

github-actions bot commented Feb 20, 2024 •

edited

Loading

jayfoad commented Feb 21, 2024

vpykhtin commented Feb 21, 2024

jayfoad commented Feb 21, 2024

vpykhtin commented Feb 21, 2024

jayfoad commented Feb 21, 2024

jayfoad Feb 22, 2024

vpykhtin Feb 22, 2024

jayfoad left a comment

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

Conversation

vpykhtin commented Feb 20, 2024

llvmbot commented Feb 20, 2024

github-actions bot commented Feb 20, 2024 • edited Loading

jayfoad commented Feb 21, 2024

vpykhtin commented Feb 21, 2024

jayfoad commented Feb 21, 2024

vpykhtin commented Feb 21, 2024

jayfoad commented Feb 21, 2024

jayfoad Feb 22, 2024

Choose a reason for hiding this comment

vpykhtin Feb 22, 2024

Choose a reason for hiding this comment

jayfoad left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2024 •

edited

Loading