Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. #82406

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

vpykhtin
Copy link
Contributor

This fixes crash when operand sources for V_OR instruction reside in different basic blocks.

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 20, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Valery Pykhtin (vpykhtin)

Changes

This fixes crash when operand sources for V_OR instruction reside in different basic blocks.


Full diff: https://github.com/llvm/llvm-project/pull/82406.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir (+67)
diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 53fc2c0686245f..739ee17464edec 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -739,6 +739,11 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
     MachineInstr *SDWAInst = OrSDWADef->getParent();
     MachineInstr *OtherInst = OrOtherDef->getParent();
 
+    // Instruction and operand sources should reside in the same BB.
+    if (SDWAInst->getParent() != MI.getParent() ||
+        OtherInst->getParent() != MI.getParent())
+      break;
+
     // Check that OtherInstr is actually bitwise compatible with SDWAInst = their
     // destination patterns don't overlap. Compatible instruction can be either
     // regular instruction with compatible bitness or SDWA instruction with
@@ -815,7 +820,6 @@ SIPeepholeSDWA::matchSDWAOperand(MachineInstr &MI) {
 
     return std::make_unique<SDWADstPreserveOperand>(
       OrDst, OrSDWADef, OrOtherDef, DstSel);
-
   }
   }
 
diff --git a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
index f93456ccacb806..359945ff799434 100644
--- a/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
+++ b/llvm/test/CodeGen/AMDGPU/sdwa-preserve.mir
@@ -160,3 +160,70 @@ body:             |
     S_ENDPGM 0
 
 ...
+---
+name:            add_f16_u32_preserve_different_bb
+tracksRegLiveness: true
+registers:
+  - { id: 0, class: vreg_64 }
+  - { id: 1, class: vreg_64 }
+  - { id: 2, class: sreg_64 }
+  - { id: 3, class: vgpr_32 }
+  - { id: 4, class: vgpr_32 }
+  - { id: 5, class: vgpr_32 }
+  - { id: 6, class: vgpr_32 }
+  - { id: 7, class: vgpr_32 }
+  - { id: 8, class: vgpr_32 }
+  - { id: 9, class: vgpr_32 }
+  - { id: 10, class: vgpr_32 }
+  - { id: 11, class: vgpr_32 }
+  - { id: 12, class: vgpr_32 }
+  - { id: 13, class: vgpr_32 }
+body:             |
+  ; SDWA-LABEL: name: add_f16_u32_preserve_different_bb
+  ; SDWA: bb.0:
+  ; SDWA-NEXT:   successors: %bb.1(0x80000000)
+  ; SDWA-NEXT:   liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+  ; SDWA-NEXT: {{  $}}
+  ; SDWA-NEXT:   [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr30_sgpr31
+  ; SDWA-NEXT:   [[COPY1:%[0-9]+]]:vreg_64 = COPY $vgpr2_vgpr3
+  ; SDWA-NEXT:   [[COPY2:%[0-9]+]]:vreg_64 = COPY $vgpr0_vgpr1
+  ; SDWA-NEXT:   [[FLAT_LOAD_DWORD:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY2]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+  ; SDWA-NEXT:   [[FLAT_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = FLAT_LOAD_DWORD [[COPY1]], 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+  ; SDWA-NEXT:   [[V_AND_B32_e32_:%[0-9]+]]:vgpr_32 = V_AND_B32_e32 65535, [[FLAT_LOAD_DWORD]], implicit $exec
+  ; SDWA-NEXT:   [[V_LSHRREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e64 16, [[FLAT_LOAD_DWORD1]], implicit $exec
+  ; SDWA-NEXT:   [[V_BFE_U32_e64_:%[0-9]+]]:vgpr_32 = V_BFE_U32_e64 [[FLAT_LOAD_DWORD]], 8, 8, implicit $exec
+  ; SDWA-NEXT:   [[V_LSHRREV_B32_e32_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e32 24, [[FLAT_LOAD_DWORD1]], implicit $exec
+  ; SDWA-NEXT:   [[V_ADD_F16_sdwa:%[0-9]+]]:vgpr_32 = V_ADD_F16_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 1, 0, 4, 5, implicit $mode, implicit $exec
+  ; SDWA-NEXT:   [[V_MUL_F32_sdwa:%[0-9]+]]:vgpr_32 = V_MUL_F32_sdwa 0, [[FLAT_LOAD_DWORD]], 0, [[FLAT_LOAD_DWORD1]], 0, 0, 5, 0, 1, 3, implicit $mode, implicit $exec
+  ; SDWA-NEXT: {{  $}}
+  ; SDWA-NEXT: bb.1:
+  ; SDWA-NEXT:   [[V_OR_B32_e64_:%[0-9]+]]:vgpr_32 = V_OR_B32_e64 [[V_ADD_F16_sdwa]], [[V_MUL_F32_sdwa]], implicit $exec
+  ; SDWA-NEXT:   FLAT_STORE_DWORD [[COPY2]], [[V_OR_B32_e64_]], 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+  ; SDWA-NEXT:   $sgpr30_sgpr31 = COPY [[COPY]]
+  ; SDWA-NEXT:   S_SETPC_B64_return $sgpr30_sgpr31
+  bb.0:
+    liveins: $vgpr0_vgpr1, $vgpr2_vgpr3, $sgpr30_sgpr31
+
+    %2 = COPY $sgpr30_sgpr31
+    %1 = COPY $vgpr2_vgpr3
+    %0 = COPY $vgpr0_vgpr1
+    %3 = FLAT_LOAD_DWORD %0, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+    %4 = FLAT_LOAD_DWORD %1, 0, 0, implicit $exec, implicit $flat_scr :: (load (s32))
+
+    %5 = V_AND_B32_e32 65535, %3, implicit $exec
+    %6 = V_LSHRREV_B32_e64 16, %4, implicit $exec
+    %7 = V_BFE_U32_e64 %3, 8, 8, implicit $exec
+    %8 = V_LSHRREV_B32_e32 24, %4, implicit $exec
+
+    %9 = V_ADD_F16_e64 0, %5, 0, %6, 0, 0, implicit $mode, implicit $exec
+    %10 = V_LSHLREV_B16_e64 8, %9, implicit $exec
+    %11 = V_MUL_F32_e64 0, %7, 0, %8, 0, 0, implicit $mode, implicit $exec
+    %12 = V_LSHLREV_B32_e64 16, %11, implicit $exec
+
+  bb.1:
+    %13 = V_OR_B32_e64 %10, %12, implicit $exec
+
+    FLAT_STORE_DWORD %0, %13, 0, 0, implicit $exec, implicit $flat_scr :: (store (s32))
+    $sgpr30_sgpr31 = COPY %2
+    S_SETPC_B64_return $sgpr30_sgpr31
+...

Copy link

github-actions bot commented Feb 20, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jayfoad
Copy link
Contributor

jayfoad commented Feb 21, 2024

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

@vpykhtin
Copy link
Contributor Author

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

I'm not sure it's safe to move instructions across possible execmask change.

@jayfoad
Copy link
Contributor

jayfoad commented Feb 21, 2024

Why not allow the optimization, and just fix the "Move MI before v_or_b32" code in convertToSDWA?

I'm not sure it's safe to move instructions across possible execmask change.

I think it should be fine. This happens before control flow lowering, so exec mask changes are not explicit in the MIR anyway.

@vpykhtin
Copy link
Contributor Author

I think it should be fine. This happens before control flow lowering, so exec mask changes are not explicit in the MIR anyway.

I'm trying to understand why it would be valid.

%v0:vgpr_32 = sdwa1 .. dst_sel: WORD0 unused_pad
%v1:vgpr_32 = sdwa2 .. dst_sel: WORD1 unused_pad

if (...) {
$v2:vgpr_32 = V_OR_B32 %v0, %v1
use %v2
}

So this would be valid because 'if' only narrows the execmask and we cannot have uninitialized values in %v0, %v1 at some lanes inside 'then' bb. I probably cannot imagine situation when execmask at 'then' BB isn't a subset of execmasks from BBs for sdwa1 and sdwa2 instructions.

@jayfoad
Copy link
Contributor

jayfoad commented Feb 21, 2024

I probably cannot imagine situation when execmask at 'then' BB isn't a subset of execmasks from BBs for sdwa1 and sdwa2 instructions.

Right. We are in SSA form which guarantees that (for every thread) a definition of %v0 and %v1 dominates each use. After control flow lowering, this implies that the exec mask at the use will be a subset of the exec mask at the definitions.

@vpykhtin vpykhtin changed the title [AMDGPU] Prevent SDWA 'preserve' transformation for instructions in different basic blocks. [AMDGPU] Fix SDWA 'preserve' transformation for instructions in different basic blocks. Feb 22, 2024
---
name: add_f16_u32_preserve_different_bb
tracksRegLiveness: true
registers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the registers section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@vpykhtin vpykhtin force-pushed the fix_sdwa_preserve branch 5 times, most recently from 79c3dd3 to d23f860 Compare February 28, 2024 12:12
@vpykhtin vpykhtin merged commit a845ea3 into llvm:main Feb 28, 2024
3 of 4 checks passed
@vpykhtin vpykhtin deleted the fix_sdwa_preserve branch February 28, 2024 13:47
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Jul 17, 2024
…rent basic blocks. (llvm#82406)

This fixes crash when operand sources for V_OR instruction reside in
different basic blocks.

Change-Id: I1a5db649e174f76b1c749b02467768bd0d3ca440
jrbyrnes pushed a commit to jrbyrnes/llvm-project that referenced this pull request Jul 31, 2024
…rent basic blocks. (llvm#82406)

This fixes crash when operand sources for V_OR instruction reside in
different basic blocks.

Change-Id: I1a5db649e174f76b1c749b02467768bd0d3ca440
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants