Skip to content

[AMDGPU][True16] Add legalization/selection handling for G_MERGE_VALUES of 2 s16 -> s32#200082

Open
saxlungs wants to merge 1 commit into
users/saxlungs/true16-merge-values-s16-testingfrom
users/saxlungs/true16-merge-values-s16
Open

[AMDGPU][True16] Add legalization/selection handling for G_MERGE_VALUES of 2 s16 -> s32#200082
saxlungs wants to merge 1 commit into
users/saxlungs/true16-merge-values-s16-testingfrom
users/saxlungs/true16-merge-values-s16

Conversation

@saxlungs
Copy link
Copy Markdown
Contributor

@saxlungs saxlungs commented May 27, 2026

Stack created with GitHub Stacks CLIGive Feedback 💬

Stack PRs:
#200081

@saxlungs
Copy link
Copy Markdown
Contributor Author

@petar-avramovic @kosarev @Sisyph @broxigarchen Context for this PR (and the stacked test PR): There's a change downstream to reorder the legalizer rules for G_MERGE_VALUES and G_UNMERGE_VALUES to avoid merging two s16s into an s32. When attempting to upstream this change, we discovered it causes lots of regressions. So, the plan is to revert that change downstream. However, there are some downstream tests that are broken by doing this.

This change instead implements a method of legalizing and selecting the G_MERGE_VALUES pseudo in this case. If this goes in, then the downstream change can be safely reverted.

@saxlungs saxlungs marked this pull request as ready for review May 27, 2026 23:49
@saxlungs saxlungs requested a review from vangthao95 as a code owner May 27, 2026 23:49
@llvmorg-github-actions
Copy link
Copy Markdown

llvmorg-github-actions Bot commented May 27, 2026

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Domenic Nutile (saxlungs)

Changes

<sub>Stack created with <a href="https://github.com/github/gh-stack"&gt;GitHub Stacks CLI</a> • <a href="https://gh.io/stacks-feedback"&gt;Give Feedback 💬</a></sub>

Stack PRs:
#200081


Patch is 178.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/200082.diff

6 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+42-1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll (+178-59)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+997-494)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+892-443)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/orn2.ll (+178-59)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 4ca2de216f487..555d274ab23fb 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -677,8 +677,49 @@ bool AMDGPUInstructionSelector::selectG_MERGE_VALUES(MachineInstr &MI) const {
   LLT SrcTy = MRI->getType(MI.getOperand(1).getReg());
 
   const unsigned SrcSize = SrcTy.getSizeInBits();
-  if (SrcSize < 32)
+  if (SrcSize < 32) {
+    // Handle sgpr32 <- G_MERGE_VALUES sgpr16, sgpr16
+    if (SrcSize == 16 && DstTy.getSizeInBits() == 32 &&
+        MI.getNumOperands() == 3) {
+      Register Lo = MI.getOperand(1).getReg();
+      Register Hi = MI.getOperand(2).getReg();
+
+      const RegisterBank *DstBank = RBI.getRegBank(DstReg, *MRI, TRI);
+      const RegisterBank *LoBank = RBI.getRegBank(Lo, *MRI, TRI);
+      const RegisterBank *HiBank = RBI.getRegBank(Hi, *MRI, TRI);
+
+      if (DstBank->getID() == AMDGPU::SGPRRegBankID &&
+          LoBank->getID() == AMDGPU::SGPRRegBankID &&
+          HiBank->getID() == AMDGPU::SGPRRegBankID) {
+        const DebugLoc &DL = MI.getDebugLoc();
+
+        // Mask and shift: dst = (lo & 0xFFFF) | (hi << 16)
+        Register MaskedLo =
+            MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
+        BuildMI(*BB, &MI, DL, TII.get(AMDGPU::S_AND_B32), MaskedLo)
+            .addReg(Lo)
+            .addImm(0xFFFF);
+
+        Register ShiftedHi =
+            MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
+        BuildMI(*BB, &MI, DL, TII.get(AMDGPU::S_LSHL_B32), ShiftedHi)
+            .addReg(Hi)
+            .addImm(16);
+
+        BuildMI(*BB, &MI, DL, TII.get(AMDGPU::S_OR_B32), DstReg)
+            .addReg(MaskedLo)
+            .addReg(ShiftedHi);
+
+        if (!RBI.constrainGenericRegister(DstReg, AMDGPU::SReg_32RegClass,
+                                          *MRI))
+          return false;
+
+        MI.eraseFromParent();
+        return true;
+      }
+    }
     return selectImpl(MI, *CoverageInfo);
+  }
 
   const DebugLoc &DL = MI.getDebugLoc();
   const RegisterBank *DstBank = RBI.getRegBank(DstReg, *MRI, TRI);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
index fdcbdc8712f01..5eba5e2e491da 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
@@ -706,7 +706,9 @@ RegBankLegalizeRules::RegBankLegalizeRules(const GCNSubtarget &_ST,
       .Any({{DivBRC, BRC}, {{}, {}, ApplyAllVgpr}});
 
   addRulesForGOpcs({G_MERGE_VALUES, G_CONCAT_VECTORS})
+      .Any({{UniBRC, S16}, {{}, {}, VerifyAllSgpr}})
       .Any({{UniBRC, BRC}, {{}, {}, VerifyAllSgpr}})
+      .Any({{DivBRC, S16}, {{}, {}, ApplyAllVgpr}})
       .Any({{DivBRC, BRC}, {{}, {}, ApplyAllVgpr}});
 
   addRulesForGOpcs({G_PHI})
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll
index 22daebe753b1c..79037582cd892 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll
@@ -3,7 +3,7 @@
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=tahiti < %s | FileCheck -check-prefixes=GCN,GFX6 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < %s | FileCheck -check-prefixes=GCN,GFX9 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX10 %s
-; RUN: not llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s
+; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=+real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-TRUE16 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -mattr=-real-true16 -amdgpu-enable-delay-alu=0 < %s | FileCheck -check-prefixes=GFX10PLUS,GFX11,GFX11-FAKE16 %s
 
 ; FIXME: regbankcombiner regression, related to:
@@ -459,12 +459,26 @@ define i16 @v_andn2_i16(i16 %src0, i16 %src1) {
 ; GCN-NEXT:    v_and_b32_e32 v0, v0, v1
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX10PLUS-LABEL: v_andn2_i16:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10PLUS-NEXT:    v_xor_b32_e32 v1, -1, v1
-; GFX10PLUS-NEXT:    v_and_b32_e32 v0, v0, v1
-; GFX10PLUS-NEXT:    s_setpc_b64 s[30:31]
+; GFX10-LABEL: v_andn2_i16:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10-NEXT:    v_xor_b32_e32 v1, -1, v1
+; GFX10-NEXT:    v_and_b32_e32 v0, v0, v1
+; GFX10-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-TRUE16-LABEL: v_andn2_i16:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT:    v_xor_b16 v0.h, v1.l, -1
+; GFX11-TRUE16-NEXT:    v_and_b16 v0.l, v0.l, v0.h
+; GFX11-TRUE16-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-FAKE16-LABEL: v_andn2_i16:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-FAKE16-NEXT:    v_xor_b32_e32 v1, -1, v1
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, v0, v1
+; GFX11-FAKE16-NEXT:    s_setpc_b64 s[30:31]
   %not.src1 = xor i16 %src1, -1
   %and = and i16 %src0, %not.src1
   ret i16 %and
@@ -478,12 +492,26 @@ define amdgpu_ps float @v_andn2_i16_sv(i16 inreg %src0, i16 %src1) {
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: v_andn2_i16_sv:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    v_xor_b32_e32 v0, -1, v0
-; GFX10PLUS-NEXT:    v_and_b32_e32 v0, s2, v0
-; GFX10PLUS-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-LABEL: v_andn2_i16_sv:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    v_xor_b32_e32 v0, -1, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, s2, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX10-NEXT:    ; return to shader part epilog
+;
+; GFX11-TRUE16-LABEL: v_andn2_i16_sv:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    v_xor_b16 v0.l, v0.l, -1
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT:    v_and_b16 v0.l, s2, v0.l
+; GFX11-TRUE16-NEXT:    ; return to shader part epilog
+;
+; GFX11-FAKE16-LABEL: v_andn2_i16_sv:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    v_xor_b32_e32 v0, -1, v0
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, s2, v0
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %not.src1 = xor i16 %src1, -1
   %and = and i16 %src0, %not.src1
   %zext = zext i16 %and to i32
@@ -499,12 +527,26 @@ define amdgpu_ps float @v_andn2_i16_vs(i16 %src0, i16 inreg %src1) {
 ; GCN-NEXT:    v_and_b32_e32 v0, 0xffff, v0
 ; GCN-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: v_andn2_i16_vs:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    s_xor_b32 s0, s2, -1
-; GFX10PLUS-NEXT:    v_and_b32_e32 v0, s0, v0
-; GFX10PLUS-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-LABEL: v_andn2_i16_vs:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_xor_b32 s0, s2, -1
+; GFX10-NEXT:    v_and_b32_e32 v0, s0, v0
+; GFX10-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX10-NEXT:    ; return to shader part epilog
+;
+; GFX11-TRUE16-LABEL: v_andn2_i16_vs:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_xor_b32 s0, s2, -1
+; GFX11-TRUE16-NEXT:    v_mov_b16_e32 v0.h, 0
+; GFX11-TRUE16-NEXT:    v_and_b16 v0.l, v0.l, s0
+; GFX11-TRUE16-NEXT:    ; return to shader part epilog
+;
+; GFX11-FAKE16-LABEL: v_andn2_i16_vs:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_xor_b32 s0, s2, -1
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, s0, v0
+; GFX11-FAKE16-NEXT:    v_and_b32_e32 v0, 0xffff, v0
+; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %not.src1 = xor i16 %src1, -1
   %and = and i16 %src0, %not.src1
   %zext = zext i16 %and to i32
@@ -692,17 +734,40 @@ define amdgpu_ps i48 @s_andn2_v3i16(<3 x i16> inreg %src0, <3 x i16> inreg %src1
 ; GFX9-NEXT:    s_and_b32 s1, s1, 0xffff
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: s_andn2_v3i16:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    s_mov_b64 s[0:1], -1
-; GFX10PLUS-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
-; GFX10PLUS-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
-; GFX10PLUS-NEXT:    s_lshr_b32 s2, s0, 16
-; GFX10PLUS-NEXT:    s_and_b32 s0, s0, 0xffff
-; GFX10PLUS-NEXT:    s_lshl_b32 s2, s2, 16
-; GFX10PLUS-NEXT:    s_and_b32 s1, s1, 0xffff
-; GFX10PLUS-NEXT:    s_or_b32 s0, s0, s2
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-LABEL: s_andn2_v3i16:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_mov_b64 s[0:1], -1
+; GFX10-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX10-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
+; GFX10-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX10-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX10-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX10-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX10-NEXT:    s_or_b32 s0, s0, s2
+; GFX10-NEXT:    ; return to shader part epilog
+;
+; GFX11-TRUE16-LABEL: s_andn2_v3i16:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-TRUE16-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX11-TRUE16-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
+; GFX11-TRUE16-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX11-TRUE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-TRUE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-TRUE16-NEXT:    s_or_b32 s0, s0, s2
+; GFX11-TRUE16-NEXT:    ; return to shader part epilog
+;
+; GFX11-FAKE16-LABEL: s_andn2_v3i16:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-FAKE16-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX11-FAKE16-NEXT:    s_and_b64 s[0:1], s[2:3], s[0:1]
+; GFX11-FAKE16-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX11-FAKE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-FAKE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-FAKE16-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX11-FAKE16-NEXT:    s_or_b32 s0, s0, s2
+; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %not.src1 = xor <3 x i16> %src1, <i16 -1, i16 -1, i16 -1>
   %and = and <3 x i16> %src0, %not.src1
   %cast = bitcast <3 x i16> %and to i48
@@ -745,17 +810,40 @@ define amdgpu_ps i48 @s_andn2_v3i16_commute(<3 x i16> inreg %src0, <3 x i16> inr
 ; GFX9-NEXT:    s_and_b32 s1, s1, 0xffff
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: s_andn2_v3i16_commute:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    s_mov_b64 s[0:1], -1
-; GFX10PLUS-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
-; GFX10PLUS-NEXT:    s_and_b64 s[0:1], s[0:1], s[2:3]
-; GFX10PLUS-NEXT:    s_lshr_b32 s2, s0, 16
-; GFX10PLUS-NEXT:    s_and_b32 s0, s0, 0xffff
-; GFX10PLUS-NEXT:    s_lshl_b32 s2, s2, 16
-; GFX10PLUS-NEXT:    s_and_b32 s1, s1, 0xffff
-; GFX10PLUS-NEXT:    s_or_b32 s0, s0, s2
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-LABEL: s_andn2_v3i16_commute:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_mov_b64 s[0:1], -1
+; GFX10-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX10-NEXT:    s_and_b64 s[0:1], s[0:1], s[2:3]
+; GFX10-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX10-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX10-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX10-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX10-NEXT:    s_or_b32 s0, s0, s2
+; GFX10-NEXT:    ; return to shader part epilog
+;
+; GFX11-TRUE16-LABEL: s_andn2_v3i16_commute:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-TRUE16-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX11-TRUE16-NEXT:    s_and_b64 s[0:1], s[0:1], s[2:3]
+; GFX11-TRUE16-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX11-TRUE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-TRUE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-TRUE16-NEXT:    s_or_b32 s0, s0, s2
+; GFX11-TRUE16-NEXT:    ; return to shader part epilog
+;
+; GFX11-FAKE16-LABEL: s_andn2_v3i16_commute:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-FAKE16-NEXT:    s_xor_b64 s[0:1], s[4:5], s[0:1]
+; GFX11-FAKE16-NEXT:    s_and_b64 s[0:1], s[0:1], s[2:3]
+; GFX11-FAKE16-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX11-FAKE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-FAKE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-FAKE16-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX11-FAKE16-NEXT:    s_or_b32 s0, s0, s2
+; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %not.src1 = xor <3 x i16> %src1, <i16 -1, i16 -1, i16 -1>
   %and = and <3 x i16> %not.src1, %src0
   %cast = bitcast <3 x i16> %and to i48
@@ -808,22 +896,55 @@ define amdgpu_ps { i48, i48 } @s_andn2_v3i16_multi_use(<3 x i16> inreg %src0, <3
 ; GFX9-NEXT:    s_and_b32 s3, s5, 0xffff
 ; GFX9-NEXT:    ; return to shader part epilog
 ;
-; GFX10PLUS-LABEL: s_andn2_v3i16_multi_use:
-; GFX10PLUS:       ; %bb.0:
-; GFX10PLUS-NEXT:    s_mov_b64 s[0:1], -1
-; GFX10PLUS-NEXT:    s_xor_b64 s[4:5], s[4:5], s[0:1]
-; GFX10PLUS-NEXT:    s_and_b64 s[0:1], s[2:3], s[4:5]
-; GFX10PLUS-NEXT:    s_lshr_b32 s3, s4, 16
-; GFX10PLUS-NEXT:    s_lshr_b32 s2, s0, 16
-; GFX10PLUS-NEXT:    s_and_b32 s0, s0, 0xffff
-; GFX10PLUS-NEXT:    s_lshl_b32 s2, s2, 16
-; GFX10PLUS-NEXT:    s_lshl_b32 s3, s3, 16
-; GFX10PLUS-NEXT:    s_or_b32 s0, s0, s2
-; GFX10PLUS-NEXT:    s_and_b32 s2, s4, 0xffff
-; GFX10PLUS-NEXT:    s_and_b32 s1, s1, 0xffff
-; GFX10PLUS-NEXT:    s_or_b32 s2, s2, s3
-; GFX10PLUS-NEXT:    s_and_b32 s3, s5, 0xffff
-; GFX10PLUS-NEXT:    ; return to shader part epilog
+; GFX10-LABEL: s_andn2_v3i16_multi_use:
+; GFX10:       ; %bb.0:
+; GFX10-NEXT:    s_mov_b64 s[0:1], -1
+; GFX10-NEXT:    s_xor_b64 s[4:5], s[4:5], s[0:1]
+; GFX10-NEXT:    s_and_b64 s[0:1], s[2:3], s[4:5]
+; GFX10-NEXT:    s_lshr_b32 s3, s4, 16
+; GFX10-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX10-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX10-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX10-NEXT:    s_lshl_b32 s3, s3, 16
+; GFX10-NEXT:    s_or_b32 s0, s0, s2
+; GFX10-NEXT:    s_and_b32 s2, s4, 0xffff
+; GFX10-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX10-NEXT:    s_or_b32 s2, s2, s3
+; GFX10-NEXT:    s_and_b32 s3, s5, 0xffff
+; GFX10-NEXT:    ; return to shader part epilog
+;
+; GFX11-TRUE16-LABEL: s_andn2_v3i16_multi_use:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-TRUE16-NEXT:    s_xor_b64 s[4:5], s[4:5], s[0:1]
+; GFX11-TRUE16-NEXT:    s_and_b64 s[0:1], s[2:3], s[4:5]
+; GFX11-TRUE16-NEXT:    s_lshr_b32 s2, s4, 16
+; GFX11-TRUE16-NEXT:    s_lshr_b32 s3, s0, 16
+; GFX11-TRUE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-TRUE16-NEXT:    s_lshl_b32 s3, s3, 16
+; GFX11-TRUE16-NEXT:    s_and_b32 s4, s4, 0xffff
+; GFX11-TRUE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-TRUE16-NEXT:    s_or_b32 s0, s0, s3
+; GFX11-TRUE16-NEXT:    s_or_b32 s2, s4, s2
+; GFX11-TRUE16-NEXT:    s_mov_b32 s3, s5
+; GFX11-TRUE16-NEXT:    ; return to shader part epilog
+;
+; GFX11-FAKE16-LABEL: s_andn2_v3i16_multi_use:
+; GFX11-FAKE16:       ; %bb.0:
+; GFX11-FAKE16-NEXT:    s_mov_b64 s[0:1], -1
+; GFX11-FAKE16-NEXT:    s_xor_b64 s[4:5], s[4:5], s[0:1]
+; GFX11-FAKE16-NEXT:    s_and_b64 s[0:1], s[2:3], s[4:5]
+; GFX11-FAKE16-NEXT:    s_lshr_b32 s3, s4, 16
+; GFX11-FAKE16-NEXT:    s_lshr_b32 s2, s0, 16
+; GFX11-FAKE16-NEXT:    s_and_b32 s0, s0, 0xffff
+; GFX11-FAKE16-NEXT:    s_lshl_b32 s2, s2, 16
+; GFX11-FAKE16-NEXT:    s_lshl_b32 s3, s3, 16
+; GFX11-FAKE16-NEXT:    s_or_b32 s0, s0, s2
+; GFX11-FAKE16-NEXT:    s_and_b32 s2, s4, 0xffff
+; GFX11-FAKE16-NEXT:    s_and_b32 s1, s1, 0xffff
+; GFX11-FAKE16-NEXT:    s_or_b32 s2, s2, s3
+; GFX11-FAKE16-NEXT:    s_and_b32 s3, s5, 0xffff
+; GFX11-FAKE16-NEXT:    ; return to shader part epilog
   %not.src1 = xor <3 x i16> %src1, <i16 -1, i16 -1, i16 -1>
   %and = and <3 x i16> %src0, %not.src1
   %cast.0 = bitcast <3 x i16> %and to i48
@@ -1127,5 +1248,3 @@ define <4 x i16> @v_andn2_v4i16(<4 x i16> %src0, <4 x i16> %src1) {
   %and = and <4 x i16> %src0, %not.src1
   ret <4 x i16> %and
 }
-;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
-; GFX11-FAKE16: {{.*}}
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
index d7887507160cd..305fce504faed 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
@@ -3,7 +3,7 @@
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=fiji -o - %s | FileCheck -check-prefixes=GCN,GFX8 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx900 -o - %s | FileCheck -check-prefixes=GCN,GFX9 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -o - %s | FileCheck -check-prefixes=GCN,GFX10 %s
-; RUN: not llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr=+real-true16 -o - %s
+; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr=+real-true16 -o - %s | FileCheck -check-prefixes=GFX11,GFX11-TRUE16 %s
 ; RUN: llc -global-isel -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1100 -mattr=-real-true16 -o - %s | FileCheck -check-prefixes=GFX11,GFX11-FAKE16 %s
 
 define amdgpu_ps i7 @s_fshl_i7(i7 inreg %lhs, i7 inreg %rhs, i7 inreg %amt) {
@@ -318,45 +318,85 @@ define i7 @v_fshl_i7(i7 %lhs, i7 %rhs, i7 %amt) {
 ; GFX10-NEXT:    v_or_b32_e32 v0, v0, v1
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
-; GFX11-LABEL: v_fshl_i7:
-; GFX11:       ; %bb.0:
-; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX11-NEXT:    v_cvt_f32_ubyte0_e32 v3, 7
-; GFX11-NEXT:    v_and_b32_e32 v2, 0x7f, v2
-; GFX11-NEXT:    v_and_b32_e32 v1, 0x7f, v1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_rcp_iflag_f32_e32 v3, v3
-; GFX11-NEXT:    v_lshrrev_b16 v1, 1, v1
-; GFX11-NEXT:    s_waitcnt_depctr depctr_va_vdst(0)
-; GFX11-NEXT:    v_mul_f32_e32 v3, 0x4f7ffffe, v3
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_cvt_u32_f32_e32 v3, v3
-; GFX11-NEXT:    v_readfirstlane_b32 s0, v3
-; GFX11-NEXT:    s_mul_i32 s1, s0, -7
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
-; GFX11-NEXT:    s_mul_hi_u32 s1, s0, s1
-; GFX11-NEXT:    s_add_i32 s0, s0, s1
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mul_hi_u32 v5, v2, s0
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, -7, v[2:3]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_add_nc_u32_e32 v2, -7, v3
-; GFX11-NEXT:    v_cmp_le_u32_e32 vcc_lo, 7, v3
-; GFX11-NEXT:    v_cndmask_b32_e32 v2, v3, v2, vcc_lo
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_add_nc_u32_e32 v3, -7, v2
-; GFX11-NEXT:    v_cmp_le_u32_e32 vcc_lo, 7, v2
-; GFX11-NEXT:    v_cndmask_b32_e32 v2, v2, v3, vcc_lo
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_sub_nc_u16 v3, 6, v2
-; GFX11-NEXT:    v_and_b32_e32 v2, 0x7f, v2
-; GFX11-NEXT:    v_and_b32_e32 v3, 0x7f, v3
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
-; GFX11-NEXT:    v_lshlrev_b16 v0, v2, v0
-; GFX11-NEXT:    v_lshrrev_b16 v1, v3, v1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_or_b32_e32 v0, v0, v1
-; GFX11-NEXT:    s_setpc_b64 s[30:31]
+; GFX11-TRUE16-LABEL: v_fshl_i7:
+; GFX11-TRUE16:       ; %bb.0:
+; GFX11-TRUE16-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-TRUE16-NEXT:    v_cvt_f32_ubyte0_e32 v3, 7
+; GFX11-TRUE16-NEXT:    v_and_b32_e32 v2, 0x7f, v2
+; GFX11-TRUE16-NEXT:    v_and_b16 v0.h, 0x7f, v1.l
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-TRUE16-NEXT:    v_rcp_iflag_f32_e32 v3, v3
+; GFX11-TRUE16-NEXT:    v_lshrrev_b16 v0.h, 1, v0.h
+; GFX11-TRUE16-NEXT:    s_waitcnt_depctr depctr_va_vdst(0)
+; GFX11-TRUE16-NEXT:    v_mul_f32_e32 v3, 0x4f7ffffe, v3
+; GFX11-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) ...
[truncated]

addRulesForGOpcs({G_MERGE_VALUES, G_CONCAT_VECTORS})
.Any({{UniBRC, S16}, {{}, {}, VerifyAllSgpr}})
.Any({{UniBRC, BRC}, {{}, {}, VerifyAllSgpr}})
.Any({{DivBRC, S16}, {{}, {}, ApplyAllVgpr}})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to add vgpr variant of some of the existing tests, for example s_andn2_v3i16, also move G_MERGE_VALUES to same group as G_BUILD_VECTOR, these rules do not apply to G_CONCAT_VECTORS

if (SrcSize < 32)
if (SrcSize < 32) {
// Handle sgpr32 <- G_MERGE_VALUES sgpr16, sgpr16
if (SrcSize == 16 && DstTy.getSizeInBits() == 32 &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to share code with G_BUILD_VECTOR, I would expect to get S_PACK_LL_B32_B16 instead of mask+shift+or, maybe refactor into some helpers functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants