Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SeparateConstOffsetFromGEP] Reorder trivial GEP chains to separate constants #73056

Merged
merged 1 commit into from
Feb 13, 2024

Conversation

jrbyrnes
Copy link
Contributor

In this case, a trivial GEP chain has the form:

%ptr = getelementptr sameType, %base, constant
%val = getelementptr sameType, %ptr, %variable

That is, a one-index GEP consumes another (of the same basis and result type) one-index GEP, where the inner GEP uses a constant index and the outer GEP uses a variable index. For chains of this type, it is trivial to reorder them (by simply swapping the indexes). The result of doing so is better AddrMode matching for users of the ultimate ptr produced by GEP chain.

Future patches can extend this to support non-trivial GEP chains (e.g. those with different basis types and/or multiple indices).

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 22, 2023

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-transforms

Author: Jeffrey Byrnes (jrbyrnes)

Changes

In this case, a trivial GEP chain has the form:

%ptr = getelementptr sameType, %base, constant
%val = getelementptr sameType, %ptr, %variable

That is, a one-index GEP consumes another (of the same basis and result type) one-index GEP, where the inner GEP uses a constant index and the outer GEP uses a variable index. For chains of this type, it is trivial to reorder them (by simply swapping the indexes). The result of doing so is better AddrMode matching for users of the ultimate ptr produced by GEP chain.

Future patches can extend this to support non-trivial GEP chains (e.g. those with different basis types and/or multiple indices).


Patch is 79.08 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73056.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp (+42-3)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+18-20)
  • (modified) llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll (+116-135)
  • (modified) llvm/test/CodeGen/PowerPC/licm-remat.ll (+1-1)
  • (added) llvm/test/Transforms/SeparateConstOffsetFromGEP/AMDGPU/reorder-gep.ll (+175)
  • (added) llvm/test/Transforms/SeparateConstOffsetFromGEP/NVPTX/lower-gep-reorder.ll (+65)
  • (added) llvm/test/Transforms/SeparateConstOffsetFromGEP/reorder-gep.ll (+188)
diff --git a/llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp b/llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
index b8c9d9d100f117a..72791ac55839a3d 100644
--- a/llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
+++ b/llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
@@ -987,11 +987,50 @@ bool SeparateConstOffsetFromGEP::splitGEP(GetElementPtrInst *GEP) {
   bool NeedsExtraction;
   int64_t AccumulativeByteOffset = accumulateByteOffset(GEP, NeedsExtraction);
 
-  if (!NeedsExtraction)
-    return Changed;
-
   TargetTransformInfo &TTI = GetTTI(*GEP->getFunction());
 
+  if (!NeedsExtraction) {
+    // Attempt to reassociate GEPs for better AddrMode construction.
+    if (auto PtrGEP = dyn_cast<GetElementPtrInst>(GEP->getPointerOperand())) {
+      // TODO: support reordering for non-trivial GEP chains
+      if (PtrGEP->getNumIndices() != 1 || GEP->getNumIndices() != 1)
+        return Changed;
+
+      Type *GEPType = GEP->getResultElementType();
+      Type *PtrGEPType = PtrGEP->getResultElementType();
+      // TODO: support reordering for non-trivial GEP chains
+      if ((PtrGEPType != GEPType) ||
+          (PtrGEP->getSourceElementType() != GEP->getSourceElementType()))
+        return Changed;
+
+      bool NestedNeedsExtraction;
+      int64_t NestedByteOffset =
+          accumulateByteOffset(PtrGEP, NestedNeedsExtraction);
+      if (!NestedNeedsExtraction)
+        return Changed;
+
+      unsigned AddrSpace = PtrGEP->getPointerAddressSpace();
+      if (!TTI.isLegalAddressingMode(GEP->getResultElementType(),
+                                     /*BaseGV=*/nullptr, NestedByteOffset,
+                                     /*HasBaseReg=*/true, /*Scale=*/0,
+                                     AddrSpace)) {
+        return Changed;
+      }
+
+      IRBuilder<> Builder(GEP);
+      Builder.SetCurrentDebugLocation(GEP->getDebugLoc());
+      // For trivial GEP chains, we can swap the indicies.
+      auto NewSrc = Builder.CreateGEP(PtrGEPType, PtrGEP->getPointerOperand(),
+                                      SmallVector<Value *, 4>(GEP->indices()));
+      auto NewGEP = Builder.CreateGEP(
+          GEPType, NewSrc, SmallVector<Value *, 4>(PtrGEP->indices()));
+      GEP->replaceAllUsesWith(NewGEP);
+      RecursivelyDeleteTriviallyDeadInstructions(GEP);
+      Changed = true;
+    }
+    return Changed;
+  }
+
   // If LowerGEP is disabled, before really splitting the GEP, check whether the
   // backend supports the addressing mode we are about to produce. If no, this
   // splitting probably won't be beneficial.
diff --git a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
index 75f3b5463c3944b..91f635893aed305 100644
--- a/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
+++ b/llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll
@@ -264,11 +264,11 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CHECK-NEXT:    ds_write_b32 v0, v58
 ; CHECK-NEXT:    s_branch .LBB0_7
-; CHECK-NEXT:  .LBB0_16: ; %Flow43
+; CHECK-NEXT:  .LBB0_16: ; %Flow45
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s57
 ; CHECK-NEXT:    v_mov_b32_e32 v57, v0
-; CHECK-NEXT:  .LBB0_17: ; %Flow44
+; CHECK-NEXT:  .LBB0_17: ; %Flow46
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s56
 ; CHECK-NEXT:    s_mov_b32 s55, exec_lo
@@ -311,11 +311,11 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; CHECK-NEXT:    ds_write_b32 v0, v57
 ; CHECK-NEXT:    s_branch .LBB0_19
-; CHECK-NEXT:  .LBB0_22: ; %Flow41
+; CHECK-NEXT:  .LBB0_22: ; %Flow43
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_inst_prefetch 0x2
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s56
-; CHECK-NEXT:  .LBB0_23: ; %Flow42
+; CHECK-NEXT:  .LBB0_23: ; %Flow44
 ; CHECK-NEXT:    ; in Loop: Header=BB0_5 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s55
 ; CHECK-NEXT:  ; %bb.24: ; in Loop: Header=BB0_5 Depth=1
@@ -328,7 +328,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_or_b32 s49, s4, s49
 ; CHECK-NEXT:    s_andn2_b32 exec_lo, exec_lo, s49
 ; CHECK-NEXT:    s_cbranch_execnz .LBB0_5
-; CHECK-NEXT:  .LBB0_25: ; %Flow49
+; CHECK-NEXT:  .LBB0_25: ; %Flow51
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s48
 ; CHECK-NEXT:    v_mov_b32_e32 v31, v41
 ; CHECK-NEXT:    v_mov_b32_e32 v0, 1
@@ -347,18 +347,16 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_cmpx_gt_u32_e64 v47, v40
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_33
 ; CHECK-NEXT:  ; %bb.26:
-; CHECK-NEXT:    s_add_u32 s52, s44, 8
-; CHECK-NEXT:    s_addc_u32 s53, s45, 0
 ; CHECK-NEXT:    s_getpc_b64 s[42:43]
 ; CHECK-NEXT:    s_add_u32 s42, s42, _Z10atomic_addPU3AS1Vjj@rel32@lo+4
 ; CHECK-NEXT:    s_addc_u32 s43, s43, _Z10atomic_addPU3AS1Vjj@rel32@hi+12
 ; CHECK-NEXT:    s_mov_b32 s54, 0
-; CHECK-NEXT:    s_getpc_b64 s[44:45]
-; CHECK-NEXT:    s_add_u32 s44, s44, _Z10atomic_subPU3AS1Vjj@rel32@lo+4
-; CHECK-NEXT:    s_addc_u32 s45, s45, _Z10atomic_subPU3AS1Vjj@rel32@hi+12
 ; CHECK-NEXT:    s_getpc_b64 s[48:49]
-; CHECK-NEXT:    s_add_u32 s48, s48, _Z14get_local_sizej@rel32@lo+4
-; CHECK-NEXT:    s_addc_u32 s49, s49, _Z14get_local_sizej@rel32@hi+12
+; CHECK-NEXT:    s_add_u32 s48, s48, _Z10atomic_subPU3AS1Vjj@rel32@lo+4
+; CHECK-NEXT:    s_addc_u32 s49, s49, _Z10atomic_subPU3AS1Vjj@rel32@hi+12
+; CHECK-NEXT:    s_getpc_b64 s[52:53]
+; CHECK-NEXT:    s_add_u32 s52, s52, _Z14get_local_sizej@rel32@lo+4
+; CHECK-NEXT:    s_addc_u32 s53, s53, _Z14get_local_sizej@rel32@hi+12
 ; CHECK-NEXT:    s_branch .LBB0_28
 ; CHECK-NEXT:  .LBB0_27: ; in Loop: Header=BB0_28 Depth=1
 ; CHECK-NEXT:    s_or_b32 exec_lo, exec_lo, s55
@@ -371,7 +369,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_mov_b32 s12, s41
 ; CHECK-NEXT:    s_mov_b32 s13, s40
 ; CHECK-NEXT:    s_mov_b32 s14, s33
-; CHECK-NEXT:    s_swappc_b64 s[30:31], s[48:49]
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[52:53]
 ; CHECK-NEXT:    v_add_co_u32 v40, vcc_lo, v0, v40
 ; CHECK-NEXT:    v_cmp_le_u32_e32 vcc_lo, v47, v40
 ; CHECK-NEXT:    s_or_b32 s54, vcc_lo, s54
@@ -388,15 +386,15 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    v_mul_u32_u24_e32 v1, 0x180, v63
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 5, v62
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v4, 5, v72
-; CHECK-NEXT:    v_add_co_u32 v2, s4, s52, v1
-; CHECK-NEXT:    v_add_co_ci_u32_e64 v3, null, s53, 0, s4
+; CHECK-NEXT:    v_add_co_u32 v2, s4, s44, v1
+; CHECK-NEXT:    v_add_co_ci_u32_e64 v3, null, s45, 0, s4
 ; CHECK-NEXT:    v_add_co_u32 v0, vcc_lo, v2, v0
 ; CHECK-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, 0, v3, vcc_lo
 ; CHECK-NEXT:    v_add_co_u32 v2, vcc_lo, v2, v4
 ; CHECK-NEXT:    v_add_co_ci_u32_e32 v3, vcc_lo, 0, v3, vcc_lo
 ; CHECK-NEXT:    s_clause 0x1
-; CHECK-NEXT:    global_load_dwordx4 v[4:7], v[0:1], off
-; CHECK-NEXT:    global_load_dwordx4 v[8:11], v[2:3], off
+; CHECK-NEXT:    global_load_dwordx4 v[4:7], v[0:1], off offset:8
+; CHECK-NEXT:    global_load_dwordx4 v[8:11], v[2:3], off offset:8
 ; CHECK-NEXT:    s_waitcnt vmcnt(0)
 ; CHECK-NEXT:    v_xor_b32_e32 v46, v9, v5
 ; CHECK-NEXT:    v_xor_b32_e32 v45, v8, v4
@@ -408,8 +406,8 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_cbranch_execz .LBB0_27
 ; CHECK-NEXT:  ; %bb.29: ; in Loop: Header=BB0_28 Depth=1
 ; CHECK-NEXT:    s_clause 0x1
-; CHECK-NEXT:    global_load_dwordx2 v[58:59], v[2:3], off offset:16
-; CHECK-NEXT:    global_load_dwordx2 v[60:61], v[0:1], off offset:16
+; CHECK-NEXT:    global_load_dwordx2 v[58:59], v[2:3], off offset:24
+; CHECK-NEXT:    global_load_dwordx2 v[60:61], v[0:1], off offset:24
 ; CHECK-NEXT:    v_lshlrev_b32_e32 v0, 4, v45
 ; CHECK-NEXT:    v_alignbit_b32 v1, v46, v45, 12
 ; CHECK-NEXT:    v_and_b32_e32 v2, 0xf0000, v45
@@ -484,7 +482,7 @@ define protected amdgpu_kernel void @kernel_round1(ptr addrspace(1) nocapture no
 ; CHECK-NEXT:    s_mov_b32 s12, s41
 ; CHECK-NEXT:    s_mov_b32 s13, s40
 ; CHECK-NEXT:    s_mov_b32 s14, s33
-; CHECK-NEXT:    s_swappc_b64 s[30:31], s[44:45]
+; CHECK-NEXT:    s_swappc_b64 s[30:31], s[48:49]
 ; CHECK-NEXT:    s_branch .LBB0_27
 ; CHECK-NEXT:  .LBB0_33:
 ; CHECK-NEXT:    s_endpgm
diff --git a/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll b/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
index fbe0b156cd9baa0..3186eedbbc0f523 100644
--- a/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
+++ b/llvm/test/CodeGen/AMDGPU/splitkit-getsubrangeformask.ll
@@ -31,205 +31,188 @@ define amdgpu_gs void @_amdgpu_gs_main(i32 inreg %primShaderTableAddrLow, <31 x
   ; CHECK-NEXT:   [[COPY13:%[0-9]+]]:sgpr_32 = COPY $sgpr10
   ; CHECK-NEXT:   [[COPY14:%[0-9]+]]:sgpr_32 = COPY $sgpr8
   ; CHECK-NEXT:   undef [[S_LOAD_DWORDX2_IMM:%[0-9]+]].sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM [[COPY]], 232, 0 :: (invariant load (s64) from %ir.39, addrspace 4)
+  ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_IMM undef %125:sgpr_128, 0, 0 :: (dereferenceable invariant load (s32))
+  ; CHECK-NEXT:   KILL undef %125:sgpr_128
   ; CHECK-NEXT:   [[S_LSHL_B32_:%[0-9]+]]:sreg_32 = S_LSHL_B32 [[COPY5]], 4, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_LSHL_B32_1:%[0-9]+]]:sreg_32 = S_LSHL_B32 [[COPY4]], 4, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_LSHL_B32_2:%[0-9]+]]:sreg_32 = S_LSHL_B32 [[COPY3]], 4, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ASHR_I32_:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 [[S_LSHL_B32_]], 31, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_ASHR_I32_1:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 [[S_LSHL_B32_1]], 31, implicit-def dead $scc
-  ; CHECK-NEXT:   [[S_ASHR_I32_2:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 [[S_LSHL_B32_2]], 31, implicit-def dead $scc
   ; CHECK-NEXT:   [[S_LOAD_DWORDX2_IMM:%[0-9]+]].sub1:sgpr_128 = S_AND_B32 [[S_LOAD_DWORDX2_IMM]].sub1, 65535, implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_ASHR_I32_2:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 [[S_LSHL_B32_2]], 31, implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_SUB_I32_:%[0-9]+]]:sreg_32 = S_SUB_I32 [[S_BUFFER_LOAD_DWORD_IMM]], 29, implicit-def dead $scc
+  ; CHECK-NEXT:   [[S_SUB_I32_1:%[0-9]+]]:sreg_32 = S_SUB_I32 [[S_BUFFER_LOAD_DWORD_IMM]], 30, implicit-def dead $scc
   ; CHECK-NEXT:   undef [[S_ADD_U32_:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY6]], [[S_LSHL_B32_2]], implicit-def $scc
   ; CHECK-NEXT:   [[S_ADD_U32_:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %54:sreg_32, [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_]], 16, 0 :: (invariant load (s128) from %ir.81, addrspace 4)
   ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM1:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM undef %74:sreg_64, 0, 0 :: (invariant load (s128) from `ptr addrspace(4) undef`, addrspace 4)
-  ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_IMM undef %132:sgpr_128, 0, 0 :: (dereferenceable invariant load (s32))
   ; CHECK-NEXT:   KILL undef %74:sreg_64
-  ; CHECK-NEXT:   KILL undef %132:sgpr_128
   ; CHECK-NEXT:   KILL [[S_ADD_U32_]].sub0, [[S_ADD_U32_]].sub1
   ; CHECK-NEXT:   [[S_BUFFER_LOAD_DWORD_IMM1:%[0-9]+]]:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_IMM [[S_LOAD_DWORDX4_IMM]], 0, 0 :: (dereferenceable invariant load (s32))
   ; CHECK-NEXT:   [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
   ; CHECK-NEXT:   undef [[S_MOV_B32_:%[0-9]+]].sub1:sgpr_128 = S_MOV_B32 0
+  ; CHECK-NEXT:   [[BUFFER_LOAD_DWORD_OFFSET:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_DWORD_OFFSET undef %118:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], undef %89:sgpr_128, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   [[BUFFER_LOAD_FORMAT_X_IDXEN1:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_FORMAT_X_IDXEN [[V_MOV_B32_e32_]], [[S_LOAD_DWORDX4_IMM1]], 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 8)
   ; CHECK-NEXT:   KILL undef %89:sgpr_128
-  ; CHECK-NEXT:   [[S_SUB_I32_:%[0-9]+]]:sreg_32 = S_SUB_I32 [[S_BUFFER_LOAD_DWORD_IMM]], 29, implicit-def dead $scc
-  ; CHECK-NEXT:   [[S_SUB_I32_1:%[0-9]+]]:sreg_32 = S_SUB_I32 [[S_BUFFER_LOAD_DWORD_IMM]], 30, implicit-def dead $scc
+  ; CHECK-NEXT:   KILL undef %118:sgpr_128
   ; CHECK-NEXT:   [[S_SUB_I32_2:%[0-9]+]]:sreg_32 = S_SUB_I32 [[S_BUFFER_LOAD_DWORD_IMM1]], 31, implicit-def dead $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_1:%[0-9]+]]:sreg_32 = S_ADD_U32 [[COPY6]], 64, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADDC_U32_:%[0-9]+]]:sreg_32 = S_ADDC_U32 undef %54:sreg_32, 0, implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_2:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_1]], [[S_LSHL_B32_]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_2:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_]], [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_3:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_1]], [[S_LSHL_B32_1]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM2:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_2]], 0, 0 :: (invariant load (s128) from %ir.87, addrspace 4)
-  ; CHECK-NEXT:   [[S_ADD_U32_3:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_]], [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_4:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_1]], [[S_LSHL_B32_2]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_4:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_]], [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   [[S_ASHR_I32_3:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 undef %171:sreg_32, 31, implicit-def dead $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_5:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_1]], undef %171:sreg_32, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_5:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_]], [[S_ASHR_I32_3]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_6:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY7]].sub0, [[S_LSHL_B32_]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_6:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %51:sreg_32, [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_7:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY7]].sub0, [[S_LSHL_B32_1]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_7:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %51:sreg_32, [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_8:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY7]].sub0, undef %171:sreg_32, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_8:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %51:sreg_32, [[S_ASHR_I32_3]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_9:%[0-9]+]]:sreg_32 = S_ADD_U32 [[COPY7]].sub0, 224, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADDC_U32_1:%[0-9]+]]:sreg_32 = S_ADDC_U32 undef %51:sreg_32, 0, implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_10:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_9]], [[S_LSHL_B32_]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_10:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_1]], [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_11:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_9]], [[S_LSHL_B32_1]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_11:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_1]], [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_12:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_9]], [[S_LSHL_B32_2]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_12:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_1]], [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_13:%[0-9]+]]:sreg_32 = S_ADD_U32 [[COPY7]].sub0, 576, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADDC_U32_2:%[0-9]+]]:sreg_32 = S_ADDC_U32 undef %51:sreg_32, 0, implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_14:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_13]], [[S_LSHL_B32_]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_14:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_2]], [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_15:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_13]], [[S_LSHL_B32_2]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_15:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_2]], [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_16:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[S_ADD_U32_13]], undef %171:sreg_32, implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_16:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 [[S_ADDC_U32_2]], [[S_ASHR_I32_3]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_17:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY8]], [[S_LSHL_B32_]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_17:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %48:sreg_32, [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_18:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY9]], [[S_LSHL_B32_1]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_18:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %45:sreg_32, [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
-  ; CHECK-NEXT:   undef [[S_ADD_U32_19:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY9]], [[S_LSHL_B32_2]], implicit-def $scc
-  ; CHECK-NEXT:   [[S_ADD_U32_19:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %45:sreg_32, [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_1:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY6]], [[S_LSHL_B32_]], implicit-def $scc
+  ; CHECK-NEXT:   [[S_ADD_U32_1:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %54:sreg_32, [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_2:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY6]], [[S_LSHL_B32_1]], implicit-def $scc
+  ; CHECK-NEXT:   [[S_ADD_U32_2:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %54:sreg_32, [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_3:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY6]], [[S_LSHL_B32_2]], implicit-def $scc
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM2:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_1]], 64, 0 :: (invariant load (s128) from %ir.87, addrspace 4)
+  ; CHECK-NEXT:   [[S_LOAD_DWORDX4_IMM3:%[0-9]+]]:sgpr_128 = S_LOAD_DWORDX4_IMM [[S_ADD_U32_2]], 64, 0 :: (invariant load (s128) from %ir.93, addrspace 4)
+  ; CHECK-NEXT:   KILL [[S_ADD_U32_1]].sub0, [[S_ADD_U32_1]].sub1
+  ; CHECK-NEXT:   KILL [[S_ADD_U32_2]].sub0, [[S_ADD_U32_2]].sub1
+  ; CHECK-NEXT:   [[S_ADD_U32_3:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %54:sreg_32, [[S_ASHR_I32_2]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   [[S_ASHR_I32_3:%[0-9]+]]:sreg_32_xm0 = S_ASHR_I32 undef %169:sreg_32, 31, implicit-def dead $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_4:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY6]], undef %169:sreg_32, implicit-def $scc
+  ; CHECK-NEXT:   [[S_ADD_U32_4:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %54:sreg_32, [[S_ASHR_I32_3]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_5:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY7]].sub0, [[S_LSHL_B32_]], implicit-def $scc
+  ; CHECK-NEXT:   [[S_ADD_U32_5:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %51:sreg_32, [[S_ASHR_I32_]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:   undef [[S_ADD_U32_6:%[0-9]+]].sub0:sreg_64 = S_ADD_U32 [[COPY7]].sub0, [[S_LSHL_B32_1]], implicit-def $scc
+  ; CHECK-NEXT:   [[S_ADD_U32_6:%[0-9]+]].sub1:sreg_64 = S_ADDC_U32 undef %51:sreg_32, [[S_ASHR_I32_1]], implicit-def dead $scc, implicit $scc
+  ; CHECK-NEXT:  ...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat confused by the description, since this is what I thought the point of the pass was and it already did. What exactly is special about this particular case?

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp Outdated Show resolved Hide resolved
@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Dec 7, 2023

What exactly is special about this particular case?

When analyzing the GEP operands for constants, this pass only supports certain Value classes: {ConstantInt, BinaryOperator, TruncInst, SExtInst, ZExtInst}. In this case, the operand is a GetElementPtrInst, so the pass does not attempt to reorganize the GEP in such a way that the constant is the outtermost index.

I think it is feasible to extend the set of supported Value classes to include GEPs, but that requires a more significant rework. This feature can bridge the gap until then.

@jrbyrnes jrbyrnes force-pushed the ReorderGEP branch 2 times, most recently from 330df8e to 3d0f306 Compare December 12, 2023 22:28
Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@arsenm
Copy link
Contributor

arsenm commented Jan 3, 2024

Can you add an alive2 link that shows the inbounds preserve is OK?

@arsenm
Copy link
Contributor

arsenm commented Feb 6, 2024

reverse ping

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Feb 7, 2024

So it appears that the inbounds preservation is not correct.

Preserved:
https://alive2.llvm.org/ce/z/fdH9f2

Stripped:
https://alive2.llvm.org/ce/z/dJwv-W

Instead of stripping the inbounds, I've disabled the optimization for inbounds GEPs

Copy link

github-actions bot commented Feb 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@krzysz00
Copy link
Contributor

krzysz00 commented Feb 7, 2024

... hold on, from looking at that Alive2, I think you can be a bit more aggressive than just not applying this swap to inbounds GEPs (or stripping the inbounds - do we know which of these is more likely to lead to good codegen?).

Specifically, I claim that, if you can prove that the constant and variable indices are non-negative, you should be able to reassociate the inbounds GEPs soundly.

See the somewhat ham-fisted https://alive2.llvm.org/ce/z/CRkgEN

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Feb 8, 2024

Specifically, I claim that, if you can prove that the constant and variable indices are non-negative, you should be able to reassociate the inbounds GEPs soundly.

Thanks @krzysz00 for pointing that out -- I agree with that claim and have added to the most recent version

do we know which of these is more likely to lead to good codegen

I'm not quite sure; I think I'd have to analyze empirically.

My initial thought was to disable the reordering if we lose inbounds to avoid introducing regressions, however, most of the optimizations that depend on typically occur earlier in the pipeline.

Change-Id: Ia3d536df43067a796f4c9104a8a13ca0f45fc8fa
@krzysz00
Copy link
Contributor

Do you need someone to land this or?

@jrbyrnes
Copy link
Contributor Author

Do you need someone to land this or?

I was just waiting to see if there were any objections with the newest version of the changes -- based on this, I'll assume not. Thanks for the review!

@jrbyrnes jrbyrnes merged commit 1b65742 into llvm:main Feb 13, 2024
3 of 4 checks passed
@jrbyrnes
Copy link
Contributor Author

I guess some unaccounted for change had come along to effect the behavior of NVPTX/lower-gep-reorder.ll

I've resolved the check-llvm fails via ec0aa16

@nico
Copy link
Contributor

nico commented Feb 13, 2024

Looks like this still breaks tests: http://45.33.8.238/linux/130717/step_12.txt

This is after ec0aa16.

Should we revert for now?

@krzysz00
Copy link
Contributor

Might be good to revert and make sure there's a local test run with all targets enabled at least?

preames added a commit that referenced this pull request Feb 13, 2024
…parate constants (#73056)" and follow ups

"ninja check-llvm" is failing on tip of tree.

This reverts commit ec0aa16.
This reverts commit 1b65742.
@AaronHLiu
Copy link
Contributor

AaronHLiu commented Feb 15, 2024

Looks like there are some LIT failures need to be addressed:

FAIL: LLVM :: Transforms/SeparateConstOffsetFromGEP/NVPTX/lower-gep-reorder.ll (68679 of 78188)
******************** TEST 'LLVM :: Transforms/SeparateConstOffsetFromGEP/NVPTX/lower-gep-reorder.ll' FAILED ********************
Exit Code: 1
...

@krzysz00
Copy link
Contributor

... that got handled in the re-land?

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Feb 15, 2024

@AaronHLiu this PR was known to cause that exact LIT failure -- this was why it was reverted upstream -- does your build have the revert?

The reland should be okay; in you deleted comment there was a CHECK line flagged that existed in that bad version of the LIT test and not the reland version (reland: #81671) .

@AaronHLiu
Copy link
Contributor

@AaronHLiu this PR was known to cause that exact LIT failure -- this was why it was reverted upstream -- does your build have the revert?

The reland should be okay; in you deleted comment there was a CHECK line flagged that existed in that bad version of the LIT test and not the reland version (reland: #81671) .

The build did not contain the new version yet. Will check. Thanks!

jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request Apr 26, 2024
jrbyrnes added a commit to jrbyrnes/llvm-project that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants