Skip to content

Conversation

@choikwa
Copy link
Contributor

@choikwa choikwa commented Dec 9, 2025

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.

Credit: ruiling in #171253

@llvmbot
Copy link
Member

llvmbot commented Dec 9, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.


Patch is 29.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/171253.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+30)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll (+5-12)
  • (added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll (+497)
  • (added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll (+82)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index b79689c39ef84..b3de42ff7612a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -42,6 +42,7 @@
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Pass.h"
+#include "llvm/Support/MathExtras.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
@@ -556,6 +557,35 @@ static Value *promoteAllocaUserToVector(
       auto *SubVecTy = FixedVectorType::get(VecEltTy, NumLoadedElts);
       assert(DL.getTypeStoreSize(SubVecTy) == DL.getTypeStoreSize(AccessTy));
 
+      // If idx is dynamic, then sandwich load with bitcasts.
+      // ie. <64 x i8> -> <16 x i8>  instead do
+      //     <64 x i8> -> <4 x i128> -> i128 -> <16 x i8>
+      // Extracting subvector with dynamic index has very large expansion in
+      // the amdgpu backend. Limit to pow2 for UDiv.
+      if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
+          llvm::isPowerOf2_32(VectorTy->getNumElements()) &&
+          llvm::isPowerOf2_32(SubVecTy->getNumElements())) {
+        IntegerType *NewElemType = Builder.getIntNTy(
+            SubVecTy->getScalarSizeInBits() * SubVecTy->getNumElements());
+        const unsigned NewNumElts =
+            VectorTy->getNumElements() * VectorTy->getScalarSizeInBits() /
+              NewElemType->getScalarSizeInBits();
+        const unsigned IndexDivisor = VectorTy->getNumElements() / NewNumElts;
+        assert(VectorTy->getScalarSizeInBits() <
+            NewElemType->getScalarSizeInBits() &&
+            "New element type should be bigger");
+        assert(IndexDivisor > 0u && "Zero index divisor");
+        FixedVectorType *BitCastType =
+            FixedVectorType::get(NewElemType, NewNumElts);
+        Value *BCVal = Builder.CreateBitCast(CurVal, BitCastType);
+        Value *NewIdx = Builder.CreateUDiv(Index,
+            ConstantInt::get(Index->getType(), IndexDivisor));
+        Value *ExtVal = Builder.CreateExtractElement(BCVal, NewIdx);
+        Value *BCOut = Builder.CreateBitCast(ExtVal, SubVecTy);
+        Inst->replaceAllUsesWith(BCOut);
+        return nullptr;
+      }
+
       Value *SubVec = PoisonValue::get(SubVecTy);
       for (unsigned K = 0; K < NumLoadedElts; ++K) {
         Value *CurIdx =
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
index 62a04f3a6f86f..5aa09ae74ec36 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
@@ -436,18 +436,11 @@ define <4 x i16> @nonconst_indexes(i1 %cond, i32 %otheridx, <4 x i16> %store) #0
 ; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[INDEX_1]], 3
 ; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <4 x i16> [[STORE]], i64 3
 ; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <16 x i16> [[TMP7]], i16 [[TMP9]], i32 [[TMP8]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[INDEX_2]]
-; CHECK-NEXT:    [[TMP12:%.*]] = insertelement <4 x i16> poison, i16 [[TMP11]], i64 0
-; CHECK-NEXT:    [[TMP13:%.*]] = add i32 [[INDEX_2]], 1
-; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <4 x i16> [[TMP12]], i16 [[TMP14]], i64 1
-; CHECK-NEXT:    [[TMP16:%.*]] = add i32 [[INDEX_2]], 2
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP16]]
-; CHECK-NEXT:    [[TMP18:%.*]] = insertelement <4 x i16> [[TMP15]], i16 [[TMP17]], i64 2
-; CHECK-NEXT:    [[TMP19:%.*]] = add i32 [[INDEX_2]], 3
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP19]]
-; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <4 x i16> [[TMP18]], i16 [[TMP20]], i64 3
-; CHECK-NEXT:    ret <4 x i16> [[TMP21]]
+; CHECK-NEXT:    [[TMP11:%.*]] = bitcast <16 x i16> [[TMP10]] to <4 x i64>
+; CHECK-NEXT:    [[TMP12:%.*]] = udiv i32 [[INDEX_2]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <4 x i64> [[TMP11]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = bitcast i64 [[TMP13]] to <4 x i16>
+; CHECK-NEXT:    ret <4 x i16> [[TMP14]]
 ;
 entry:
   %data = alloca [16 x i16], addrspace(5)
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
new file mode 100644
index 0000000000000..084b7a2d59b2f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
@@ -0,0 +1,497 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 < %s | FileCheck %s --check-prefixes=GFX9
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck %s --check-prefixes=GFX11
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 < %s | FileCheck %s --check-prefixes=GFX12
+
+define amdgpu_kernel void @test_bitcast_llc_v128i8_v16i8(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dwordx2 s[34:35], s[4:5], 0x0
+; GFX9-NEXT:    s_load_dword s33, s[4:5], 0x8
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s33, s33, s33
+; GFX9-NEXT:    s_mov_b32 s1, s0
+; GFX9-NEXT:    s_lshl_b32 s33, s33, 1
+; GFX9-NEXT:    s_mov_b32 s2, s0
+; GFX9-NEXT:    s_mov_b32 s3, s0
+; GFX9-NEXT:    s_mov_b32 s4, s0
+; GFX9-NEXT:    s_mov_b32 s5, s0
+; GFX9-NEXT:    s_mov_b32 s6, s0
+; GFX9-NEXT:    s_mov_b32 s7, s0
+; GFX9-NEXT:    s_mov_b32 s8, s0
+; GFX9-NEXT:    s_mov_b32 s9, s0
+; GFX9-NEXT:    s_mov_b32 s10, s0
+; GFX9-NEXT:    s_mov_b32 s11, s0
+; GFX9-NEXT:    s_mov_b32 s12, s0
+; GFX9-NEXT:    s_mov_b32 s13, s0
+; GFX9-NEXT:    s_mov_b32 s14, s0
+; GFX9-NEXT:    s_mov_b32 s15, s0
+; GFX9-NEXT:    s_mov_b32 s16, s0
+; GFX9-NEXT:    s_mov_b32 s17, s0
+; GFX9-NEXT:    s_mov_b32 s18, s0
+; GFX9-NEXT:    s_mov_b32 s19, s0
+; GFX9-NEXT:    s_mov_b32 s20, s0
+; GFX9-NEXT:    s_mov_b32 s21, s0
+; GFX9-NEXT:    s_mov_b32 s22, s0
+; GFX9-NEXT:    s_mov_b32 s23, s0
+; GFX9-NEXT:    s_mov_b32 s24, s0
+; GFX9-NEXT:    s_mov_b32 s25, s0
+; GFX9-NEXT:    s_mov_b32 s26, s0
+; GFX9-NEXT:    s_mov_b32 s27, s0
+; GFX9-NEXT:    s_mov_b32 s28, s0
+; GFX9-NEXT:    s_mov_b32 s29, s0
+; GFX9-NEXT:    s_mov_b32 s30, s0
+; GFX9-NEXT:    s_mov_b32 s31, s0
+; GFX9-NEXT:    s_add_i32 s36, s33, 3
+; GFX9-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
+; GFX9-NEXT:    v_mov_b64_e32 v[2:3], s[2:3]
+; GFX9-NEXT:    v_mov_b64_e32 v[4:5], s[4:5]
+; GFX9-NEXT:    v_mov_b64_e32 v[6:7], s[6:7]
+; GFX9-NEXT:    v_mov_b64_e32 v[8:9], s[8:9]
+; GFX9-NEXT:    v_mov_b64_e32 v[10:11], s[10:11]
+; GFX9-NEXT:    v_mov_b64_e32 v[12:13], s[12:13]
+; GFX9-NEXT:    v_mov_b64_e32 v[14:15], s[14:15]
+; GFX9-NEXT:    v_mov_b64_e32 v[16:17], s[16:17]
+; GFX9-NEXT:    v_mov_b64_e32 v[18:19], s[18:19]
+; GFX9-NEXT:    v_mov_b64_e32 v[20:21], s[20:21]
+; GFX9-NEXT:    v_mov_b64_e32 v[22:23], s[22:23]
+; GFX9-NEXT:    v_mov_b64_e32 v[24:25], s[24:25]
+; GFX9-NEXT:    v_mov_b64_e32 v[26:27], s[26:27]
+; GFX9-NEXT:    v_mov_b64_e32 v[28:29], s[28:29]
+; GFX9-NEXT:    v_mov_b64_e32 v[30:31], s[30:31]
+; GFX9-NEXT:    s_set_gpr_idx_on s36, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v35, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s0, s33, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s0, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v34, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    v_mov_b32_e32 v36, 0
+; GFX9-NEXT:    s_set_gpr_idx_on s33, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v33, v1
+; GFX9-NEXT:    v_mov_b32_e32 v32, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v36, v[32:35], s[34:35]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b64 s[34:35], s[4:5], 0x0
+; GFX11-NEXT:    s_load_b32 s33, s[4:5], 0x8
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX11-NEXT:    v_mov_b32_e32 v35, 0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_mov_b32 s1, s0
+; GFX11-NEXT:    s_mov_b32 s2, s0
+; GFX11-NEXT:    s_mov_b32 s3, s0
+; GFX11-NEXT:    s_mov_b32 s4, s0
+; GFX11-NEXT:    s_mov_b32 s5, s0
+; GFX11-NEXT:    s_mov_b32 s6, s0
+; GFX11-NEXT:    s_mov_b32 s7, s0
+; GFX11-NEXT:    s_mov_b32 s8, s0
+; GFX11-NEXT:    s_mov_b32 s9, s0
+; GFX11-NEXT:    s_mov_b32 s10, s0
+; GFX11-NEXT:    s_mov_b32 s11, s0
+; GFX11-NEXT:    s_mov_b32 s12, s0
+; GFX11-NEXT:    s_mov_b32 s13, s0
+; GFX11-NEXT:    s_mov_b32 s14, s0
+; GFX11-NEXT:    s_mov_b32 s15, s0
+; GFX11-NEXT:    s_mov_b32 s16, s0
+; GFX11-NEXT:    s_mov_b32 s17, s0
+; GFX11-NEXT:    s_mov_b32 s18, s0
+; GFX11-NEXT:    s_mov_b32 s19, s0
+; GFX11-NEXT:    s_mov_b32 s20, s0
+; GFX11-NEXT:    s_mov_b32 s21, s0
+; GFX11-NEXT:    s_mov_b32 s22, s0
+; GFX11-NEXT:    s_mov_b32 s23, s0
+; GFX11-NEXT:    s_mov_b32 s24, s0
+; GFX11-NEXT:    s_mov_b32 s25, s0
+; GFX11-NEXT:    s_mov_b32 s26, s0
+; GFX11-NEXT:    s_mov_b32 s27, s0
+; GFX11-NEXT:    s_mov_b32 s28, s0
+; GFX11-NEXT:    s_mov_b32 s29, s0
+; GFX11-NEXT:    s_mov_b32 s30, s0
+; GFX11-NEXT:    s_mov_b32 s31, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s33, s33, s33
+; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX11-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX11-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX11-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX11-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX11-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX11-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX11-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX11-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX11-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX11-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX11-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX11-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX11-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX11-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_add_i32 m0, s0, 3
+; GFX11-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX11-NEXT:    s_add_i32 m0, s0, 2
+; GFX11-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX11-NEXT:    s_mov_b32 m0, s0
+; GFX11-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX11-NEXT:    global_store_b128 v35, v[31:34], s[34:35]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[36:38], s[4:5], 0x0
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX12-NEXT:    v_mov_b32_e32 v35, 0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_mov_b32 s1, s0
+; GFX12-NEXT:    s_mov_b32 s2, s0
+; GFX12-NEXT:    s_mov_b32 s3, s0
+; GFX12-NEXT:    s_mov_b32 s4, s0
+; GFX12-NEXT:    s_mov_b32 s5, s0
+; GFX12-NEXT:    s_mov_b32 s6, s0
+; GFX12-NEXT:    s_mov_b32 s7, s0
+; GFX12-NEXT:    s_mov_b32 s8, s0
+; GFX12-NEXT:    s_mov_b32 s9, s0
+; GFX12-NEXT:    s_mov_b32 s10, s0
+; GFX12-NEXT:    s_mov_b32 s11, s0
+; GFX12-NEXT:    s_mov_b32 s12, s0
+; GFX12-NEXT:    s_mov_b32 s13, s0
+; GFX12-NEXT:    s_mov_b32 s14, s0
+; GFX12-NEXT:    s_mov_b32 s15, s0
+; GFX12-NEXT:    s_mov_b32 s16, s0
+; GFX12-NEXT:    s_mov_b32 s17, s0
+; GFX12-NEXT:    s_mov_b32 s18, s0
+; GFX12-NEXT:    s_mov_b32 s19, s0
+; GFX12-NEXT:    s_mov_b32 s20, s0
+; GFX12-NEXT:    s_mov_b32 s21, s0
+; GFX12-NEXT:    s_mov_b32 s22, s0
+; GFX12-NEXT:    s_mov_b32 s23, s0
+; GFX12-NEXT:    s_mov_b32 s24, s0
+; GFX12-NEXT:    s_mov_b32 s25, s0
+; GFX12-NEXT:    s_mov_b32 s26, s0
+; GFX12-NEXT:    s_mov_b32 s27, s0
+; GFX12-NEXT:    s_mov_b32 s28, s0
+; GFX12-NEXT:    s_mov_b32 s29, s0
+; GFX12-NEXT:    s_mov_b32 s30, s0
+; GFX12-NEXT:    s_mov_b32 s31, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s33, s38, s38
+; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX12-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX12-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX12-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX12-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX12-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX12-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX12-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX12-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX12-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX12-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX12-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX12-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX12-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX12-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX12-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX12-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 3
+; GFX12-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 2
+; GFX12-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX12-NEXT:    s_mov_b32 m0, s0
+; GFX12-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX12-NEXT:    global_store_b128 v35, v[31:34], s[36:37]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <128 x i8> poison
+  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <16 x i8>
+  store <16 x i8> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v64i16_v8i16(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 3
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 2
+; GFX12-NEXT:    v_mov_b32_e32 v4, 0
+; GFX12-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX12-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <64 x i16> poison
+  %allocabc = bitcast <64 x i16> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <8 x i16>
+  store <8 x i16> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v32i32_v4i32(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2...
[truncated]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the AMDGPU promote-alloca pass by reducing the number of extract/insert element operations when accessing vectors with dynamic indices. When a subvector is loaded with a dynamic index, the code now inserts bitcasts to transform the operation into a single extractelement on a larger integer type, rather than generating multiple extract/insert operations that would expand inefficiently in the AMDGPU backend.

Key changes:

  • Added optimization in promoteAllocaUserToVector to detect dynamic index loads of power-of-2 integer vectors
  • Transforms scalarized copy operations using intermediate bitcasts to reduce extractelement chains
  • Updated test expectations to verify the new bitcast-based transformation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Implements the bitcast optimization for dynamic index vector loads
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll Tests the IR transformation with various vector sizes
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll Tests the backend code generation for different GPU architectures
llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll Updates existing test to reflect new transformation behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

@ruiling
Copy link
Contributor

ruiling commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

@choikwa
Copy link
Contributor Author

choikwa commented Dec 9, 2025

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

@ruiling
Copy link
Contributor

ruiling commented Dec 9, 2025

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

I think GEPToVectorIndex is used to get the index into the vector type translated from the alloca. By saying unaligned, I really mean not aligned to the vector type you just bitcasted to temporarily. Like in below case, the alloca was translated to <32 x i16>. but in order to translate the load <8 x i16> into the simplified form. You bitcast it to <4 x i128>. Unalignment means the case that the address of %gep is not aligned to 128bit.

 %alloca = alloca [32 x i16], align 16, addrspace(5)
 %gep = getelementptr  i16, ptr addrspace(5) %alloca, i32 0, i32 %idx
 %load = load <8 x i16>, ptr addrspace(5) %gep, align 2

Hope I understand everything correctly.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the alignment test is required.

…itcasts to reduce number of extractelements as they have large expansion in the backend.

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index.
Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller generated and efficient code.
@github-actions
Copy link

github-actions bot commented Dec 15, 2025

🐧 Linux x64 Test Results

  • 187505 tests passed
  • 4970 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

🪟 Windows x64 Test Results

  • 128666 tests passed
  • 2821 tests skipped

✅ The build succeeded and all tests passed.

Comment on lines 663 to 667
// Check that NumBits and GEP's type's NumBits match for alignment
if (auto *GEPTy =
dyn_cast<FixedVectorType>(GEP->getSourceElementType())) {
if (NumBits == GEPTy->getScalarSizeInBits() *
GEPTy->getNumElements()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of bad problems there. First, you used a dyn_cast above, so GEP may be null.

Second, looking at the GEP source element type is strongly discouraged since GEPs can (and will be) canonicalized to i8.

Can't you just use the load instruction alignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Yes, I can look at the Load's align instead.

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Comment on lines 669 to 671
assert(VectorTy->getScalarSizeInBits() <
NewElemType->getScalarSizeInBits() &&
"New element type should be bigger");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only just noticed that. What justifies this assertion? And is it really necessary?

What if we have a <4 x i16> load from a <32 x i32> alloca?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the code tries to assert the total access size should be larger than the vector element size. That is to say at least access one vector element. I think the previous assert(AccessSize.isKnownMultipleOf(DL.getTypeStoreSize(VecEltTy))); should already cover this.

Copy link
Contributor Author

@choikwa choikwa Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a mental check to say element type of bitcast should always be larger one than alloca & subvector's element type. I had meant to compare SubVecTy's element type but it should be same as VectorTy's element type. I can use SubVecTy's.

uint64_t NumBits =
SubVecTy->getScalarSizeInBits() * SubVecTy->getNumElements();
uint64_t LoadAlign = cast<LoadInst>(Inst)->getAlign().value();
if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SubVecTy->isIntOrIntVectorTy() is not needed. It will always be vector type. Whether it is float or integer is not important.

uint64_t LoadAlign = cast<LoadInst>(Inst)->getAlign().value();
if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
llvm::isPowerOf2_32(SubVecTy->getNumElements()) &&
VectorTy->getNumElements() % SubVecTy->getNumElements() == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have so many checks here. It would be better to make this more readable. Like giving some checks a proper named variable.

unsigned TotalNumElts = VectorTy->getNumElements();
bool IsProperlyDivisible = TotalNumElts % NumLoadedElts == 0;
bool IsAlignedLoad = NumBits <= (LoadAlign * 8u);

assert(VectorTy->getScalarSizeInBits() <
NewElemType->getScalarSizeInBits() &&
"New element type should be bigger");
assert(IndexDivisor > 0u && "Zero index divisor");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't VectorTy->getNumElements() % SubVecTy->getNumElements() == 0 already cover this??

llvm::isPowerOf2_32(AccessVecTy->getNumElements()) &&
NumBits <= (LoadAlign * 8u)) {
IntegerType *NewElemType = Builder.getIntNTy(NumBits);
const unsigned NewNumElts = VectorTy->getNumElements() *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter code would be easy to follow. Like:

const unsigned NewNumElts = DL.getTypeStoreSize(VectorTy) / NumBits;

Comment on lines 669 to 671
assert(VectorTy->getScalarSizeInBits() <
NewElemType->getScalarSizeInBits() &&
"New element type should be bigger");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the code tries to assert the total access size should be larger than the vector element size. That is to say at least access one vector element. I think the previous assert(AccessSize.isKnownMultipleOf(DL.getTypeStoreSize(VecEltTy))); should already cover this.

FixedVectorType *BitCastType =
FixedVectorType::get(NewElemType, NewNumElts);
Value *BCVal = Builder.CreateBitCast(CurVal, BitCastType);
Value *NewIdx = Builder.CreateUDiv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use shift right explicitly now that we have check the power-of-2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants