[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. #171253

choikwa · 2025-12-09T04:21:20Z

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.

Credit: ruiling in #171253

llvmbot · 2025-12-09T04:21:48Z

@llvm/pr-subscribers-backend-amdgpu

Author: Kevin Choi (choikwa)

Changes

Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller, efficient generated code.

Patch is 29.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/171253.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+30)
(modified) llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll (+5-12)
(added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll (+497)
(added) llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll (+82)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index b79689c39ef84..b3de42ff7612a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -42,6 +42,7 @@
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Pass.h"
+#include "llvm/Support/MathExtras.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 
@@ -556,6 +557,35 @@ static Value *promoteAllocaUserToVector(
       auto *SubVecTy = FixedVectorType::get(VecEltTy, NumLoadedElts);
       assert(DL.getTypeStoreSize(SubVecTy) == DL.getTypeStoreSize(AccessTy));
 
+      // If idx is dynamic, then sandwich load with bitcasts.
+      // ie. <64 x i8> -> <16 x i8>  instead do
+      //     <64 x i8> -> <4 x i128> -> i128 -> <16 x i8>
+      // Extracting subvector with dynamic index has very large expansion in
+      // the amdgpu backend. Limit to pow2 for UDiv.
+      if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
+          llvm::isPowerOf2_32(VectorTy->getNumElements()) &&
+          llvm::isPowerOf2_32(SubVecTy->getNumElements())) {
+        IntegerType *NewElemType = Builder.getIntNTy(
+            SubVecTy->getScalarSizeInBits() * SubVecTy->getNumElements());
+        const unsigned NewNumElts =
+            VectorTy->getNumElements() * VectorTy->getScalarSizeInBits() /
+              NewElemType->getScalarSizeInBits();
+        const unsigned IndexDivisor = VectorTy->getNumElements() / NewNumElts;
+        assert(VectorTy->getScalarSizeInBits() <
+            NewElemType->getScalarSizeInBits() &&
+            "New element type should be bigger");
+        assert(IndexDivisor > 0u && "Zero index divisor");
+        FixedVectorType *BitCastType =
+            FixedVectorType::get(NewElemType, NewNumElts);
+        Value *BCVal = Builder.CreateBitCast(CurVal, BitCastType);
+        Value *NewIdx = Builder.CreateUDiv(Index,
+            ConstantInt::get(Index->getType(), IndexDivisor));
+        Value *ExtVal = Builder.CreateExtractElement(BCVal, NewIdx);
+        Value *BCOut = Builder.CreateBitCast(ExtVal, SubVecTy);
+        Inst->replaceAllUsesWith(BCOut);
+        return nullptr;
+      }
+
       Value *SubVec = PoisonValue::get(SubVecTy);
       for (unsigned K = 0; K < NumLoadedElts; ++K) {
         Value *CurIdx =
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
index 62a04f3a6f86f..5aa09ae74ec36 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll
@@ -436,18 +436,11 @@ define <4 x i16> @nonconst_indexes(i1 %cond, i32 %otheridx, <4 x i16> %store) #0
 ; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[INDEX_1]], 3
 ; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <4 x i16> [[STORE]], i64 3
 ; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <16 x i16> [[TMP7]], i16 [[TMP9]], i32 [[TMP8]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[INDEX_2]]
-; CHECK-NEXT:    [[TMP12:%.*]] = insertelement <4 x i16> poison, i16 [[TMP11]], i64 0
-; CHECK-NEXT:    [[TMP13:%.*]] = add i32 [[INDEX_2]], 1
-; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = insertelement <4 x i16> [[TMP12]], i16 [[TMP14]], i64 1
-; CHECK-NEXT:    [[TMP16:%.*]] = add i32 [[INDEX_2]], 2
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP16]]
-; CHECK-NEXT:    [[TMP18:%.*]] = insertelement <4 x i16> [[TMP15]], i16 [[TMP17]], i64 2
-; CHECK-NEXT:    [[TMP19:%.*]] = add i32 [[INDEX_2]], 3
-; CHECK-NEXT:    [[TMP20:%.*]] = extractelement <16 x i16> [[TMP10]], i32 [[TMP19]]
-; CHECK-NEXT:    [[TMP21:%.*]] = insertelement <4 x i16> [[TMP18]], i16 [[TMP20]], i64 3
-; CHECK-NEXT:    ret <4 x i16> [[TMP21]]
+; CHECK-NEXT:    [[TMP11:%.*]] = bitcast <16 x i16> [[TMP10]] to <4 x i64>
+; CHECK-NEXT:    [[TMP12:%.*]] = udiv i32 [[INDEX_2]], 4
+; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <4 x i64> [[TMP11]], i32 [[TMP12]]
+; CHECK-NEXT:    [[TMP14:%.*]] = bitcast i64 [[TMP13]] to <4 x i16>
+; CHECK-NEXT:    ret <4 x i16> [[TMP14]]
 ;
 entry:
   %data = alloca [16 x i16], addrspace(5)
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
new file mode 100644
index 0000000000000..084b7a2d59b2f
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll
@@ -0,0 +1,497 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 < %s | FileCheck %s --check-prefixes=GFX9
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck %s --check-prefixes=GFX11
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1200 < %s | FileCheck %s --check-prefixes=GFX12
+
+define amdgpu_kernel void @test_bitcast_llc_v128i8_v16i8(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dwordx2 s[34:35], s[4:5], 0x0
+; GFX9-NEXT:    s_load_dword s33, s[4:5], 0x8
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX9-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX9-NEXT:    s_or_b32 s0, s1, s0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s33, s33, s33
+; GFX9-NEXT:    s_mov_b32 s1, s0
+; GFX9-NEXT:    s_lshl_b32 s33, s33, 1
+; GFX9-NEXT:    s_mov_b32 s2, s0
+; GFX9-NEXT:    s_mov_b32 s3, s0
+; GFX9-NEXT:    s_mov_b32 s4, s0
+; GFX9-NEXT:    s_mov_b32 s5, s0
+; GFX9-NEXT:    s_mov_b32 s6, s0
+; GFX9-NEXT:    s_mov_b32 s7, s0
+; GFX9-NEXT:    s_mov_b32 s8, s0
+; GFX9-NEXT:    s_mov_b32 s9, s0
+; GFX9-NEXT:    s_mov_b32 s10, s0
+; GFX9-NEXT:    s_mov_b32 s11, s0
+; GFX9-NEXT:    s_mov_b32 s12, s0
+; GFX9-NEXT:    s_mov_b32 s13, s0
+; GFX9-NEXT:    s_mov_b32 s14, s0
+; GFX9-NEXT:    s_mov_b32 s15, s0
+; GFX9-NEXT:    s_mov_b32 s16, s0
+; GFX9-NEXT:    s_mov_b32 s17, s0
+; GFX9-NEXT:    s_mov_b32 s18, s0
+; GFX9-NEXT:    s_mov_b32 s19, s0
+; GFX9-NEXT:    s_mov_b32 s20, s0
+; GFX9-NEXT:    s_mov_b32 s21, s0
+; GFX9-NEXT:    s_mov_b32 s22, s0
+; GFX9-NEXT:    s_mov_b32 s23, s0
+; GFX9-NEXT:    s_mov_b32 s24, s0
+; GFX9-NEXT:    s_mov_b32 s25, s0
+; GFX9-NEXT:    s_mov_b32 s26, s0
+; GFX9-NEXT:    s_mov_b32 s27, s0
+; GFX9-NEXT:    s_mov_b32 s28, s0
+; GFX9-NEXT:    s_mov_b32 s29, s0
+; GFX9-NEXT:    s_mov_b32 s30, s0
+; GFX9-NEXT:    s_mov_b32 s31, s0
+; GFX9-NEXT:    s_add_i32 s36, s33, 3
+; GFX9-NEXT:    v_mov_b64_e32 v[0:1], s[0:1]
+; GFX9-NEXT:    v_mov_b64_e32 v[2:3], s[2:3]
+; GFX9-NEXT:    v_mov_b64_e32 v[4:5], s[4:5]
+; GFX9-NEXT:    v_mov_b64_e32 v[6:7], s[6:7]
+; GFX9-NEXT:    v_mov_b64_e32 v[8:9], s[8:9]
+; GFX9-NEXT:    v_mov_b64_e32 v[10:11], s[10:11]
+; GFX9-NEXT:    v_mov_b64_e32 v[12:13], s[12:13]
+; GFX9-NEXT:    v_mov_b64_e32 v[14:15], s[14:15]
+; GFX9-NEXT:    v_mov_b64_e32 v[16:17], s[16:17]
+; GFX9-NEXT:    v_mov_b64_e32 v[18:19], s[18:19]
+; GFX9-NEXT:    v_mov_b64_e32 v[20:21], s[20:21]
+; GFX9-NEXT:    v_mov_b64_e32 v[22:23], s[22:23]
+; GFX9-NEXT:    v_mov_b64_e32 v[24:25], s[24:25]
+; GFX9-NEXT:    v_mov_b64_e32 v[26:27], s[26:27]
+; GFX9-NEXT:    v_mov_b64_e32 v[28:29], s[28:29]
+; GFX9-NEXT:    v_mov_b64_e32 v[30:31], s[30:31]
+; GFX9-NEXT:    s_set_gpr_idx_on s36, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v35, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s0, s33, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s0, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v34, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    v_mov_b32_e32 v36, 0
+; GFX9-NEXT:    s_set_gpr_idx_on s33, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v33, v1
+; GFX9-NEXT:    v_mov_b32_e32 v32, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v36, v[32:35], s[34:35]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b64 s[34:35], s[4:5], 0x0
+; GFX11-NEXT:    s_load_b32 s33, s[4:5], 0x8
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX11-NEXT:    v_mov_b32_e32 v35, 0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX11-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s1, s0
+; GFX11-NEXT:    s_mov_b32 s1, s0
+; GFX11-NEXT:    s_mov_b32 s2, s0
+; GFX11-NEXT:    s_mov_b32 s3, s0
+; GFX11-NEXT:    s_mov_b32 s4, s0
+; GFX11-NEXT:    s_mov_b32 s5, s0
+; GFX11-NEXT:    s_mov_b32 s6, s0
+; GFX11-NEXT:    s_mov_b32 s7, s0
+; GFX11-NEXT:    s_mov_b32 s8, s0
+; GFX11-NEXT:    s_mov_b32 s9, s0
+; GFX11-NEXT:    s_mov_b32 s10, s0
+; GFX11-NEXT:    s_mov_b32 s11, s0
+; GFX11-NEXT:    s_mov_b32 s12, s0
+; GFX11-NEXT:    s_mov_b32 s13, s0
+; GFX11-NEXT:    s_mov_b32 s14, s0
+; GFX11-NEXT:    s_mov_b32 s15, s0
+; GFX11-NEXT:    s_mov_b32 s16, s0
+; GFX11-NEXT:    s_mov_b32 s17, s0
+; GFX11-NEXT:    s_mov_b32 s18, s0
+; GFX11-NEXT:    s_mov_b32 s19, s0
+; GFX11-NEXT:    s_mov_b32 s20, s0
+; GFX11-NEXT:    s_mov_b32 s21, s0
+; GFX11-NEXT:    s_mov_b32 s22, s0
+; GFX11-NEXT:    s_mov_b32 s23, s0
+; GFX11-NEXT:    s_mov_b32 s24, s0
+; GFX11-NEXT:    s_mov_b32 s25, s0
+; GFX11-NEXT:    s_mov_b32 s26, s0
+; GFX11-NEXT:    s_mov_b32 s27, s0
+; GFX11-NEXT:    s_mov_b32 s28, s0
+; GFX11-NEXT:    s_mov_b32 s29, s0
+; GFX11-NEXT:    s_mov_b32 s30, s0
+; GFX11-NEXT:    s_mov_b32 s31, s0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s33, s33, s33
+; GFX11-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX11-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX11-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX11-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX11-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX11-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX11-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX11-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX11-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX11-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX11-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX11-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX11-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX11-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX11-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX11-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX11-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_add_i32 m0, s0, 3
+; GFX11-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX11-NEXT:    s_add_i32 m0, s0, 2
+; GFX11-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX11-NEXT:    s_mov_b32 m0, s0
+; GFX11-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX11-NEXT:    global_store_b128 v35, v[31:34], s[34:35]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v128i8_v16i8:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[36:38], s[4:5], 0x0
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX12-NEXT:    v_mov_b32_e32 v35, 0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xff
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_and_b32 s1, s0, 0xffff
+; GFX12-NEXT:    s_lshl_b32 s0, s0, 16
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_or_b32 s0, s1, s0
+; GFX12-NEXT:    s_mov_b32 s1, s0
+; GFX12-NEXT:    s_mov_b32 s2, s0
+; GFX12-NEXT:    s_mov_b32 s3, s0
+; GFX12-NEXT:    s_mov_b32 s4, s0
+; GFX12-NEXT:    s_mov_b32 s5, s0
+; GFX12-NEXT:    s_mov_b32 s6, s0
+; GFX12-NEXT:    s_mov_b32 s7, s0
+; GFX12-NEXT:    s_mov_b32 s8, s0
+; GFX12-NEXT:    s_mov_b32 s9, s0
+; GFX12-NEXT:    s_mov_b32 s10, s0
+; GFX12-NEXT:    s_mov_b32 s11, s0
+; GFX12-NEXT:    s_mov_b32 s12, s0
+; GFX12-NEXT:    s_mov_b32 s13, s0
+; GFX12-NEXT:    s_mov_b32 s14, s0
+; GFX12-NEXT:    s_mov_b32 s15, s0
+; GFX12-NEXT:    s_mov_b32 s16, s0
+; GFX12-NEXT:    s_mov_b32 s17, s0
+; GFX12-NEXT:    s_mov_b32 s18, s0
+; GFX12-NEXT:    s_mov_b32 s19, s0
+; GFX12-NEXT:    s_mov_b32 s20, s0
+; GFX12-NEXT:    s_mov_b32 s21, s0
+; GFX12-NEXT:    s_mov_b32 s22, s0
+; GFX12-NEXT:    s_mov_b32 s23, s0
+; GFX12-NEXT:    s_mov_b32 s24, s0
+; GFX12-NEXT:    s_mov_b32 s25, s0
+; GFX12-NEXT:    s_mov_b32 s26, s0
+; GFX12-NEXT:    s_mov_b32 s27, s0
+; GFX12-NEXT:    s_mov_b32 s28, s0
+; GFX12-NEXT:    s_mov_b32 s29, s0
+; GFX12-NEXT:    s_mov_b32 s30, s0
+; GFX12-NEXT:    s_mov_b32 s31, s0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s33, s38, s38
+; GFX12-NEXT:    v_dual_mov_b32 v0, s0 :: v_dual_mov_b32 v1, s1
+; GFX12-NEXT:    v_dual_mov_b32 v2, s2 :: v_dual_mov_b32 v3, s3
+; GFX12-NEXT:    v_dual_mov_b32 v4, s4 :: v_dual_mov_b32 v5, s5
+; GFX12-NEXT:    v_dual_mov_b32 v6, s6 :: v_dual_mov_b32 v7, s7
+; GFX12-NEXT:    v_dual_mov_b32 v8, s8 :: v_dual_mov_b32 v9, s9
+; GFX12-NEXT:    v_dual_mov_b32 v10, s10 :: v_dual_mov_b32 v11, s11
+; GFX12-NEXT:    v_dual_mov_b32 v12, s12 :: v_dual_mov_b32 v13, s13
+; GFX12-NEXT:    v_dual_mov_b32 v14, s14 :: v_dual_mov_b32 v15, s15
+; GFX12-NEXT:    v_dual_mov_b32 v16, s16 :: v_dual_mov_b32 v17, s17
+; GFX12-NEXT:    v_dual_mov_b32 v18, s18 :: v_dual_mov_b32 v19, s19
+; GFX12-NEXT:    v_dual_mov_b32 v20, s20 :: v_dual_mov_b32 v21, s21
+; GFX12-NEXT:    v_dual_mov_b32 v22, s22 :: v_dual_mov_b32 v23, s23
+; GFX12-NEXT:    v_dual_mov_b32 v24, s24 :: v_dual_mov_b32 v25, s25
+; GFX12-NEXT:    v_dual_mov_b32 v26, s26 :: v_dual_mov_b32 v27, s27
+; GFX12-NEXT:    v_dual_mov_b32 v28, s28 :: v_dual_mov_b32 v29, s29
+; GFX12-NEXT:    v_dual_mov_b32 v30, s30 :: v_dual_mov_b32 v31, s31
+; GFX12-NEXT:    s_lshl_b32 s0, s33, 1
+; GFX12-NEXT:    s_wait_alu depctr_sa_sdst(0)
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 3
+; GFX12-NEXT:    v_movrels_b32_e32 v34, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s0, 2
+; GFX12-NEXT:    v_movrels_b32_e32 v33, v0
+; GFX12-NEXT:    s_mov_b32 m0, s0
+; GFX12-NEXT:    v_movrels_b32_e32 v32, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v31, v0
+; GFX12-NEXT:    global_store_b128 v35, v[31:34], s[36:37]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <128 x i8> poison
+  %allocabc = bitcast <128 x i8> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <16 x i8>
+  store <16 x i8> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v64i16_v8i16(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v64i16_v8i16:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 3
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX12-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2, 2
+; GFX12-NEXT:    v_mov_b32_e32 v4, 0
+; GFX12-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX12-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX12-NEXT:    s_endpgm
+entry:
+  %alloca = freeze <64 x i16> poison
+  %allocabc = bitcast <64 x i16> %alloca to <8 x i128>
+  %vec = extractelement <8 x i128> %allocabc, i32 %idx
+  %vecbc = bitcast i128 %vec to <8 x i16>
+  store <8 x i16> %vecbc, ptr addrspace(1) %out, align 16
+  ret void
+}
+
+define amdgpu_kernel void @test_bitcast_llc_v32i32_v4i32(ptr addrspace(1) %out, i32 %idx) {
+; GFX9-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX9:       ; %bb.0: ; %entry
+; GFX9-NEXT:    s_load_dword s2, s[4:5], 0x8
+; GFX9-NEXT:    s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT:    v_mov_b32_e32 v4, 0
+; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX9-NEXT:    s_add_i32 s2, s2, s2
+; GFX9-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v1, v1
+; GFX9-NEXT:    s_add_i32 s3, s2, 3
+; GFX9-NEXT:    v_mov_b32_e32 v0, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_set_gpr_idx_on s3, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v3, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    s_add_i32 s2, s2, 2
+; GFX9-NEXT:    s_set_gpr_idx_on s2, gpr_idx(SRC0)
+; GFX9-NEXT:    v_mov_b32_e32 v2, v0
+; GFX9-NEXT:    s_set_gpr_idx_off
+; GFX9-NEXT:    global_store_dwordx4 v4, v[0:3], s[0:1]
+; GFX9-NEXT:    s_endpgm
+;
+; GFX11-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX11:       ; %bb.0: ; %entry
+; GFX11-NEXT:    s_clause 0x1
+; GFX11-NEXT:    s_load_b32 s2, s[4:5], 0x8
+; GFX11-NEXT:    s_load_b64 s[0:1], s[4:5], 0x0
+; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-NEXT:    s_add_i32 s2, s2, s2
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX11-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX11-NEXT:    s_mov_b32 m0, s2
+; GFX11-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX11-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 3
+; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-NEXT:    v_movrels_b32_e32 v3, v0
+; GFX11-NEXT:    s_add_i32 m0, s2, 2
+; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_movrels_b32_e32 v2, v0
+; GFX11-NEXT:    global_store_b128 v4, v[0:3], s[0:1]
+; GFX11-NEXT:    s_endpgm
+;
+; GFX12-LABEL: test_bitcast_llc_v32i32_v4i32:
+; GFX12:       ; %bb.0: ; %entry
+; GFX12-NEXT:    s_load_b96 s[0:2], s[4:5], 0x0
+; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_add_co_i32 s2, s2, s2
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
+; GFX12-NEXT:    s_lshl_b32 s2, s2, 1
+; GFX12-NEXT:    s_mov_b32 m0, s2
+; GFX12-NEXT:    v_movrels_b32_e32 v1, v1
+; GFX12-NEXT:    v_movrels_b32_e32 v0, v0
+; GFX12-NEXT:    s_add_co_i32 m0, s2...
[truncated]

Copilot

Pull request overview

This PR optimizes the AMDGPU promote-alloca pass by reducing the number of extract/insert element operations when accessing vectors with dynamic indices. When a subvector is loaded with a dynamic index, the code now inserts bitcasts to transform the operation into a single extractelement on a larger integer type, rather than generating multiple extract/insert operations that would expand inefficiently in the AMDGPU backend.

Key changes:

Added optimization in promoteAllocaUserToVector to detect dynamic index loads of power-of-2 integer vectors
Transforms scalarized copy operations using intermediate bitcasts to reduce extractelement chains
Updated test expectations to verify the new bitcast-based transformation

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp	Implements the bitcast optimization for dynamic index vector loads
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts.ll	Tests the IR transformation with various vector sizes
llvm/test/CodeGen/AMDGPU/promote-alloca-vector-dynamic-idx-bitcasts-llc.ll	Tests the backend code generation for different GPU architectures
llvm/test/CodeGen/AMDGPU/promote-alloca-subvecs.ll	Updates existing test to reflect new transformation behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

github-actions · 2025-12-09T04:23:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

choikwa · 2025-12-09T04:38:44Z

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

ruiling · 2025-12-09T07:17:44Z

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:

  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1

As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

choikwa · 2025-12-09T13:56:09Z

@ruiling do you know if the alignment check is strictly required here? As I understand, once alloca is promoted to vector, there is no need for alignment for extracting subvector.

Vector element extract/insert needs to operate on element basis, like in the example:
  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1
As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.

I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

ruiling · 2025-12-09T15:12:59Z

Vector element extract/insert needs to operate on element basis, like in the example:
  %alloca = alloca [32 x i16], align 16, addrspace(5)
  %gep = getelementptr  i8, ptr addrspace(5) %alloca, i32 0, i32 %idx
  %load = load <8 x i16>, ptr addrspace(5) %gep, align 1
As we don't know whether %gep will be aligned to 16. You cannot bitcast %alloca to <4 x i128> and translate the <8 x i16> load into one extractelement. So you need alignment check to lower this way. For the unaligned case, I feel @arsenm's point is let's optimize the register allocator part or other ways to simplify the IR. I think you can only handle aligned case properly in this change. Better we have some tests for the unaligned case.
I believe GEPToVectorIndex is only allowing GEPs that map to the vector index. I will add an unaligned testcase to make sure.

I think GEPToVectorIndex is used to get the index into the vector type translated from the alloca. By saying unaligned, I really mean not aligned to the vector type you just bitcasted to temporarily. Like in below case, the alloca was translated to <32 x i16>. but in order to translate the load <8 x i16> into the simplified form. You bitcast it to <4 x i128>. Unalignment means the case that the address of %gep is not aligned to 128bit.

 %alloca = alloca [32 x i16], align 16, addrspace(5)
 %gep = getelementptr  i16, ptr addrspace(5) %alloca, i32 0, i32 %idx
 %load = load <8 x i16>, ptr addrspace(5) %gep, align 2

Hope I understand everything correctly.

nhaehnle

Agreed that the alignment test is required.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

…itcasts to reduce number of extractelements as they have large expansion in the backend. Investigation revealed that scalarized copy results in a long chain of extract/insert elements which can explode in generated temps in the AMDGPU backend as there is no efficient representation for extracting subvector with dynamic index. Using identity bitcasts can reduce the number of extract/insert elements down to 1 and produce much smaller generated and efficient code.

…ultiple of SubVecTy. Also checks GEPTy

github-actions · 2025-12-15T19:05:43Z

🐧 Linux x64 Test Results

187505 tests passed
4970 tests skipped

✅ The build succeeded and all tests passed.

github-actions · 2025-12-15T19:05:43Z

🪟 Windows x64 Test Results

128666 tests passed
2821 tests skipped

✅ The build succeeded and all tests passed.

nhaehnle · 2025-12-15T19:54:05Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        // Check that NumBits and GEP's type's NumBits match for alignment
+        if (auto *GEPTy =
+                dyn_cast<FixedVectorType>(GEP->getSourceElementType())) {
+          if (NumBits == GEPTy->getScalarSizeInBits() *
+                         GEPTy->getNumElements()) {


Couple of bad problems there. First, you used a dyn_cast above, so GEP may be null.

Second, looking at the GEP source element type is strongly discouraged since GEPs can (and will be) canonicalized to i8.

Can't you just use the load instruction alignment?

Understood. Yes, I can look at the Load's align instead.

nhaehnle

Thanks.

nhaehnle · 2025-12-16T01:26:50Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        assert(VectorTy->getScalarSizeInBits() <
+                   NewElemType->getScalarSizeInBits() &&
+               "New element type should be bigger");


I only just noticed that. What justifies this assertion? And is it really necessary?

What if we have a <4 x i16> load from a <32 x i32> alloca?

It seems the code tries to assert the total access size should be larger than the vector element size. That is to say at least access one vector element. I think the previous assert(AccessSize.isKnownMultipleOf(DL.getTypeStoreSize(VecEltTy))); should already cover this.

This was just a mental check to say element type of bitcast should always be larger one than alloca & subvector's element type. I had meant to compare SubVecTy's element type but it should be same as VectorTy's element type. I can use SubVecTy's.

ruiling · 2025-12-16T02:37:36Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+      uint64_t NumBits =
+          SubVecTy->getScalarSizeInBits() * SubVecTy->getNumElements();
+      uint64_t LoadAlign = cast<LoadInst>(Inst)->getAlign().value();
+      if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&


The SubVecTy->isIntOrIntVectorTy() is not needed. It will always be vector type. Whether it is float or integer is not important.

ruiling · 2025-12-16T02:45:05Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+      uint64_t LoadAlign = cast<LoadInst>(Inst)->getAlign().value();
+      if (!isa<ConstantInt>(Index) && SubVecTy->isIntOrIntVectorTy() &&
+          llvm::isPowerOf2_32(SubVecTy->getNumElements()) &&
+          VectorTy->getNumElements() % SubVecTy->getNumElements() == 0 &&


We have so many checks here. It would be better to make this more readable. Like giving some checks a proper named variable.

unsigned TotalNumElts = VectorTy->getNumElements(); bool IsProperlyDivisible = TotalNumElts % NumLoadedElts == 0; bool IsAlignedLoad = NumBits <= (LoadAlign * 8u);

ruiling · 2025-12-16T02:45:45Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        assert(VectorTy->getScalarSizeInBits() <
+                   NewElemType->getScalarSizeInBits() &&
+               "New element type should be bigger");
+        assert(IndexDivisor > 0u && "Zero index divisor");


Shouldn't VectorTy->getNumElements() % SubVecTy->getNumElements() == 0 already cover this??

ruiling · 2025-12-16T02:51:38Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+          llvm::isPowerOf2_32(AccessVecTy->getNumElements()) &&
+          NumBits <= (LoadAlign * 8u)) {
+        IntegerType *NewElemType = Builder.getIntNTy(NumBits);
+        const unsigned NewNumElts = VectorTy->getNumElements() *


Shorter code would be easy to follow. Like:

const unsigned NewNumElts = DL.getTypeStoreSize(VectorTy) / NumBits;

ruiling · 2025-12-16T03:11:10Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        assert(VectorTy->getScalarSizeInBits() <
+                   NewElemType->getScalarSizeInBits() &&
+               "New element type should be bigger");


It seems the code tries to assert the total access size should be larger than the vector element size. That is to say at least access one vector element. I think the previous assert(AccessSize.isKnownMultipleOf(DL.getTypeStoreSize(VecEltTy))); should already cover this.

ruiling · 2025-12-16T03:24:36Z

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

+        FixedVectorType *BitCastType =
+            FixedVectorType::get(NewElemType, NewNumElts);
+        Value *BCVal = Builder.CreateBitCast(CurVal, BitCastType);
+        Value *NewIdx = Builder.CreateUDiv(


Can we use shift right explicitly now that we have check the power-of-2?

choikwa requested review from Pierre-vh, arsenm, Copilot, nhaehnle and ruiling December 9, 2025 04:21

llvmbot added the backend:AMDGPU label Dec 9, 2025

Copilot AI reviewed Dec 9, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa mentioned this pull request Dec 9, 2025

[AMDGPU] Limit promoting allocas that have users with dynamic index above a threshold on number of elements #170327

Open

nhaehnle reviewed Dec 12, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp Outdated Show resolved Hide resolved

choikwa added 2 commits December 15, 2025 12:54

Address feedback, add unaligned testcase. Now checks if VectorTy is m…

8ae99cc

…ultiple of SubVecTy. Also checks GEPTy

choikwa force-pushed the alloca-bitcast branch from fffe861 to 8ae99cc Compare December 15, 2025 18:55

nhaehnle reviewed Dec 15, 2025

View reviewed changes

choikwa added 3 commits December 15, 2025 14:40

don't use gep, compare against SubVecTy and Load's alignment

25b622f

formatting

0ef9d81

more formatting

9b60c8e

nhaehnle reviewed Dec 16, 2025

View reviewed changes

ruiling reviewed Dec 16, 2025

View reviewed changes

choikwa added 2 commits December 16, 2025 00:51

Address feedback, use LShr instead of UDiv, simplify checks

2bd4cd6

space

a5f1a4b

[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. #171253

Are you sure you want to change the base?

[AMDGPU] In promote-alloca, if index is dynamic, sandwich load with bitcasts to reduce number of extractelements as they have large expansion in the backend. #171253

Conversation

choikwa commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choikwa commented Dec 9, 2025

Uh oh!

ruiling commented Dec 9, 2025

Uh oh!

choikwa commented Dec 9, 2025

Uh oh!

ruiling commented Dec 9, 2025

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choikwa Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

choikwa commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

choikwa Dec 16, 2025 •

edited

Loading