Skip to content

[AMDGPU] Add another SIFoldOperands instance after shrink #67878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

rampitec
Copy link
Collaborator

There is no fold operands pass past the shrink and at the same time there are limited attempts to do shrinking right inside folding. We seem to need to run shrinking before folding, hence this patch. I can see some clear benefits in the tests we have. I also need this for a future patch. We could extend our efforts to do shrinkig inside folding, but at the end it will just result in the recreation of the the shrinking pass there.

As an alternative I have tried to move previous instance of the folding past the shrink, but the result is not as good as here and there were few regressions. We may have some light compile time regressions, hence it is disabled at -O1.

There is no fold operands pass past the shrink and at the same
time there are limited attempts to do shrinking right inside
folding. We seem to need to run shrinking before folding, hence
this patch. I can see some clear benefits in the tests we have.
I also need this for a future patch. We could extend our efforts
to do shrinkig inside folding, but at the end it will just result
in the recreation of the the shrinking pass there.

As an alternative I have tried to move previous instance of the
folding past the shrink, but the result is not as good as here and
there were few regressions. We may have some light compile time
regressions, hence it is disabled at -O1.
@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2023

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Changes

There is no fold operands pass past the shrink and at the same time there are limited attempts to do shrinking right inside folding. We seem to need to run shrinking before folding, hence this patch. I can see some clear benefits in the tests we have. I also need this for a future patch. We could extend our efforts to do shrinkig inside folding, but at the end it will just result in the recreation of the the shrinking pass there.

As an alternative I have tried to move previous instance of the folding past the shrink, but the result is not as good as here and there were few regressions. We may have some light compile time regressions, hence it is disabled at -O1.


Patch is 206.21 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/67878.diff

18 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll (+2-3)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll (+69-92)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i32.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i32.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/ssubsat.ll (+69-92)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i32.ll (+7-7)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/ds-combine-large-stride.ll (+23-23)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+2)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+40-66)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+3-6)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+645-653)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+3-6)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+1-2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index bcbc03eb2559c4f..a674c52667c684b 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -1199,6 +1199,8 @@ void GCNPassConfig::addMachineSSAOptimization() {
   }
   addPass(&DeadMachineInstructionElimID);
   addPass(createSIShrinkInstructionsPass());
+  if (TM->getOptLevel() > CodeGenOptLevel::Less)
+    addPass(&SIFoldOperandsID);
 }
 
 bool GCNPassConfig::addILPOpts() {
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll
index 26d1fbb09210c64..e9f30e8503b310e 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/add.v2i16.ll
@@ -165,9 +165,8 @@ define <2 x i16> @v_add_v2i16_neg_inline_imm_splat(<2 x i16> %a) {
 ; GFX7-LABEL: v_add_v2i16_neg_inline_imm_splat:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    s_movk_i32 s4, 0xffc0
-; GFX7-NEXT:    v_add_i32_e32 v0, vcc, s4, v0
-; GFX7-NEXT:    v_add_i32_e32 v1, vcc, s4, v1
+; GFX7-NEXT:    v_add_i32_e32 v0, vcc, 0xffffffc0, v0
+; GFX7-NEXT:    v_add_i32_e32 v1, vcc, 0xffffffc0, v1
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX9-LABEL: v_add_v2i16_neg_inline_imm_splat:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
index cded5c94edf8cc3..c78d4533f4ddd3f 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/saddsat.ll
@@ -231,14 +231,12 @@ define i16 @v_saddsat_v2i8(i16 %lhs.arg, i16 %rhs.arg) {
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v2, 8, v0
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 24, v0
-; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v5, 0, v0
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v3, 8, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s5, v5
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s4, v4
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX6-NEXT:    v_max_i32_e32 v1, v5, v1
 ; GFX6-NEXT:    v_min_i32_e32 v1, v1, v4
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
@@ -246,8 +244,8 @@ define i16 @v_saddsat_v2i8(i16 %lhs.arg, i16 %rhs.arg) {
 ; GFX6-NEXT:    v_min_i32_e32 v4, 0, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 24, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s5, v4
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s4, v3
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x80000000, v4
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x7fffffff, v3
 ; GFX6-NEXT:    v_max_i32_e32 v2, v4, v2
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v3
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
@@ -512,15 +510,15 @@ define i32 @v_saddsat_v4i8(i32 %lhs.arg, i32 %rhs.arg) {
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
 ; GFX6-NEXT:    v_lshrrev_b32_e32 v7, 24, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v1
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v8, 0, v0
 ; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, s5, v10
-; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, s4, v8
+; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, 0x7fffffff, v8
 ; GFX6-NEXT:    v_max_i32_e32 v1, v10, v1
 ; GFX6-NEXT:    v_min_i32_e32 v1, v1, v8
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 24, v2
 ; GFX6-NEXT:    v_min_i32_e32 v8, 0, v1
+; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 24, v5
 ; GFX6-NEXT:    v_max_i32_e32 v5, 0, v1
 ; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, s5, v8
@@ -1265,19 +1263,17 @@ define <2 x i32> @v_saddsat_v2i32(<2 x i32> %lhs, <2 x i32> %rhs) {
 ; GFX6-LABEL: v_saddsat_v2i32:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v5, 0, v0
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s5, v5
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s4, v4
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX6-NEXT:    v_max_i32_e32 v2, v5, v2
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v4
 ; GFX6-NEXT:    v_min_i32_e32 v4, 0, v1
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
 ; GFX6-NEXT:    v_max_i32_e32 v2, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s5, v4
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, s4, v2
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x80000000, v4
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0x7fffffff, v2
 ; GFX6-NEXT:    v_max_i32_e32 v3, v4, v3
 ; GFX6-NEXT:    v_min_i32_e32 v2, v3, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
@@ -1286,19 +1282,17 @@ define <2 x i32> @v_saddsat_v2i32(<2 x i32> %lhs, <2 x i32> %rhs) {
 ; GFX8-LABEL: v_saddsat_v2i32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    s_brev_b32 s5, 1
 ; GFX8-NEXT:    v_min_i32_e32 v5, 0, v0
-; GFX8-NEXT:    s_brev_b32 s4, -2
 ; GFX8-NEXT:    v_max_i32_e32 v4, 0, v0
-; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, s5, v5
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, s4, v4
+; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, 0x80000000, v5
+; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX8-NEXT:    v_max_i32_e32 v2, v5, v2
 ; GFX8-NEXT:    v_min_i32_e32 v2, v2, v4
 ; GFX8-NEXT:    v_min_i32_e32 v4, 0, v1
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v0, v2
 ; GFX8-NEXT:    v_max_i32_e32 v2, 0, v1
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, s5, v4
-; GFX8-NEXT:    v_sub_u32_e32 v2, vcc, s4, v2
+; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0x80000000, v4
+; GFX8-NEXT:    v_sub_u32_e32 v2, vcc, 0x7fffffff, v2
 ; GFX8-NEXT:    v_max_i32_e32 v3, v4, v3
 ; GFX8-NEXT:    v_min_i32_e32 v2, v3, v2
 ; GFX8-NEXT:    v_add_u32_e32 v1, vcc, v1, v2
@@ -1383,26 +1377,25 @@ define <3 x i32> @v_saddsat_v3i32(<3 x i32> %lhs, <3 x i32> %rhs) {
 ; GFX6-LABEL: v_saddsat_v3i32:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v7, 0, v0
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v6, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v7, vcc, s5, v7
-; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, s4, v6
+; GFX6-NEXT:    v_sub_i32_e32 v7, vcc, 0x80000000, v7
+; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 0x7fffffff, v6
 ; GFX6-NEXT:    v_max_i32_e32 v3, v7, v3
 ; GFX6-NEXT:    v_min_i32_e32 v3, v3, v6
 ; GFX6-NEXT:    v_min_i32_e32 v6, 0, v1
+; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, s5, v6
+; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 0x80000000, v6
 ; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s4, v3
 ; GFX6-NEXT:    v_max_i32_e32 v4, v6, v4
 ; GFX6-NEXT:    v_min_i32_e32 v3, v4, v3
 ; GFX6-NEXT:    v_min_i32_e32 v4, 0, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, 0, v2
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s5, v4
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s4, v3
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x80000000, v4
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x7fffffff, v3
 ; GFX6-NEXT:    v_max_i32_e32 v4, v4, v5
 ; GFX6-NEXT:    v_min_i32_e32 v3, v4, v3
 ; GFX6-NEXT:    v_add_i32_e32 v2, vcc, v2, v3
@@ -1411,26 +1404,25 @@ define <3 x i32> @v_saddsat_v3i32(<3 x i32> %lhs, <3 x i32> %rhs) {
 ; GFX8-LABEL: v_saddsat_v3i32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    s_brev_b32 s5, 1
 ; GFX8-NEXT:    v_min_i32_e32 v7, 0, v0
-; GFX8-NEXT:    s_brev_b32 s4, -2
 ; GFX8-NEXT:    v_max_i32_e32 v6, 0, v0
-; GFX8-NEXT:    v_sub_u32_e32 v7, vcc, s5, v7
-; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, s4, v6
+; GFX8-NEXT:    v_sub_u32_e32 v7, vcc, 0x80000000, v7
+; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, 0x7fffffff, v6
 ; GFX8-NEXT:    v_max_i32_e32 v3, v7, v3
 ; GFX8-NEXT:    v_min_i32_e32 v3, v3, v6
 ; GFX8-NEXT:    v_min_i32_e32 v6, 0, v1
+; GFX8-NEXT:    s_brev_b32 s4, -2
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v0, v3
 ; GFX8-NEXT:    v_max_i32_e32 v3, 0, v1
-; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, s5, v6
+; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, 0x80000000, v6
 ; GFX8-NEXT:    v_sub_u32_e32 v3, vcc, s4, v3
 ; GFX8-NEXT:    v_max_i32_e32 v4, v6, v4
 ; GFX8-NEXT:    v_min_i32_e32 v3, v4, v3
 ; GFX8-NEXT:    v_min_i32_e32 v4, 0, v2
 ; GFX8-NEXT:    v_add_u32_e32 v1, vcc, v1, v3
 ; GFX8-NEXT:    v_max_i32_e32 v3, 0, v2
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, s5, v4
-; GFX8-NEXT:    v_sub_u32_e32 v3, vcc, s4, v3
+; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0x80000000, v4
+; GFX8-NEXT:    v_sub_u32_e32 v3, vcc, 0x7fffffff, v3
 ; GFX8-NEXT:    v_max_i32_e32 v4, v4, v5
 ; GFX8-NEXT:    v_min_i32_e32 v3, v4, v3
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v3
@@ -1536,26 +1528,24 @@ define <4 x i32> @v_saddsat_v4i32(<4 x i32> %lhs, <4 x i32> %rhs) {
 ; GFX6-LABEL: v_saddsat_v4i32:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v9, 0, v0
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v8, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v9, vcc, s5, v9
-; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, s4, v8
+; GFX6-NEXT:    v_sub_i32_e32 v9, vcc, 0x80000000, v9
+; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, 0x7fffffff, v8
 ; GFX6-NEXT:    v_max_i32_e32 v4, v9, v4
 ; GFX6-NEXT:    v_min_i32_e32 v4, v4, v8
 ; GFX6-NEXT:    v_min_i32_e32 v8, 0, v1
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v4
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, s5, v8
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s4, v4
+; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, 0x80000000, v8
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX6-NEXT:    v_max_i32_e32 v5, v8, v5
 ; GFX6-NEXT:    v_min_i32_e32 v4, v5, v4
 ; GFX6-NEXT:    v_min_i32_e32 v5, 0, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v4
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v2
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s5, v5
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s4, v4
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX6-NEXT:    v_max_i32_e32 v5, v5, v6
 ; GFX6-NEXT:    v_min_i32_e32 v4, v5, v4
 ; GFX6-NEXT:    v_min_i32_e32 v5, 0, v3
@@ -1571,26 +1561,24 @@ define <4 x i32> @v_saddsat_v4i32(<4 x i32> %lhs, <4 x i32> %rhs) {
 ; GFX8-LABEL: v_saddsat_v4i32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    s_brev_b32 s5, 1
 ; GFX8-NEXT:    v_min_i32_e32 v9, 0, v0
-; GFX8-NEXT:    s_brev_b32 s4, -2
 ; GFX8-NEXT:    v_max_i32_e32 v8, 0, v0
-; GFX8-NEXT:    v_sub_u32_e32 v9, vcc, s5, v9
-; GFX8-NEXT:    v_sub_u32_e32 v8, vcc, s4, v8
+; GFX8-NEXT:    v_sub_u32_e32 v9, vcc, 0x80000000, v9
+; GFX8-NEXT:    v_sub_u32_e32 v8, vcc, 0x7fffffff, v8
 ; GFX8-NEXT:    v_max_i32_e32 v4, v9, v4
 ; GFX8-NEXT:    v_min_i32_e32 v4, v4, v8
 ; GFX8-NEXT:    v_min_i32_e32 v8, 0, v1
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v0, v4
 ; GFX8-NEXT:    v_max_i32_e32 v4, 0, v1
-; GFX8-NEXT:    v_sub_u32_e32 v8, vcc, s5, v8
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, s4, v4
+; GFX8-NEXT:    v_sub_u32_e32 v8, vcc, 0x80000000, v8
+; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX8-NEXT:    v_max_i32_e32 v5, v8, v5
 ; GFX8-NEXT:    v_min_i32_e32 v4, v5, v4
 ; GFX8-NEXT:    v_min_i32_e32 v5, 0, v2
 ; GFX8-NEXT:    v_add_u32_e32 v1, vcc, v1, v4
 ; GFX8-NEXT:    v_max_i32_e32 v4, 0, v2
-; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, s5, v5
-; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, s4, v4
+; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, 0x80000000, v5
+; GFX8-NEXT:    v_sub_u32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX8-NEXT:    v_max_i32_e32 v5, v5, v6
 ; GFX8-NEXT:    v_min_i32_e32 v4, v5, v4
 ; GFX8-NEXT:    v_min_i32_e32 v5, 0, v3
@@ -1724,34 +1712,32 @@ define <5 x i32> @v_saddsat_v5i32(<5 x i32> %lhs, <5 x i32> %rhs) {
 ; GFX6-LABEL: v_saddsat_v5i32:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX6-NEXT:    s_brev_b32 s5, 1
-; GFX6-NEXT:    v_min_i32_e32 v12, 0, v0
-; GFX6-NEXT:    s_brev_b32 s4, -2
+; GFX6-NEXT:    v_min_i32_e32 v11, 0, v0
 ; GFX6-NEXT:    v_max_i32_e32 v10, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v12, vcc, s5, v12
-; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, s4, v10
-; GFX6-NEXT:    v_max_i32_e32 v5, v12, v5
+; GFX6-NEXT:    v_sub_i32_e32 v11, vcc, 0x80000000, v11
+; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, 0x7fffffff, v10
+; GFX6-NEXT:    v_max_i32_e32 v5, v11, v5
 ; GFX6-NEXT:    v_min_i32_e32 v5, v5, v10
 ; GFX6-NEXT:    v_min_i32_e32 v10, 0, v1
+; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v5
 ; GFX6-NEXT:    v_max_i32_e32 v5, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, s5, v10
+; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, 0x80000000, v10
 ; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s4, v5
 ; GFX6-NEXT:    v_max_i32_e32 v6, v10, v6
 ; GFX6-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX6-NEXT:    v_min_i32_e32 v6, 0, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v5
 ; GFX6-NEXT:    v_max_i32_e32 v5, 0, v2
-; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, s5, v6
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s4, v5
+; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 0x80000000, v6
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x7fffffff, v5
 ; GFX6-NEXT:    v_max_i32_e32 v6, v6, v7
 ; GFX6-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX6-NEXT:    v_min_i32_e32 v6, 0, v3
-; GFX6-NEXT:    v_bfrev_b32_e32 v11, -2
 ; GFX6-NEXT:    v_add_i32_e32 v2, vcc, v2, v5
 ; GFX6-NEXT:    v_max_i32_e32 v5, 0, v3
 ; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 0x80000000, v6
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, v11, v5
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x7fffffff, v5
 ; GFX6-NEXT:    v_max_i32_e32 v6, v6, v8
 ; GFX6-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX6-NEXT:    v_min_i32_e32 v6, 0, v4
@@ -1767,34 +1753,32 @@ define <5 x i32> @v_saddsat_v5i32(<5 x i32> %lhs, <5 x i32> %rhs) {
 ; GFX8-LABEL: v_saddsat_v5i32:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    s_brev_b32 s5, 1
-; GFX8-NEXT:    v_min_i32_e32 v12, 0, v0
-; GFX8-NEXT:    s_brev_b32 s4, -2
+; GFX8-NEXT:    v_min_i32_e32 v11, 0, v0
 ; GFX8-NEXT:    v_max_i32_e32 v10, 0, v0
-; GFX8-NEXT:    v_sub_u32_e32 v12, vcc, s5, v12
-; GFX8-NEXT:    v_sub_u32_e32 v10, vcc, s4, v10
-; GFX8-NEXT:    v_max_i32_e32 v5, v12, v5
+; GFX8-NEXT:    v_sub_u32_e32 v11, vcc, 0x80000000, v11
+; GFX8-NEXT:    v_sub_u32_e32 v10, vcc, 0x7fffffff, v10
+; GFX8-NEXT:    v_max_i32_e32 v5, v11, v5
 ; GFX8-NEXT:    v_min_i32_e32 v5, v5, v10
 ; GFX8-NEXT:    v_min_i32_e32 v10, 0, v1
+; GFX8-NEXT:    s_brev_b32 s4, -2
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, v0, v5
 ; GFX8-NEXT:    v_max_i32_e32 v5, 0, v1
-; GFX8-NEXT:    v_sub_u32_e32 v10, vcc, s5, v10
+; GFX8-NEXT:    v_sub_u32_e32 v10, vcc, 0x80000000, v10
 ; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, s4, v5
 ; GFX8-NEXT:    v_max_i32_e32 v6, v10, v6
 ; GFX8-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX8-NEXT:    v_min_i32_e32 v6, 0, v2
 ; GFX8-NEXT:    v_add_u32_e32 v1, vcc, v1, v5
 ; GFX8-NEXT:    v_max_i32_e32 v5, 0, v2
-; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, s5, v6
-; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, s4, v5
+; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, 0x80000000, v6
+; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, 0x7fffffff, v5
 ; GFX8-NEXT:    v_max_i32_e32 v6, v6, v7
 ; GFX8-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX8-NEXT:    v_min_i32_e32 v6, 0, v3
-; GFX8-NEXT:    v_bfrev_b32_e32 v11, -2
 ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v5
 ; GFX8-NEXT:    v_max_i32_e32 v5, 0, v3
 ; GFX8-NEXT:    v_sub_u32_e32 v6, vcc, 0x80000000, v6
-; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, v11, v5
+; GFX8-NEXT:    v_sub_u32_e32 v5, vcc, 0x7fffffff, v5
 ; GFX8-NEXT:    v_max_i32_e32 v6, v6, v8
 ; GFX8-NEXT:    v_min_i32_e32 v5, v6, v5
 ; GFX8-NEXT:    v_min_i32_e32 v6, 0, v4
@@ -2766,13 +2750,11 @@ define <2 x i16> @v_saddsat_v2i16(<2 x i16> %lhs, <2 x i16> %rhs) {
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v5, 0, v0
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v4, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, s5, v5
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s4, v4
+; GFX6-NEXT:    v_sub_i32_e32 v5, vcc, 0x80000000, v5
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x7fffffff, v4
 ; GFX6-NEXT:    v_max_i32_e32 v2, v5, v2
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v4
@@ -2780,8 +2762,8 @@ define <2 x i16> @v_saddsat_v2i16(<2 x i16> %lhs, <2 x i16> %rhs) {
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 16, v3
 ; GFX6-NEXT:    v_max_i32_e32 v3, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s5, v4
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s4, v3
+; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, 0x80000000, v4
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x7fffffff, v3
 ; GFX6-NEXT:    v_max_i32_e32 v2, v4, v2
 ; GFX6-NEXT:    v_min_i32_e32 v2, v2, v3
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
@@ -2978,13 +2960,11 @@ define amdgpu_ps float @saddsat_v2i16_vs(<2 x i16> %lhs, <2 x i16> inreg %rhs) {
 ; GFX6-LABEL: saddsat_v2i16_vs:
 ; GFX6:       ; %bb.0:
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v0, 16, v0
-; GFX6-NEXT:    s_brev_b32 s3, 1
 ; GFX6-NEXT:    v_min_i32_e32 v3, 0, v0
 ; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
-; GFX6-NEXT:    s_brev_b32 s2, -2
 ; GFX6-NEXT:    v_max_i32_e32 v2, 0, v0
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s3, v3
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, s2, v2
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x80000000, v3
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0x7fffffff, v2
 ; GFX6-NEXT:    v_max_i32_e32 v3, s0, v3
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX6-NEXT:    v_min_i32_e32 v2, v3, v2
@@ -2992,8 +2972,8 @@ define amdgpu_ps float @saddsat_v2i16_vs(<2 x i16> %lhs, <2 x i16> inreg %rhs) {
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
 ; GFX6-NEXT:    s_lshl_b32 s0, s1, 16
 ; GFX6-NEXT:    v_max_i32_e32 v2, 0, v1
-; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, s3, v3
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, s2, v2
+; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 0x80000000, v3
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, 0x7fffffff, v2
 ; GFX6-NEXT:    v_max_i32_e32 v3, s0, v3
 ; GFX6-NEXT:    v_min_i32_e32 v2, v3, v2
 ; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v1, v2
@@ -3059,14 +3039,14 @@ define <2 x float> @v_saddsat_v4i16(<4 x i16> %lhs, <4 x i16> %rhs) {
 ; GFX6-NEXT:    s_brev_b32 s5, 1
 ; GFX6-NEXT:    v_min_i32_e32 v10, 0, v0
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
-; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_max_i32_e32 v8, 0, v0
 ; GFX6-NEXT:    v_sub_i32_e32 v10, vcc, s5, v10
-; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, s4, v8
+; GFX6-NEXT:    v_sub_i32_e32 v8, vcc, 0x7fffffff, v8
 ; GFX6-NEXT:    v_max_i32_e32 v4, v10, v4
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
 ; GFX6-NEXT:    v_min_i32_e32 v4, v4, v8
 ; GFX6-NEXT:    v_min_i32_e32 v8, 0, v1
+; GFX6-NEXT:    s_brev_b32 s4, -2
 ; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v4
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v4, 16, v5
 ; GFX6-NEXT:    v_max_i32_e32 v5, 0, v1
@@ -4847,8 +4827,7 @@ define <2 x i64> @v_saddsat_v2i64(<2 x i64> %lhs, <2 x i64> %rhs) {
 ; GFX6-NEXT:    v_cmp_lt_i64_e64 s[4:5], v[8:9], v[0:1]
 ; GFX6-NEXT:    v_cmp_gt_i64_e64 s[6:7], 0, v[4:5]
 ; GFX6-NEXT:    v_ashrrev_i32_e32 v0, 31, v9
-; GFX6-NEXT:    v_bfrev_b32_e32 v1, 1
-; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v0, v1
+; GFX6-NEXT:    v_add_i32_e32 v1, vcc, 0x80000000, v0
 ; GFX6-NEXT:    s_xor_b64 vcc, s[6:7], s[4:5]
 ; GFX6-NEXT:    v_cndmask_b32_e32 v0, v8, v0, vcc
 ; GFX6-NEXT:    v_cndmask_b32_e32 v1, v9, v1, vcc
@@ -4871,8 +4850,7 @@ define <2 x i64> @v_saddsat_v2i64(<2 x i64> %lhs, <2 x i64> %rhs) {
 ; GFX8-NEXT:    v_cmp_lt_i64_e64 s[4:5], v[8:9], v[0:1]
 ; GFX8-NEXT:    v_cmp_gt_i64_e64 s[6:7], 0, v[4:5]
 ; GFX8-NEXT:    v_ashrrev_i32_e32 v0, 31, v9
-; GFX8-NEXT:    v_bfrev_b32_e32 v1, 1
-; GFX8-NEXT:    v_add_u32_e32 v1, vcc, v0, v1
+; GFX8-NEXT:    v_add_u32_e32 v1, vcc, 0x80000000, v0
 ; GFX8-NEXT:    s_xor_b64 vcc, s[6:7], s[4:5]
 ; GFX8-NEXT:    v_cndm...
[truncated]

@@ -312,13 +312,13 @@ define <2 x i32> @v_srem_v2i32_pow2k_denom(<2 x i32> %num) {
; GISEL-NEXT: v_sub_i32_e32 v0, vcc, v0, v4
; GISEL-NEXT: v_sub_i32_e32 v1, vcc, v1, v3
; GISEL-NEXT: v_subrev_i32_e32 v3, vcc, s4, v0
; GISEL-NEXT: v_subrev_i32_e32 v4, vcc, s4, v1
; GISEL-NEXT: v_subrev_i32_e32 v4, vcc, 0x1000, v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the constant only folded into the second v_subrev, not both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root cause is that some shrinking only happens post-RA (e.g. shrinking V_SUBREV_CO_U32_e64 only if VCC was allocated for the carry-out operand) which is too late to do any more folding.

And the reason it only happens for some SUBREV instructions is even more convoluted. It's because SIFoldOperands will sometimes shrink V_SUB_CO_U32_e64 to V_SUBREV_CO_U32_e32 even it does not manage to fold anything into it. This does seem wrong and is probably worth a closer look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a deficiency in the shrink pass itself. There were 4 subtractions using this register before the shrink pass, 2 of them were already V_SUBREV_CO_U32_e32, and two were V_SUB_CO_U32_e64:

  %47:sreg_32 = S_MOV_B32 4096
  %62:vgpr_32, dead %95:sreg_64 = V_SUB_CO_U32_e64 %59:vgpr_32, %47:sreg_32, 0, implicit $exec
  %65:vgpr_32, dead %94:sreg_64 = V_SUB_CO_U32_e64 %63:vgpr_32, %47:sreg_32, 0, implicit $exec
  %37:vgpr_32 = V_SUBREV_CO_U32_e32 %47:sreg_32, %34:vgpr_32, implicit-def $vcc, implicit $exec
  %40:vgpr_32 = V_SUBREV_CO_U32_e32 %47:sreg_32, %38:vgpr_32, implicit-def $vcc, implicit $exec

The shrink pass did not touch last 2, which are already VOP2, but commuted 2 first:

  %62:vgpr_32, dead %95:sreg_64 = V_SUBREV_CO_U32_e64 %47:sreg_32, %59:vgpr_32, 0, implicit $exec
  %65:vgpr_32, dead %94:sreg_64 = V_SUBREV_CO_U32_e64 %47:sreg_32, %63:vgpr_32, 0, implicit $exec

Although after commute it did not continue with shrinking of the commuted instructions. It was only shrunk much later, when shrink pass was running after RA.

I believe this is because of this code in the shrink pass:

      if (!TII->canShrink(MI, *MRI)) {
        // Try commuting the instruction and see if that enables us to shrink
        // it.
        if (!MI.isCommutable() || !TII->commuteInstruction(MI) ||
            !TII->canShrink(MI, *MRI)) {
          tryReplaceDeadSDST(MI);
          continue;
        }
      }

The continue here will pick the next instruction and does not try to shrink newly commuted one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, shrink pass has commuted it, but cannot shrink because of the sdst. It has marked the sdst with the allocation hint to be vcc and bailed until RA. Although it is possible to shink it because sdst is dead. Like this:

diff --git a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
index 4159dc694c1e..b20bcc987a7c 100644
--- a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
+++ b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
@@ -960,7 +960,7 @@ bool SIShrinkInstructions::runOnMachineFunction(MachineFunction &MF) {
       const MachineOperand *SDst = TII->getNamedOperand(MI,
                                                         AMDGPU::OpName::sdst);

-      if (SDst) {
+      if (SDst && !SDst->isDead()) {
         bool Next = false;

         if (SDst->getReg() != VCCReg) {

That patch has folded all 4 occurrences.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayfoad
Copy link
Contributor

jayfoad commented Oct 2, 2023

If this goes in, as a follow up, could try removing SIShrinkInstructions::foldImmediates? It ought to be redundant.

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 2, 2023

If this goes in, as a follow up, could try removing SIShrinkInstructions::foldImmediates? It ought to be redundant.

Sure, I will see if it can be removed now.

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 2, 2023

SIShrinkInstructions::foldImmediates

Yes, it is absolutely removable now: #68045

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 2, 2023

If this goes in, as a follow up, could try removing SIShrinkInstructions::foldImmediates? It ought to be redundant.

Then I am not sure if very limited shrinking capabilities of the folding pass are needed as well with this. I will check that too later.

rampitec and others added 3 commits October 3, 2023 02:02
There is no fold operands pass past the shrink and at the same
time there are limited attempts to do shrinking right inside
folding. We seem to need to run shrinking before folding, hence
this patch. I can see some clear benefits in the tests we have.
I also need this for a future patch. We could extend our efforts
to do shrinkig inside folding, but at the end it will just result
in the recreation of the the shrinking pass there.

As an alternative I have tried to move previous instance of the
folding past the shrink, but the result is not as good as here and
there were few regressions. We may have some light compile time
regressions, hence it is disabled at -O1.
Copy link
Contributor

@Sisyph Sisyph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, especially since it enables #68045

@jayfoad
Copy link
Contributor

jayfoad commented Oct 3, 2023

I've just tested this on 10000 graphics shaders and it seems to make no difference at all. I tried gfx900 and gfx1100. Can anyone else from the graphics team confirm this?

@Sisyph
Copy link
Contributor

Sisyph commented Oct 3, 2023

I've just tested this on 10000 graphics shaders and it seems to make no difference at all. I tried gfx900 and gfx1100. Can anyone else from the graphics team confirm this?

I can confirm no difference on gfx1102

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 3, 2023

I've just tested this on 10000 graphics shaders and it seems to make no difference at all. I tried gfx900 and gfx1100. Can anyone else from the graphics team confirm this?

I can confirm no difference on gfx1102

gfx11 is the same as gfx10, it just bails because of the VOP3 literal support. This is strange for gfx9. Do these shaders use -O2 or -O3?

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 3, 2023

I've just tested this on 10000 graphics shaders and it seems to make no difference at all. I tried gfx900 and gfx1100. Can anyone else from the graphics team confirm this?

It seems the most impact is on the pre-gfx9 targets, very similar to #68028 and for the same reason: there were no no-carry add/sub. The rest of the impact is when an add/sub is created late in the pipeline.

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 3, 2023

I have measured compile time performance impact with timing check-llvm-codegen-amdgpu on the release build:

before the patch:           11.06s
add folding:                11.09s +0.2%
remove folding from shrink: 11.02s -0.4%

In general the impact smaller than run to run variance, the numbers are median time of 5 runs.

@jayfoad
Copy link
Contributor

jayfoad commented Oct 5, 2023

I've just tested this on 10000 graphics shaders and it seems to make no difference at all. I tried gfx900 and gfx1100. Can anyone else from the graphics team confirm this?

I can confirm no difference on gfx1102

gfx11 is the same as gfx10, it just bails because of the VOP3 literal support. This is strange for gfx9. Do these shaders use -O2 or -O3?

I just use the default which is supposed to be -O2.

@jayfoad
Copy link
Contributor

jayfoad commented Oct 5, 2023

I've taken another look at this. The patch does not show any benefit from running another SIFoldOperands pass after SIShrinkInstructions per se; you get exactly the same results (modulo a couple of add instructions that have their operands commuted differently) if you put the second SIFoldOperands run before SIShrinkInstructions instead.

In other words SIFoldOperands is not idempotent, and the reason for the that seems to be:

And the reason it only happens for some SUBREV instructions is even more convoluted. It's because SIFoldOperands will sometimes shrink V_SUB_CO_U32_e64 to V_SUBREV_CO_U32_e32 even it does not manage to fold anything into it. This does seem wrong and is probably worth a closer look.

This goes back to https://reviews.llvm.org/D51345. Notice how the code that was added to updateOperand does the shrinking but does not actually do any folding; it returns before we get to Old.ChangeToImmediate/Old.substVirtReg. A second run of SIFoldOperands will see the shrunk instruction and fold into it.

@rampitec
Copy link
Collaborator Author

rampitec commented Oct 5, 2023

I've taken another look at this. The patch does not show any benefit from running another SIFoldOperands pass after SIShrinkInstructions per se; you get exactly the same results (modulo a couple of add instructions that have their operands commuted differently) if you put the second SIFoldOperands run before SIShrinkInstructions instead.

In other words SIFoldOperands is not idempotent, and the reason for the that seems to be:

And the reason it only happens for some SUBREV instructions is even more convoluted. It's because SIFoldOperands will sometimes shrink V_SUB_CO_U32_e64 to V_SUBREV_CO_U32_e32 even it does not manage to fold anything into it. This does seem wrong and is probably worth a closer look.

This goes back to https://reviews.llvm.org/D51345. Notice how the code that was added to updateOperand does the shrinking but does not actually do any folding; it returns before we get to Old.ChangeToImmediate/Old.substVirtReg. A second run of SIFoldOperands will see the shrunk instruction and fold into it.

Yes, this is mostly old targets without no-carry add/sub and the impact is on these 2 instructions which needs to be shrunk before folding. Although fold operands' shrinking capabilities are really limited compared to the shrink pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants