-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582
Conversation
@llvm/pr-subscribers-backend-x86 Author: Quentin Dian (DianQK) ChangesFixes #78578. We should add a count check to the predecessors to avoid the code size explosion. I found a strange argument during my investigation. llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp Lines 76 to 77 in 4b2381a
We didn't use Also, it may be that an issue with AsmPrinter is causing this use case to print two line breaks. This makes the test case fail. I haven't checked, but I don't think it at least affects the review. Patch is 29.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78582.diff 3 Files Affected:
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121ed..e76d63d3c0d66f 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -76,6 +76,11 @@ static cl::opt<bool>
static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
cl::Hidden);
+static cl::opt<unsigned> TailDupPredSizeLimit(
+ "tail-dup-pred-size-limit",
+ cl::desc("Maximum predecessors to consider tail duplicating."), cl::init(8),
+ cl::Hidden);
+
void TailDuplicator::initMF(MachineFunction &MFin, bool PreRegAlloc,
const MachineBranchProbabilityInfo *MBPIin,
MBFIWrapper *MBFIin,
@@ -565,6 +570,8 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
if (TailBB.isSuccessor(&TailBB))
return false;
+ if (TailDupPredSizeLimit < TailBB.pred_size())
+ return false;
// Set the limit on the cost to duplicate. When optimizing for size,
// duplicate only one, because one branch instruction can be eliminated to
// compensate for the duplication.
diff --git a/llvm/test/CodeGen/X86/mul-constant-result.ll b/llvm/test/CodeGen/X86/mul-constant-result.ll
index 1f9e7a93ad0b90..73c764a3f53da1 100644
--- a/llvm/test/CodeGen/X86/mul-constant-result.ll
+++ b/llvm/test/CodeGen/X86/mul-constant-result.ll
@@ -28,162 +28,132 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
; X86-NEXT: .LBB0_4:
; X86-NEXT: decl %ecx
; X86-NEXT: cmpl $31, %ecx
-; X86-NEXT: ja .LBB0_35
+; X86-NEXT: ja .LBB0_31
; X86-NEXT: # %bb.5:
; X86-NEXT: jmpl *.LJTI0_0(,%ecx,4)
; X86-NEXT: .LBB0_6:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
+; X86-NEXT: jmp .LBB0_40
; X86-NEXT: .LBB0_7:
-; X86-NEXT: .cfi_def_cfa_offset 8
; X86-NEXT: leal (%eax,%eax,8), %ecx
; X86-NEXT: leal (%ecx,%ecx,2), %ecx
-; X86-NEXT: jmp .LBB0_9
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
; X86-NEXT: .LBB0_8:
; X86-NEXT: movl %eax, %ecx
; X86-NEXT: shll $4, %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_10:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_9:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_11:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_10:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_13:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_11:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_12:
; X86-NEXT: leal (%eax,%eax,2), %ecx
-; X86-NEXT: jmp .LBB0_14
-; X86-NEXT: .LBB0_15:
+; X86-NEXT: leal (%eax,%ecx,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_13:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: jmp .LBB0_12
-; X86-NEXT: .LBB0_16:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_14:
; X86-NEXT: leal (%eax,%eax,4), %ecx
; X86-NEXT: leal (%ecx,%ecx,4), %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_17:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_15:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: jmp .LBB0_12
-; X86-NEXT: .LBB0_19:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_17:
; X86-NEXT: shll $4, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_20:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_18:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_21:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_19:
; X86-NEXT: shll $3, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_22:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_20:
; X86-NEXT: shll $5, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_23:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_21:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: .LBB0_33:
; X86-NEXT: leal (%eax,%eax,8), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_24:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_22:
; X86-NEXT: leal (%eax,%eax,4), %ecx
-; X86-NEXT: .LBB0_14:
; X86-NEXT: leal (%eax,%ecx,4), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_25:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_23:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_26:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_24:
; X86-NEXT: leal (%eax,%eax,4), %ecx
; X86-NEXT: leal (%eax,%ecx,4), %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_27:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_25:
; X86-NEXT: leal (%eax,%eax), %ecx
; X86-NEXT: shll $4, %eax
-; X86-NEXT: jmp .LBB0_28
-; X86-NEXT: .LBB0_29:
+; X86-NEXT: subl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_26:
; X86-NEXT: leal (,%eax,8), %ecx
-; X86-NEXT: jmp .LBB0_38
-; X86-NEXT: .LBB0_30:
+; X86-NEXT: jmp .LBB0_33
+; X86-NEXT: .LBB0_27:
; X86-NEXT: leal (%eax,%eax,8), %ecx
-; X86-NEXT: jmp .LBB0_32
-; X86-NEXT: .LBB0_31:
+; X86-NEXT: leal (%eax,%ecx,2), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_28:
; X86-NEXT: leal (%eax,%eax,4), %ecx
-; X86-NEXT: .LBB0_32:
; X86-NEXT: leal (%eax,%ecx,2), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_34:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_29:
+; X86-NEXT: leal (%eax,%eax,8), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_30:
; X86-NEXT: movl %eax, %ecx
; X86-NEXT: shll $5, %ecx
-; X86-NEXT: jmp .LBB0_38
-; X86-NEXT: .LBB0_35:
+; X86-NEXT: jmp .LBB0_33
+; X86-NEXT: .LBB0_31:
; X86-NEXT: xorl %eax, %eax
-; X86-NEXT: .LBB0_36:
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_37:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_32:
; X86-NEXT: leal (%eax,%eax,2), %ecx
; X86-NEXT: shll $3, %ecx
-; X86-NEXT: .LBB0_38:
+; X86-NEXT: .LBB0_33:
; X86-NEXT: subl %eax, %ecx
; X86-NEXT: movl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_39:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_34:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: .LBB0_12:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_40:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_35:
; X86-NEXT: shll $3, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_41:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_36:
; X86-NEXT: leal (%eax,%eax,8), %ecx
; X86-NEXT: leal (%ecx,%ecx,2), %ecx
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: .LBB0_9:
; X86-NEXT: addl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_42:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_37:
; X86-NEXT: leal (%eax,%eax), %ecx
; X86-NEXT: shll $5, %eax
-; X86-NEXT: .LBB0_28:
; X86-NEXT: subl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_43:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_38:
; X86-NEXT: leal (%eax,%eax,8), %eax
-; X86-NEXT: .LBB0_18:
+; X86-NEXT: .LBB0_39:
; X86-NEXT: leal (%eax,%eax,2), %eax
+; X86-NEXT: .LBB0_40:
; X86-NEXT: popl %esi
; X86-NEXT: .cfi_def_cfa_offset 4
; X86-NEXT: retl
@@ -199,154 +169,131 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
; X64-HSW-NEXT: cmovel %ecx, %eax
; X64-HSW-NEXT: decl %edi
; X64-HSW-NEXT: cmpl $31, %edi
-; X64-HSW-NEXT: ja .LBB0_31
+; X64-HSW-NEXT: ja .LBB0_28
; X64-HSW-NEXT: # %bb.1:
; X64-HSW-NEXT: jmpq *.LJTI0_0(,%rdi,8)
; X64-HSW-NEXT: .LBB0_2:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_37
; X64-HSW-NEXT: .LBB0_3:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,2), %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
+; X64-HSW-NEXT: jmp .LBB0_21
; X64-HSW-NEXT: .LBB0_4:
; X64-HSW-NEXT: movl %eax, %ecx
; X64-HSW-NEXT: shll $4, %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
+; X64-HSW-NEXT: jmp .LBB0_21
; X64-HSW-NEXT: .LBB0_5:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: .LBB0_13:
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_36
; X64-HSW-NEXT: .LBB0_6:
; X64-HSW-NEXT: shll $2, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_7:
+; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
+; X64-HSW-NEXT: jmp .LBB0_37
; X64-HSW-NEXT: .LBB0_8:
; X64-HSW-NEXT: leal (%rax,%rax,2), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_10:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_9:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: .LBB0_7:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_11:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_10:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,4), %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
-; X64-HSW-NEXT: .LBB0_12:
+; X64-HSW-NEXT: jmp .LBB0_21
+; X64-HSW-NEXT: .LBB0_11:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_14:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_13:
; X64-HSW-NEXT: shll $4, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_15:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_14:
; X64-HSW-NEXT: shll $2, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_16:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_15:
; X64-HSW-NEXT: shll $3, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_17:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_16:
; X64-HSW-NEXT: shll $5, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_18:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_17:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: .LBB0_29:
; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_19:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_18:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_20:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_19:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_21:
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_20:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %ecx
-; X64-HSW-NEXT: .LBB0_22:
+; X64-HSW-NEXT: .LBB0_21:
; X64-HSW-NEXT: addl %eax, %ecx
; X64-HSW-NEXT: movl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_23:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_22:
; X64-HSW-NEXT: leal (%rax,%rax), %ecx
; X64-HSW-NEXT: shll $4, %eax
; X64-HSW-NEXT: subl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_25:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_23:
; X64-HSW-NEXT: leal (,%rax,8), %ecx
-; X64-HSW-NEXT: jmp .LBB0_34
-; X64-HSW-NEXT: .LBB0_26:
+; X64-HSW-NEXT: jmp .LBB0_30
+; X64-HSW-NEXT: .LBB0_24:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_27:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_25:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_30:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_26:
+; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_27:
; X64-HSW-NEXT: movl %eax, %ecx
; X64-HSW-NEXT: shll $5, %ecx
-; X64-HSW-NEXT: jmp .LBB0_34
-; X64-HSW-NEXT: .LBB0_31:
+; X64-HSW-NEXT: jmp .LBB0_30
+; X64-HSW-NEXT: .LBB0_28:
; X64-HSW-NEXT: xorl %eax, %eax
-; X64-HSW-NEXT: .LBB0_32:
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_33:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_29:
; X64-HSW-NEXT: leal (%rax,%rax,2), %ecx
; X64-HSW-NEXT: shll $3, %ecx
-; X64-HSW-NEXT: .LBB0_34:
+; X64-HSW-NEXT: .LBB0_30:
; X64-HSW-NEXT: subl %eax, %ecx
; X64-HSW-NEXT: movl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_36:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_31:
; X64-HSW-NEXT: shll $2, %eax
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_37:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_32:
; X64-HSW-NEXT: shll $3, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_38:
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_33:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,2), %ecx
; X64-HSW-NEXT: addl %eax, %eax
; X64-HSW-NEXT: addl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_39:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_34:
; X64-HSW-NEXT: leal (%rax,%rax), %ecx
; X64-HSW-NEXT: shll $5, %eax
; X64-HSW-NEXT: subl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_40:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_35:
; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT: .LBB0_36:
; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
+; X64-HSW-NEXT: .LBB0_37:
; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
; X64-HSW-NEXT: retq
%3 = icmp eq i32 %1, 0
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
new file mode 100644
index 00000000000000..47b9fcaa7d6c85
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=3 < %s | FileCheck %s -check-prefix=LIMIT
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=4 < %s | FileCheck %s -check-prefix=NOLIMIT
+
+define i32 @foo(ptr %0, i32 %1) {
+ ; LIMIT-LABEL: name: foo
+ ; LIMIT: bb.0 (%ir-block.2):
+ ; LIMIT-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; LIMIT-NEXT: liveins: $rdi, $esi
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[COPY:%[0-9]+]]:gr32 = COPY $esi
+ ; LIMIT-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rdi
+ ; LIMIT-NEXT: [[SHR32ri:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri:%[0-9]+]]:gr32 = AND32ri [[SHR32ri]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.1 (%ir-block.5):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.2 (%ir-block.7):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.3 (%ir-block.10):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.4 (%ir-block.13):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.5.default.unreachable2:
+ ; LIMIT-NEXT: successors:
+ ; LIMIT: bb.6 (%ir-block.16):
+ ; LIMIT-NEXT: successors: %bb.7(0x20000000), %bb.8(0x20000000), %bb.9(0x20000000), %bb.10(0x20000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[PHI:%[0-9]+]]:gr32 = PHI [[SHR32ri3]], %bb.4, [[SHR32ri2]], %bb.3, [[SHR32ri1]], %bb.2, [[MOV32rm]], %bb.1
+ ; LIMIT-NEXT: [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg :: (load (s64) from jump-table)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.7 (%ir-block.20):
+ ; LIMIT-NEXT: successors: %bb.11(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: JMP_1 %bb.11
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.8 (%ir-block.22):
+ ; LIMIT-NEXT: successors: %bb.11(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32r...
[truncated]
|
I can verify that the initial OOM case is back to normal. Also, when the default branch is removed, the instructions have some improvements. With the default branch:
Without the default branch:
Text diff? diff --git a/output.s b/output.s
index 322d0d0..6ca97d0 100644
--- a/output.s
+++ b/output.s
@@ -1,5 +1,5 @@
.text
- .file "oom_manual.c"
+ .file "oom_manual2.c"
.globl f1 # -- Begin function f1
.p2align 4, 0x90
.type f1,@function
@@ -33,12805 +33,12788 @@ f1: # @f1
movl %eax, %ecx
shrl %ecx
andl $127, %ecx
- cmpl $126, %ecx
- ja .LBB0_15
-# %bb.1:
jmpq *.LJTI0_0(,%rcx,8) I can see many |
@aeubanks mind taking a look at this small patch? Is this limit reasonable/consistent with other similar limits, etc? Do we need more data to back up why this particular bound was chosen? |
the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR |
do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened? was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated? or is there an underlying algorithmic problem that we can fix? |
Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.
It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors.
I think so.
I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements. |
What do you mean "poorly maintained"? Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.
if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment |
The cause is early-tailduplication transforms normal |
If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock. |
8777d96
to
028c8c7
Compare
@aeubanks
@aeubanks
@bzEq
@weiguozhi |
Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP). |
But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered. |
I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then. |
We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified. |
I'm not sure if I want to use PGO, since most applications won't be built with PGO. But I think the limiting conditions here can be adjusted. Perhaps consider the number of instructions duplicated. |
I've written down the compilation time in the issue. I simply tried the runtime benchmark. The
int src(void) {
return -1;
}
extern int f1(unsigned int *b);
int main(int argc, char **argv) {
int r = argc;
unsigned int b[] = { -1, -2, -3 };
for (int i = 0; i < 1000000; i++) {
r += f1(b);
}
return r;
}
clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual2.c main.c -o oom_manual2
ls -lh oom_manual oom_manual2 output:
hyperfine -i -N --runs 200 --warmup 50 ./oom_manual ./oom_manual2 output:
function run_perf() {
echo "perf stat $1"
perf stat -x \; \
-e instructions \
-e instructions:u \
-e cycles \
-e task-clock \
-e branches \
-e branch-misses \
$1
}
run_perf ./oom_manual
run_perf ./oom_manual2 output:
I am trying to change the code to see the results of different scenarios. |
028c8c7
to
b0e06a2
Compare
I made some progress. There are usually only two instructions that can be duplicated. Indirect branches increase this limit to 20. llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp Lines 591 to 602 in 4b2381a
I understand from the comments that this is to improve the accuracy of branch prediction. I want to know that if it is appropriate with numerous of indirect branches. So I did some experimenting on https://github.com/DianQK/llvm-tail-dup-indirect-succ-size. I'm using the result of One of my results is as follows:
If this is the right route, I'll continue to figure out the other two problems.
|
If I understand what you are describing correctly. This duplication is an improvement regardless of the number of predecessors and successors. But I do want to add some limits for the overhead of compile time and code size. As it stands, I think past transformation scenarios that reach 128 predecessors are almost non-existent, otherwise we would have caught the problem earlier? Currently, it's because we removed the default branch exposing this transformation opportunity. Can we at least add a condition to limit this transformation? I think it makes sense to limit the number of predecessors when we encounter indirect branches. |
There's sort of two separate questions here. One is whether we're emitting too much code here, i.e. the codesize tradeoff is worthwhile. This might depend on what exactly the user wrote... we might want to be more aggressive if the user explicitly uses indirect gotos. This explains the codesize growth you're seeing, but not really the compile-time; it's a constant number of instructions per switch case. The other is whether the representation we're using is causing non-linear compile-time. The codesize growth should be linear. But if we're representing the edges explicitly, the size of the IR in memory might grow non-linearly, causing compile-time issues. Even if we're tail-duplicating, we don't necessarily need to explicitly represent all the possible source->destination edges separately. It's just the most convenient thing to do given the way codegen works. But we could do something different: instead of making every indirect jump have an edge to every indirect jump destination, we could make all the jumps jump to a synthetic basic block, and then have edges from that basic block to all the destinations. Same semantics, without the compile-time. (Not suggesting you try to fix this here, just noting it's possible.) Some threshold probably makes sense, but we should make clear whether we're primarily trying to target the codesize, or the compile-time, so we have a good starting point the next time this is revisited. |
I just remembered that I found a performance issue here. In the example of successive switch statements, we would add a number of PHI instructions, which seem to end up as stack operation instructions. I'm not sure whether it can be optimized in any other way.
As I mentioned above, the code size problem should be caused by instruction copying and the addition of extra PHI instructions.
Actually, In fact, the code size of #79993 isn't terribly bad, but this is a horrible time increase.
I may indeed not be able to fix the issue, but will I follow up to see if I can actually fix the issue. (Although it probably won't.)
I'll continue to work on this issue. Even if I submit a workaround, I'll try to explain the various issues I've found. :) |
The PHI node issue is interesting... I would have thought that if you have a bunch of identical PHIs, the register allocator would do the right thing, but maybe not.
Okay, sounds good. |
b0e06a2
to
dfc2e9e
Compare
@efriedma-quic I've put most of the analysis into this comment. In a nutshell, I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes. A critical BB refers to one with multiple predecessors and multiple successors. |
Ping. If the new changes are suitable, I hope to catch up on the final release of 18.1.0. |
Ping. |
Ping. I assume this is an uncommon scenario, since there has never been feedback on similar compile-time issues before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The new check looks sufficiently narrow (requires many predecessors and many successors) to just do this.
We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected? |
@DianQK ^^ It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here. |
I apologize for my late reply. (*^^*) |
I'll verify at the end of the week that increasing to 128 has no noticeable impact on compilation time. |
Could you try with |
Sure, I'll run the comparison before and after this commit with |
Benchmark compiled with clang after this commit with |
Actually, just |
Thanks, I will continue to investigate this. |
Hmm, I tried LLVM 18 and the main (c8864bc) branch on Raspberry Pi 4 (arm64), but I didn't find any performance issues:
|
Fixes #78578.
Duplicating a BB which has both multiple predecessors and successors will result in a complex CFG and also may cause huge amount of PHI nodes. See #78578 (comment) for a detailed description of the limit.