Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

Merged
merged 4 commits into from
Apr 17, 2024

Conversation

DianQK
Copy link
Member

@DianQK DianQK commented Jan 18, 2024

Fixes #78578.

Duplicating a BB which has both multiple predecessors and successors will result in a complex CFG and also may cause huge amount of PHI nodes. See #78578 (comment) for a detailed description of the limit.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 18, 2024

@llvm/pr-subscribers-backend-x86

Author: Quentin Dian (DianQK)

Changes

Fixes #78578.

We should add a count check to the predecessors to avoid the code size explosion.

I found a strange argument during my investigation.

static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
cl::Hidden);

We didn't use -tail-dup-limit while setting a meaningless value.

Also, it may be that an issue with AsmPrinter is causing this use case to print two line breaks. This makes the test case fail. I haven't checked, but I don't think it at least affects the review.


Patch is 29.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78582.diff

3 Files Affected:

  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+7)
  • (modified) llvm/test/CodeGen/X86/mul-constant-result.ll (+135-188)
  • (added) llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll (+242)
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121ed..e76d63d3c0d66f 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -76,6 +76,11 @@ static cl::opt<bool>
 static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
                                       cl::Hidden);
 
+static cl::opt<unsigned> TailDupPredSizeLimit(
+    "tail-dup-pred-size-limit",
+    cl::desc("Maximum predecessors to consider tail duplicating."), cl::init(8),
+    cl::Hidden);
+
 void TailDuplicator::initMF(MachineFunction &MFin, bool PreRegAlloc,
                             const MachineBranchProbabilityInfo *MBPIin,
                             MBFIWrapper *MBFIin,
@@ -565,6 +570,8 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
   if (TailBB.isSuccessor(&TailBB))
     return false;
 
+  if (TailDupPredSizeLimit < TailBB.pred_size())
+    return false;
   // Set the limit on the cost to duplicate. When optimizing for size,
   // duplicate only one, because one branch instruction can be eliminated to
   // compensate for the duplication.
diff --git a/llvm/test/CodeGen/X86/mul-constant-result.ll b/llvm/test/CodeGen/X86/mul-constant-result.ll
index 1f9e7a93ad0b90..73c764a3f53da1 100644
--- a/llvm/test/CodeGen/X86/mul-constant-result.ll
+++ b/llvm/test/CodeGen/X86/mul-constant-result.ll
@@ -28,162 +28,132 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X86-NEXT:  .LBB0_4:
 ; X86-NEXT:    decl %ecx
 ; X86-NEXT:    cmpl $31, %ecx
-; X86-NEXT:    ja .LBB0_35
+; X86-NEXT:    ja .LBB0_31
 ; X86-NEXT:  # %bb.5:
 ; X86-NEXT:    jmpl *.LJTI0_0(,%ecx,4)
 ; X86-NEXT:  .LBB0_6:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_7:
-; X86-NEXT:    .cfi_def_cfa_offset 8
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
-; X86-NEXT:    jmp .LBB0_9
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_8:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $4, %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_10:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_10:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_13:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
-; X86-NEXT:    jmp .LBB0_14
-; X86-NEXT:  .LBB0_15:
+; X86-NEXT:    leal (%eax,%ecx,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_13:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_16:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_17:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_15:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_19:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_17:
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_20:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_18:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_21:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_19:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_22:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_20:
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_23:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_21:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_24:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_22:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%ecx,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_25:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_23:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_26:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_24:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%eax,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_27:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_25:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    jmp .LBB0_28
-; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    subl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_26:
 ; X86-NEXT:    leal (,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_30:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_27:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_32
-; X86-NEXT:  .LBB0_31:
+; X86-NEXT:    leal (%eax,%ecx,2), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%ecx,2), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_34:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    leal (%eax,%eax,8), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_30:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $5, %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_35:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_31:
 ; X86-NEXT:    xorl %eax, %eax
-; X86-NEXT:  .LBB0_36:
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_37:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
 ; X86-NEXT:    shll $3, %ecx
-; X86-NEXT:  .LBB0_38:
+; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    subl %eax, %ecx
 ; X86-NEXT:    movl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_39:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_34:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_40:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_35:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_41:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_36:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    addl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_42:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_37:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    subl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_43:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_38:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:  .LBB0_18:
+; X86-NEXT:  .LBB0_39:
 ; X86-NEXT:    leal (%eax,%eax,2), %eax
+; X86-NEXT:  .LBB0_40:
 ; X86-NEXT:    popl %esi
 ; X86-NEXT:    .cfi_def_cfa_offset 4
 ; X86-NEXT:    retl
@@ -199,154 +169,131 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X64-HSW-NEXT:    cmovel %ecx, %eax
 ; X64-HSW-NEXT:    decl %edi
 ; X64-HSW-NEXT:    cmpl $31, %edi
-; X64-HSW-NEXT:    ja .LBB0_31
+; X64-HSW-NEXT:    ja .LBB0_28
 ; X64-HSW-NEXT:  # %bb.1:
 ; X64-HSW-NEXT:    jmpq *.LJTI0_0(,%rdi,8)
 ; X64-HSW-NEXT:  .LBB0_2:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_3:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_4:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $4, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_5:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:  .LBB0_13:
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
 ; X64-HSW-NEXT:  .LBB0_6:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_7:
+; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_8:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_10:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_9:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_7:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_11:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_10:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,4), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
-; X64-HSW-NEXT:  .LBB0_12:
+; X64-HSW-NEXT:    jmp .LBB0_21
+; X64-HSW-NEXT:  .LBB0_11:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_14:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_13:
 ; X64-HSW-NEXT:    shll $4, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_15:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_14:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_16:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_15:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_17:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_16:
 ; X64-HSW-NEXT:    shll $5, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_18:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_17:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_19:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_18:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_20:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_19:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_21:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_20:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %ecx
-; X64-HSW-NEXT:  .LBB0_22:
+; X64-HSW-NEXT:  .LBB0_21:
 ; X64-HSW-NEXT:    addl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_23:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_22:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $4, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_25:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_23:
 ; X64-HSW-NEXT:    leal (,%rax,8), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_24:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_27:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_25:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_30:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_27:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $5, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_31:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_28:
 ; X64-HSW-NEXT:    xorl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_32:
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_33:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    shll $3, %ecx
-; X64-HSW-NEXT:  .LBB0_34:
+; X64-HSW-NEXT:  .LBB0_30:
 ; X64-HSW-NEXT:    subl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_36:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_31:
 ; X64-HSW-NEXT:    shll $2, %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_37:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_32:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_38:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_33:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
 ; X64-HSW-NEXT:    addl %eax, %eax
 ; X64-HSW-NEXT:    addl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_39:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_34:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $5, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_40:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_35:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:  .LBB0_36:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
+; X64-HSW-NEXT:  .LBB0_37:
 ; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
 ; X64-HSW-NEXT:    retq
   %3 = icmp eq i32 %1, 0
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
new file mode 100644
index 00000000000000..47b9fcaa7d6c85
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=3 < %s | FileCheck %s -check-prefix=LIMIT
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=4 < %s | FileCheck %s -check-prefix=NOLIMIT
+
+define i32 @foo(ptr %0, i32 %1) {
+  ; LIMIT-LABEL: name: foo
+  ; LIMIT: bb.0 (%ir-block.2):
+  ; LIMIT-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; LIMIT-NEXT:   liveins: $rdi, $esi
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[COPY:%[0-9]+]]:gr32 = COPY $esi
+  ; LIMIT-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY $rdi
+  ; LIMIT-NEXT:   [[SHR32ri:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri:%[0-9]+]]:gr32 = AND32ri [[SHR32ri]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.1 (%ir-block.5):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.2 (%ir-block.7):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.3 (%ir-block.10):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.4 (%ir-block.13):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.5.default.unreachable2:
+  ; LIMIT-NEXT:   successors:
+  ; LIMIT: bb.6 (%ir-block.16):
+  ; LIMIT-NEXT:   successors: %bb.7(0x20000000), %bb.8(0x20000000), %bb.9(0x20000000), %bb.10(0x20000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[PHI:%[0-9]+]]:gr32 = PHI [[SHR32ri3]], %bb.4, [[SHR32ri2]], %bb.3, [[SHR32ri1]], %bb.2, [[MOV32rm]], %bb.1
+  ; LIMIT-NEXT:   [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.7 (%ir-block.20):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.11
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.8 (%ir-block.22):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32r...
[truncated]

@DianQK
Copy link
Member Author

DianQK commented Jan 18, 2024

I can verify that the initial OOM case is back to normal.

Also, when the default branch is removed, the instructions have some improvements.

With the default branch:

Iterations:        100
Instructions:      710800
Total Cycles:      166608
Total uOps:        710800

Dispatch Width:    6
uOps Per Cycle:    4.27
IPC:               4.27
Block RThroughput: 1216.5

Without the default branch:

Iterations:        100
Instructions:      709000
Total Cycles:      158311
Total uOps:        709000

Dispatch Width:    6
uOps Per Cycle:    4.48
IPC:               4.48
Block RThroughput: 1205.0

Text diff?

diff --git a/output.s b/output.s
index 322d0d0..6ca97d0 100644
--- a/output.s
+++ b/output.s
@@ -1,5 +1,5 @@
 	.text
-	.file	"oom_manual.c"
+	.file	"oom_manual2.c"
 	.globl	f1                              # -- Begin function f1
 	.p2align	4, 0x90
 	.type	f1,@function
@@ -33,12805 +33,12788 @@ f1:                                     # @f1
 	movl	%eax, %ecx
 	shrl	%ecx
 	andl	$127, %ecx
-	cmpl	$126, %ecx
-	ja	.LBB0_15
-# %bb.1:
 	jmpq	*.LJTI0_0(,%rcx,8)

I can see many cmpl instructions disappearing.

@dwblaikie
Copy link
Collaborator

@aeubanks mind taking a look at this small patch? Is this limit reasonable/consistent with other similar limits, etc? Do we need more data to back up why this particular bound was chosen?

@aeubanks
Copy link
Contributor

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

@aeubanks
Copy link
Contributor

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened? was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated? or is there an underlying algorithmic problem that we can fix?

@DianQK
Copy link
Member Author

DianQK commented Jan 19, 2024

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors.
The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors.
Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.

    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \

was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so.
I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.)
I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements.
I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

@aeubanks
Copy link
Contributor

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors. The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors. Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.

    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \

was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so. I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.) I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements. I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

@bzEq
Copy link
Collaborator

bzEq commented Jan 19, 2024

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG.
You may be interested in https://reviews.llvm.org/D110613.

@weiguozhi
Copy link
Contributor

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

@DianQK
Copy link
Member Author

DianQK commented Jan 21, 2024

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

@aeubanks
I have changed the test case. That was my mistake, I mistakenly thought the mir file would be very large.

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

@aeubanks
I've updated the issue. I added all the details I could.

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG. You may be interested in https://reviews.llvm.org/D110613.

@bzEq
This looks like a very similar patch. But it looks like it's been revert for unknown reasons. What I've found so far is that the main time is early-tailduplication.

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

@weiguozhi
Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

@weiguozhi
Copy link
Contributor

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

@DianQK
Copy link
Member Author

DianQK commented Jan 23, 2024

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

@dwblaikie
Copy link
Collaborator

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

@nikic
Copy link
Contributor

nikic commented Jan 23, 2024

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

@DianQK
Copy link
Member Author

DianQK commented Jan 25, 2024

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

I'm not sure if I want to use PGO, since most applications won't be built with PGO. But I think the limiting conditions here can be adjusted. Perhaps consider the number of instructions duplicated.

@DianQK
Copy link
Member Author

DianQK commented Jan 25, 2024

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

I've written down the compilation time in the issue. I simply tried the runtime benchmark.

The oom_manual.c file is to replace the default branch by __builtin_unreachable().

clang -v:

clang version 17.0.6
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/j7mhyvnbbm4pk7vriilrc5wvh2kims5p-clang-17.0.6/bin

main.c:

int src(void) {
  return -1;
}

extern int f1(unsigned int *b);

int main(int argc, char **argv) {
  int r = argc;
  unsigned int b[] = { -1, -2, -3 };
  for (int i = 0; i < 1000000; i++) {
    r += f1(b);
  }
  return r;
}

build.sh:

clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual2.c main.c -o oom_manual2
ls -lh oom_manual oom_manual2

output:

-rwxr-xr-x 1 dianqk users  56K Jan 25 22:15 oom_manual
-rwxr-xr-x 1 dianqk users 192K Jan 25 22:15 oom_manual2

hyperfine.sh:

hyperfine -i -N --runs 200 --warmup 50 ./oom_manual ./oom_manual2

output:

Benchmark 1: ./oom_manual
  Time (mean ± σ):      12.4 ms ±   0.0 ms    [User: 12.2 ms, System: 0.2 ms]
  Range (min … max):    12.4 ms …  12.6 ms    200 runs

Benchmark 2: ./oom_manual2
  Time (mean ± σ):      14.9 ms ±   0.3 ms    [User: 14.7 ms, System: 0.2 ms]
  Range (min … max):    14.8 ms …  18.8 ms    200 runs

Summary
  ./oom_manual ran
    1.20 ± 0.02 times faster than ./oom_manual2

perf.sh:

function run_perf() {
  echo "perf stat $1"
  perf stat -x \; \
    -e instructions \
    -e instructions:u \
    -e cycles \
    -e task-clock \
    -e branches \
    -e branch-misses \
    $1
}

run_perf ./oom_manual
run_perf ./oom_manual2

output:

perf stat ./oom_manual
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
69202217;;cycles:u;12839070;100.00;5.390;GHz
12.84;msec;task-clock:u;12839070;100.00;0.981;CPUs utilized
56047969;;branches:u;12839070;100.00;4.365;G/sec
2876;;branch-misses:u;12839070;100.00;0.01;of all branches
perf stat ./oom_manual2
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
82892122;;cycles:u;15665683;100.00;5.291;GHz
15.67;msec;task-clock:u;15665683;100.00;0.981;CPUs utilized
32047975;;branches:u;15665683;100.00;2.046;G/sec
2827;;branch-misses:u;15665683;100.00;0.01;of all branches

I am trying to change the code to see the results of different scenarios.

@DianQK
Copy link
Member Author

DianQK commented Feb 5, 2024

I made some progress.

There are usually only two instructions that can be duplicated. Indirect branches increase this limit to 20.

// If the target has hardware branch prediction that can handle indirect
// branches, duplicating them can often make them predictable when there
// are common paths through the code. The limit needs to be high enough
// to allow undoing the effects of tail merging and other optimizations
// that rearrange the predecessors of the indirect branch.
bool HasIndirectbr = false;
if (!TailBB.empty())
HasIndirectbr = TailBB.back().isIndirectBranch();
if (HasIndirectbr && PreRegAlloc)
MaxDuplicateCount = TailDupIndirectBranchSize;

I understand from the comments that this is to improve the accuracy of branch prediction. I want to know that if it is appropriate with numerous of indirect branches. So I did some experimenting on https://github.com/DianQK/llvm-tail-dup-indirect-succ-size.

I'm using the result of instructions:u and branch-misses:u that don't call f1 as a baseline. The arguments of dup are -tail-dup-indirect-size=20, and the arguments of nodup are -tail-dup-indirect-size=2. I think the replication of this scenario may not make sense.

One of my results is as follows:

==================
Iterations: 1
dup: instructions: 418, branches: 32, branch-misses: 64
nodup: instructions: 280, branches: 53, branch-misses: 1
==================
Iterations: 2
dup: instructions: 840, branches: 63, branch-misses: 60
nodup: instructions: 565, branches: 106, branch-misses: 30
==================
Iterations: 3
dup: instructions: 1260, branches: 93, branch-misses: 46
nodup: instructions: 857, branches: 159, branch-misses: 62
==================
Iterations: 4
dup: instructions: 1681, branches: 123, branch-misses: 48
nodup: instructions: 1135, branches: 212, branch-misses: 74
==================
Iterations: 5
dup: instructions: 2102, branches: 152, branch-misses: 46
nodup: instructions: 1434, branches: 265, branch-misses: 67
==================
Iterations: 6
dup: instructions: 2536, branches: 183, branch-misses: 55
nodup: instructions: 1705, branches: 318, branch-misses: 82
==================
Iterations: 7
dup: instructions: 2944, branches: 212, branch-misses: 60
nodup: instructions: 1990, branches: 371, branch-misses: 77
==================
Iterations: 8
dup: instructions: 3371, branches: 243, branch-misses: 45
nodup: instructions: 2279, branches: 424, branch-misses: 57
==================
Iterations: 9
dup: instructions: 3793, branches: 273, branch-misses: 48
nodup: instructions: 2565, branches: 477, branch-misses: 56
==================
Iterations: 10
dup: instructions: 4232, branches: 309, branch-misses: 61
nodup: instructions: 2870, branches: 536, branch-misses: 49
==================
Iterations: 100
dup: instructions: 42138, branches: 3015, branch-misses: 54
nodup: instructions: 28540, branches: 5312, branch-misses: 55
==================
Iterations: 1000
dup: instructions: 421057, branches: 30020, branch-misses: 42
nodup: instructions: 285055, branches: 53018, branch-misses: 84
==================
Iterations: 10000
dup: instructions: 4210078, branches: 300027, branch-misses: 51
nodup: instructions: 2850075, branches: 530024, branch-misses: 15
==================
Iterations: 100000
dup: instructions: 42100100, branches: 3000035, branch-misses: 60
nodup: instructions: 28500098, branches: 5300032, branch-misses: 52
==================
Iterations: 1000000
dup: instructions: 421000146, branches: 30000068, branch-misses: 157
nodup: instructions: 285000149, branches: 53000061, branch-misses: 117
==================
Iterations: 10000000
dup: instructions: 4210000439, branches: 300000341, branch-misses: 749
nodup: instructions: 2850000384, branches: 530000291, branch-misses: 621
==================
Iterations: 100000000
dup: instructions: 42100003182, branches: 3000003075, branch-misses: 6277
nodup: instructions: 28500002671, branches: 5300002564, branch-misses: 5186

If this is the right route, I'll continue to figure out the other two problems.

  • Why does execute more instructions? (Related to increased PHI?)
  • Finding a suitable successor size?

@DianQK
Copy link
Member Author

DianQK commented Feb 13, 2024

The general pattern that's supposed to be targeted by TailDupIndirectBranchSize is something like an interpreter loop. The program executes one interpreter opcode, then jumps to the next opcode. Interpreters tend to have sequences of opcodes where, given an opcode, the next opcode is somewhat predictable. If we merge all the indirect branches of every opcode together, the indirect branch predictor loses the information about the previous opcode. So we want to be a bit more generous in this case. At least, that was the goal when this heuristic was originally added back in 2009. Indirect branch predictors today are probably smarter.

If I understand what you are describing correctly. This duplication is an improvement regardless of the number of predecessors and successors. But I do want to add some limits for the overhead of compile time and code size. As it stands, I think past transformation scenarios that reach 128 predecessors are almost non-existent, otherwise we would have caught the problem earlier? Currently, it's because we removed the default branch exposing this transformation opportunity. Can we at least add a condition to limit this transformation? I think it makes sense to limit the number of predecessors when we encounter indirect branches.

@efriedma-quic
Copy link
Collaborator

There's sort of two separate questions here.

One is whether we're emitting too much code here, i.e. the codesize tradeoff is worthwhile. This might depend on what exactly the user wrote... we might want to be more aggressive if the user explicitly uses indirect gotos. This explains the codesize growth you're seeing, but not really the compile-time; it's a constant number of instructions per switch case.

The other is whether the representation we're using is causing non-linear compile-time. The codesize growth should be linear. But if we're representing the edges explicitly, the size of the IR in memory might grow non-linearly, causing compile-time issues.

Even if we're tail-duplicating, we don't necessarily need to explicitly represent all the possible source->destination edges separately. It's just the most convenient thing to do given the way codegen works. But we could do something different: instead of making every indirect jump have an edge to every indirect jump destination, we could make all the jumps jump to a synthetic basic block, and then have edges from that basic block to all the destinations. Same semantics, without the compile-time. (Not suggesting you try to fix this here, just noting it's possible.)


Some threshold probably makes sense, but we should make clear whether we're primarily trying to target the codesize, or the compile-time, so we have a good starting point the next time this is revisited.

@DianQK
Copy link
Member Author

DianQK commented Feb 14, 2024

I just remembered that I found a performance issue here. In the example of successive switch statements, we would add a number of PHI instructions, which seem to end up as stack operation instructions. I'm not sure whether it can be optimized in any other way.


One is whether we're emitting too much code here, i.e. the codesize tradeoff is worthwhile. This might depend on what exactly the user wrote... we might want to be more aggressive if the user explicitly uses indirect gotos. This explains the codesize growth you're seeing, but not really the compile-time; it's a constant number of instructions per switch case.

As I mentioned above, the code size problem should be caused by instruction copying and the addition of extra PHI instructions.

The other is whether the representation we're using is causing non-linear compile-time. The codesize growth should be linear. But if we're representing the edges explicitly, the size of the IR in memory might grow non-linearly, causing compile-time issues.

Actually, In fact, the code size of #79993 isn't terribly bad, but this is a horrible time increase.

Even if we're tail-duplicating, we don't necessarily need to explicitly represent all the possible source->destination edges separately. It's just the most convenient thing to do given the way codegen works. But we could do something different: instead of making every indirect jump have an edge to every indirect jump destination, we could make all the jumps jump to a synthetic basic block, and then have edges from that basic block to all the destinations. Same semantics, without the compile-time. (Not suggesting you try to fix this here, just noting it's possible.)

I may indeed not be able to fix the issue, but will I follow up to see if I can actually fix the issue. (Although it probably won't.)

Some threshold probably makes sense, but we should make clear whether we're primarily trying to target the codesize, or the compile-time, so we have a good starting point the next time this is revisited.

I'll continue to work on this issue. Even if I submit a workaround, I'll try to explain the various issues I've found. :)

@efriedma-quic
Copy link
Collaborator

The PHI node issue is interesting... I would have thought that if you have a bunch of identical PHIs, the register allocator would do the right thing, but maybe not.

I'll continue to work on this issue. Even if I submit a workaround, I'll try to explain the various issues I've found. :)

Okay, sounds good.

@DianQK
Copy link
Member Author

DianQK commented Feb 24, 2024

@efriedma-quic I've put most of the analysis into this comment.

In a nutshell, I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes. A critical BB refers to one with multiple predecessors and multiple successors.

@DianQK DianQK changed the title [TailDuplicator] Add a limit on the number of indirect branch successors [TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks Feb 24, 2024
@DianQK
Copy link
Member Author

DianQK commented Feb 27, 2024

Ping. If the new changes are suitable, I hope to catch up on the final release of 18.1.0.

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 22, 2024

Ping.

@DianQK
Copy link
Member Author

DianQK commented Apr 15, 2024

Ping. I assume this is an uncommon scenario, since there has never been feedback on similar compile-time issues before.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The new check looks sufficiently narrow (requires many predecessors and many successors) to just do this.

@DianQK DianQK merged commit 86a7828 into llvm:main Apr 17, 2024
3 of 4 checks passed
@DianQK DianQK deleted the tail-dup-pred-size-limit branch April 17, 2024 12:27
@alexfh
Copy link
Contributor

alexfh commented Apr 22, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@alexfh
Copy link
Contributor

alexfh commented Apr 25, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

@DianQK
Copy link
Member Author

DianQK commented Apr 25, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

I apologize for my late reply. (*^^*)
I think we can adjust the values of tail-dup-succ-size and tail-dup-pred-size, and based on my experiments, this compilation time is acceptable up to 100.

@DianQK
Copy link
Member Author

DianQK commented Apr 25, 2024

I'll verify at the end of the week that increasing to 128 has no noticeable impact on compilation time.

@DianQK
Copy link
Member Author

DianQK commented Apr 27, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

Could you try with -tail-dup-pred-size=64 -tail-dup-succ-size=64? It seems I can't reproduce it on x86-64.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Sure, I'll run the comparison before and after this commit with -tail-dup-pred-size=64 -tail-dup-succ-size=64. And yes, this only reproduces on AArch64, but not on any flavor of x86-64 I have access to.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Benchmark compiled with clang after this commit with -mllvm=-tail-dup-pred-size=64 -mllvm=-tail-dup-succ-size=64 has no regression compared to before this commit. I'm running with 32 just in case. Also note that this benchmark is in no way considered representative of any real workloads we care about.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Actually, just -mllvm=-tail-dup-succ-size=32 (without touching -tail-dup-pred-size) is enough to fix the regression, while -mllvm=-tail-dup-succ-size=24 -mllvm=-tail-dup-succ-size=24 is not. Thus, the optimal value of -tail-dup-succ-size for this particular benchmark on aarch64 is somewhere between 24 and 32 (with the default value of -tail-dup-pred-size - 16).

@DianQK
Copy link
Member Author

DianQK commented May 6, 2024

Thanks, I will continue to investigate this.

@DianQK
Copy link
Member Author

DianQK commented May 12, 2024

Hmm, I tried LLVM 18 and the main (c8864bc) branch on Raspberry Pi 4 (arm64), but I didn't find any performance issues:


 Performance counter stats for './llvm18':

          1,805.64 msec task-clock:u                     #    0.994 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.476 /sec
     2,653,746,964      cycles:u                         #    1.470 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           247,193      branch-misses:u

       1.815950443 seconds time elapsed

       1.800769000 seconds user
       0.004009000 seconds sys


Sum: 3273600000

 Performance counter stats for './main':

          1,784.68 msec task-clock:u                     #    0.998 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.775 /sec
     2,653,681,171      cycles:u                         #    1.487 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           241,177      branch-misses:u

       1.788022549 seconds time elapsed

       1.780709000 seconds user
       0.003991000 seconds sys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CodeGen][OOM] Switch statements without default branching result in long compile times or OOM
10 participants