Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TailDuplicator] Do not restrict the computed gotos #114990

Merged
merged 2 commits into from
Mar 10, 2025

Conversation

DianQK
Copy link
Member

@DianQK DianQK commented Nov 5, 2024

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

@llvmbot
Copy link
Member

llvmbot commented Nov 5, 2024

@llvm/pr-subscribers-backend-x86

Author: DianQK (DianQK)

Changes

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the BB that has indirect jumps with the jump table. I believe GCC has provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically unfactors computed gotos that were factored early on in the compilation process to speed up edge based data flow. We used to not unfactor them again, which can seriously pessimize code with many computed jumps in the source code, such as interpreters.

https://github.com/gcc-mirror/gcc/blob/7f67acf60c5429895d7c9e5df81796753e2913e0/gcc/bb-reorder.cc#L2757-L2761


Full diff: https://github.com/llvm/llvm-project/pull/114990.diff

3 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineInstr.h (+6-2)
  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+12-10)
  • (added) llvm/test/CodeGen/X86/tail-dup-computed-goto.mir (+310)
diff --git a/llvm/include/llvm/CodeGen/MachineInstr.h b/llvm/include/llvm/CodeGen/MachineInstr.h
index ead6bbe1d5f641..6d268628ec14b4 100644
--- a/llvm/include/llvm/CodeGen/MachineInstr.h
+++ b/llvm/include/llvm/CodeGen/MachineInstr.h
@@ -986,8 +986,12 @@ class MachineInstr
 
   /// Return true if this is an indirect branch, such as a
   /// branch through a register.
-  bool isIndirectBranch(QueryType Type = AnyInBundle) const {
-    return hasProperty(MCID::IndirectBranch, Type);
+  bool isIndirectBranch(QueryType Type = AnyInBundle,
+                        bool IncludeJumpTable = true) const {
+    return hasProperty(MCID::IndirectBranch, Type) &&
+           (IncludeJumpTable || !llvm::any_of(operands(), [](const auto &Op) {
+              return Op.isJTI();
+            }));
   }
 
   /// Return true if this is a branch which may fall
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 3f2e1511d403a0..988c58beac20f6 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -603,17 +603,19 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
       TailBB.canFallThrough())
     return false;
 
-  // If the target has hardware branch prediction that can handle indirect
-  // branches, duplicating them can often make them predictable when there
-  // are common paths through the code.  The limit needs to be high enough
-  // to allow undoing the effects of tail merging and other optimizations
-  // that rearrange the predecessors of the indirect branch.
-
-  bool HasIndirectbr = false;
+  // Only duplicate the blocks containing computed gotos. This basically
+  // unfactors computed gotos that were factored early on in the compilation
+  // process to speed up edge based data flow. If we do not unfactor them again,
+  // it can seriously pessimize code with many computed jumps in the source
+  // code, such as interpreters.
+  bool HasComputedGoto = false;
   if (!TailBB.empty())
-    HasIndirectbr = TailBB.back().isIndirectBranch();
+    HasComputedGoto = TailBB.back().isIndirectBranch(
+        /*Type=*/MachineInstr::AnyInBundle,
+        // Jump tables are not considered computed gotos.
+        /*IncludeJumpTable=*/false);
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     MaxDuplicateCount = TailDupIndirectBranchSize;
 
   // Check the instructions in the block to determine whether tail-duplication
@@ -685,7 +687,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
     }
   }
 
-  if (HasIndirectbr && PreRegAlloc)
+  if (HasComputedGoto && PreRegAlloc)
     return true;
 
   if (IsSimple)
diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
new file mode 100644
index 00000000000000..b1c699c11f4619
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir
@@ -0,0 +1,310 @@
+# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5
+# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-size=0 %s -o - | FileCheck %s
+# Check that only the computed goto is duplicated.
+--- |
+  declare i64 @f0()
+  declare i64 @f1()
+  declare i64 @f2()
+  declare i64 @f3()
+  declare i64 @f4()
+  declare i64 @f5()
+  @computed_goto.dispatch = external global [5 x ptr]
+  define void @computed_goto() { ret void }
+  define void @jump_table() { ret void }
+...
+---
+name:            computed_goto
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64_nosp }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64 }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+body:             |
+  ; CHECK-LABEL: name: computed_goto
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]]
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY1]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64_nosp = COPY [[COPY3]]
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]]
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64_nosp = COPY [[COPY7]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY7]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64_nosp = COPY [[COPY9]]
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64_nosp = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY10]], @computed_goto.dispatch, $noreg
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY12:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY13:%[0-9]+]]:gr64_nosp = COPY [[COPY12]]
+  ; CHECK-NEXT:   [[COPY14:%[0-9]+]]:gr64_nosp = COPY [[COPY13]]
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[COPY13]], @computed_goto.dispatch, $noreg
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %6:gr64 = COPY $rax
+    %0:gr64 = COPY %6
+    JMP_1 %bb.5
+
+  bb.1:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %1:gr64 = COPY %10
+    JMP_1 %bb.5
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %2:gr64 = COPY %9
+    JMP_1 %bb.5
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %8:gr64 = COPY $rax
+    %3:gr64 = COPY %8
+    JMP_1 %bb.5
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %4:gr64 = COPY %7
+
+  bb.5:
+    successors: %bb.1, %bb.2, %bb.3, %bb.4
+
+    %5:gr64_nosp = PHI %0, %bb.0, %4, %bb.4, %3, %bb.3, %2, %bb.2, %1, %bb.1
+    JMP64m $noreg, 8, %5, @computed_goto.dispatch, $noreg
+
+...
+---
+name:            jump_table
+alignment:       16
+tracksRegLiveness: true
+noPhis:          false
+isSSA:           true
+noVRegs:         false
+hasFakeUses:     false
+debugInstrRef:   true
+registers:
+  - { id: 0, class: gr64 }
+  - { id: 1, class: gr64 }
+  - { id: 2, class: gr64 }
+  - { id: 3, class: gr64 }
+  - { id: 4, class: gr64 }
+  - { id: 5, class: gr64 }
+  - { id: 6, class: gr64 }
+  - { id: 7, class: gr64 }
+  - { id: 8, class: gr64_nosp }
+  - { id: 9, class: gr64 }
+  - { id: 10, class: gr64 }
+  - { id: 11, class: gr64 }
+  - { id: 12, class: gr64 }
+  - { id: 13, class: gr64 }
+frameInfo:
+  maxAlignment:    1
+  adjustsStack:    true
+  hasCalls:        true
+machineFunctionInfo:
+  amxProgModel:    None
+jumpTable:
+  kind:            block-address
+  entries:
+    - id:              0
+      blocks:          [ '%bb.2', '%bb.3', '%bb.4', '%bb.5', '%bb.6' ]
+body:             |
+  ; CHECK-LABEL: name: jump_table
+  ; CHECK: bb.0:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY [[COPY]]
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.1:
+  ; CHECK-NEXT:   successors: %bb.2(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   [[PHI:%[0-9]+]]:gr64 = PHI [[COPY1]], %bb.0, %6, %bb.7, %5, %bb.6, %4, %bb.5, %3, %bb.4, %2, %bb.3
+  ; CHECK-NEXT:   [[DEC64r:%[0-9]+]]:gr64_nosp = DEC64r [[PHI]], implicit-def dead $eflags
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.2:
+  ; CHECK-NEXT:   successors: %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a), %bb.7(0x1999999a)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   JMP64m $noreg, 8, [[DEC64r]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.3:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:gr64 = COPY [[COPY2]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.4:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:gr64 = COPY [[COPY4]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.5:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:gr64 = COPY [[COPY6]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.6:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY9:%[0-9]+]]:gr64 = COPY [[COPY8]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.7:
+  ; CHECK-NEXT:   successors: %bb.1(0x80000000)
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT:   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+  ; CHECK-NEXT:   ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+  ; CHECK-NEXT:   [[COPY10:%[0-9]+]]:gr64 = COPY $rax
+  ; CHECK-NEXT:   [[COPY11:%[0-9]+]]:gr64 = COPY [[COPY10]]
+  ; CHECK-NEXT:   JMP_1 %bb.1
+  ; CHECK-NEXT: {{  $}}
+  ; CHECK-NEXT: bb.8:
+  bb.0:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %7:gr64 = COPY $rax
+    %0:gr64 = COPY %7
+
+  bb.1:
+    %1:gr64 = PHI %0, %bb.0, %6, %bb.6, %5, %bb.5, %4, %bb.4, %3, %bb.3, %2, %bb.2
+    %8:gr64_nosp = DEC64r %1, implicit-def dead $eflags
+
+  bb.8:
+    successors: %bb.2(0x1999999a), %bb.3(0x1999999a), %bb.4(0x1999999a), %bb.5(0x1999999a), %bb.6(0x1999999a)
+
+    JMP64m $noreg, 8, %8, %jump-table.0, $noreg :: (load (s64) from jump-table)
+
+  bb.2:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %13:gr64 = COPY $rax
+    %2:gr64 = COPY %13
+    JMP_1 %bb.1
+
+  bb.3:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %12:gr64 = COPY $rax
+    %3:gr64 = COPY %12
+    JMP_1 %bb.1
+
+  bb.4:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %11:gr64 = COPY $rax
+    %4:gr64 = COPY %11
+    JMP_1 %bb.1
+
+  bb.5:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %10:gr64 = COPY $rax
+    %5:gr64 = COPY %10
+    JMP_1 %bb.1
+
+  bb.6:
+    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    CALL64pcrel32 target-flags(x86-plt) @f5, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax
+    ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
+    %9:gr64 = COPY $rax
+    %6:gr64 = COPY %9
+    JMP_1 %bb.1
+
+  bb.7:
+
+...

@DianQK
Copy link
Member Author

DianQK commented Nov 5, 2024

I will revert #78582 after this PR is merged.

@efriedma-quic
Copy link
Collaborator

Please don't link to gcc source code in commit messages.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

@DianQK
Copy link
Member Author

DianQK commented Nov 6, 2024

Please don't link to gcc source code in commit messages.

Removed.

I think it might make sense to try to pursue a more general solution here... can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something. But special-casing indirectbr seems okay short-term.

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

@efriedma-quic
Copy link
Collaborator

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

@DianQK
Copy link
Member Author

DianQK commented Nov 8, 2024

I think the current IR is already as you want: https://llvm.godbolt.org/z/o5YT191P9.
There’s only a single merged indirectbr here.

Currently, taildup means we end up with O(N^2) edges, even in the indirectbr case. This patch is just deciding that we're willing to pay that cost in the indirectbr case, because it's a strong hint that that the code is important for performance. The question is, can we rework the backend representation to allow taildup without the explosion in the number of edges?

Sound makes sense to me. I can try to address it, although the process can be slow.

https://www.ajla-lang.cz/tutorial.html pointed out several issues:

It is recommended to use gcc &mdash the compilation will take several minutes and it will consume about 4GB memory when compiling the files ipret.c and ipretc.c. Compilation with clang works, but it is very slow, it may take an hour or more to compile the files ipret.c and ipretc.c.

@DianQK DianQK requested a review from fhahn November 19, 2024 01:29
@DianQK DianQK marked this pull request as draft November 19, 2024 01:30
@DianQK DianQK marked this pull request as ready for review November 21, 2024 04:00
@DianQK
Copy link
Member Author

DianQK commented Dec 3, 2024

ping :p

@bzEq
Copy link
Collaborator

bzEq commented Dec 3, 2024

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as

bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2

=>

bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

@DianQK
Copy link
Member Author

DianQK commented Dec 17, 2024

can we simplify the CFG while still keeping the benefit of taildup? Maybe introduce a fake BB into the CFG or something.

The idea sounds good. Maybe outlined as

bb.0:
  succ: bb.0, bb.1, bb.2
  indirectbr %0
bb.1:
  succ: bb.0, bb.1, bb.2
  indirectbr %1
bb.2:
  succ: bb.0, bb.1, bb.2
  indirectbr %2

=>

bb.0:
  succ: bb.3
  indirectbr %0
bb.1:
  succ: bb.3
  indirectbr %1
bb.2:
  succ: bb.3
  indirectbr %2
bb.3:
  succ: bb.0, bb.1, bb.2
  %3 = phi(%0, %1, %2)
  PSEUDO_INDIRECT_BR %3

I don't fully understand the details of this improvement yet, but it shouldn't block this PR.

So.. ping

@fhahn
Copy link
Contributor

fhahn commented Jan 22, 2025

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

@DianQK
Copy link
Member Author

DianQK commented Jan 23, 2025

Just getting back to this, sorry! Is this the latest version of the patch that should fix the regressions?

I tried the patch when building Python on macOS and it improves performance by ~0.5% while with #116072 increases performance by 2-3%

I don't have a CPU like the i7-2640M, but I can observe a significant decrease in instructions:u from the perf stat output.

perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    98,475,825,975      instructions:u                   #    2.95  insn per cycle
perf stat -r 3 ./ajla --nosave loop.ajla 1000000000
    77,379,026,755      instructions:u                   #    2.09  insn per cycle

This should help CPUs with weaker branch prediction capabilities.

I will rebase this PR after #116072 land.

@DianQK DianQK changed the title [TailDuplicator] Only duplicate the blocks containing computed gotos [TailDuplicator] Do not restrict the computed gotos Jan 24, 2025
@DianQK
Copy link
Member Author

DianQK commented Jan 24, 2025

Rebased.
cc @fhahn

@DianQK
Copy link
Member Author

DianQK commented Mar 6, 2025

Ping ~ (From #106846 (comment))

indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 8, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 8, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
indygreg added a commit to indygreg/toolchain-tools that referenced this pull request Mar 9, 2025
The vendored patches were produced from the latest versions of the
following PRs:

* llvm/llvm-project#114990
* llvm/llvm-project#120267

The first improves codegen for computed gotos. There was a
regression in LLVM 19 causing a ~10% performance drop in CPython.

The second enables BOLT to work with computed gotos. This enables
BOLT to accomplish more on CPython.
@DianQK
Copy link
Member Author

DianQK commented Mar 10, 2025

@dtcxzyw @nikic @fhahn

Ping, I believe this issue is quietly and broadly affecting interpreters implemented using computed gotos. Compared to Clang 18, this shouldn't introduce new regressions, and I think we can move on this a bit more quickly.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DianQK DianQK merged commit dd21aac into llvm:main Mar 10, 2025
8 checks passed
@DianQK DianQK deleted the computed-goto branch March 10, 2025 11:34
swift-ci pushed a commit to swiftlang/llvm-project that referenced this pull request Mar 11, 2025
Fixes llvm#106846.

This is what I learned from GCC. I found that GCC does not duplicate the
BB that has indirect jumps with the jump table. I believe GCC has
provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically
unfactors computed gotos that were factored early on in the compilation
process to speed up edge based data flow. We used to not unfactor them
again, which can seriously pessimize code with many computed jumps in
the source code, such as interpreters.

(cherry picked from commit dd21aac)
Bertik23 pushed a commit to Bertik23/llvm-project that referenced this pull request Mar 12, 2025
Fixes llvm#106846.

This is what I learned from GCC. I found that GCC does not duplicate the
BB that has indirect jumps with the jump table. I believe GCC has
provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically
unfactors computed gotos that were factored early on in the compilation
process to speed up edge based data flow. We used to not unfactor them
again, which can seriously pessimize code with many computed jumps in
the source code, such as interpreters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

performance regression in clang-19 when using computed goto
6 participants