Add a new pass to speculate around PHI nodes with constant (integer) …

…operands when profitable. The core idea is to (re-)introduce some redundancies where their cost is hidden by the cost of materializing immediates for constant operands of PHI nodes. When the cost of the redundancies is covered by this, avoiding materializing the immediate has numerous benefits: 1) Less register pressure 2) Potential for further folding / combining 3) Potential for more efficient instructions due to immediate operand As a motivating example, consider the remarkably different cost on x86 of a SHL instruction with an immediate operand versus a register operand. This pattern turns up surprisingly frequently, but is somewhat rarely obvious as a significant performance problem. The pass is entirely target independent, but it does rely on the target cost model in TTI to decide when to speculate things around the PHI node. I've included x86-focused tests, but any target that sets up its immediate cost model should benefit from this pass. There is probably more that can be done in this space, but the pass as-is is enough to get some important performance on our internal benchmarks, and should be generally performance neutral, but help with more extensive benchmarking is always welcome. One awkward part is that this pass has to be scheduled after *everything* that can eliminate these kinds of redundancies. This includes SimplifyCFG, GVN, etc. I'm open to suggestions about better places to put this. We could in theory make it part of the codegen pass pipeline, but there doesn't really seem to be a good reason for that -- it isn't "lowering" in any sense and only relies on pretty standard cost model based TTI queries, so it seems to fit well with the "optimization" pipeline model. Still, further thoughts on the pipeline position are welcome. I've also only implemented this in the new pass manager. If folks are very interested, I can try to add it to the old PM as well, but I didn't really see much point (my use case is already switched over to the new PM). I've tested this pretty heavily without issue. A wide range of benchmarks internally show no change outside the noise, and I don't see any significant changes in SPEC either. However, the size class computation in tcmalloc is substantially improved by this, which turns into a 2% to 4% win on the hottest path through tcmalloc for us, so there are definitely important cases where this is going to make a substantial difference. Differential revision: https://reviews.llvm.org/D37467 llvm-svn: 319164
llvm · Nov 28, 2017 · c34f789 · c34f789
1 parent b789ab3
commit c34f789
Show file tree

Hide file tree

Showing 8 changed files with 1,527 additions and 0 deletions.
diff --git a/llvm/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h b/llvm/include/llvm/Transforms/Scalar/SpeculateAroundPHIs.h
@@ -0,0 +1,111 @@
+//===- SpeculateAroundPHIs.h - Speculate around PHIs ------------*- C++ -*-===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H
+#define LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H
+
+#include "llvm/ADT/SetVector.h"
+#include "llvm/Analysis/AssumptionCache.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/PassManager.h"
+#include "llvm/Support/Compiler.h"
+#include <vector>
+
+namespace llvm {
+
+/// This pass handles simple speculating of  instructions around PHIs when
+/// doing so is profitable for a particular target despite duplicated
+/// instructions.
+///
+/// The motivating example are PHIs of constants which will require
+/// materializing the constants along each edge. If the PHI is used by an
+/// instruction where the target can materialize the constant as part of the
+/// instruction, it is profitable to speculate those instructions around the
+/// PHI node. This can reduce dynamic instruction count as well as decrease
+/// register pressure.
+///
+/// Consider this IR for example:
+///   ```
+///   entry:
+///     br i1 %flag, label %a, label %b
+///
+///   a:
+///     br label %exit
+///
+///   b:
+///     br label %exit
+///
+///   exit:
+///     %p = phi i32 [ 7, %a ], [ 11, %b ]
+///     %sum = add i32 %arg, %p
+///     ret i32 %sum
+///   ```
+/// To materialize the inputs to this PHI node may require an explicit
+/// instruction. For example, on x86 this would turn into something like
+///   ```
+///     testq %eax, %eax
+///     movl $7, %rNN
+///     jne .L
+///     movl $11, %rNN
+///   .L:
+///     addl %edi, %rNN
+///     movl %rNN, %eax
+///     retq
+///   ```
+/// When these constants can be folded directly into another instruction, it
+/// would be preferable to avoid the potential for register pressure (above we
+/// can easily avoid it, but that isn't always true) and simply duplicate the
+/// instruction using the PHI:
+///   ```
+///   entry:
+///     br i1 %flag, label %a, label %b
+///
+///   a:
+///     %sum.1 = add i32 %arg, 7
+///     br label %exit
+///
+///   b:
+///     %sum.2 = add i32 %arg, 11
+///     br label %exit
+///
+///   exit:
+///     %p = phi i32 [ %sum.1, %a ], [ %sum.2, %b ]
+///     ret i32 %p
+///   ```
+/// Which will generate something like the following on x86:
+///   ```
+///     testq %eax, %eax
+///     addl $7, %edi
+///     jne .L
+///     addl $11, %edi
+///   .L:
+///     movl %edi, %eax
+///     retq
+///   ```
+///
+/// It is important to note that this pass is never intended to handle more
+/// complex cases where speculating around PHIs allows simplifications of the
+/// IR itself or other subsequent optimizations. Those can and should already
+/// be handled before this pass is ever run by a more powerful analysis that
+/// can reason about equivalences and common subexpressions. Classically, those
+/// cases would be handled by a GVN-powered PRE or similar transform. This
+/// pass, in contrast, is *only* interested in cases where despite no
+/// simplifications to the IR itself, speculation is *faster* to execute. The
+/// result of this is that the cost models which are appropriate to consider
+/// here are relatively simple ones around execution and codesize cost, without
+/// any need to consider simplifications or other transformations.
+struct SpeculateAroundPHIsPass : PassInfoMixin<SpeculateAroundPHIsPass> {
+  /// \brief Run the pass over the function.
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
+};
+
+} // end namespace llvm
+
+#endif // LLVM_TRANSFORMS_SCALAR_SPECULATEAROUNDPHIS_H
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
@@ -132,6 +132,7 @@
 #include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
 #include "llvm/Transforms/Scalar/SimplifyCFG.h"
 #include "llvm/Transforms/Scalar/Sink.h"
+#include "llvm/Transforms/Scalar/SpeculateAroundPHIs.h"
 #include "llvm/Transforms/Scalar/SpeculativeExecution.h"
 #include "llvm/Transforms/Scalar/TailRecursionElimination.h"
 #include "llvm/Transforms/Utils/AddDiscriminators.h"
@@ -799,6 +800,11 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
   // resulted in single-entry-single-exit or empty blocks. Clean up the CFG.
   OptimizePM.addPass(SimplifyCFGPass());
 
+  // Optimize PHIs by speculating around them when profitable. Note that this
+  // pass needs to be run after any PRE or similar pass as it is essentially
+  // inserting redudnancies into the progrem. This even includes SimplifyCFG.
+  OptimizePM.addPass(SpeculateAroundPHIsPass());
+
   // Add the core optimizing pipeline.
   MPM.addPass(createModuleToFunctionPassAdaptor(std::move(OptimizePM)));
 

diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
@@ -199,6 +199,7 @@ FUNCTION_PASS("simplify-cfg", SimplifyCFGPass())
 FUNCTION_PASS("sink", SinkingPass())
 FUNCTION_PASS("slp-vectorizer", SLPVectorizerPass())
 FUNCTION_PASS("speculative-execution", SpeculativeExecutionPass())
+FUNCTION_PASS("spec-phis", SpeculateAroundPHIsPass())
 FUNCTION_PASS("sroa", SROA())
 FUNCTION_PASS("tailcallelim", TailCallElimPass())
 FUNCTION_PASS("unreachableblockelim", UnreachableBlockElimPass())

diff --git a/llvm/lib/Transforms/Scalar/CMakeLists.txt b/llvm/lib/Transforms/Scalar/CMakeLists.txt
@@ -62,6 +62,7 @@ add_llvm_library(LLVMScalarOpts
   SimplifyCFGPass.cpp
   Sink.cpp
   SpeculativeExecution.cpp
+  SpeculateAroundPHIs.cpp
   StraightLineStrengthReduce.cpp
   StructurizeCFG.cpp
   TailRecursionElimination.cpp