Skip to content

Commit

Permalink
ext-tsp basic block layout
Browse files Browse the repository at this point in the history
A new basic block ordering improving existing MachineBlockPlacement.

The algorithm tries to find a layout of nodes (basic blocks) of a given CFG
optimizing jump locality and thus processor I-cache utilization. This is
achieved via increasing the number of fall-through jumps and co-locating
frequently executed nodes together. The name follows the underlying
optimization problem, Extended-TSP, which is a generalization of classical
(maximum) Traveling Salesmen Problem.

The algorithm is a greedy heuristic that works with chains (ordered lists)
of basic blocks. Initially all chains are isolated basic blocks. On every
iteration, we pick a pair of chains whose merging yields the biggest increase
in the ExtTSP value, which models how i-cache "friendly" a specific chain is.
A pair of chains giving the maximum gain is merged into a new chain. The
procedure stops when there is only one chain left, or when merging does not
increase ExtTSP. In the latter case, the remaining chains are sorted by
density in decreasing order.

An important aspect is the way two chains are merged. Unlike earlier
algorithms (e.g., based on the approach of Pettis-Hansen), two
chains, X and Y, are first split into three, X1, X2, and Y. Then we
consider all possible ways of gluing the three chains (e.g., X1YX2, X1X2Y,
X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score.
This improves the quality of the final result (the search space is larger)
while keeping the implementation sufficiently fast.

Differential Revision: https://reviews.llvm.org/D113424
  • Loading branch information
spupyrev committed Dec 7, 2021
1 parent 976a74d commit f573f68
Show file tree
Hide file tree
Showing 6 changed files with 1,867 additions and 1 deletion.
58 changes: 58 additions & 0 deletions llvm/include/llvm/Transforms/Utils/CodeLayout.h
@@ -0,0 +1,58 @@
//===- CodeLayout.h - Code layout/placement algorithms ---------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
/// \file
/// Declares methods and data structures for code layout algorithms.
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_TRANSFORMS_UTILS_CODELAYOUT_H
#define LLVM_TRANSFORMS_UTILS_CODELAYOUT_H

#include "llvm/ADT/DenseMap.h"

#include <vector>

namespace llvm {

class MachineBasicBlock;

/// Find a layout of nodes (basic blocks) of a given CFG optimizing jump
/// locality and thus processor I-cache utilization. This is achieved via
/// increasing the number of fall-through jumps and co-locating frequently
/// executed nodes together.
/// The nodes are assumed to be indexed by integers from [0, |V|) so that the
/// current order is the identity permutation.
/// \p NodeSizes: The sizes of the nodes (in bytes).
/// \p NodeCounts: The execution counts of the nodes in the profile.
/// \p EdgeCounts: The execution counts of every edge (jump) in the profile. The
/// map also defines the edges in CFG and should include 0-count edges.
/// \returns The best block order found.
std::vector<uint64_t> applyExtTspLayout(
const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,
const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

/// Estimate the "quality" of a given node order in CFG. The higher the score,
/// the better the order is. The score is designed to reflect the locality of
/// the given order, which is anti-correlated with the number of I-cache misses
/// in a typical execution of the function.
double calcExtTspScore(
const std::vector<uint64_t> &Order, const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,
const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

/// Estimate the "quality" of the current node order in CFG.
double calcExtTspScore(
const std::vector<uint64_t> &NodeSizes,
const std::vector<uint64_t> &NodeCounts,
const DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> &EdgeCounts);

} // end namespace llvm

#endif // LLVM_TRANSFORMS_UTILS_CODELAYOUT_H
161 changes: 160 additions & 1 deletion llvm/lib/CodeGen/MachineBlockPlacement.cpp
Expand Up @@ -61,6 +61,7 @@
#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/Transforms/Utils/CodeLayout.h"
#include <algorithm>
#include <cassert>
#include <cstdint>
Expand Down Expand Up @@ -193,6 +194,11 @@ static cl::opt<unsigned> TriangleChainCount(
cl::init(2),
cl::Hidden);

static cl::opt<bool> EnableExtTspBlockPlacement(
"enable-ext-tsp-block-placement", cl::Hidden, cl::init(false),
cl::desc("Enable machine block placement based on the ext-tsp model, "
"optimizing I-cache utilization."));

namespace llvm {
extern cl::opt<unsigned> StaticLikelyProb;
extern cl::opt<unsigned> ProfileLikelyProb;
Expand Down Expand Up @@ -557,6 +563,15 @@ class MachineBlockPlacement : public MachineFunctionPass {
/// but a local analysis would not find them.
void precomputeTriangleChains();

/// Apply a post-processing step optimizing block placement.
void applyExtTsp();

/// Modify the existing block placement in the function and adjust all jumps.
void assignBlockOrder(const std::vector<const MachineBasicBlock *> &NewOrder);

/// Create a single CFG chain from the current block order.
void createCFGChainExtTsp();

public:
static char ID; // Pass identification, replacement for typeid

Expand Down Expand Up @@ -3387,6 +3402,15 @@ bool MachineBlockPlacement::runOnMachineFunction(MachineFunction &MF) {
}
}

// Apply a post-processing optimizing block placement.
if (MF.size() >= 3 && EnableExtTspBlockPlacement) {
// Find a new placement and modify the layout of the blocks in the function.
applyExtTsp();

// Re-create CFG chain so that we can optimizeBranches and alignBlocks.
createCFGChainExtTsp();
}

optimizeBranches();
alignBlocks();

Expand All @@ -3413,12 +3437,147 @@ bool MachineBlockPlacement::runOnMachineFunction(MachineFunction &MF) {
MBFI->view("MBP." + MF.getName(), false);
}


// We always return true as we have no way to track whether the final order
// differs from the original order.
return true;
}

void MachineBlockPlacement::applyExtTsp() {
// Prepare data; blocks are indexed by their index in the current ordering.
DenseMap<const MachineBasicBlock *, uint64_t> BlockIndex;
BlockIndex.reserve(F->size());
std::vector<const MachineBasicBlock *> CurrentBlockOrder;
CurrentBlockOrder.reserve(F->size());
size_t NumBlocks = 0;
for (const MachineBasicBlock &MBB : *F) {
BlockIndex[&MBB] = NumBlocks++;
CurrentBlockOrder.push_back(&MBB);
}

auto BlockSizes = std::vector<uint64_t>(F->size());
auto BlockCounts = std::vector<uint64_t>(F->size());
DenseMap<std::pair<uint64_t, uint64_t>, uint64_t> JumpCounts;
for (MachineBasicBlock &MBB : *F) {
// Getting the block frequency.
BlockFrequency BlockFreq = MBFI->getBlockFreq(&MBB);
BlockCounts[BlockIndex[&MBB]] = BlockFreq.getFrequency();
// Getting the block size:
// - approximate the size of an instruction by 4 bytes, and
// - ignore debug instructions.
// Note: getting the exact size of each block is target-dependent and can be
// done by extending the interface of MCCodeEmitter. Experimentally we do
// not see a perf improvement with the exact block sizes.
auto NonDbgInsts =
instructionsWithoutDebug(MBB.instr_begin(), MBB.instr_end());
int NumInsts = std::distance(NonDbgInsts.begin(), NonDbgInsts.end());
BlockSizes[BlockIndex[&MBB]] = 4 * NumInsts;
// Getting jump frequencies.
for (MachineBasicBlock *Succ : MBB.successors()) {
auto EP = MBPI->getEdgeProbability(&MBB, Succ);
BlockFrequency EdgeFreq = BlockFreq * EP;
auto Edge = std::make_pair(BlockIndex[&MBB], BlockIndex[Succ]);
JumpCounts[Edge] = EdgeFreq.getFrequency();
}
}

LLVM_DEBUG(dbgs() << "Applying ext-tsp layout for |V| = " << F->size()
<< " with profile = " << F->getFunction().hasProfileData()
<< " (" << F->getName().str() << ")"
<< "\n");
LLVM_DEBUG(
dbgs() << format(" original layout score: %0.2f\n",
calcExtTspScore(BlockSizes, BlockCounts, JumpCounts)));

// Run the layout algorithm.
auto NewOrder = applyExtTspLayout(BlockSizes, BlockCounts, JumpCounts);
std::vector<const MachineBasicBlock *> NewBlockOrder;
NewBlockOrder.reserve(F->size());
for (uint64_t Node : NewOrder) {
NewBlockOrder.push_back(CurrentBlockOrder[Node]);
}
LLVM_DEBUG(dbgs() << format(" optimized layout score: %0.2f\n",
calcExtTspScore(NewOrder, BlockSizes, BlockCounts,
JumpCounts)));

// Assign new block order.
assignBlockOrder(NewBlockOrder);
}

void MachineBlockPlacement::assignBlockOrder(
const std::vector<const MachineBasicBlock *> &NewBlockOrder) {
assert(F->size() == NewBlockOrder.size() && "Incorrect size of block order");
F->RenumberBlocks();

bool HasChanges = false;
for (size_t I = 0; I < NewBlockOrder.size(); I++) {
if (NewBlockOrder[I] != F->getBlockNumbered(I)) {
HasChanges = true;
break;
}
}
// Stop early if the new block order is identical to the existing one.
if (!HasChanges)
return;

SmallVector<MachineBasicBlock *, 4> PrevFallThroughs(F->getNumBlockIDs());
for (auto &MBB : *F) {
PrevFallThroughs[MBB.getNumber()] = MBB.getFallThrough();
}

// Sort basic blocks in the function according to the computed order.
DenseMap<const MachineBasicBlock *, size_t> NewIndex;
for (const MachineBasicBlock *MBB : NewBlockOrder) {
NewIndex[MBB] = NewIndex.size();
}
F->sort([&](MachineBasicBlock &L, MachineBasicBlock &R) {
return NewIndex[&L] < NewIndex[&R];
});

// Update basic block branches by inserting explicit fallthrough branches
// when required and re-optimize branches when possible.
const TargetInstrInfo *TII = F->getSubtarget().getInstrInfo();
SmallVector<MachineOperand, 4> Cond;
for (auto &MBB : *F) {
MachineFunction::iterator NextMBB = std::next(MBB.getIterator());
MachineFunction::iterator EndIt = MBB.getParent()->end();
auto *FTMBB = PrevFallThroughs[MBB.getNumber()];
// If this block had a fallthrough before we need an explicit unconditional
// branch to that block if the fallthrough block is not adjacent to the
// block in the new order.
if (FTMBB && (NextMBB == EndIt || &*NextMBB != FTMBB)) {
TII->insertUnconditionalBranch(MBB, FTMBB, MBB.findBranchDebugLoc());
}

// It might be possible to optimize branches by flipping the condition.
Cond.clear();
MachineBasicBlock *TBB = nullptr, *FBB = nullptr;
if (TII->analyzeBranch(MBB, TBB, FBB, Cond))
continue;
MBB.updateTerminator(FTMBB);
}

#ifndef NDEBUG
// Make sure we correctly constructed all branches.
F->verify(this, "After optimized block reordering");
#endif
}

void MachineBlockPlacement::createCFGChainExtTsp() {
BlockToChain.clear();
ComputedEdges.clear();
ChainAllocator.DestroyAll();

MachineBasicBlock *HeadBB = &F->front();
BlockChain *FunctionChain =
new (ChainAllocator.Allocate()) BlockChain(BlockToChain, HeadBB);

for (MachineBasicBlock &MBB : *F) {
if (HeadBB == &MBB)
continue; // Ignore head of the chain
FunctionChain->merge(&MBB, nullptr);
}
}

namespace {

/// A pass to compute block placement statistics.
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Transforms/Utils/CMakeLists.txt
Expand Up @@ -14,6 +14,7 @@ add_llvm_component_library(LLVMTransformUtils
CloneFunction.cpp
CloneModule.cpp
CodeExtractor.cpp
CodeLayout.cpp
CodeMoverUtils.cpp
CtorUtils.cpp
Debugify.cpp
Expand Down

0 comments on commit f573f68

Please sign in to comment.