Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Add AMDGPU-specific module splitting #89245

Merged
merged 10 commits into from
May 23, 2024

Conversation

Pierre-vh
Copy link
Contributor

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

@llvmbot llvmbot added backend:AMDGPU LTO Link time optimization (regular/full LTO or ThinLTO) labels Apr 18, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 18, 2024

@llvm/pr-subscribers-lto

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.


Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff

22 Files Affected:

  • (modified) llvm/include/llvm/Target/TargetMachine.h (+12)
  • (modified) llvm/lib/LTO/LTOBackend.cpp (+9-4)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46)
  • (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39)
  • (added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98)
  • (added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2)
  • (modified) llvm/tools/llvm-split/CMakeLists.txt (+7)
  • (modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25)
diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index ceb371bdc73480..48ea3cfe02775b 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -418,6 +418,18 @@ class TargetMachine {
   virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
     return 0;
   }
+
+  /// Entry point for module splitting. Targets can implement custom module
+  /// splitting logic, mainly used by LTO for --lto-partitions.
+  ///
+  /// \returns `true` if the module was split, `false` otherwise. When  `false`
+  /// is returned, it is assumed that \p ModuleCallback has never been called
+  /// and \p M has not been modified.
+  virtual bool splitModule(
+      Module &M, unsigned NumParts,
+      function_ref<void(std::unique_ptr<Module> MPart)> ModuleCallback) const {
+    return false;
+  }
 };
 
 /// This class describes a target machine that is implemented with the LLVM
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 71e8849dc3cc91..d4b89ede2d7134 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
   unsigned ThreadCount = 0;
   const Target *T = &TM->getTarget();
 
-  SplitModule(
-      Mod, ParallelCodeGenParallelismLevel,
+  const auto HandleModulePartition =
       [&](std::unique_ptr<Module> MPart) {
         // We want to clone the module in a new context to multi-thread the
         // codegen. We do it by serializing partition modules to bitcode
@@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
             // Pass BC using std::move to ensure that it get moved rather than
             // copied into the thread's context.
             std::move(BC), ThreadCount++);
-      },
-      false);
+      };
+
+  // Try target-specific module splitting first, then fallback to the default.
+  if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel,
+                       HandleModulePartition)) {
+    SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition,
+                false);
+  }
 
   // Because the inner lambda (which runs in a worker thread) captures our local
   // variables, we need to wait for the worker threads to terminate before we
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
new file mode 100644
index 00000000000000..fa47d494f04148
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
@@ -0,0 +1,733 @@
+//===- AMDGPUSplitModule.cpp ----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file Implements a module splitting algorithm designed to support the
+/// FullLTO --lto-partitions option for parallel codegen. This is completely
+/// different from the common SplitModule pass, as this system is designed with
+/// AMDGPU in mind.
+///
+/// The basic idea of this module splitting implementation is the same as
+/// SplitModule: load-balance the module's functions across a set of N
+/// partitions to allow parallel codegen. However, it does it very
+/// differently than the target-agnostic variant:
+///   - Kernels are used as the module's "roots".
+///     They're known entry points on AMDGPU, and everything else is often
+///     internal only.
+///   - Each kernel has a set of dependencies, and when a kernel and its
+///     dependencies is considered "big", we try to put it in a partition where
+///     most dependencies are already imported, to avoid duplicating large
+///     amounts of code.
+///   - There's special care for indirect calls in order to ensure
+///     AMDGPUResourceUsageAnalysis can work correctly.
+///
+/// This file also includes a more elaborate logging system to enable
+/// users to easily generate logs that (if desired) do not include any value
+/// names, in order to not leak information about the source file.
+/// Such logs are very helpful to understand and fix potential issues with
+/// module splitting.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUSplitModule.h"
+#include "AMDGPUTargetMachine.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/User.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/Process.h"
+#include "llvm/Support/SHA256.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+#include <algorithm>
+#include <cassert>
+#include <iterator>
+#include <memory>
+#include <utility>
+#include <vector>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-split-module"
+
+namespace {
+
+static cl::opt<float> LargeKernelFactor(
+    "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden,
+    cl::desc(
+        "consider a kernel as large and needing special treatment when it "
+        "exceeds the average cost of a partition by this factor; e;g. 2.0 "
+        "means if the kernel and its dependencies is 2 times bigger than "
+        "an average partition; 0 disables large kernels handling entirely"));
+
+static cl::opt<float> LargeKernelOverlapForMerge(
+    "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8),
+    cl::Hidden,
+    cl::desc("defines how much overlap between two large kernel's dependencies "
+             "is needed to put them in the same partition"));
+
+static cl::opt<bool> NoExternalizeGlobals(
+    "amdgpu-module-splitting-no-externalize-globals", cl::Hidden,
+    cl::desc("disables externalization of global variable with local linkage; "
+             "may cause globals to be duplicated which increases binary size"));
+
+static cl::opt<std::string>
+    LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden,
+              cl::desc("output directory for AMDGPU module splitting logs"));
+
+static cl::opt<bool>
+    LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden,
+               cl::desc("hash value names before printing them in the AMDGPU "
+                        "module splitting logs"));
+
+using CostType = InstructionCost::CostType;
+using PartitionID = unsigned;
+
+static std::string getName(const Value &V) {
+  static std::optional<bool> HideNames;
+  if (!HideNames) {
+    if (LogPrivate.getNumOccurrences())
+      HideNames = LogPrivate;
+    else {
+      const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE");
+      HideNames = (EV.value_or("0") != "0");
+    }
+  }
+
+  if (!*HideNames)
+    return V.getName().str();
+  return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())),
+               /*LowerCase*/ true);
+}
+
+/// Main logging helper.
+///
+/// Logging can be configured by the following environment variable.
+///   AMD_SPLIT_MODULE_LOG_DIR=<filepath>
+///     If set, uses <filepath> as the directory to write logfiles to
+///     each time module splitting is used.
+///   AMD_SPLIT_MODULE_LOG_PRIVATE
+///     If set to anything other than zero, all names are hidden.
+///
+/// Both environment variables have corresponding CL options which
+/// takes priority over them.
+///
+/// Any output printed to the log files is also printed to dbgs() when -debug is
+/// used and LLVM_DEBUG is defined.
+///
+/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic
+/// cannot be removed from the code (by building without debug). This probably
+/// has a small performance cost because if some computation/formatting is
+/// needed for logging purpose, it may be done everytime only to be ignored
+/// by the logger.
+///
+/// As this pass only runs once and is not doing anything computationally
+/// expensive, this is likely a reasonable trade-off.
+///
+/// If some computation should really be avoided when unused, users of the class
+/// can check whether any logging will occur by using the bool operator.
+///
+/// \code
+///   if (SML) {
+///     // Executes only if logging to a file or if -debug is available and
+///     used.
+///   }
+/// \endcode
+class SplitModuleLogger {
+public:
+  SplitModuleLogger(const Module &M) {
+    std::string LogDir = LogDirOpt;
+    if (LogDir.empty())
+      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
+
+    // No log dir specified means we don't need to log to a file.
+    // We may still log to dbgs(), though.
+    if (LogDir.empty())
+      return;
+
+    if (!sys::fs::is_directory(LogDir)) {
+      report_fatal_error("invalid AMDGPU split module log directory: '" +
+                             Twine(LogDir) + "' is not a directory",
+                         /*CrashDiag=*/false);
+    }
+
+    // If a log directory is specified, create a new file with a unique name in
+    // that directory.
+    SmallString<0> FilePath;
+    int Fd;
+    std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt");
+    if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) {
+      dbgs() << LogFile << "\n";
+      std::string Msg =
+          "Failed to create log file at '" + LogDir + "': " + Err.message();
+      report_fatal_error(StringRef(Msg),
+                         /*CrashDiag=*/false);
+    }
+
+    FileOS = std::make_unique<raw_fd_ostream>(Fd, /*shouldClose*/ true);
+  }
+
+  bool hasLogFile() const { return FileOS != nullptr; }
+
+  raw_ostream &logfile() {
+    assert(FileOS && "no logfile!");
+    return *FileOS;
+  }
+
+  /// \returns true if this SML will log anything either to a file or dbgs().
+  /// Can be used to avoid expensive computations that are ignored when logging
+  /// is disabled.
+  operator bool() const {
+    return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE));
+  }
+
+private:
+  std::unique_ptr<raw_fd_ostream> FileOS;
+};
+
+template <typename Ty>
+static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) {
+  static_assert(
+      !std::is_same_v<Ty, Value>,
+      "do not print values to logs directly, use handleName instead!");
+  LLVM_DEBUG(dbgs() << Val);
+  if (SML.hasLogFile())
+    SML.logfile() << Val;
+  return SML;
+}
+
+/// Calculate the cost of each function in \p M
+/// \param SML Log Helper
+/// \param TM TargetMachine instance used to retrieve TargetTransformInfo.
+/// \param M Module to analyze.
+/// \param CostMap[out] Resulting Function -> Cost map.
+/// \return The module's total cost.
+static CostType
+calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM,
+                       Module &M,
+                       DenseMap<const Function *, CostType> &CostMap) {
+  CostType ModuleCost = 0;
+  CostType KernelCost = 0;
+
+  for (auto &Fn : M) {
+    if (Fn.isDeclaration())
+      continue;
+
+    CostType FnCost = 0;
+    auto TTI = TM.getTargetTransformInfo(Fn);
+
+    for (auto &BB : Fn) {
+      for (auto &I : BB) {
+        auto Cost =
+            TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
+        assert(Cost != InstructionCost::getMax());
+        // Assume expensive if we can't tell the cost of an instruction.
+        CostType CostVal =
+            Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive);
+        assert((FnCost + CostVal) >= FnCost && "Overflow!");
+        FnCost += CostVal;
+      }
+    }
+
+    assert(FnCost != 0);
+
+    CostMap[&Fn] = FnCost;
+    assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!");
+    ModuleCost += FnCost;
+
+    if (AMDGPU::isKernelCC(&Fn))
+      KernelCost += FnCost;
+  }
+
+  CostType FnCost = (ModuleCost - KernelCost);
+  SML << "=> Total Module Cost: " << ModuleCost << "\n"
+      << "  => KernelCost: " << KernelCost << " ("
+      << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n"
+      << "  => FnsCost: " << FnCost << " ("
+      << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n";
+
+  return ModuleCost;
+}
+
+/// When a kernel or any of its callees performs an indirect call, this function
+/// takes over \ref addAllDependencies and adds all potentially callable
+/// functions to \p Fns so they can be counted as dependencies of the kernel.
+///
+/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the
+/// presence of an indirect call, the function's resource usage is the same as
+/// the most expensive function in the module.
+/// \param M    The module.
+/// \param Fns[out] Resulting list of functions.
+static void addAllIndirectCallDependencies(const Module &M,
+                                           DenseSet<const Function *> &Fns) {
+  for (const auto &Fn : M) {
+    if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv()))
+      Fns.insert(&Fn);
+  }
+}
+
+/// Adds the functions that \p Fn may call to \p Fns, then recurses into each
+/// callee until all reachable functions have been gathered.
+///
+/// \param SML Log Helper
+/// \param CG Call graph for \p Fn's module.
+/// \param Fn Current function to look at.
+/// \param Fns[out] Resulting list of functions.
+/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some
+/// point, either in \p Fn or in one of the function it calls. When that
+/// happens, we fall back to adding all callable functions inside \p Fn's module
+/// to \p Fns.
+/// \param HadExternalCall[out] Set to true if a call to an external function
+/// was seen at some point, either in \p Fn or in one of the function it calls
+static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG,
+                               const Function &Fn,
+                               DenseSet<const Function *> &Fns,
+                               bool &HadIndirectCall, bool &HadExternalCall) {
+  assert(!Fn.isDeclaration());
+
+  const Module &M = *Fn.getParent();
+  SmallVector<const Function *> WorkList({&Fn});
+  while (!WorkList.empty()) {
+    const auto &CurFn = *WorkList.pop_back_val();
+
+    // Scan for an indirect call. If such a call is found, we have to
+    // conservatively assume this can call all non-entrypoint functions in the
+    // module.
+    for (const auto &BB : CurFn) {
+      for (const auto &I : BB) {
+        const auto *CB = dyn_cast<CallBase>(&I);
+        if (!CB || !CB->isIndirectCall())
+          continue;
+
+        SML << "Indirect call detected in " << getName(CurFn)
+            << " - treating all non-entrypoint functions as "
+               "potential dependencies\n";
+
+        // TODO: Print an ORE as well ?
+        addAllIndirectCallDependencies(M, Fns);
+        HadIndirectCall = true;
+        return;
+      }
+    }
+
+    for (auto &CGEntry : *CG[&CurFn]) {
+      auto *Callee = CGEntry.second->getFunction();
+      if (!Callee)
+        continue;
+
+      assert(!AMDGPU::isKernelCC(Callee));
+
+      if (Callee->isDeclaration())
+        continue;
+
+      if (Callee->hasExternalLinkage())
+        HadExternalCall = true;
+
+      auto [It, Inserted] = Fns.insert(Callee);
+      if (Inserted)
+        WorkList.push_back(Callee);
+    }
+  }
+}
+
+/// Contains information about a kernel and its dependencies.
+struct KernelWithDependencies {
+  KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG,
+                         const DenseMap<const Function *, CostType> &FnCosts,
+                         const Function *Fn)
+      : Fn(Fn) {
+    addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall,
+                       HasExternalCall);
+    TotalCost = FnCosts.at(Fn);
+    for (const auto *Dep : Dependencies)
+      TotalCost += FnCosts.at(Dep);
+  }
+
+  const Function *Fn = nullptr;
+  DenseSet<const Function *> Dependencies;
+  /// Whether \p Fn or any of its \ref Dependencies contains an indirect call.
+  bool HasIndirectCall = false;
+  /// Whether \p Fn or any of its \ref Dependencies contains a call to a
+  /// function with external linkage.
+  bool HasExternalCall = false;
+
+  CostType TotalCost = 0;
+
+  /// \returns true if this kernel and its dependencies can be considered large
+  /// according to \p Threshold.
+  bool isLarge(CostType Threshold) const {
+    return TotalCost > Threshold && !Dependencies.empty();
+  }
+};
+
+/// Calculates how much overlap there is between \p A and \p B.
+/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A
+/// and B have no shared elements. Kernels do not count in overlap calculation.
+static float calculateOverlap(const DenseSet<const Function *> &A,
+                              const DenseSet<const Function *> &B) {
+  DenseSet<const Function *> Total;
+  for (const auto *F : A) {
+    if (!AMDGPU::isKernelCC(F))
+      Total.insert(F);
+  }
+
+  if (Total.empty())
+    return 0.0f;
+
+  unsigned NumCommon = 0;
+  for (const auto *F : B) {
+    if (AMDGPU::isKernelCC(F))
+      continue;
+
+    auto [It, Inserted] = Total.insert(F);
+    if (!Inserted)
+      ++NumCommon;
+  }
+
+  return float(NumCommon) / Total.size();
+}
+
+/// Performs all of the partitioning work on \p M.
+/// \param SML Log Helper
+/// \param M Module to partition.
+/// \param NumParts Number of partitions to create.
+/// \param ModuleCost Total cost of all functions in \p M.
+/// \param FnCosts Map of Function -> Cost
+/// \param WorkList Kernels and their dependencies to process in order.
+/// \returns The created partitions (a vector of size \p NumParts )
+static std::vector<DenseSet<const Function *>>
+doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts,
+               CostType ModuleCost,
+               const DenseMap<const Function *, CostType> &FnCosts,
+               const SmallVector<KernelWithDependencies> &WorkList) {
+
+  SML << "\n--Partitioning Starts--\n";
+
+  // Calculate a "large kernel threshold". When more than one kernel's total
+  // import cost exceeds this value, we will try to merge it with other,
+  // similarly large kernels.
+  //
+  // e.g. let two kernels X and Y have a import cost of ~10% of the module, we
+  // assign X to a partition as usual, but when we get to Y, we check if it's
+  // worth also putting it in Y's partition.
+  const CostType LargeKernelThreshold =
+      LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor)
+                        : std::numeric_limits<CostType>::max();
+
+  std::vector<DenseSet<const Function *>> Partitions;
+  Partitions.resize(NumParts);
+
+  // Assign a partition to each kernel, and try to keep the partitions more or
+  // less balanced. We do that through a priority queue sorted in reverse, so we
+  // can always look at the partition with the least content.
+  //
+  // There are some cases where we will be deliberately unbalanced though.
+  //  - Large kernels: we try to merge with existing partitions to reduce code
+  //  duplication.
+  //  - Kernels with indirect or external calls always go in the first partition
+  //  (P0).
+  auto ComparePartitions = [](const std::pair<PartitionID, CostType> &a,
+                              const std::pair<PartitionID, CostType> &b) {
+    // When two partitions have the same cost, assign to the one with the
+    // biggest ID first. This allows us to put things in P0 last, because P0 may
+    // have other stuff added later.
+    if (a.second == b.second)
+      return a.first < b.first;
+    return a.second > b.second;
+  };
+
+  // We can't use priority_queue here because we need to be able to access any
+  // element. This makes this a bit inefficient as we need to sort it again
+  // everytime we change it, but it's a very small array anyway (likely under 64
+  // partitions) so it's a cheap operation.
+  std::vector<std::pair<PartitionID, CostType>> BalancingQueue;
+  for (unsigned I = 0; I < NumParts; ++I)
+    BalancingQueue.push_back(std::make_pair(I, 0));
+
+  // Helper function to...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 18, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.


Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff

22 Files Affected:

  • (modified) llvm/include/llvm/Target/TargetMachine.h (+12)
  • (modified) llvm/lib/LTO/LTOBackend.cpp (+9-4)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733)
  • (added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46)
  • (added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75)
  • (added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39)
  • (added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98)
  • (added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2)
  • (modified) llvm/tools/llvm-split/CMakeLists.txt (+7)
  • (modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25)
diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index ceb371bdc73480..48ea3cfe02775b 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -418,6 +418,18 @@ class TargetMachine {
   virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
     return 0;
   }
+
+  /// Entry point for module splitting. Targets can implement custom module
+  /// splitting logic, mainly used by LTO for --lto-partitions.
+  ///
+  /// \returns `true` if the module was split, `false` otherwise. When  `false`
+  /// is returned, it is assumed that \p ModuleCallback has never been called
+  /// and \p M has not been modified.
+  virtual bool splitModule(
+      Module &M, unsigned NumParts,
+      function_ref<void(std::unique_ptr<Module> MPart)> ModuleCallback) const {
+    return false;
+  }
 };
 
 /// This class describes a target machine that is implemented with the LLVM
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 71e8849dc3cc91..d4b89ede2d7134 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
   unsigned ThreadCount = 0;
   const Target *T = &TM->getTarget();
 
-  SplitModule(
-      Mod, ParallelCodeGenParallelismLevel,
+  const auto HandleModulePartition =
       [&](std::unique_ptr<Module> MPart) {
         // We want to clone the module in a new context to multi-thread the
         // codegen. We do it by serializing partition modules to bitcode
@@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
             // Pass BC using std::move to ensure that it get moved rather than
             // copied into the thread's context.
             std::move(BC), ThreadCount++);
-      },
-      false);
+      };
+
+  // Try target-specific module splitting first, then fallback to the default.
+  if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel,
+                       HandleModulePartition)) {
+    SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition,
+                false);
+  }
 
   // Because the inner lambda (which runs in a worker thread) captures our local
   // variables, we need to wait for the worker threads to terminate before we
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
new file mode 100644
index 00000000000000..fa47d494f04148
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
@@ -0,0 +1,733 @@
+//===- AMDGPUSplitModule.cpp ----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file Implements a module splitting algorithm designed to support the
+/// FullLTO --lto-partitions option for parallel codegen. This is completely
+/// different from the common SplitModule pass, as this system is designed with
+/// AMDGPU in mind.
+///
+/// The basic idea of this module splitting implementation is the same as
+/// SplitModule: load-balance the module's functions across a set of N
+/// partitions to allow parallel codegen. However, it does it very
+/// differently than the target-agnostic variant:
+///   - Kernels are used as the module's "roots".
+///     They're known entry points on AMDGPU, and everything else is often
+///     internal only.
+///   - Each kernel has a set of dependencies, and when a kernel and its
+///     dependencies is considered "big", we try to put it in a partition where
+///     most dependencies are already imported, to avoid duplicating large
+///     amounts of code.
+///   - There's special care for indirect calls in order to ensure
+///     AMDGPUResourceUsageAnalysis can work correctly.
+///
+/// This file also includes a more elaborate logging system to enable
+/// users to easily generate logs that (if desired) do not include any value
+/// names, in order to not leak information about the source file.
+/// Such logs are very helpful to understand and fix potential issues with
+/// module splitting.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUSplitModule.h"
+#include "AMDGPUTargetMachine.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/User.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/Process.h"
+#include "llvm/Support/SHA256.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+#include <algorithm>
+#include <cassert>
+#include <iterator>
+#include <memory>
+#include <utility>
+#include <vector>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-split-module"
+
+namespace {
+
+static cl::opt<float> LargeKernelFactor(
+    "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden,
+    cl::desc(
+        "consider a kernel as large and needing special treatment when it "
+        "exceeds the average cost of a partition by this factor; e;g. 2.0 "
+        "means if the kernel and its dependencies is 2 times bigger than "
+        "an average partition; 0 disables large kernels handling entirely"));
+
+static cl::opt<float> LargeKernelOverlapForMerge(
+    "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8),
+    cl::Hidden,
+    cl::desc("defines how much overlap between two large kernel's dependencies "
+             "is needed to put them in the same partition"));
+
+static cl::opt<bool> NoExternalizeGlobals(
+    "amdgpu-module-splitting-no-externalize-globals", cl::Hidden,
+    cl::desc("disables externalization of global variable with local linkage; "
+             "may cause globals to be duplicated which increases binary size"));
+
+static cl::opt<std::string>
+    LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden,
+              cl::desc("output directory for AMDGPU module splitting logs"));
+
+static cl::opt<bool>
+    LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden,
+               cl::desc("hash value names before printing them in the AMDGPU "
+                        "module splitting logs"));
+
+using CostType = InstructionCost::CostType;
+using PartitionID = unsigned;
+
+static std::string getName(const Value &V) {
+  static std::optional<bool> HideNames;
+  if (!HideNames) {
+    if (LogPrivate.getNumOccurrences())
+      HideNames = LogPrivate;
+    else {
+      const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE");
+      HideNames = (EV.value_or("0") != "0");
+    }
+  }
+
+  if (!*HideNames)
+    return V.getName().str();
+  return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())),
+               /*LowerCase*/ true);
+}
+
+/// Main logging helper.
+///
+/// Logging can be configured by the following environment variable.
+///   AMD_SPLIT_MODULE_LOG_DIR=<filepath>
+///     If set, uses <filepath> as the directory to write logfiles to
+///     each time module splitting is used.
+///   AMD_SPLIT_MODULE_LOG_PRIVATE
+///     If set to anything other than zero, all names are hidden.
+///
+/// Both environment variables have corresponding CL options which
+/// takes priority over them.
+///
+/// Any output printed to the log files is also printed to dbgs() when -debug is
+/// used and LLVM_DEBUG is defined.
+///
+/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic
+/// cannot be removed from the code (by building without debug). This probably
+/// has a small performance cost because if some computation/formatting is
+/// needed for logging purpose, it may be done everytime only to be ignored
+/// by the logger.
+///
+/// As this pass only runs once and is not doing anything computationally
+/// expensive, this is likely a reasonable trade-off.
+///
+/// If some computation should really be avoided when unused, users of the class
+/// can check whether any logging will occur by using the bool operator.
+///
+/// \code
+///   if (SML) {
+///     // Executes only if logging to a file or if -debug is available and
+///     used.
+///   }
+/// \endcode
+class SplitModuleLogger {
+public:
+  SplitModuleLogger(const Module &M) {
+    std::string LogDir = LogDirOpt;
+    if (LogDir.empty())
+      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
+
+    // No log dir specified means we don't need to log to a file.
+    // We may still log to dbgs(), though.
+    if (LogDir.empty())
+      return;
+
+    if (!sys::fs::is_directory(LogDir)) {
+      report_fatal_error("invalid AMDGPU split module log directory: '" +
+                             Twine(LogDir) + "' is not a directory",
+                         /*CrashDiag=*/false);
+    }
+
+    // If a log directory is specified, create a new file with a unique name in
+    // that directory.
+    SmallString<0> FilePath;
+    int Fd;
+    std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt");
+    if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) {
+      dbgs() << LogFile << "\n";
+      std::string Msg =
+          "Failed to create log file at '" + LogDir + "': " + Err.message();
+      report_fatal_error(StringRef(Msg),
+                         /*CrashDiag=*/false);
+    }
+
+    FileOS = std::make_unique<raw_fd_ostream>(Fd, /*shouldClose*/ true);
+  }
+
+  bool hasLogFile() const { return FileOS != nullptr; }
+
+  raw_ostream &logfile() {
+    assert(FileOS && "no logfile!");
+    return *FileOS;
+  }
+
+  /// \returns true if this SML will log anything either to a file or dbgs().
+  /// Can be used to avoid expensive computations that are ignored when logging
+  /// is disabled.
+  operator bool() const {
+    return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE));
+  }
+
+private:
+  std::unique_ptr<raw_fd_ostream> FileOS;
+};
+
+template <typename Ty>
+static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) {
+  static_assert(
+      !std::is_same_v<Ty, Value>,
+      "do not print values to logs directly, use handleName instead!");
+  LLVM_DEBUG(dbgs() << Val);
+  if (SML.hasLogFile())
+    SML.logfile() << Val;
+  return SML;
+}
+
+/// Calculate the cost of each function in \p M
+/// \param SML Log Helper
+/// \param TM TargetMachine instance used to retrieve TargetTransformInfo.
+/// \param M Module to analyze.
+/// \param CostMap[out] Resulting Function -> Cost map.
+/// \return The module's total cost.
+static CostType
+calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM,
+                       Module &M,
+                       DenseMap<const Function *, CostType> &CostMap) {
+  CostType ModuleCost = 0;
+  CostType KernelCost = 0;
+
+  for (auto &Fn : M) {
+    if (Fn.isDeclaration())
+      continue;
+
+    CostType FnCost = 0;
+    auto TTI = TM.getTargetTransformInfo(Fn);
+
+    for (auto &BB : Fn) {
+      for (auto &I : BB) {
+        auto Cost =
+            TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
+        assert(Cost != InstructionCost::getMax());
+        // Assume expensive if we can't tell the cost of an instruction.
+        CostType CostVal =
+            Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive);
+        assert((FnCost + CostVal) >= FnCost && "Overflow!");
+        FnCost += CostVal;
+      }
+    }
+
+    assert(FnCost != 0);
+
+    CostMap[&Fn] = FnCost;
+    assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!");
+    ModuleCost += FnCost;
+
+    if (AMDGPU::isKernelCC(&Fn))
+      KernelCost += FnCost;
+  }
+
+  CostType FnCost = (ModuleCost - KernelCost);
+  SML << "=> Total Module Cost: " << ModuleCost << "\n"
+      << "  => KernelCost: " << KernelCost << " ("
+      << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n"
+      << "  => FnsCost: " << FnCost << " ("
+      << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n";
+
+  return ModuleCost;
+}
+
+/// When a kernel or any of its callees performs an indirect call, this function
+/// takes over \ref addAllDependencies and adds all potentially callable
+/// functions to \p Fns so they can be counted as dependencies of the kernel.
+///
+/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the
+/// presence of an indirect call, the function's resource usage is the same as
+/// the most expensive function in the module.
+/// \param M    The module.
+/// \param Fns[out] Resulting list of functions.
+static void addAllIndirectCallDependencies(const Module &M,
+                                           DenseSet<const Function *> &Fns) {
+  for (const auto &Fn : M) {
+    if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv()))
+      Fns.insert(&Fn);
+  }
+}
+
+/// Adds the functions that \p Fn may call to \p Fns, then recurses into each
+/// callee until all reachable functions have been gathered.
+///
+/// \param SML Log Helper
+/// \param CG Call graph for \p Fn's module.
+/// \param Fn Current function to look at.
+/// \param Fns[out] Resulting list of functions.
+/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some
+/// point, either in \p Fn or in one of the function it calls. When that
+/// happens, we fall back to adding all callable functions inside \p Fn's module
+/// to \p Fns.
+/// \param HadExternalCall[out] Set to true if a call to an external function
+/// was seen at some point, either in \p Fn or in one of the function it calls
+static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG,
+                               const Function &Fn,
+                               DenseSet<const Function *> &Fns,
+                               bool &HadIndirectCall, bool &HadExternalCall) {
+  assert(!Fn.isDeclaration());
+
+  const Module &M = *Fn.getParent();
+  SmallVector<const Function *> WorkList({&Fn});
+  while (!WorkList.empty()) {
+    const auto &CurFn = *WorkList.pop_back_val();
+
+    // Scan for an indirect call. If such a call is found, we have to
+    // conservatively assume this can call all non-entrypoint functions in the
+    // module.
+    for (const auto &BB : CurFn) {
+      for (const auto &I : BB) {
+        const auto *CB = dyn_cast<CallBase>(&I);
+        if (!CB || !CB->isIndirectCall())
+          continue;
+
+        SML << "Indirect call detected in " << getName(CurFn)
+            << " - treating all non-entrypoint functions as "
+               "potential dependencies\n";
+
+        // TODO: Print an ORE as well ?
+        addAllIndirectCallDependencies(M, Fns);
+        HadIndirectCall = true;
+        return;
+      }
+    }
+
+    for (auto &CGEntry : *CG[&CurFn]) {
+      auto *Callee = CGEntry.second->getFunction();
+      if (!Callee)
+        continue;
+
+      assert(!AMDGPU::isKernelCC(Callee));
+
+      if (Callee->isDeclaration())
+        continue;
+
+      if (Callee->hasExternalLinkage())
+        HadExternalCall = true;
+
+      auto [It, Inserted] = Fns.insert(Callee);
+      if (Inserted)
+        WorkList.push_back(Callee);
+    }
+  }
+}
+
+/// Contains information about a kernel and its dependencies.
+struct KernelWithDependencies {
+  KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG,
+                         const DenseMap<const Function *, CostType> &FnCosts,
+                         const Function *Fn)
+      : Fn(Fn) {
+    addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall,
+                       HasExternalCall);
+    TotalCost = FnCosts.at(Fn);
+    for (const auto *Dep : Dependencies)
+      TotalCost += FnCosts.at(Dep);
+  }
+
+  const Function *Fn = nullptr;
+  DenseSet<const Function *> Dependencies;
+  /// Whether \p Fn or any of its \ref Dependencies contains an indirect call.
+  bool HasIndirectCall = false;
+  /// Whether \p Fn or any of its \ref Dependencies contains a call to a
+  /// function with external linkage.
+  bool HasExternalCall = false;
+
+  CostType TotalCost = 0;
+
+  /// \returns true if this kernel and its dependencies can be considered large
+  /// according to \p Threshold.
+  bool isLarge(CostType Threshold) const {
+    return TotalCost > Threshold && !Dependencies.empty();
+  }
+};
+
+/// Calculates how much overlap there is between \p A and \p B.
+/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A
+/// and B have no shared elements. Kernels do not count in overlap calculation.
+static float calculateOverlap(const DenseSet<const Function *> &A,
+                              const DenseSet<const Function *> &B) {
+  DenseSet<const Function *> Total;
+  for (const auto *F : A) {
+    if (!AMDGPU::isKernelCC(F))
+      Total.insert(F);
+  }
+
+  if (Total.empty())
+    return 0.0f;
+
+  unsigned NumCommon = 0;
+  for (const auto *F : B) {
+    if (AMDGPU::isKernelCC(F))
+      continue;
+
+    auto [It, Inserted] = Total.insert(F);
+    if (!Inserted)
+      ++NumCommon;
+  }
+
+  return float(NumCommon) / Total.size();
+}
+
+/// Performs all of the partitioning work on \p M.
+/// \param SML Log Helper
+/// \param M Module to partition.
+/// \param NumParts Number of partitions to create.
+/// \param ModuleCost Total cost of all functions in \p M.
+/// \param FnCosts Map of Function -> Cost
+/// \param WorkList Kernels and their dependencies to process in order.
+/// \returns The created partitions (a vector of size \p NumParts )
+static std::vector<DenseSet<const Function *>>
+doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts,
+               CostType ModuleCost,
+               const DenseMap<const Function *, CostType> &FnCosts,
+               const SmallVector<KernelWithDependencies> &WorkList) {
+
+  SML << "\n--Partitioning Starts--\n";
+
+  // Calculate a "large kernel threshold". When more than one kernel's total
+  // import cost exceeds this value, we will try to merge it with other,
+  // similarly large kernels.
+  //
+  // e.g. let two kernels X and Y have a import cost of ~10% of the module, we
+  // assign X to a partition as usual, but when we get to Y, we check if it's
+  // worth also putting it in Y's partition.
+  const CostType LargeKernelThreshold =
+      LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor)
+                        : std::numeric_limits<CostType>::max();
+
+  std::vector<DenseSet<const Function *>> Partitions;
+  Partitions.resize(NumParts);
+
+  // Assign a partition to each kernel, and try to keep the partitions more or
+  // less balanced. We do that through a priority queue sorted in reverse, so we
+  // can always look at the partition with the least content.
+  //
+  // There are some cases where we will be deliberately unbalanced though.
+  //  - Large kernels: we try to merge with existing partitions to reduce code
+  //  duplication.
+  //  - Kernels with indirect or external calls always go in the first partition
+  //  (P0).
+  auto ComparePartitions = [](const std::pair<PartitionID, CostType> &a,
+                              const std::pair<PartitionID, CostType> &b) {
+    // When two partitions have the same cost, assign to the one with the
+    // biggest ID first. This allows us to put things in P0 last, because P0 may
+    // have other stuff added later.
+    if (a.second == b.second)
+      return a.first < b.first;
+    return a.second > b.second;
+  };
+
+  // We can't use priority_queue here because we need to be able to access any
+  // element. This makes this a bit inefficient as we need to sort it again
+  // everytime we change it, but it's a very small array anyway (likely under 64
+  // partitions) so it's a cheap operation.
+  std::vector<std::pair<PartitionID, CostType>> BalancingQueue;
+  for (unsigned I = 0; I < NumParts; ++I)
+    BalancingQueue.push_back(std::make_pair(I, 0));
+
+  // Helper function to...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
SplitModuleLogger(const Module &M) {
std::string LogDir = LogDirOpt;
if (LogDir.empty())
LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend passes should probably not depend on environment variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know but it's very convenient, and only used for debugging.

Passing arguments with values to the offload-linker's backend can be a bit annoying. I always have trouble figuring out the right syntax. I think it's something like -Xoffload-linker -mllvm=-amdgpu-module-splitting-log-dir=<dir> but I'm not sure)

If you know what the right syntax is for sure then I'm good to remove the env vars and instead create a section in AMDGPUUsage explaining how to get logs from the pass.

@arsenm
Copy link
Contributor

arsenm commented Apr 20, 2024

Can you add some tests with aliases? I'm not sure they work correctly in the call graph

Pierre-vh added a commit that referenced this pull request Apr 22, 2024
@Pierre-vh Pierre-vh requested a review from arsenm April 22, 2024 07:44
@gandhi56
Copy link
Contributor

gandhi56 commented May 3, 2024

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt
clone-lds-function-ancestor-kernels.ll.txt
clone-lds-function-successor.ll.txt

@jrbyrnes
Copy link
Contributor

jrbyrnes commented May 3, 2024

These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning

I ran into https://llvm.org/docs/CodingStandards.html#beware-of-non-determinism-due-to-ordering-of-pointers a while ago, and thought I'd point it out since I see this work uses sets/maps with pointer keys. Not sure if the iteration on these effects test output, but something to consider..

llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
return I == 0;
}));
if (I != 0)
MPart->setModuleInlineAsm("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this drop whatever asm is there? Does module asm even work today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started this off by copying SplitModule so it's just a leftover

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.
@Pierre-vh
Copy link
Contributor Author

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

I'm not able to reproduce it, can you create a testcase that only depends on this pass?
I had to modify the tests a lot to make them work so it possibly lost what makes them non-deterministic. Also I see you're using %u but I don't know what that's supposed to do, I replaced it with %t to make it work.

@Pierre-vh
Copy link
Contributor Author

Note: I added the isAddressTaken check for the indirect calls dependency collection. I think that's a trivial addition to narrow down the set of indirectly callable functions.

@gandhi56
Copy link
Contributor

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

Never mind, I had some errors in the RUN lines in my tests. This issue does not exist.

@Pierre-vh Pierre-vh requested a review from arsenm May 14, 2024 13:36
Copy link

github-actions bot commented May 16, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

// We cannot duplicate functions with external linkage, or functions that
// may be overriden at runtime.
HasNonDuplicatableDependecy |=
(Dep->hasExternalLinkage() || !Dep->isDefinitionExact());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be mayBeDerefined? I'm not a linkage expert, but it seems there's some nobuiltin interaction here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mayBeDerefined is private I think, isDefinitionExact is just !mayBeDerefined so it's already using it

@arsenm arsenm requested a review from jhuber6 May 16, 2024 09:20
@Pierre-vh Pierre-vh requested a review from arsenm May 16, 2024 12:32
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that --lto-partitions doesn't work because the AMDGPU backend needs to consume the SCC in order. I'm assuming this patch maintains that logic but splits the independent kernel calls up to they can be done in parallel?

I've thought that the true solution to resolving this would just be emitting the kernel resource usage inside of ld.lld. Presumably that would require emitting resource usage per-function in parallel and then the callgraph information (There's some small support for this already). Then ld.lld would need to traverse the callgraph to get the diameter of said graph. However I'm assuming that'd take more effort to define an actual linking ABI.

@Pierre-vh
Copy link
Contributor Author

My understanding is that --lto-partitions doesn't work because the AMDGPU backend needs to consume the SCC in order. I'm assuming this patch maintains that logic but splits the independent kernel calls up to they can be done in parallel?

Exactly. If a function is called directly or indirectly by a kernel, it stays in that kernel's module.

I've thought that the true solution to resolving this would just be emitting the kernel resource usage inside of ld.lld. Presumably that would require emitting resource usage per-function in parallel and then the callgraph information (There's some small support for this already). Then ld.lld would need to traverse the callgraph to get the diameter of said graph. However I'm assuming that'd take more effort to define an actual linking ABI.

Indeed, and this is where I started looking as well (so we could just enable thinLTO), but it's hard to get right and this solution was by far the easiest and the most realistic one to do in the short term.

continue;

CostType FnCost = 0;
TargetTransformInfo TTI = TM.getTargetTransformInfo(Fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be const ref

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signature is

virtual TargetTransformInfo getTargetTransformInfo(const Function &F) const

So can't be const ref

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You aren't supposed to query this from the TargetMachine, you need to get this from the PassManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly a pass though, it's not ran by the pass manager but by the LTO driver after running the passes. I don't see another way to get TTI from here.
We also call this on the one module so the entry point can't just take a const &

We could eventually pass a functor to hide away the call to getTargetTransformInfo into TM so this doesn't need to know about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly a pass though, it's not ran by the pass manager but by the LTO driver after running the passes.

This sounds like a defect in how this is implemented. I would expect this to be a module pass, and then you can query the function analysis. Can you just ignore the cost stuff for now? Split the TTI usage into a separate patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be a module pass, and then you can query the function analysis

SplitModule is also not a ModulePass. Fitting this in a module pass would be awkward I think. I guess we could make it an analysis, but then you still need to query the analysis when calling the CloneModule callback and IMO getting an analysis' result outside the PM feels more awkward than getting a TTI instance. It would also restrict what we can do/involve a lot more bookkeeping.

Can you just ignore the cost stuff for now?

The costs are an integral part of how this system works, it balances kernels depending on their size. Without that we'd likely affect the performance of this splitting negatively. I could estimate them based on some heuristics like number of BB/functions but that's really not accurate because some instructions are noops while some a ton of instructions.

There are also some precedents on passes/code using TM.getTargetTransformInfo directly:

  • InterleavedLoadCombineImpl
  • SelectionDAGBuilder::shouldKeepJumpConditionsTogether
  • computeHeuristicUnrollFactor in OMPIRBuilder

Copy link
Contributor

@arsenm arsenm May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These precedents are all bugs and constructing new TTI instances

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#93103 fixes the InterleavedLoadCombine one. shouldKeepJumpConditionsTogether requires a bit more boilerplate. For computeHeuristicUnrollFactor, just what is going on in there?

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved
@Pierre-vh Pierre-vh requested a review from arsenm May 21, 2024 07:19
@Pierre-vh Pierre-vh merged commit d7c3713 into llvm:main May 23, 2024
4 checks passed
@Pierre-vh Pierre-vh deleted the amdgpu-module-splitting branch May 23, 2024 10:26
@vitalybuka
Copy link
Collaborator

Can this be fixed or reverted https://lab.llvm.org/buildbot/#/builders/85/builds/24181 ?

vitalybuka added a commit that referenced this pull request May 24, 2024
@vitalybuka vitalybuka mentioned this pull request May 24, 2024
Pierre-vh added a commit that referenced this pull request May 27, 2024
(with fix for ubsan)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their
specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as
resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just
LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value
names hidden so they can be safely shared without leaking informatiton
about the source. Logs can also be enabled through an environment
variable, which avoids the sometimes complicated process of passing a
-mllvm option all the way from clang driver to the offload linker that
handles full LTO codegen.
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 28, 2024
Allow targets to implement custom module splitting logic for
--lto-partitions, see llvm#89245

https: //discourse.llvm.org/t/rfc-lto-target-specific-module-splittting/77252
Change-Id: I22abc7f37988e839eb7b06fbd1a0c69dfef1af28
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request May 28, 2024
This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their
specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as
resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just
LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value
names hidden so they can be safely shared without leaking informatiton
about the source. Logs can also be enabled through an environment
variable, which avoids the sometimes complicated process of passing a
-mllvm option all the way from clang driver to the offload linker that
handles full LTO codegen.

Change-Id: Ifec430cc5b439e004822a4e5659cf47639458bfe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants