[AMDGPU] Add AMDGPU-specific module splitting #89245

Pierre-vh · 2024-04-18T14:51:26Z

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

llvmbot · 2024-04-18T14:51:58Z

@llvm/pr-subscribers-lto

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff

22 Files Affected:

(modified) llvm/include/llvm/Target/TargetMachine.h (+12)
(modified) llvm/lib/LTO/LTOBackend.cpp (+9-4)
(added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733)
(added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46)
(added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39)
(added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98)
(added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2)
(modified) llvm/tools/llvm-split/CMakeLists.txt (+7)
(modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25)

diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index ceb371bdc73480..48ea3cfe02775b 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -418,6 +418,18 @@ class TargetMachine {
   virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
     return 0;
   }
+
+  /// Entry point for module splitting. Targets can implement custom module
+  /// splitting logic, mainly used by LTO for --lto-partitions.
+  ///
+  /// \returns `true` if the module was split, `false` otherwise. When  `false`
+  /// is returned, it is assumed that \p ModuleCallback has never been called
+  /// and \p M has not been modified.
+  virtual bool splitModule(
+      Module &M, unsigned NumParts,
+      function_ref<void(std::unique_ptr<Module> MPart)> ModuleCallback) const {
+    return false;
+  }
 };
 
 /// This class describes a target machine that is implemented with the LLVM
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 71e8849dc3cc91..d4b89ede2d7134 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
   unsigned ThreadCount = 0;
   const Target *T = &TM->getTarget();
 
-  SplitModule(
-      Mod, ParallelCodeGenParallelismLevel,
+  const auto HandleModulePartition =
       [&](std::unique_ptr<Module> MPart) {
         // We want to clone the module in a new context to multi-thread the
         // codegen. We do it by serializing partition modules to bitcode
@@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
             // Pass BC using std::move to ensure that it get moved rather than
             // copied into the thread's context.
             std::move(BC), ThreadCount++);
-      },
-      false);
+      };
+
+  // Try target-specific module splitting first, then fallback to the default.
+  if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel,
+                       HandleModulePartition)) {
+    SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition,
+                false);
+  }
 
   // Because the inner lambda (which runs in a worker thread) captures our local
   // variables, we need to wait for the worker threads to terminate before we
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
new file mode 100644
index 00000000000000..fa47d494f04148
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
@@ -0,0 +1,733 @@
+//===- AMDGPUSplitModule.cpp ----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file Implements a module splitting algorithm designed to support the
+/// FullLTO --lto-partitions option for parallel codegen. This is completely
+/// different from the common SplitModule pass, as this system is designed with
+/// AMDGPU in mind.
+///
+/// The basic idea of this module splitting implementation is the same as
+/// SplitModule: load-balance the module's functions across a set of N
+/// partitions to allow parallel codegen. However, it does it very
+/// differently than the target-agnostic variant:
+///   - Kernels are used as the module's "roots".
+///     They're known entry points on AMDGPU, and everything else is often
+///     internal only.
+///   - Each kernel has a set of dependencies, and when a kernel and its
+///     dependencies is considered "big", we try to put it in a partition where
+///     most dependencies are already imported, to avoid duplicating large
+///     amounts of code.
+///   - There's special care for indirect calls in order to ensure
+///     AMDGPUResourceUsageAnalysis can work correctly.
+///
+/// This file also includes a more elaborate logging system to enable
+/// users to easily generate logs that (if desired) do not include any value
+/// names, in order to not leak information about the source file.
+/// Such logs are very helpful to understand and fix potential issues with
+/// module splitting.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUSplitModule.h"
+#include "AMDGPUTargetMachine.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/User.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/Process.h"
+#include "llvm/Support/SHA256.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+#include <algorithm>
+#include <cassert>
+#include <iterator>
+#include <memory>
+#include <utility>
+#include <vector>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-split-module"
+
+namespace {
+
+static cl::opt<float> LargeKernelFactor(
+    "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden,
+    cl::desc(
+        "consider a kernel as large and needing special treatment when it "
+        "exceeds the average cost of a partition by this factor; e;g. 2.0 "
+        "means if the kernel and its dependencies is 2 times bigger than "
+        "an average partition; 0 disables large kernels handling entirely"));
+
+static cl::opt<float> LargeKernelOverlapForMerge(
+    "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8),
+    cl::Hidden,
+    cl::desc("defines how much overlap between two large kernel's dependencies "
+             "is needed to put them in the same partition"));
+
+static cl::opt<bool> NoExternalizeGlobals(
+    "amdgpu-module-splitting-no-externalize-globals", cl::Hidden,
+    cl::desc("disables externalization of global variable with local linkage; "
+             "may cause globals to be duplicated which increases binary size"));
+
+static cl::opt<std::string>
+    LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden,
+              cl::desc("output directory for AMDGPU module splitting logs"));
+
+static cl::opt<bool>
+    LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden,
+               cl::desc("hash value names before printing them in the AMDGPU "
+                        "module splitting logs"));
+
+using CostType = InstructionCost::CostType;
+using PartitionID = unsigned;
+
+static std::string getName(const Value &V) {
+  static std::optional<bool> HideNames;
+  if (!HideNames) {
+    if (LogPrivate.getNumOccurrences())
+      HideNames = LogPrivate;
+    else {
+      const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE");
+      HideNames = (EV.value_or("0") != "0");
+    }
+  }
+
+  if (!*HideNames)
+    return V.getName().str();
+  return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())),
+               /*LowerCase*/ true);
+}
+
+/// Main logging helper.
+///
+/// Logging can be configured by the following environment variable.
+///   AMD_SPLIT_MODULE_LOG_DIR=<filepath>
+///     If set, uses <filepath> as the directory to write logfiles to
+///     each time module splitting is used.
+///   AMD_SPLIT_MODULE_LOG_PRIVATE
+///     If set to anything other than zero, all names are hidden.
+///
+/// Both environment variables have corresponding CL options which
+/// takes priority over them.
+///
+/// Any output printed to the log files is also printed to dbgs() when -debug is
+/// used and LLVM_DEBUG is defined.
+///
+/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic
+/// cannot be removed from the code (by building without debug). This probably
+/// has a small performance cost because if some computation/formatting is
+/// needed for logging purpose, it may be done everytime only to be ignored
+/// by the logger.
+///
+/// As this pass only runs once and is not doing anything computationally
+/// expensive, this is likely a reasonable trade-off.
+///
+/// If some computation should really be avoided when unused, users of the class
+/// can check whether any logging will occur by using the bool operator.
+///
+/// \code
+///   if (SML) {
+///     // Executes only if logging to a file or if -debug is available and
+///     used.
+///   }
+/// \endcode
+class SplitModuleLogger {
+public:
+  SplitModuleLogger(const Module &M) {
+    std::string LogDir = LogDirOpt;
+    if (LogDir.empty())
+      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
+
+    // No log dir specified means we don't need to log to a file.
+    // We may still log to dbgs(), though.
+    if (LogDir.empty())
+      return;
+
+    if (!sys::fs::is_directory(LogDir)) {
+      report_fatal_error("invalid AMDGPU split module log directory: '" +
+                             Twine(LogDir) + "' is not a directory",
+                         /*CrashDiag=*/false);
+    }
+
+    // If a log directory is specified, create a new file with a unique name in
+    // that directory.
+    SmallString<0> FilePath;
+    int Fd;
+    std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt");
+    if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) {
+      dbgs() << LogFile << "\n";
+      std::string Msg =
+          "Failed to create log file at '" + LogDir + "': " + Err.message();
+      report_fatal_error(StringRef(Msg),
+                         /*CrashDiag=*/false);
+    }
+
+    FileOS = std::make_unique<raw_fd_ostream>(Fd, /*shouldClose*/ true);
+  }
+
+  bool hasLogFile() const { return FileOS != nullptr; }
+
+  raw_ostream &logfile() {
+    assert(FileOS && "no logfile!");
+    return *FileOS;
+  }
+
+  /// \returns true if this SML will log anything either to a file or dbgs().
+  /// Can be used to avoid expensive computations that are ignored when logging
+  /// is disabled.
+  operator bool() const {
+    return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE));
+  }
+
+private:
+  std::unique_ptr<raw_fd_ostream> FileOS;
+};
+
+template <typename Ty>
+static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) {
+  static_assert(
+      !std::is_same_v<Ty, Value>,
+      "do not print values to logs directly, use handleName instead!");
+  LLVM_DEBUG(dbgs() << Val);
+  if (SML.hasLogFile())
+    SML.logfile() << Val;
+  return SML;
+}
+
+/// Calculate the cost of each function in \p M
+/// \param SML Log Helper
+/// \param TM TargetMachine instance used to retrieve TargetTransformInfo.
+/// \param M Module to analyze.
+/// \param CostMap[out] Resulting Function -> Cost map.
+/// \return The module's total cost.
+static CostType
+calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM,
+                       Module &M,
+                       DenseMap<const Function *, CostType> &CostMap) {
+  CostType ModuleCost = 0;
+  CostType KernelCost = 0;
+
+  for (auto &Fn : M) {
+    if (Fn.isDeclaration())
+      continue;
+
+    CostType FnCost = 0;
+    auto TTI = TM.getTargetTransformInfo(Fn);
+
+    for (auto &BB : Fn) {
+      for (auto &I : BB) {
+        auto Cost =
+            TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
+        assert(Cost != InstructionCost::getMax());
+        // Assume expensive if we can't tell the cost of an instruction.
+        CostType CostVal =
+            Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive);
+        assert((FnCost + CostVal) >= FnCost && "Overflow!");
+        FnCost += CostVal;
+      }
+    }
+
+    assert(FnCost != 0);
+
+    CostMap[&Fn] = FnCost;
+    assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!");
+    ModuleCost += FnCost;
+
+    if (AMDGPU::isKernelCC(&Fn))
+      KernelCost += FnCost;
+  }
+
+  CostType FnCost = (ModuleCost - KernelCost);
+  SML << "=> Total Module Cost: " << ModuleCost << "\n"
+      << "  => KernelCost: " << KernelCost << " ("
+      << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n"
+      << "  => FnsCost: " << FnCost << " ("
+      << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n";
+
+  return ModuleCost;
+}
+
+/// When a kernel or any of its callees performs an indirect call, this function
+/// takes over \ref addAllDependencies and adds all potentially callable
+/// functions to \p Fns so they can be counted as dependencies of the kernel.
+///
+/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the
+/// presence of an indirect call, the function's resource usage is the same as
+/// the most expensive function in the module.
+/// \param M    The module.
+/// \param Fns[out] Resulting list of functions.
+static void addAllIndirectCallDependencies(const Module &M,
+                                           DenseSet<const Function *> &Fns) {
+  for (const auto &Fn : M) {
+    if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv()))
+      Fns.insert(&Fn);
+  }
+}
+
+/// Adds the functions that \p Fn may call to \p Fns, then recurses into each
+/// callee until all reachable functions have been gathered.
+///
+/// \param SML Log Helper
+/// \param CG Call graph for \p Fn's module.
+/// \param Fn Current function to look at.
+/// \param Fns[out] Resulting list of functions.
+/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some
+/// point, either in \p Fn or in one of the function it calls. When that
+/// happens, we fall back to adding all callable functions inside \p Fn's module
+/// to \p Fns.
+/// \param HadExternalCall[out] Set to true if a call to an external function
+/// was seen at some point, either in \p Fn or in one of the function it calls
+static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG,
+                               const Function &Fn,
+                               DenseSet<const Function *> &Fns,
+                               bool &HadIndirectCall, bool &HadExternalCall) {
+  assert(!Fn.isDeclaration());
+
+  const Module &M = *Fn.getParent();
+  SmallVector<const Function *> WorkList({&Fn});
+  while (!WorkList.empty()) {
+    const auto &CurFn = *WorkList.pop_back_val();
+
+    // Scan for an indirect call. If such a call is found, we have to
+    // conservatively assume this can call all non-entrypoint functions in the
+    // module.
+    for (const auto &BB : CurFn) {
+      for (const auto &I : BB) {
+        const auto *CB = dyn_cast<CallBase>(&I);
+        if (!CB || !CB->isIndirectCall())
+          continue;
+
+        SML << "Indirect call detected in " << getName(CurFn)
+            << " - treating all non-entrypoint functions as "
+               "potential dependencies\n";
+
+        // TODO: Print an ORE as well ?
+        addAllIndirectCallDependencies(M, Fns);
+        HadIndirectCall = true;
+        return;
+      }
+    }
+
+    for (auto &CGEntry : *CG[&CurFn]) {
+      auto *Callee = CGEntry.second->getFunction();
+      if (!Callee)
+        continue;
+
+      assert(!AMDGPU::isKernelCC(Callee));
+
+      if (Callee->isDeclaration())
+        continue;
+
+      if (Callee->hasExternalLinkage())
+        HadExternalCall = true;
+
+      auto [It, Inserted] = Fns.insert(Callee);
+      if (Inserted)
+        WorkList.push_back(Callee);
+    }
+  }
+}
+
+/// Contains information about a kernel and its dependencies.
+struct KernelWithDependencies {
+  KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG,
+                         const DenseMap<const Function *, CostType> &FnCosts,
+                         const Function *Fn)
+      : Fn(Fn) {
+    addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall,
+                       HasExternalCall);
+    TotalCost = FnCosts.at(Fn);
+    for (const auto *Dep : Dependencies)
+      TotalCost += FnCosts.at(Dep);
+  }
+
+  const Function *Fn = nullptr;
+  DenseSet<const Function *> Dependencies;
+  /// Whether \p Fn or any of its \ref Dependencies contains an indirect call.
+  bool HasIndirectCall = false;
+  /// Whether \p Fn or any of its \ref Dependencies contains a call to a
+  /// function with external linkage.
+  bool HasExternalCall = false;
+
+  CostType TotalCost = 0;
+
+  /// \returns true if this kernel and its dependencies can be considered large
+  /// according to \p Threshold.
+  bool isLarge(CostType Threshold) const {
+    return TotalCost > Threshold && !Dependencies.empty();
+  }
+};
+
+/// Calculates how much overlap there is between \p A and \p B.
+/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A
+/// and B have no shared elements. Kernels do not count in overlap calculation.
+static float calculateOverlap(const DenseSet<const Function *> &A,
+                              const DenseSet<const Function *> &B) {
+  DenseSet<const Function *> Total;
+  for (const auto *F : A) {
+    if (!AMDGPU::isKernelCC(F))
+      Total.insert(F);
+  }
+
+  if (Total.empty())
+    return 0.0f;
+
+  unsigned NumCommon = 0;
+  for (const auto *F : B) {
+    if (AMDGPU::isKernelCC(F))
+      continue;
+
+    auto [It, Inserted] = Total.insert(F);
+    if (!Inserted)
+      ++NumCommon;
+  }
+
+  return float(NumCommon) / Total.size();
+}
+
+/// Performs all of the partitioning work on \p M.
+/// \param SML Log Helper
+/// \param M Module to partition.
+/// \param NumParts Number of partitions to create.
+/// \param ModuleCost Total cost of all functions in \p M.
+/// \param FnCosts Map of Function -> Cost
+/// \param WorkList Kernels and their dependencies to process in order.
+/// \returns The created partitions (a vector of size \p NumParts )
+static std::vector<DenseSet<const Function *>>
+doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts,
+               CostType ModuleCost,
+               const DenseMap<const Function *, CostType> &FnCosts,
+               const SmallVector<KernelWithDependencies> &WorkList) {
+
+  SML << "\n--Partitioning Starts--\n";
+
+  // Calculate a "large kernel threshold". When more than one kernel's total
+  // import cost exceeds this value, we will try to merge it with other,
+  // similarly large kernels.
+  //
+  // e.g. let two kernels X and Y have a import cost of ~10% of the module, we
+  // assign X to a partition as usual, but when we get to Y, we check if it's
+  // worth also putting it in Y's partition.
+  const CostType LargeKernelThreshold =
+      LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor)
+                        : std::numeric_limits<CostType>::max();
+
+  std::vector<DenseSet<const Function *>> Partitions;
+  Partitions.resize(NumParts);
+
+  // Assign a partition to each kernel, and try to keep the partitions more or
+  // less balanced. We do that through a priority queue sorted in reverse, so we
+  // can always look at the partition with the least content.
+  //
+  // There are some cases where we will be deliberately unbalanced though.
+  //  - Large kernels: we try to merge with existing partitions to reduce code
+  //  duplication.
+  //  - Kernels with indirect or external calls always go in the first partition
+  //  (P0).
+  auto ComparePartitions = [](const std::pair<PartitionID, CostType> &a,
+                              const std::pair<PartitionID, CostType> &b) {
+    // When two partitions have the same cost, assign to the one with the
+    // biggest ID first. This allows us to put things in P0 last, because P0 may
+    // have other stuff added later.
+    if (a.second == b.second)
+      return a.first < b.first;
+    return a.second > b.second;
+  };
+
+  // We can't use priority_queue here because we need to be able to access any
+  // element. This makes this a bit inefficient as we need to sort it again
+  // everytime we change it, but it's a very small array anyway (likely under 64
+  // partitions) so it's a cheap operation.
+  std::vector<std::pair<PartitionID, CostType>> BalancingQueue;
+  for (unsigned I = 0; I < NumParts; ++I)
+    BalancingQueue.push_back(std::make_pair(I, 0));
+
+  // Helper function to...
[truncated]

llvmbot · 2024-04-18T14:51:58Z

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

(See #83128 to review first commit)

This enables the --lto-partitions option to work more consistently.

This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of
them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented).

This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows
printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

Patch is 59.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/89245.diff

22 Files Affected:

(modified) llvm/include/llvm/Target/TargetMachine.h (+12)
(modified) llvm/lib/LTO/LTOBackend.cpp (+9-4)
(added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp (+733)
(added) llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h (+30)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+8)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h (+4)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll (+46)
(added) llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize.ll (+37)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll (+54)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependencies.ll (+50)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-duplication.ll (+41)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-external.ll (+43)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-dependency-indirect.ll (+69)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll (+42)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll (+44)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-load-balancing.ll (+75)
(added) llvm/test/tools/llvm-split/AMDGPU/kernels-no-dependencies.ll (+39)
(added) llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll (+98)
(added) llvm/test/tools/llvm-split/AMDGPU/lit.local.cfg (+2)
(modified) llvm/tools/llvm-split/CMakeLists.txt (+7)
(modified) llvm/tools/llvm-split/llvm-split.cpp (+76-25)

diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index ceb371bdc73480..48ea3cfe02775b 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -418,6 +418,18 @@ class TargetMachine {
   virtual unsigned getAddressSpaceForPseudoSourceKind(unsigned Kind) const {
     return 0;
   }
+
+  /// Entry point for module splitting. Targets can implement custom module
+  /// splitting logic, mainly used by LTO for --lto-partitions.
+  ///
+  /// \returns `true` if the module was split, `false` otherwise. When  `false`
+  /// is returned, it is assumed that \p ModuleCallback has never been called
+  /// and \p M has not been modified.
+  virtual bool splitModule(
+      Module &M, unsigned NumParts,
+      function_ref<void(std::unique_ptr<Module> MPart)> ModuleCallback) const {
+    return false;
+  }
 };
 
 /// This class describes a target machine that is implemented with the LLVM
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 71e8849dc3cc91..d4b89ede2d7134 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -436,8 +436,7 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
   unsigned ThreadCount = 0;
   const Target *T = &TM->getTarget();
 
-  SplitModule(
-      Mod, ParallelCodeGenParallelismLevel,
+  const auto HandleModulePartition =
       [&](std::unique_ptr<Module> MPart) {
         // We want to clone the module in a new context to multi-thread the
         // codegen. We do it by serializing partition modules to bitcode
@@ -469,8 +468,14 @@ static void splitCodeGen(const Config &C, TargetMachine *TM,
             // Pass BC using std::move to ensure that it get moved rather than
             // copied into the thread's context.
             std::move(BC), ThreadCount++);
-      },
-      false);
+      };
+
+  // Try target-specific module splitting first, then fallback to the default.
+  if (!TM->splitModule(Mod, ParallelCodeGenParallelismLevel,
+                       HandleModulePartition)) {
+    SplitModule(Mod, ParallelCodeGenParallelismLevel, HandleModulePartition,
+                false);
+  }
 
   // Because the inner lambda (which runs in a worker thread) captures our local
   // variables, we need to wait for the worker threads to terminate before we
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
new file mode 100644
index 00000000000000..fa47d494f04148
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
@@ -0,0 +1,733 @@
+//===- AMDGPUSplitModule.cpp ----------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file Implements a module splitting algorithm designed to support the
+/// FullLTO --lto-partitions option for parallel codegen. This is completely
+/// different from the common SplitModule pass, as this system is designed with
+/// AMDGPU in mind.
+///
+/// The basic idea of this module splitting implementation is the same as
+/// SplitModule: load-balance the module's functions across a set of N
+/// partitions to allow parallel codegen. However, it does it very
+/// differently than the target-agnostic variant:
+///   - Kernels are used as the module's "roots".
+///     They're known entry points on AMDGPU, and everything else is often
+///     internal only.
+///   - Each kernel has a set of dependencies, and when a kernel and its
+///     dependencies is considered "big", we try to put it in a partition where
+///     most dependencies are already imported, to avoid duplicating large
+///     amounts of code.
+///   - There's special care for indirect calls in order to ensure
+///     AMDGPUResourceUsageAnalysis can work correctly.
+///
+/// This file also includes a more elaborate logging system to enable
+/// users to easily generate logs that (if desired) do not include any value
+/// names, in order to not leak information about the source file.
+/// Such logs are very helpful to understand and fix potential issues with
+/// module splitting.
+//
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPUSplitModule.h"
+#include "AMDGPUTargetMachine.h"
+#include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/User.h"
+#include "llvm/IR/Value.h"
+#include "llvm/Support/Casting.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/Process.h"
+#include "llvm/Support/SHA256.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+#include <algorithm>
+#include <cassert>
+#include <iterator>
+#include <memory>
+#include <utility>
+#include <vector>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-split-module"
+
+namespace {
+
+static cl::opt<float> LargeKernelFactor(
+    "amdgpu-module-splitting-large-kernel-threshold", cl::init(2.0), cl::Hidden,
+    cl::desc(
+        "consider a kernel as large and needing special treatment when it "
+        "exceeds the average cost of a partition by this factor; e;g. 2.0 "
+        "means if the kernel and its dependencies is 2 times bigger than "
+        "an average partition; 0 disables large kernels handling entirely"));
+
+static cl::opt<float> LargeKernelOverlapForMerge(
+    "amdgpu-module-splitting-large-kernel-merge-overlap", cl::init(0.8),
+    cl::Hidden,
+    cl::desc("defines how much overlap between two large kernel's dependencies "
+             "is needed to put them in the same partition"));
+
+static cl::opt<bool> NoExternalizeGlobals(
+    "amdgpu-module-splitting-no-externalize-globals", cl::Hidden,
+    cl::desc("disables externalization of global variable with local linkage; "
+             "may cause globals to be duplicated which increases binary size"));
+
+static cl::opt<std::string>
+    LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden,
+              cl::desc("output directory for AMDGPU module splitting logs"));
+
+static cl::opt<bool>
+    LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden,
+               cl::desc("hash value names before printing them in the AMDGPU "
+                        "module splitting logs"));
+
+using CostType = InstructionCost::CostType;
+using PartitionID = unsigned;
+
+static std::string getName(const Value &V) {
+  static std::optional<bool> HideNames;
+  if (!HideNames) {
+    if (LogPrivate.getNumOccurrences())
+      HideNames = LogPrivate;
+    else {
+      const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE");
+      HideNames = (EV.value_or("0") != "0");
+    }
+  }
+
+  if (!*HideNames)
+    return V.getName().str();
+  return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())),
+               /*LowerCase*/ true);
+}
+
+/// Main logging helper.
+///
+/// Logging can be configured by the following environment variable.
+///   AMD_SPLIT_MODULE_LOG_DIR=<filepath>
+///     If set, uses <filepath> as the directory to write logfiles to
+///     each time module splitting is used.
+///   AMD_SPLIT_MODULE_LOG_PRIVATE
+///     If set to anything other than zero, all names are hidden.
+///
+/// Both environment variables have corresponding CL options which
+/// takes priority over them.
+///
+/// Any output printed to the log files is also printed to dbgs() when -debug is
+/// used and LLVM_DEBUG is defined.
+///
+/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic
+/// cannot be removed from the code (by building without debug). This probably
+/// has a small performance cost because if some computation/formatting is
+/// needed for logging purpose, it may be done everytime only to be ignored
+/// by the logger.
+///
+/// As this pass only runs once and is not doing anything computationally
+/// expensive, this is likely a reasonable trade-off.
+///
+/// If some computation should really be avoided when unused, users of the class
+/// can check whether any logging will occur by using the bool operator.
+///
+/// \code
+///   if (SML) {
+///     // Executes only if logging to a file or if -debug is available and
+///     used.
+///   }
+/// \endcode
+class SplitModuleLogger {
+public:
+  SplitModuleLogger(const Module &M) {
+    std::string LogDir = LogDirOpt;
+    if (LogDir.empty())
+      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
+
+    // No log dir specified means we don't need to log to a file.
+    // We may still log to dbgs(), though.
+    if (LogDir.empty())
+      return;
+
+    if (!sys::fs::is_directory(LogDir)) {
+      report_fatal_error("invalid AMDGPU split module log directory: '" +
+                             Twine(LogDir) + "' is not a directory",
+                         /*CrashDiag=*/false);
+    }
+
+    // If a log directory is specified, create a new file with a unique name in
+    // that directory.
+    SmallString<0> FilePath;
+    int Fd;
+    std::string LogFile = (LogDir + "/" + "Module-%%-%%-%%-%%-%%-%%-%%.txt");
+    if (auto Err = sys::fs::createUniqueFile(LogFile, Fd, FilePath)) {
+      dbgs() << LogFile << "\n";
+      std::string Msg =
+          "Failed to create log file at '" + LogDir + "': " + Err.message();
+      report_fatal_error(StringRef(Msg),
+                         /*CrashDiag=*/false);
+    }
+
+    FileOS = std::make_unique<raw_fd_ostream>(Fd, /*shouldClose*/ true);
+  }
+
+  bool hasLogFile() const { return FileOS != nullptr; }
+
+  raw_ostream &logfile() {
+    assert(FileOS && "no logfile!");
+    return *FileOS;
+  }
+
+  /// \returns true if this SML will log anything either to a file or dbgs().
+  /// Can be used to avoid expensive computations that are ignored when logging
+  /// is disabled.
+  operator bool() const {
+    return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE));
+  }
+
+private:
+  std::unique_ptr<raw_fd_ostream> FileOS;
+};
+
+template <typename Ty>
+static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) {
+  static_assert(
+      !std::is_same_v<Ty, Value>,
+      "do not print values to logs directly, use handleName instead!");
+  LLVM_DEBUG(dbgs() << Val);
+  if (SML.hasLogFile())
+    SML.logfile() << Val;
+  return SML;
+}
+
+/// Calculate the cost of each function in \p M
+/// \param SML Log Helper
+/// \param TM TargetMachine instance used to retrieve TargetTransformInfo.
+/// \param M Module to analyze.
+/// \param CostMap[out] Resulting Function -> Cost map.
+/// \return The module's total cost.
+static CostType
+calculateFunctionCosts(SplitModuleLogger &SML, const AMDGPUTargetMachine &TM,
+                       Module &M,
+                       DenseMap<const Function *, CostType> &CostMap) {
+  CostType ModuleCost = 0;
+  CostType KernelCost = 0;
+
+  for (auto &Fn : M) {
+    if (Fn.isDeclaration())
+      continue;
+
+    CostType FnCost = 0;
+    auto TTI = TM.getTargetTransformInfo(Fn);
+
+    for (auto &BB : Fn) {
+      for (auto &I : BB) {
+        auto Cost =
+            TTI.getInstructionCost(&I, TargetTransformInfo::TCK_CodeSize);
+        assert(Cost != InstructionCost::getMax());
+        // Assume expensive if we can't tell the cost of an instruction.
+        CostType CostVal =
+            Cost.getValue().value_or(TargetTransformInfo::TCC_Expensive);
+        assert((FnCost + CostVal) >= FnCost && "Overflow!");
+        FnCost += CostVal;
+      }
+    }
+
+    assert(FnCost != 0);
+
+    CostMap[&Fn] = FnCost;
+    assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!");
+    ModuleCost += FnCost;
+
+    if (AMDGPU::isKernelCC(&Fn))
+      KernelCost += FnCost;
+  }
+
+  CostType FnCost = (ModuleCost - KernelCost);
+  SML << "=> Total Module Cost: " << ModuleCost << "\n"
+      << "  => KernelCost: " << KernelCost << " ("
+      << format("%0.2f", (float(KernelCost) / ModuleCost) * 100) << "%)\n"
+      << "  => FnsCost: " << FnCost << " ("
+      << format("%0.2f", (float(FnCost) / ModuleCost) * 100) << "%)\n";
+
+  return ModuleCost;
+}
+
+/// When a kernel or any of its callees performs an indirect call, this function
+/// takes over \ref addAllDependencies and adds all potentially callable
+/// functions to \p Fns so they can be counted as dependencies of the kernel.
+///
+/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the
+/// presence of an indirect call, the function's resource usage is the same as
+/// the most expensive function in the module.
+/// \param M    The module.
+/// \param Fns[out] Resulting list of functions.
+static void addAllIndirectCallDependencies(const Module &M,
+                                           DenseSet<const Function *> &Fns) {
+  for (const auto &Fn : M) {
+    if (!Fn.isDeclaration() && !AMDGPU::isEntryFunctionCC(Fn.getCallingConv()))
+      Fns.insert(&Fn);
+  }
+}
+
+/// Adds the functions that \p Fn may call to \p Fns, then recurses into each
+/// callee until all reachable functions have been gathered.
+///
+/// \param SML Log Helper
+/// \param CG Call graph for \p Fn's module.
+/// \param Fn Current function to look at.
+/// \param Fns[out] Resulting list of functions.
+/// \param HadIndirectCall[out] Set to true if an indirect call was seen at some
+/// point, either in \p Fn or in one of the function it calls. When that
+/// happens, we fall back to adding all callable functions inside \p Fn's module
+/// to \p Fns.
+/// \param HadExternalCall[out] Set to true if a call to an external function
+/// was seen at some point, either in \p Fn or in one of the function it calls
+static void addAllDependencies(SplitModuleLogger &SML, const CallGraph &CG,
+                               const Function &Fn,
+                               DenseSet<const Function *> &Fns,
+                               bool &HadIndirectCall, bool &HadExternalCall) {
+  assert(!Fn.isDeclaration());
+
+  const Module &M = *Fn.getParent();
+  SmallVector<const Function *> WorkList({&Fn});
+  while (!WorkList.empty()) {
+    const auto &CurFn = *WorkList.pop_back_val();
+
+    // Scan for an indirect call. If such a call is found, we have to
+    // conservatively assume this can call all non-entrypoint functions in the
+    // module.
+    for (const auto &BB : CurFn) {
+      for (const auto &I : BB) {
+        const auto *CB = dyn_cast<CallBase>(&I);
+        if (!CB || !CB->isIndirectCall())
+          continue;
+
+        SML << "Indirect call detected in " << getName(CurFn)
+            << " - treating all non-entrypoint functions as "
+               "potential dependencies\n";
+
+        // TODO: Print an ORE as well ?
+        addAllIndirectCallDependencies(M, Fns);
+        HadIndirectCall = true;
+        return;
+      }
+    }
+
+    for (auto &CGEntry : *CG[&CurFn]) {
+      auto *Callee = CGEntry.second->getFunction();
+      if (!Callee)
+        continue;
+
+      assert(!AMDGPU::isKernelCC(Callee));
+
+      if (Callee->isDeclaration())
+        continue;
+
+      if (Callee->hasExternalLinkage())
+        HadExternalCall = true;
+
+      auto [It, Inserted] = Fns.insert(Callee);
+      if (Inserted)
+        WorkList.push_back(Callee);
+    }
+  }
+}
+
+/// Contains information about a kernel and its dependencies.
+struct KernelWithDependencies {
+  KernelWithDependencies(SplitModuleLogger &SML, CallGraph &CG,
+                         const DenseMap<const Function *, CostType> &FnCosts,
+                         const Function *Fn)
+      : Fn(Fn) {
+    addAllDependencies(SML, CG, *Fn, Dependencies, HasIndirectCall,
+                       HasExternalCall);
+    TotalCost = FnCosts.at(Fn);
+    for (const auto *Dep : Dependencies)
+      TotalCost += FnCosts.at(Dep);
+  }
+
+  const Function *Fn = nullptr;
+  DenseSet<const Function *> Dependencies;
+  /// Whether \p Fn or any of its \ref Dependencies contains an indirect call.
+  bool HasIndirectCall = false;
+  /// Whether \p Fn or any of its \ref Dependencies contains a call to a
+  /// function with external linkage.
+  bool HasExternalCall = false;
+
+  CostType TotalCost = 0;
+
+  /// \returns true if this kernel and its dependencies can be considered large
+  /// according to \p Threshold.
+  bool isLarge(CostType Threshold) const {
+    return TotalCost > Threshold && !Dependencies.empty();
+  }
+};
+
+/// Calculates how much overlap there is between \p A and \p B.
+/// \return A number between 0.0 and 1.0, where 1.0 means A == B and 0.0 means A
+/// and B have no shared elements. Kernels do not count in overlap calculation.
+static float calculateOverlap(const DenseSet<const Function *> &A,
+                              const DenseSet<const Function *> &B) {
+  DenseSet<const Function *> Total;
+  for (const auto *F : A) {
+    if (!AMDGPU::isKernelCC(F))
+      Total.insert(F);
+  }
+
+  if (Total.empty())
+    return 0.0f;
+
+  unsigned NumCommon = 0;
+  for (const auto *F : B) {
+    if (AMDGPU::isKernelCC(F))
+      continue;
+
+    auto [It, Inserted] = Total.insert(F);
+    if (!Inserted)
+      ++NumCommon;
+  }
+
+  return float(NumCommon) / Total.size();
+}
+
+/// Performs all of the partitioning work on \p M.
+/// \param SML Log Helper
+/// \param M Module to partition.
+/// \param NumParts Number of partitions to create.
+/// \param ModuleCost Total cost of all functions in \p M.
+/// \param FnCosts Map of Function -> Cost
+/// \param WorkList Kernels and their dependencies to process in order.
+/// \returns The created partitions (a vector of size \p NumParts )
+static std::vector<DenseSet<const Function *>>
+doPartitioning(SplitModuleLogger &SML, Module &M, unsigned NumParts,
+               CostType ModuleCost,
+               const DenseMap<const Function *, CostType> &FnCosts,
+               const SmallVector<KernelWithDependencies> &WorkList) {
+
+  SML << "\n--Partitioning Starts--\n";
+
+  // Calculate a "large kernel threshold". When more than one kernel's total
+  // import cost exceeds this value, we will try to merge it with other,
+  // similarly large kernels.
+  //
+  // e.g. let two kernels X and Y have a import cost of ~10% of the module, we
+  // assign X to a partition as usual, but when we get to Y, we check if it's
+  // worth also putting it in Y's partition.
+  const CostType LargeKernelThreshold =
+      LargeKernelFactor ? ((ModuleCost / NumParts) * LargeKernelFactor)
+                        : std::numeric_limits<CostType>::max();
+
+  std::vector<DenseSet<const Function *>> Partitions;
+  Partitions.resize(NumParts);
+
+  // Assign a partition to each kernel, and try to keep the partitions more or
+  // less balanced. We do that through a priority queue sorted in reverse, so we
+  // can always look at the partition with the least content.
+  //
+  // There are some cases where we will be deliberately unbalanced though.
+  //  - Large kernels: we try to merge with existing partitions to reduce code
+  //  duplication.
+  //  - Kernels with indirect or external calls always go in the first partition
+  //  (P0).
+  auto ComparePartitions = [](const std::pair<PartitionID, CostType> &a,
+                              const std::pair<PartitionID, CostType> &b) {
+    // When two partitions have the same cost, assign to the one with the
+    // biggest ID first. This allows us to put things in P0 last, because P0 may
+    // have other stuff added later.
+    if (a.second == b.second)
+      return a.first < b.first;
+    return a.second > b.second;
+  };
+
+  // We can't use priority_queue here because we need to be able to access any
+  // element. This makes this a bit inefficient as we need to sort it again
+  // everytime we change it, but it's a very small array anyway (likely under 64
+  // partitions) so it's a cheap operation.
+  std::vector<std::pair<PartitionID, CostType>> BalancingQueue;
+  for (unsigned I = 0; I < NumParts; ++I)
+    BalancingQueue.push_back(std::make_pair(I, 0));
+
+  // Helper function to...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

arsenm · 2024-04-20T09:24:12Z

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

+  SplitModuleLogger(const Module &M) {
+    std::string LogDir = LogDirOpt;
+    if (LogDir.empty())
+      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");


Backend passes should probably not depend on environment variables

I know but it's very convenient, and only used for debugging.

Passing arguments with values to the offload-linker's backend can be a bit annoying. I always have trouble figuring out the right syntax. I think it's something like -Xoffload-linker -mllvm=-amdgpu-module-splitting-log-dir=<dir> but I'm not sure)

If you know what the right syntax is for sure then I'm good to remove the env vars and instead create a section in AMDGPUUsage explaining how to get logs from the pass.

arsenm · 2024-04-20T09:24:54Z

Can you add some tests with aliases? I'm not sure they work correctly in the call graph

Allow targets to implement custom module splitting logic for --lto-partitions, see #89245 https://discourse.llvm.org/t/rfc-lto-target-specific-module-splittting/77252

llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll

llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll

gandhi56 · 2024-05-03T16:47:34Z

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt
clone-lds-function-ancestor-kernels.ll.txt
clone-lds-function-successor.ll.txt

jrbyrnes · 2024-05-03T17:23:57Z

These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning

I ran into https://llvm.org/docs/CodingStandards.html#beware-of-non-determinism-due-to-ordering-of-pointers a while ago, and thought I'd point it out since I see this work uses sets/maps with pointer keys. Not sure if the iteration on these effects test output, but something to consider..

llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables-noexternal.ll

llvm/test/tools/llvm-split/AMDGPU/kernels-global-variables.ll

llvm/test/tools/llvm-split/AMDGPU/large-kernels-merging.ll

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

arsenm · 2024-05-09T18:02:17Z

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

+          return I == 0;
+        }));
+    if (I != 0)
+      MPart->setModuleInlineAsm("");


Why would this drop whatever asm is there? Does module asm even work today?

I started this off by copying SplitModule so it's just a leftover

This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

Pierre-vh · 2024-05-13T07:14:56Z

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

I'm not able to reproduce it, can you create a testcase that only depends on this pass?
I had to modify the tests a lot to make them work so it possibly lost what makes them non-deterministic. Also I see you're using %u but I don't know what that's supposed to do, I replaced it with %t to make it work.

Pierre-vh · 2024-05-13T15:17:05Z

Note: I added the isAddressTaken check for the indirect calls dependency collection. I think that's a trivial addition to narrow down the set of indirectly callable functions.

gandhi56 · 2024-05-13T15:50:33Z

I found a bug during the review of #89683. I will attach the test files here. These tests should pass when they are run individually and fail when they are tested as a group. These nondeterministic results show up when AMDGPUSplitModule or SplitModule are used for partitioning. I suspect it may have something to do with tie breaks during load balancing.

clone-lds-function.ll.txt clone-lds-function-ancestor-kernels.ll.txt clone-lds-function-successor.ll.txt

Never mind, I had some errors in the RUN lines in my tests. This issue does not exist.

github-actions · 2024-05-16T07:55:04Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2024-05-16T09:20:41Z

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

+      // We cannot duplicate functions with external linkage, or functions that
+      // may be overriden at runtime.
+      HasNonDuplicatableDependecy |=
+          (Dep->hasExternalLinkage() || !Dep->isDefinitionExact());


Maybe this should be mayBeDerefined? I'm not a linkage expert, but it seems there's some nobuiltin interaction here too

mayBeDerefined is private I think, isDefinitionExact is just !mayBeDerefined so it's already using it

jhuber6

My understanding is that --lto-partitions doesn't work because the AMDGPU backend needs to consume the SCC in order. I'm assuming this patch maintains that logic but splits the independent kernel calls up to they can be done in parallel?

I've thought that the true solution to resolving this would just be emitting the kernel resource usage inside of ld.lld. Presumably that would require emitting resource usage per-function in parallel and then the callgraph information (There's some small support for this already). Then ld.lld would need to traverse the callgraph to get the diameter of said graph. However I'm assuming that'd take more effort to define an actual linking ABI.

Pierre-vh · 2024-05-17T07:38:02Z

My understanding is that --lto-partitions doesn't work because the AMDGPU backend needs to consume the SCC in order. I'm assuming this patch maintains that logic but splits the independent kernel calls up to they can be done in parallel?

Exactly. If a function is called directly or indirectly by a kernel, it stays in that kernel's module.

I've thought that the true solution to resolving this would just be emitting the kernel resource usage inside of ld.lld. Presumably that would require emitting resource usage per-function in parallel and then the callgraph information (There's some small support for this already). Then ld.lld would need to traverse the callgraph to get the diameter of said graph. However I'm assuming that'd take more effort to define an actual linking ABI.

Indeed, and this is where I started looking as well (so we could just enable thinLTO), but it's hard to get right and this solution was by far the easiest and the most realistic one to do in the short term.

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

arsenm · 2024-05-20T21:25:45Z

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

+      continue;
+
+    CostType FnCost = 0;
+    TargetTransformInfo TTI = TM.getTargetTransformInfo(Fn);


This should be const ref

Signature is

virtual TargetTransformInfo getTargetTransformInfo(const Function &F) const

So can't be const ref

You aren't supposed to query this from the TargetMachine, you need to get this from the PassManager

This isn't exactly a pass though, it's not ran by the pass manager but by the LTO driver after running the passes. I don't see another way to get TTI from here.
We also call this on the one module so the entry point can't just take a const &

We could eventually pass a functor to hide away the call to getTargetTransformInfo into TM so this doesn't need to know about it?

This isn't exactly a pass though, it's not ran by the pass manager but by the LTO driver after running the passes.

This sounds like a defect in how this is implemented. I would expect this to be a module pass, and then you can query the function analysis. Can you just ignore the cost stuff for now? Split the TTI usage into a separate patch?

I would expect this to be a module pass, and then you can query the function analysis

SplitModule is also not a ModulePass. Fitting this in a module pass would be awkward I think. I guess we could make it an analysis, but then you still need to query the analysis when calling the CloneModule callback and IMO getting an analysis' result outside the PM feels more awkward than getting a TTI instance. It would also restrict what we can do/involve a lot more bookkeeping.

Can you just ignore the cost stuff for now?

The costs are an integral part of how this system works, it balances kernels depending on their size. Without that we'd likely affect the performance of this splitting negatively. I could estimate them based on some heuristics like number of BB/functions but that's really not accurate because some instructions are noops while some a ton of instructions.

There are also some precedents on passes/code using TM.getTargetTransformInfo directly:

InterleavedLoadCombineImpl

SelectionDAGBuilder::shouldKeepJumpConditionsTogether

computeHeuristicUnrollFactor in OMPIRBuilder

These precedents are all bugs and constructing new TTI instances

#93103 fixes the InterleavedLoadCombine one. shouldKeepJumpConditionsTogether requires a bit more boilerplate. For computeHeuristicUnrollFactor, just what is going on in there?

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp

vitalybuka · 2024-05-23T20:55:04Z

Can this be fixed or reverted https://lab.llvm.org/buildbot/#/builders/85/builds/24181 ?

This reverts commit d7c3713.

Fails on https://lab.llvm.org/buildbot/#/builders/85/builds/24181 and https://lab.llvm.org/buildbot/#/builders/5/builds/43589 Reverts #89245

(with fix for ubsan) This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen.

Allow targets to implement custom module splitting logic for --lto-partitions, see llvm#89245 https: //discourse.llvm.org/t/rfc-lto-target-specific-module-splittting/77252 Change-Id: I22abc7f37988e839eb7b06fbd1a0c69dfef1af28

This enables the --lto-partitions option to work more consistently. This module splitting logic is fully aware of AMDGPU modules and their specificities and takes advantage of them to split modules in a way that avoids compilation issue (such as resource usage being incorrectly represented). This also includes a logging system that's more elaborate than just LLVM_DEBUG which allows printing logs to uniquely named files, and optionally with all value names hidden so they can be safely shared without leaking informatiton about the source. Logs can also be enabled through an environment variable, which avoids the sometimes complicated process of passing a -mllvm option all the way from clang driver to the offload linker that handles full LTO codegen. Change-Id: Ifec430cc5b439e004822a4e5659cf47639458bfe

Pierre-vh requested review from jayfoad, arsenm, scchan and gandhi56 April 18, 2024 14:51

llvmbot added backend:AMDGPU LTO Link time optimization (regular/full LTO or ThinLTO) labels Apr 18, 2024

Pierre-vh mentioned this pull request Apr 18, 2024

[LTO] Allow target-specific module splittting #83128

Merged

Pierre-vh force-pushed the amdgpu-module-splitting branch from 6c25662 to 66e3954 Compare April 19, 2024 08:46

arsenm reviewed Apr 20, 2024

View reviewed changes

Pierre-vh force-pushed the amdgpu-module-splitting branch from 66e3954 to 9c7a0e9 Compare April 22, 2024 07:44

Pierre-vh requested a review from arsenm April 22, 2024 07:44

Pierre-vh mentioned this pull request Apr 24, 2024

[AMDGPU][LTO] Introduce AMDGPUCloneModuleLDS #89683

Open

arsenm reviewed Apr 30, 2024

View reviewed changes

llvm/test/tools/llvm-split/AMDGPU/address-taken-externalize-with-call.ll Outdated Show resolved Hide resolved

llvm/test/tools/llvm-split/AMDGPU/kernels-cost-ranking.ll Outdated Show resolved Hide resolved

Pierre-vh force-pushed the amdgpu-module-splitting branch from 9c7a0e9 to 156247c Compare May 2, 2024 08:13

Pierre-vh requested a review from arsenm May 2, 2024 08:13

arsenm reviewed May 9, 2024

View reviewed changes

Pierre-vh added 4 commits May 13, 2024 08:57

mtarget -> mtriple

a75129b

Comments

095984a

fix tests

cec4259

comments

a092a3c

Pierre-vh force-pushed the amdgpu-module-splitting branch from 156247c to a092a3c Compare May 13, 2024 07:46

Comments

0b0662a

Pierre-vh requested a review from arsenm May 14, 2024 13:36

Handle overridable functions

dc05393

Appease clang-format

e05fbc3

arsenm reviewed May 16, 2024

View reviewed changes

arsenm requested a review from jhuber6 May 16, 2024 09:20

Pierre-vh requested a review from arsenm May 16, 2024 12:32

jhuber6 reviewed May 16, 2024

View reviewed changes

arsenm reviewed May 20, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.h Show resolved Hide resolved

llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp Outdated Show resolved Hide resolved

arsenm reviewed May 20, 2024

View reviewed changes

Commnets + private name test

570d7e5

Pierre-vh requested a review from arsenm May 21, 2024 07:19

Pierre-vh merged commit d7c3713 into llvm:main May 23, 2024
4 checks passed

Pierre-vh deleted the amdgpu-module-splitting branch May 23, 2024 10:26

vitalybuka added a commit that referenced this pull request May 24, 2024

Revert "[AMDGPU] Add AMDGPU-specific module splitting (#89245)"

92f6036

This reverts commit d7c3713.

vitalybuka mentioned this pull request May 24, 2024

Revert "[AMDGPU] Add AMDGPU-specific module splitting" #93275

Merged

vitalybuka added a commit that referenced this pull request May 24, 2024

Revert "[AMDGPU] Add AMDGPU-specific module splitting" (#93275)

5a48223

Fails on https://lab.llvm.org/buildbot/#/builders/85/builds/24181 and https://lab.llvm.org/buildbot/#/builders/5/builds/43589 Reverts #89245

vitalybuka mentioned this pull request May 24, 2024

#89245 was reverted #93277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Add AMDGPU-specific module splitting #89245

[AMDGPU] Add AMDGPU-specific module splitting #89245

Pierre-vh commented Apr 18, 2024

llvmbot commented Apr 18, 2024

llvmbot commented Apr 18, 2024

arsenm Apr 20, 2024

Pierre-vh Apr 22, 2024

arsenm commented Apr 20, 2024

gandhi56 commented May 3, 2024

jrbyrnes commented May 3, 2024

arsenm May 9, 2024

Pierre-vh May 13, 2024

Pierre-vh commented May 13, 2024

Pierre-vh commented May 13, 2024

gandhi56 commented May 13, 2024

github-actions bot commented May 16, 2024 •

edited

arsenm May 16, 2024

Pierre-vh May 16, 2024

jhuber6 left a comment

Pierre-vh commented May 17, 2024

arsenm May 20, 2024

Pierre-vh May 21, 2024

arsenm May 21, 2024

Pierre-vh May 22, 2024

arsenm May 22, 2024

Pierre-vh May 22, 2024

arsenm May 22, 2024 •

edited

arsenm May 22, 2024

vitalybuka commented May 23, 2024

[AMDGPU] Add AMDGPU-specific module splitting #89245

[AMDGPU] Add AMDGPU-specific module splitting #89245

Conversation

Pierre-vh commented Apr 18, 2024

llvmbot commented Apr 18, 2024

llvmbot commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm commented Apr 20, 2024

gandhi56 commented May 3, 2024

jrbyrnes commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pierre-vh commented May 13, 2024

Pierre-vh commented May 13, 2024

gandhi56 commented May 13, 2024

github-actions bot commented May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhuber6 left a comment

Choose a reason for hiding this comment

Pierre-vh commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm May 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitalybuka commented May 23, 2024

github-actions bot commented May 16, 2024 •

edited

arsenm May 22, 2024 •

edited