[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. #87265

skc7 · 2024-04-01T17:07:18Z

This PR introduces new pass "amdgpu-sw-lower-lds".

This pass lowers the local data store, LDS, uses in kernel and non-kernel functions in module and packs them together as single allocation. Packed LDS Layout is emulated in the dynamically allocated device global memory.

For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels).

Replacement of Kernel LDS accesses:

All the LDS accesses corresponding to kernel will be packed together, where all static LDS accesses will be allocated first and then dynamic LDS follows. The total size with alignment is calculated. A new LDS global will be created for the kernel called "SW LDS" and it will have the attribute "amdgpu-lds-size" attached with value of the size calculated. All the LDS accesses in the module will be replaced by GEP with offset into the "Sw LDS".
A new "llvm.amdgcn..dynlds" is created per kernel accessing the dynamic LDS. This will be marked used by kernel and will have MD_absolue_symbol metadata set to total static LDS size, Since dynamic LDS allocation starts after all static LDS allocation.
A device global memory equal to the total LDS size will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in "SW LDS". To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "SW LDS metadata" in this pass.
SW LDS:
It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.".
SW LDS Metadata:
It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced and third represents the total aligned size. It will have name "llvm.amdgcn.sw.lds..md". This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.
At the epilogue of kernel, allocated memory would be made free by the same single work-item.

Replacement of non-kernel LDS accesses:

Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted.
This information is used to build two tables:
Base table:
Base table will have single row, with elements of the row placed as per kernel ID. Each element in the row corresponds to ptr of "SW LDS" variable created for that kernel.
Offset table:
Offset table will have multiple rows and columns. Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the ptr of the replacement of LDS global done by that particular kernel.
A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, ptr of "SW LDS" for that corresponding kernel is obtained from base table. The Offset into the base "SW LDS" is obtained from corresponding element in offset table. With this information, replacement value is obtained.

llvmbot · 2024-04-01T17:07:50Z

@llvm/pr-subscribers-backend-amdgpu

Author: Chaitanya (skc7)

Changes

This PR introduces new pass "amdgpu-sw-lower-lds". It lowers the local data store, LDS, uses in kernel and non-kernel functions in module with dynamically allocated device global memory.

Replacement of Kernel LDS accesses:

For a kernel, LDS access can be static or dynamic which are direct (accessed within kernel) and indirect (accessed through non-kernels). A device global memory equal to size of all these LDS globals will be allocated. At the prologue of the kernel, a single work-item from the work-group, does a "malloc" and stores the pointer of the allocation in new LDS global that will be created for the kernel. This will be called "malloc LDS global" in this pass. Each LDS access corresponds to an offset in the allocated memory. All static LDS accesses will be allocated first and then dynamic LDS will occupy the device global memory. To store the offsets corresponding to all LDS accesses, another global variable is created which will be called "metadata global" in this pass.
Malloc LDS Global:
It is LDS global of ptr type with name "llvm.amdgcn.sw.lds.{kernel-name}".
Metadata Global:
It is of struct type, with n members. n equals the number of LDS globals accessed by the kernel(direct and indirect). Each member of struct is another struct of type {i32, i32}. First member corresponds to offset, second member corresponds to size of LDS global being replaced.
It will have name "llvm.amdgcn.sw.lds.{kernel-name}.md".
This global will have an intializer with static LDS related offsets and sizes initialized. But for dynamic LDS related entries, offsets will be intialized to previous static LDS allocation end offset. Sizes for them will be zero initially. These dynamic LDS offset and size values will be updated with in the kernel, since kernel can read the dynamic LDS size allocation done at runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.
LDS accesses within the kernel will be replaced by "gep" ptr to corresponding offset into allocated device global memory for the kernel.
At the epilogue of kernel, allocated memory would be made free by the same single work-item.

Replacement of non-kernel LDS accesses:

Multiple kernels can access the same non-kernel function. All the kernels accessing LDS through non-kernels are sorted and assigned a kernel-id. All the LDS globals accessed by non-kernels are sorted. This information is used to build two globals which are used as tables for query of LDS replacement:
Base table:
Base table will have single row, with elements of the row placed as per kernel ID.
Each element in the row corresponds to addresss of "malloc LDS global" variable created for that kernel.
Offset table:
Offset table will have multiple rows and columns.
Rows are assumed to be from 0 to (n-1). n is total number of kernels accessing the LDS through non-kernels. Each row will have m elements. m is the total number of unique LDS globals accessed by all non-kernels. Each element in the row correspond to the address of the replacement of LDS global done by that particular kernel. A LDS variable in non-kernel will be replaced based on the information from base and offset tables. Based on kernel-id query, address of "malloc LDS global" for that corresponding kernel is obtained from base table. The Offset into the base "malloc LDS global" is obtained from corresponding element in offset table. With this information, replacement value is obtained.
Other changes:
"amdgpu-sw-lower-lds" pass is needed for address sanitizer instrumentation of LDS. So, this pass is enabled only if asan is enabled.
There are certain utility functions which can be reused from lower-module-lds pass. These functions are moved to AMDGPUMemoryUtils module for re-use in this new pass. lower-module-lds pass will be disabled if asan feature is enabled.

Patch is 122.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87265.diff

18 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp (+1-185)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(added) llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp (+865)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+6)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp (+176)
(modified) llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.h (+24)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-dynamic-indirect-access.ll (+99)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-dynamic-lds-test.ll (+57)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multi-static-dynamic-indirect-access.ll (+192)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multiple-blocks-return.ll (+79)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-dynamic-indirect-access.ll (+101)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-dynamic-lds-test.ll (+88)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-indirect-access-function-param.ll (+61)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-indirect-access-nested.ll (+212)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-indirect-access.ll (+84)
(added) llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-lds-test.ll (+58)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 6016bd5187d887..15ff74f7c53af3 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -263,6 +263,15 @@ struct AMDGPUAlwaysInlinePass : PassInfoMixin<AMDGPUAlwaysInlinePass> {
   bool GlobalOpt;
 };
 
+void initializeAMDGPUSwLowerLDSLegacyPass(PassRegistry &);
+extern char &AMDGPUSwLowerLDSLegacyPassID;
+ModulePass *createAMDGPUSwLowerLDSLegacyPass();
+
+struct AMDGPUSwLowerLDSPass : PassInfoMixin<AMDGPUSwLowerLDSPass> {
+  AMDGPUSwLowerLDSPass() {}
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+};
+
 class AMDGPUCodeGenPreparePass
     : public PassInfoMixin<AMDGPUCodeGenPreparePass> {
 private:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index 595f09664c55e4..f0456d3f62a816 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -212,6 +212,7 @@
 #define DEBUG_TYPE "amdgpu-lower-module-lds"
 
 using namespace llvm;
+using namespace AMDGPU;
 
 namespace {
 
@@ -234,17 +235,6 @@ cl::opt<LoweringKind> LoweringKindLoc(
         clEnumValN(LoweringKind::hybrid, "hybrid",
                    "Lower via mixture of above strategies")));
 
-bool isKernelLDS(const Function *F) {
-  // Some weirdness here. AMDGPU::isKernelCC does not call into
-  // AMDGPU::isKernel with the calling conv, it instead calls into
-  // isModuleEntryFunction which returns true for more calling conventions
-  // than AMDGPU::isKernel does. There's a FIXME on AMDGPU::isKernel.
-  // There's also a test that checks that the LDS lowering does not hit on
-  // a graphics shader, denoted amdgpu_ps, so stay with the limited case.
-  // Putting LDS in the name of the function to draw attention to this.
-  return AMDGPU::isKernel(F->getCallingConv());
-}
-
 template <typename T> std::vector<T> sortByName(std::vector<T> &&V) {
   llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) {
     return L->getName() < R->getName();
@@ -305,183 +295,9 @@ class AMDGPULowerModuleLDS {
         Decl, {}, {OperandBundleDefT<Value *>("ExplicitUse", UseInstance)});
   }
 
-  static bool eliminateConstantExprUsesOfLDSFromAllInstructions(Module &M) {
-    // Constants are uniqued within LLVM. A ConstantExpr referring to a LDS
-    // global may have uses from multiple different functions as a result.
-    // This pass specialises LDS variables with respect to the kernel that
-    // allocates them.
-
-    // This is semantically equivalent to (the unimplemented as slow):
-    // for (auto &F : M.functions())
-    //   for (auto &BB : F)
-    //     for (auto &I : BB)
-    //       for (Use &Op : I.operands())
-    //         if (constantExprUsesLDS(Op))
-    //           replaceConstantExprInFunction(I, Op);
-
-    SmallVector<Constant *> LDSGlobals;
-    for (auto &GV : M.globals())
-      if (AMDGPU::isLDSVariableToLower(GV))
-        LDSGlobals.push_back(&GV);
-
-    return convertUsersOfConstantsToInstructions(LDSGlobals);
-  }
-
 public:
   AMDGPULowerModuleLDS(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
 
-  using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>;
-
-  using VariableFunctionMap = DenseMap<GlobalVariable *, DenseSet<Function *>>;
-
-  static void getUsesOfLDSByFunction(CallGraph const &CG, Module &M,
-                                     FunctionVariableMap &kernels,
-                                     FunctionVariableMap &functions) {
-
-    // Get uses from the current function, excluding uses by called functions
-    // Two output variables to avoid walking the globals list twice
-    for (auto &GV : M.globals()) {
-      if (!AMDGPU::isLDSVariableToLower(GV)) {
-        continue;
-      }
-
-      for (User *V : GV.users()) {
-        if (auto *I = dyn_cast<Instruction>(V)) {
-          Function *F = I->getFunction();
-          if (isKernelLDS(F)) {
-            kernels[F].insert(&GV);
-          } else {
-            functions[F].insert(&GV);
-          }
-        }
-      }
-    }
-  }
-
-  struct LDSUsesInfoTy {
-    FunctionVariableMap direct_access;
-    FunctionVariableMap indirect_access;
-  };
-
-  static LDSUsesInfoTy getTransitiveUsesOfLDS(CallGraph const &CG, Module &M) {
-
-    FunctionVariableMap direct_map_kernel;
-    FunctionVariableMap direct_map_function;
-    getUsesOfLDSByFunction(CG, M, direct_map_kernel, direct_map_function);
-
-    // Collect variables that are used by functions whose address has escaped
-    DenseSet<GlobalVariable *> VariablesReachableThroughFunctionPointer;
-    for (Function &F : M.functions()) {
-      if (!isKernelLDS(&F))
-        if (F.hasAddressTaken(nullptr,
-                              /* IgnoreCallbackUses */ false,
-                              /* IgnoreAssumeLikeCalls */ false,
-                              /* IgnoreLLVMUsed */ true,
-                              /* IgnoreArcAttachedCall */ false)) {
-          set_union(VariablesReachableThroughFunctionPointer,
-                    direct_map_function[&F]);
-        }
-    }
-
-    auto functionMakesUnknownCall = [&](const Function *F) -> bool {
-      assert(!F->isDeclaration());
-      for (const CallGraphNode::CallRecord &R : *CG[F]) {
-        if (!R.second->getFunction()) {
-          return true;
-        }
-      }
-      return false;
-    };
-
-    // Work out which variables are reachable through function calls
-    FunctionVariableMap transitive_map_function = direct_map_function;
-
-    // If the function makes any unknown call, assume the worst case that it can
-    // access all variables accessed by functions whose address escaped
-    for (Function &F : M.functions()) {
-      if (!F.isDeclaration() && functionMakesUnknownCall(&F)) {
-        if (!isKernelLDS(&F)) {
-          set_union(transitive_map_function[&F],
-                    VariablesReachableThroughFunctionPointer);
-        }
-      }
-    }
-
-    // Direct implementation of collecting all variables reachable from each
-    // function
-    for (Function &Func : M.functions()) {
-      if (Func.isDeclaration() || isKernelLDS(&Func))
-        continue;
-
-      DenseSet<Function *> seen; // catches cycles
-      SmallVector<Function *, 4> wip{&Func};
-
-      while (!wip.empty()) {
-        Function *F = wip.pop_back_val();
-
-        // Can accelerate this by referring to transitive map for functions that
-        // have already been computed, with more care than this
-        set_union(transitive_map_function[&Func], direct_map_function[F]);
-
-        for (const CallGraphNode::CallRecord &R : *CG[F]) {
-          Function *ith = R.second->getFunction();
-          if (ith) {
-            if (!seen.contains(ith)) {
-              seen.insert(ith);
-              wip.push_back(ith);
-            }
-          }
-        }
-      }
-    }
-
-    // direct_map_kernel lists which variables are used by the kernel
-    // find the variables which are used through a function call
-    FunctionVariableMap indirect_map_kernel;
-
-    for (Function &Func : M.functions()) {
-      if (Func.isDeclaration() || !isKernelLDS(&Func))
-        continue;
-
-      for (const CallGraphNode::CallRecord &R : *CG[&Func]) {
-        Function *ith = R.second->getFunction();
-        if (ith) {
-          set_union(indirect_map_kernel[&Func], transitive_map_function[ith]);
-        } else {
-          set_union(indirect_map_kernel[&Func],
-                    VariablesReachableThroughFunctionPointer);
-        }
-      }
-    }
-
-    // Verify that we fall into one of 2 cases:
-    //    - All variables are absolute: this is a re-run of the pass
-    //      so we don't have anything to do.
-    //    - No variables are absolute.
-    std::optional<bool> HasAbsoluteGVs;
-    for (auto &Map : {direct_map_kernel, indirect_map_kernel}) {
-      for (auto &[Fn, GVs] : Map) {
-        for (auto *GV : GVs) {
-          bool IsAbsolute = GV->isAbsoluteSymbolRef();
-          if (HasAbsoluteGVs.has_value()) {
-            if (*HasAbsoluteGVs != IsAbsolute) {
-              report_fatal_error(
-                  "Module cannot mix absolute and non-absolute LDS GVs");
-            }
-          } else
-            HasAbsoluteGVs = IsAbsolute;
-        }
-      }
-    }
-
-    // If we only had absolute GVs, we have nothing to do, return an empty
-    // result.
-    if (HasAbsoluteGVs && *HasAbsoluteGVs)
-      return {FunctionVariableMap(), FunctionVariableMap()};
-
-    return {std::move(direct_map_kernel), std::move(indirect_map_kernel)};
-  }
-
   struct LDSVariableReplacement {
     GlobalVariable *SGV = nullptr;
     DenseMap<GlobalVariable *, Constant *> LDSVarsToConstantGEP;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 90f36fadf35903..eda4949d0296d5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -22,6 +22,7 @@ MODULE_PASS("amdgpu-lower-buffer-fat-pointers",
             AMDGPULowerBufferFatPointersPass(*this))
 MODULE_PASS("amdgpu-lower-ctor-dtor", AMDGPUCtorDtorLoweringPass())
 MODULE_PASS("amdgpu-lower-module-lds", AMDGPULowerModuleLDSPass(*this))
+MODULE_PASS("amdgpu-sw-lower-lds", AMDGPUSwLowerLDSPass())
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
 #undef MODULE_PASS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp
new file mode 100644
index 00000000000000..ed3670fa1386d6
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp
@@ -0,0 +1,865 @@
+//===-- AMDGPUSwLowerLDS.cpp -----------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass lowers the local data store, LDS, uses in kernel and non-kernel
+// functions in module with dynamically allocated device global memory.
+//
+// Replacement of Kernel LDS accesses:
+//    For a kernel, LDS access can be static or dynamic which are direct
+//    (accessed within kernel) and indirect (accessed through non-kernels).
+//    A device global memory equal to size of all these LDS globals will be
+//    allocated. At the prologue of the kernel, a single work-item from the
+//    work-group, does a "malloc" and stores the pointer of the allocation in
+//    new LDS global that will be created for the kernel. This will be called
+//    "malloc LDS global" in this pass.
+//    Each LDS access corresponds to an offset in the allocated memory.
+//    All static LDS accesses will be allocated first and then dynamic LDS
+//    will occupy the device global memoery.
+//    To store the offsets corresponding to all LDS accesses, another global
+//    variable is created which will be called "metadata global" in this pass.
+//    - Malloc LDS Global:
+//        It is LDS global of ptr type with name
+//        "llvm.amdgcn.sw.lds.<kernel-name>".
+//    - Metadata Global:
+//        It is of struct type, with n members. n equals the number of LDS
+//        globals accessed by the kernel(direct and indirect). Each member of
+//        struct is another struct of type {i32, i32}. First member corresponds
+//        to offset, second member corresponds to size of LDS global being
+//        replaced. It will have name "llvm.amdgcn.sw.lds.<kernel-name>.md".
+//        This global will have an intializer with static LDS related offsets
+//        and sizes initialized. But for dynamic LDS related entries, offsets
+//        will be intialized to previous static LDS allocation end offset. Sizes
+//        for them will be zero initially. These dynamic LDS offset and size
+//        values will be updated with in the kernel, since kernel can read the
+//        dynamic LDS size allocation done at runtime with query to
+//        "hidden_dynamic_lds_size" hidden kernel argument.
+//
+//    LDS accesses within the kernel will be replaced by "gep" ptr to
+//    corresponding offset into allocated device global memory for the kernel.
+//    At the epilogue of kernel, allocated memory would be made free by the same
+//    single work-item.
+//
+// Replacement of non-kernel LDS accesses:
+//    Multiple kernels can access the same non-kernel function.
+//    All the kernels accessing LDS through non-kernels are sorted and
+//    assigned a kernel-id. All the LDS globals accessed by non-kernels
+//    are sorted. This information is used to build two tables:
+//    - Base table:
+//        Base table will have single row, with elements of the row
+//        placed as per kernel ID. Each element in the row corresponds
+//        to addresss of "malloc LDS global" variable created for
+//        that kernel.
+//    - Offset table:
+//        Offset table will have multiple rows and columns.
+//        Rows are assumed to be from 0 to (n-1). n is total number
+//        of kernels accessing the LDS through non-kernels.
+//        Each row will have m elements. m is the total number of
+//        unique LDS globals accessed by all non-kernels.
+//        Each element in the row correspond to the address of
+//        the replacement of LDS global done by that particular kernel.
+//    A LDS variable in non-kernel will be replaced based on the information
+//    from base and offset tables. Based on kernel-id query, address of "malloc
+//    LDS global" for that corresponding kernel is obtained from base table.
+//    The Offset into the base "malloc LDS global" is obtained from
+//    corresponding element in offset table. With this information, replacement
+//    value is obtained.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "Utils/AMDGPUMemoryUtils.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseSet.h"
+#include "llvm/ADT/SetOperations.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/Analysis/CallGraph.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/ReplaceConstant.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
+
+#include <algorithm>
+
+#define DEBUG_TYPE "amdgpu-sw-lower-lds"
+
+using namespace llvm;
+using namespace AMDGPU;
+
+namespace {
+
+using DomTreeCallback = function_ref<DominatorTree *(Function &F)>;
+
+struct LDSAccessTypeInfo {
+  SetVector<GlobalVariable *> StaticLDSGlobals;
+  SetVector<GlobalVariable *> DynamicLDSGlobals;
+};
+
+// Struct to hold all the Metadata required for a kernel
+// to replace a LDS global uses with corresponding offset
+// in to device global memory.
+struct KernelLDSParameters {
+  GlobalVariable *MallocLDSGlobal{nullptr};
+  GlobalVariable *MallocMetadataGlobal{nullptr};
+  LDSAccessTypeInfo DirectAccess;
+  LDSAccessTypeInfo IndirectAccess;
+  DenseMap<GlobalVariable *, SmallVector<uint32_t, 3>>
+      LDSToReplacementIndicesMap;
+  int32_t KernelId{-1};
+  uint32_t MallocSize{0};
+};
+
+// Struct to store infor for creation of offset table
+// for all the non-kernel LDS accesses.
+struct NonKernelLDSParameters {
+  GlobalVariable *LDSBaseTable{nullptr};
+  GlobalVariable *LDSOffsetTable{nullptr};
+  SetVector<Function *> OrderedKernels;
+  SetVector<GlobalVariable *> OrdereLDSGlobals;
+};
+
+class AMDGPUSwLowerLDS {
+public:
+  AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)
+      : M(mod), IRB(M.getContext()), DTCallback(Callback) {}
+  bool Run();
+  void GetUsesOfLDSByNonKernels(CallGraph const &CG,
+                                FunctionVariableMap &functions);
+  SetVector<Function *>
+  GetOrderedIndirectLDSAccessingKernels(SetVector<Function *> &&Kernels);
+  SetVector<GlobalVariable *>
+  GetOrderedNonKernelAllLDSGlobals(SetVector<GlobalVariable *> &&Variables);
+  void PopulateMallocLDSGlobal(Function *Func);
+  void PopulateMallocMetadataGlobal(Function *Func);
+  void PopulateLDSToReplacementIndicesMap(Function *Func);
+  void ReplaceKernelLDSAccesses(Function *Func);
+  void LowerKernelLDSAccesses(Function *Func, DomTreeUpdater &DTU);
+  void BuildNonKernelLDSOffsetTable(
+      std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+  void BuildNonKernelLDSBaseTable(
+      std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+  Constant *
+  GetAddressesOfVariablesInKernel(Function *Func,
+                                  SetVector<GlobalVariable *> &Variables);
+  void LowerNonKernelLDSAccesses(
+      Function *Func, SetVector<GlobalVariable *> &LDSGlobals,
+      std::shared_ptr<NonKernelLDSParameters> &NKLDSParams);
+
+private:
+  Module &M;
+  IRBuilder<> IRB;
+  DomTreeCallback DTCallback;
+  DenseMap<Function *, std::shared_ptr<KernelLDSParameters>>
+      KernelToLDSParametersMap;
+};
+
+template <typename T> SetVector<T> SortByName(std::vector<T> &&V) {
+  // Sort the vector of globals or Functions based on their name.
+  // Returns a SetVector of globals/Functions.
+  llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) {
+    return L->getName() < R->getName();
+  });
+  return {std::move(SetVector<T>(V.begin(), V.end()))};
+}
+
+SetVector<GlobalVariable *> AMDGPUSwLowerLDS::GetOrderedNonKernelAllLDSGlobals(
+    SetVector<GlobalVariable *> &&Variables) {
+  // Sort all the non-kernel LDS accesses based on theor name.
+  SetVector<GlobalVariable *> Ordered = SortByName(
+      std::vector<GlobalVariable *>(Variables.begin(), Variables.end()));
+  return std::move(Ordered);
+}
+
+SetVector<Function *> AMDGPUSwLowerLDS::GetOrderedIndirectLDSAccessingKernels(
+    SetVector<Function *> &&Kernels) {
+  // Sort the non-kernels accessing LDS based on theor name.
+  // Also assign a kernel ID metadata based on the sorted order.
+  LLVMContext &Ctx = M.getContext();
+  if (Kernels.size() > UINT32_MAX) {
+    // 32 bit keeps it in one SGPR. > 2**32 kernels won't fit on the GPU
+    report_fatal_error("Unimplemented SW LDS lowering for > 2**32 kernels");
+  }
+  SetVector<Function *> OrderedKernels =
+      SortByName(std::vector<Function *>(Kernels.begin(), Kernels.end()));
+  for (size_t i = 0; i < Kernels.size(); i++) {
+    Metadata *AttrMDArgs[1] = {
+        ConstantAsMetadata::get(IRB.getInt32(i)),
+    };
+    Function *Func = OrderedKernels[i];
+    Func->setMetadata("llvm.amdgcn.lds.kernel.id",
+                      MDNode::get(Ctx, AttrMDArgs));
+    auto &LDSParams = KernelToLDSParametersMap[Func];
+    assert(LDSParams);
+    LDSParams->KernelId = i;
+  }
+  return std::move(OrderedKernels);
+}
+
+void AMDGPUSwLowerLDS::GetUsesOfLDSByNonKernels(
+    CallGraph const &CG, FunctionVariableMap &functions) {
+  // Get uses from the current function, excluding uses by called functions
+  // Two output variables to avoid walking the globals list twice
+  for (auto &GV : M.globals()) {
+    if (!AMDGPU::isLDSVariableToLower(GV)) {
+      continue;
+    }
+
+    if (GV.isAbsoluteSymbolRef()) {
+      report_fatal_error(
+          "LDS variables with absolute addresses are unimplemented.");
+    }
+
+    for (User *V : GV.users()) {
+      User *FUU = V;
+      bool isCast = isa<BitCastOperator, AddrSpaceCastOperator>(FUU);
+      if (isCast && FUU->hasOneUse() && !FUU->user_begin()->user_empty())
+        FUU = *FUU->user_begin();
+      if (auto *I = dyn_cast<Instruction>(FUU)) {
+        Function *F = I->getFunction();
+        if (!isKernelLDS(F)) {
+          functions[F].insert(&GV);
+        }
+      }
+    }
+  }
+}
+
+void AMDGPUSwLowerLDS::PopulateMallocLDSGlobal(Function *Func) {
+  // Create new LDS global required for each kernel to store
+  // device global memory pointer.
+  auto &LDSParams = KernelToLDSParametersMap[Func];
+  assert(LDSParams);
+  // create new global pointer variable
+  LDSParams->MallocLDSGlobal = new GlobalVariable(
+      M, IRB.getPtrTy(), false, GlobalValue::InternalLinkage,
+      PoisonValue::get(IRB.getPtrTy()),
+      Twine("llvm.amdgcn.sw.lds." + F...
[truncated]

arsenm · 2024-04-02T14:42:12Z

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

@@ -679,6 +680,11 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(

        if (EarlyInlineAll && !EnableFunctionCalls)
          PM.addPass(AMDGPUAlwaysInlinePass());
+
+#if __has_feature(address_sanitizer)


The host compiler support for address sanitizer is not relevant

I want to enable this new pass only when address sanitizer feature is enabled? I have found usage of this "__has_feature(asan)" in the codebase for conditional compilation. Is there any other way to figure out if asan is enabled?

There is no conditional compilation. The pass must be unconditionally run. You can skip functions inside the pass itself based on the sanitize_address function attribute on individual functions

llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multiple-blocks-return.ll

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

arsenm · 2024-04-02T14:50:45Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  // Sort the vector of globals or Functions based on their name.
+  // Returns a SetVector of globals/Functions.


Name should be a tie-breaker only. Sort by alignment/size?

amdgpu-lower-module-lds pass also uses sorting of globals based on name. It is required to maintain consistent order of globals in offset table and while replacing the LDS globals with offsets into new LDS global.

arsenm · 2024-04-02T14:58:45Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  for (size_t i = 0; i < NumberKernels; i++) {
+    Function *Func = Kernels[i];
+    auto &LDSParams = KernelToLDSParametersMap[Func];
+    assert(LDSParams);


There seem to be an excessive number of null asserts littered throughout the patch

Removed the extra asserts. Thanks.

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

arsenm · 2024-04-02T15:01:24Z

llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-indirect-access-function-param.ll

+; CHECK-NEXT:    [[TMP0:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
+; CHECK-NEXT:    [[TMP1:%.*]] = call i32 @llvm.amdgcn.workitem.id.y()
+; CHECK-NEXT:    [[TMP2:%.*]] = call i32 @llvm.amdgcn.workitem.id.z()
+; CHECK-NEXT:    [[TMP3:%.*]] = or i32 [[TMP0]], [[TMP1]]


Should try to strip the corresponding amdgpu-no-* attributes for introduced intrinsic calls

Added utility method from amdgpu-lower-module-lds pass to AMDGPUMemoryUtils and removed amdgpu-no-workitem-id-* attributes from kernels which access LDS.

llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-multiple-blocks-return.ll

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

github-actions · 2024-04-08T15:28:08Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm

Title is misleading. I think the implementation of the pass, and adding it to the pass pipeline should be done in separate changes

Pierre-vh

mostly coding style nits. The coding style here differs a bit from what we usually see so I pointed out the things that stood out to me as someone that's not in the loop with this change.

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

Pierre-vh · 2024-04-18T10:41:14Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+
+class AMDGPUSwLowerLDS {
+public:
+  AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)


Suggested change

AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)

AMDGPUSwLowerLDS(Module &Mod, DomTreeCallback Callback)

CamelCase

Pierre-vh · 2024-04-18T10:41:36Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)
+      : M(mod), IRB(M.getContext()), DTCallback(Callback) {}
+  bool run();
+  void getUsesOfLDSByNonKernels(CallGraph const &CG,


Suggested change

void getUsesOfLDSByNonKernels(CallGraph const &CG,

void getUsesOfLDSByNonKernels(const CallGraph &CG,

To be consistent with the codebase.

Pierre-vh · 2024-04-18T10:42:32Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  void getUsesOfLDSByNonKernels(CallGraph const &CG,
+                                FunctionVariableMap &functions);
+  SetVector<Function *>
+  getOrderedIndirectLDSAccessingKernels(SetVector<Function *> &&Kernels);


Please document those functions, even if it's just a short comment.
It helps maintainability.

Pierre-vh · 2024-04-18T10:44:27Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+template <typename T> SetVector<T> sortByName(std::vector<T> &&V) {
+  // Sort the vector of globals or Functions based on their name.
+  // Returns a SetVector of globals/Functions.
+  llvm::sort(V.begin(), V.end(), [](const auto *L, const auto *R) {


llvm:: is not needed, I think.
I also think you can just do llvm::sort(V, ..) ?

Pierre-vh · 2024-04-18T10:56:14Z

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

+      set_union(transitive_map_function[&Func], direct_map_function[F]);
+
+      for (const CallGraphNode::CallRecord &R : *CG[F]) {
+        Function *ith = R.second->getFunction();


Updated pre-requisite PR #88002 which has AMDGPUmemoryUtils changes. Changes here will be removed once #88002 gets merged.

Pierre-vh · 2024-04-18T10:56:20Z

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

+
+  // direct_map_kernel lists which variables are used by the kernel
+  // find the variables which are used through a function call
+  FunctionVariableMap indirect_map_kernel;


Updated pre-requisite PR #88002 which has AMDGPUmemoryUtils changes. Changes here will be removed once #88002 gets merged.

Pierre-vh · 2024-04-18T10:56:31Z

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

+      continue;
+
+    for (const CallGraphNode::CallRecord &R : *CG[&Func]) {
+      Function *ith = R.second->getFunction();


Updated pre-requisite PR #88002 which has AMDGPUmemoryUtils changes. Changes here will be removed once #88002 gets merged.

Pierre-vh · 2024-04-18T10:56:50Z

llvm/lib/Target/AMDGPU/Utils/AMDGPUMemoryUtils.cpp

+                               StringRef FnAttr) {
+  KernelRoot->removeFnAttr(FnAttr);
+
+  SmallVector<Function *> WorkList({CG[KernelRoot]->getFunction()});


= to assign

Pierre-vh · 2024-04-18T10:57:49Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+
+#include <algorithm>
+
+#define DEBUG_TYPE "amdgpu-sw-lower-lds"


nit: is it possible to add some LLVM_DEBUG output to this pass?
It greatly helps debug eventual issues

Added few debug outputs while replacing the LDS accesses. Thanks for suggestion.

arsenm

Needs a rebase, hard to see over the moved code patch

arsenm · 2024-05-08T16:57:56Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  //{StartOffset, AlignedSizeInBytes}
+  SmallString<128> MDItemStr;
+  raw_svector_ostream MDItemOS(MDItemStr);
+  MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.item";


Suggested change

MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.item";

MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.item";

arsenm · 2024-05-08T16:58:49Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+    auto MallocSizeCalcLambda =
+        [&](SetVector<GlobalVariable *> &DynamicLDSGlobals) {


Make this a regular helper function?

arsenm · 2024-05-08T16:59:16Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+    Value *ImplicitArg =
+        IRB.CreateIntrinsic(Intrinsic::amdgcn_implicitarg_ptr, {}, {});
+    Value *HiddenDynLDSSize = IRB.CreateInBoundsGEP(
+        ImplicitArg->getType(), ImplicitArg, {IRB.getInt32(15)});


Don't understand where the hardcoded 15 came from. There are various ConstInBoundsGEPs for this case too

These should also use 64-bit indexes, this is canonically a 64-bit address space. Can we use an enum or something more structured to access the ABI location? I'm assuming this is assuming COV5?

arsenm · 2024-05-08T16:59:35Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+
+    auto *GEPForEndStaticLDSSize = IRB.CreateInBoundsGEP(
+        MetadataStructType, SwLDSMetadata,
+        {IRB.getInt32(0), IRB.getInt32(NumStaticLDS - 1), IRB.getInt32(2)});


Use the Const* variants to hide all the getInt32s away

arsenm · 2024-05-08T17:17:37Z

llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-static-lds-test.ll

@@ -0,0 +1,58 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals all --version 4
+; RUN: opt < %s -passes=amdgpu-sw-lower-lds -S -mtriple=amdgcn-- | FileCheck %s


Should specifically use amdhsa triples for these tests

… device global memory. (llvm#87265)

arsenm · 2024-05-09T15:34:55Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

-  /// Strip "amdgpu-no-lds-kernel-id" from any functions where we may have
-  /// introduced its use. If AMDGPUAttributor ran prior to the pass, we inferred
-  /// the lack of llvm.amdgcn.lds.kernel.id calls.
-  void removeNoLdsKernelIdFromReachable(CallGraph &CG, Function *KernelRoot) {


Is this rebased on main? This deletion should have already been merged when the code was moved to AMDGPUMemoryUtils?

Rebased and updated in latest commits.

#92686 PR raised to move remove this change.

arsenm · 2024-05-09T15:36:45Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+
+  SmallString<128> MDTypeStr;
+  raw_svector_ostream MDTypeOS(MDTypeStr);
+  MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.type";


Suggested change

MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.type";

MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.type";

another one

arsenm · 2024-05-09T15:36:55Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+      StructType::create(Ctx, Items, MDTypeOS.str());
+  SmallString<128> MDStr;
+  raw_svector_ostream MDOS(MDStr);
+  MDOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md";


Suggested change

MDOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md";

MDOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md";

arsenm · 2024-05-09T15:37:36Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+      Value *BasePlusOffset =
+          IRB.CreateInBoundsGEP(IRB.getInt8Ty(), SwLDS, {Load});
+      LLVM_DEBUG(dbgs() << "Sw LDS Lowering, Replacing LDS "
+                        << GV->getName().str());


Suggested change

<< GV->getName().str());

<< GV->getName());

arsenm · 2024-05-09T15:38:30Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+
+  ReplaceKernelLDSAccesses(Func);
+
+  auto *CondFreeBlock = BasicBlock::Create(Ctx, "CondFree", Func);


Presumably the runtime has to manage cleanup of anything that happened in the kernel?

arsenm · 2024-05-09T15:39:08Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  // Replace LDS access in non-kernel with replacement queried from
+  // Base table and offset from offset table.
+  LLVM_DEBUG(dbgs() << "Sw LDS lowering, lower non-kernel access for : "
+                    << Func->getName().str());


Suggested change

<< Func->getName().str());

<< Func->getName());

You should almost never need to convert to std::string

arsenm · 2024-05-09T15:39:19Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+    Value *BasePlusOffset =
+        IRB.CreateInBoundsGEP(IRB.getInt8Ty(), BasePtr, {OffsetLoad});
+    LLVM_DEBUG(dbgs() << "Sw LDS Lowering, Replace non-kernel LDS for "
+                      << GV->getName().str());


Suggested change

<< GV->getName().str());

<< GV->getName());

arsenm · 2024-05-09T15:40:16Z

llvm/test/CodeGen/AMDGPU/amdgpu-sw-lower-lds-dynamic-indirect-access.ll

@@ -0,0 +1,100 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4


These should use --check-globals since that's most of the point of the pass

--check-globals cmd-line option is updating the tests with globals. But, some of the tests when run, are failing with missing ']' "closing bracket "like example below.. So have updated tests with globals check which don't complain this error.

@llvm.amdgcn.sw.lds.offset.table = internal addrspace(4) constant [2 x [4 x i32]] [[4 x i32] [i32 ptrtoint (ptr addrspace(1) @llvm.amdgcn.sw.lds.k0.md to i32), i32 poison, ..

… device global memory. (llvm#87265)

…lvm#87265)

arsenm · 2024-05-23T18:45:45Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+      removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-x");
+      removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-y");
+      removeFnAttrFromReachable(CG, Func, "amdgpu-no-workitem-id-z");


These could all be removed in one CallGraph walk instead of 3 separate ones

Currently removeFnAttrFromReachable accepts StringRef argument. Need to change it to accept array of stringref.

Raised #94188.

arsenm · 2024-05-23T18:46:07Z

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp

+  };
+  bool IsChanged = false;
+  AMDGPUSwLowerLDS SwLowerLDSImpl(M, DTCallback);
+  IsChanged |= SwLowerLDSImpl.run();


Can just define isChanged here

…lvm#87265)

skc7 requested a review from JonChesterfield April 1, 2024 17:07

llvmbot added the backend:AMDGPU label Apr 1, 2024

skc7 requested review from arsenm, b-sumner and ampandey-1995 April 1, 2024 17:07

arsenm reviewed Apr 2, 2024

View reviewed changes

JanekvO reviewed Apr 3, 2024

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUSwLowerLDS.cpp Outdated Show resolved Hide resolved

skc7 mentioned this pull request Apr 8, 2024

[AMDGPU] Move LDS utilities from amdgpu-lower-module-lds pass to AMDGPUMemoryUtils #88002

Merged

arsenm reviewed Apr 17, 2024

View reviewed changes

skc7 force-pushed the skc7/sw_lower_lds branch from 98d5f94 to e60eb97 Compare April 18, 2024 10:30

skc7 changed the title ~~[AMDGPU] Enable amdgpu-sw-lower-lds pass to lower LDS accesses to use device global memory~~ [AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses to use device global memory. Apr 18, 2024

skc7 mentioned this pull request Apr 18, 2024

[AMDGPU] Enable "amdgpu-sw-lower-lds" pass in pipeline. #89206

Open

Pierre-vh reviewed Apr 18, 2024

View reviewed changes

arsenm requested changes May 8, 2024

View reviewed changes

arsenm reviewed May 8, 2024

View reviewed changes

skc7 added a commit to skc7/llvm-project that referenced this pull request May 9, 2024

[AMDGPU] Enable amdgpu-sw-lower-lds pass to lower LDS accesses to use…

9d9e023

… device global memory. (llvm#87265)

skc7 force-pushed the skc7/sw_lower_lds branch from f01647b to 9d9e023 Compare May 9, 2024 11:25

arsenm reviewed May 9, 2024

View reviewed changes

skc7 added a commit to skc7/llvm-project that referenced this pull request May 10, 2024

[AMDGPU] Enable amdgpu-sw-lower-lds pass to lower LDS accesses to use…

8c0acc9

… device global memory. (llvm#87265)

skc7 force-pushed the skc7/sw_lower_lds branch from 9d9e023 to f2f4138 Compare May 10, 2024 17:28

skc7 force-pushed the skc7/sw_lower_lds branch from 2dc9064 to 6ad1fc2 Compare May 19, 2024 12:16

skc7 added a commit to skc7/llvm-project that referenced this pull request May 20, 2024

[AMDGPU] Introduce amdgpu-sw-lower-lds pass to lower LDS accesses. (l…

14cada9

…lvm#87265)

skc7 force-pushed the skc7/sw_lower_lds branch from 6ad1fc2 to 14cada9 Compare May 20, 2024 05:49

skc7 changed the title ~~[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses to use device global memory.~~ [AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. May 20, 2024

skc7 mentioned this pull request May 23, 2024

[AMDGPU] Introduce address sanitizer instrumentation for LDS lowered by amdgpu-sw-lower-lds pass #89208

Open

arsenm reviewed May 23, 2024

View reviewed changes

[AMDGPU] Introduce amdgpu-sw-lower-lds pass to lower LDS accesses. (l…

0acc2dc

…lvm#87265)

skc7 added 2 commits June 7, 2024 09:54

[AMDGPU] Changes as per review comments:1

f70c053

[AMDGPU] Update removeFnAttrFromReachable

f794745

skc7 force-pushed the skc7/sw_lower_lds branch from bac7c1a to f794745 Compare June 7, 2024 05:19

		// Sort the vector of globals or Functions based on their name.
		// Returns a SetVector of globals/Functions.

	AMDGPUSwLowerLDS(Module &mod, DomTreeCallback Callback)
	AMDGPUSwLowerLDS(Module &Mod, DomTreeCallback Callback)

	void getUsesOfLDSByNonKernels(CallGraph const &CG,
	void getUsesOfLDSByNonKernels(const CallGraph &CG,


		#include <algorithm>

		#define DEBUG_TYPE "amdgpu-sw-lower-lds"

	MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.item";
	MDItemOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.item";

		auto MallocSizeCalcLambda =
		[&](SetVector<GlobalVariable *> &DynamicLDSGlobals) {

		@@ -0,0 +1,58 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals all --version 4
		; RUN: opt < %s -passes=amdgpu-sw-lower-lds -S -mtriple=amdgcn-- \| FileCheck %s

	MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md.type";
	MDTypeOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md.type";

	MDOS << "llvm.amdgcn.sw.lds." << Func->getName().str() << ".md";
	MDOS << "llvm.amdgcn.sw.lds." << Func->getName() << ".md";


		ReplaceKernelLDSAccesses(Func);

		auto *CondFreeBlock = BasicBlock::Create(Ctx, "CondFree", Func);

		@@ -0,0 +1,100 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 4

[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. #87265

Are you sure you want to change the base?

[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. #87265

Conversation

skc7 commented Apr 1, 2024 • edited

llvmbot commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 8, 2024 • edited

arsenm left a comment

Choose a reason for hiding this comment

Pierre-vh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skc7 Apr 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skc7 commented Apr 1, 2024 •

edited

github-actions bot commented Apr 8, 2024 •

edited

skc7 Apr 19, 2024 •

edited