Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Add IR-level pass to rewrite away address space 7 #77952

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

krzysz00
Copy link
Contributor

This commit adds the -lower-buffer-fat-pointers pass, which is applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type ptr addrspace(7) from incoming IR. This must be done at the LLVM IR level because ptr addrspace(7), as a 160-bit primitive type, cannot be correctly handled by SelectionDAG.

The detailed operation of the pass is described in comments, but, in summary, the removal proceeds by:

  1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of i160 (including vectors and aggregates). This is needed because the in-register representation of these pointers will stop matching their in-memory representation in step 2, and so ptrtoint/inttoptr operations are used to preserve the expected memory layout

  2. Mutating the IR to replace all occurrences of ptr addrspace(7) with the type {ptr addrspace(8), ptr addrspace(6) }, which makes the two parts of a buffer fat pointer (the 128-bit address space 8 resource and the 32-bit address space 6 offset) visible in the IR. This also impacts the argument and return types of functions.

  3. Splitting the resource and offset parts. All instructions that produce or consume buffer fat pointers (like GEP or load) are rewritten to produce or consume the resource and offset parts separately. For example, GEP updates the offset part of the result and a load uses the resource and offset parts to populate the relevant llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are replaced by their new split counterparts, ensuring no invalidly-typed IR escapes this pass. (For operations like call, where the struct form is needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (
https://github.com/GPUOpen-Drivers/llpc/blob/32cda89776980202597d5bf4ed4447a1bae64047/lgc/patch/PatchBufferOp.cpp ): this pass

  • Also handles vectors of ptr addrspace(7)s
  • Also handles function boundaries
  • Includes the same uniform buffer optimization for loops and conditionals
  • Does not handle memcpy() and friends (this is future work)
  • Does not break up large loads and stores into smaller parts. This should be handled by extending the legalization
    of *.buffer.{load,store} to handle larger types by producing multiple instructions (the same way ordinary LOAD and STORE are legalized). That work is planned for a followup commit.
  • Does not have special logic for handling divergent buffer descriptors. The logic in LGC is, as far as I can tell, incorrect in general, and, per discussions with @nhaehnle, isn't widely used. Therefore, divergent descriptors are handled with waterfall loops later in legalization.

As a final matter, this commit updates atomic expansion to treat buffer operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 12, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Krzysztof Drewniak (krzysz00)

Changes

This commit adds the -lower-buffer-fat-pointers pass, which is applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type ptr addrspace(7) from incoming IR. This must be done at the LLVM IR level because ptr addrspace(7), as a 160-bit primitive type, cannot be correctly handled by SelectionDAG.

The detailed operation of the pass is described in comments, but, in summary, the removal proceeds by:

  1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of i160 (including vectors and aggregates). This is needed because the in-register representation of these pointers will stop matching their in-memory representation in step 2, and so ptrtoint/inttoptr operations are used to preserve the expected memory layout

  2. Mutating the IR to replace all occurrences of ptr addrspace(7) with the type {ptr addrspace(8), ptr addrspace(6) }, which makes the two parts of a buffer fat pointer (the 128-bit address space 8 resource and the 32-bit address space 6 offset) visible in the IR. This also impacts the argument and return types of functions.

  3. Splitting the resource and offset parts. All instructions that produce or consume buffer fat pointers (like GEP or load) are rewritten to produce or consume the resource and offset parts separately. For example, GEP updates the offset part of the result and a load uses the resource and offset parts to populate the relevant llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are replaced by their new split counterparts, ensuring no invalidly-typed IR escapes this pass. (For operations like call, where the struct form is needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (
https://github.com/GPUOpen-Drivers/llpc/blob/32cda89776980202597d5bf4ed4447a1bae64047/lgc/patch/PatchBufferOp.cpp ): this pass

  • Also handles vectors of ptr addrspace(7)s
  • Also handles function boundaries
  • Includes the same uniform buffer optimization for loops and conditionals
  • Does not handle memcpy() and friends (this is future work)
  • Does not break up large loads and stores into smaller parts. This should be handled by extending the legalization
    of *.buffer.{load,store} to handle larger types by producing multiple instructions (the same way ordinary LOAD and STORE are legalized). That work is planned for a followup commit.
  • Does not have special logic for handling divergent buffer descriptors. The logic in LGC is, as far as I can tell, incorrect in general, and, per discussions with @nhaehnle, isn't widely used. Therefore, divergent descriptors are handled with waterfall loops later in legalization.

As a final matter, this commit updates atomic expansion to treat buffer operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463


Patch is 204.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/77952.diff

16 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+13)
  • (added) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+1981)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+16)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+9-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-non-integral-address-spaces-vectors.ll (+69-5)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-non-integral-address-spaces.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+211-191)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-calls.ll (+114)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-constants.ll (+220)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-control-flow.ll (+354)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-memops.ll (+171)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-p7-in-memory.ll (+154)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-pointer-ops.ll (+407)
  • (added) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-unoptimized-debug-data.ll (+134)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/BUILD.gn (+1)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 35d33cb60bc47c..266bb4fb3a643a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -59,6 +59,7 @@ FunctionPass *createAMDGPUMachineCFGStructurizerPass();
 FunctionPass *createAMDGPURewriteOutArgumentsPass();
 ModulePass *
 createAMDGPULowerModuleLDSLegacyPass(const AMDGPUTargetMachine *TM = nullptr);
+ModulePass *createAMDGPULowerBufferFatPointersPass();
 FunctionPass *createSIModeRegisterPass();
 FunctionPass *createGCNPreRAOptimizationsPass();
 
@@ -136,6 +137,18 @@ struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
 
+void initializeAMDGPULowerBufferFatPointersPass(PassRegistry &);
+extern char &AMDGPULowerBufferFatPointersID;
+
+struct AMDGPULowerBufferFatPointersPass
+    : PassInfoMixin<AMDGPULowerBufferFatPointersPass> {
+  AMDGPULowerBufferFatPointersPass(const TargetMachine &TM) : TM(TM) {}
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+
+private:
+  const TargetMachine &TM;
+};
+
 void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
 extern char &AMDGPURewriteOutArgumentsID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
new file mode 100644
index 00000000000000..ada6c0c894a893
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
@@ -0,0 +1,1981 @@
+//===-- AMDGPULowerBufferFatPointers.cpp ---------------------------=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This pass lowers operations on buffer fat pointers (addrspace 7) to
+// operations on buffer resources (addrspace 8) and is needed for correct
+// codegen.
+//
+// # Background
+//
+// Address space 7 (the buffer fat pointer) is a 160-bit pointer that consists
+// of a 128-bit buffer descriptor and a 32-bit offset into that descriptor.
+// The buffer resource part needs to be it needs to be a "raw" buffer resource
+// (it must have a stride of 0 and bounds checks must be in raw buffer mode
+// or disabled).
+//
+// When these requirements are met, a buffer resource can be treated as a
+// typical (though quite wide) pointer that follows typical LLVM pointer
+// semantics. This allows the frontend to reason about such buffers (which are
+// often encountered in the context of SPIR-V kernels).
+//
+// However, because of their non-power-of-2 size, these fat pointers cannot be
+// present during translation to MIR (though this restriction may be lifted
+// during the transition to GlobalISel). Therefore, this pass is needed in order
+// to correctly implement these fat pointers.
+//
+// The resource intrinsics take the resource part (the address space 8 pointer)
+// and the offset part (the 32-bit integer) as separate arguments. In addition,
+// many users of these buffers manipulate the offset while leaving the resource
+// part alone. For these reasons, we want to typically separate the resource
+// and offset parts into separate variables, but combine them together when
+// encountering cases where this is required, such as by inserting these values
+// into aggretates or moving them to memory.
+//
+// Therefore, at a high level, `ptr addrspace(7) %x` becomes `ptr addrspace(8)
+// %x.rsrc` and `i32 %x.off`, which will be combined into `{ptr addrspace(8),
+// i32} %x = {%x.rsrc, %x.off}` if needed. Similarly, `vector<Nxp7>` becomes
+// `{vector<Nxp8>, vector<Nxi32 >}` and its component parts.
+//
+// # Implementation
+//
+// This pass proceeds in three main phases:
+//
+// ## Rewriting loads and stores of p7
+//
+// The first phase is to rewrite away all loads and stors of `ptr addrspace(7)`,
+// including aggregates containing such pointers, to ones that use `i160`. This
+// is handled by `StoreFatPtrsAsIntsVisitor` , which visits loads, stores, and
+// allocas and, if the loaded or stored type contains `ptr addrspace(7)`,
+// rewrites that type to one where the p7s are replaced by i160s, copying other
+// parts of aggregates as needed. In the case of a store, each pointer is
+// `ptrtoint`d to i160 before storing, and load integers are `inttoptr`d back.
+// This same transformation is applied to vectors of pointers.
+//
+// Such a transformation allows the later phases of the pass to not need
+// to handle buffer fat pointers moving to and from memory, where we load
+// have to handle the incompatibility between a `{Nxp8, Nxi32}` representation
+// and `Nxi60` directly. Instead, that transposing action (where the vectors
+// of resources and vectors of offsets are concatentated before being stored to
+// memory) are handled through implementing `inttoptr` and `ptrtoint` only.
+//
+// Atomics operations on `ptr addrspace(7)` values are not suppported, as the
+// hardware does not include a 160-bit atomic.
+//
+// ## Type remapping
+//
+// We use a `ValueMapper` to mangle uses of [vectors of] buffer fat pointers
+// to the corresponding struct type, which has a resource part and an offset
+// part.
+//
+// This uses a `BufferFatPtrToStructTypeMap` and a `FatPtrConstMaterializer`
+// to, usually by way of `setType`ing values. Constants are handled here
+// because there isn't a good way to fix them up later.
+//
+// This has the downside of leaving the IR in an invalid state (for example,
+// the instruction `getelementptr {ptr addrspace(8), i32} %p, ...` will exist),
+// but all such invalid states will be resolved by the third phase.
+//
+// Functions that don't take buffer fat pointers are modified in place. Those
+// that do take such pointers have their basic blocks moved to a new function
+// with arguments that are {ptr addrspace(8), i32} arguments and return values.
+// This phase also records intrinsics so that they can be remangled or deleted
+// later.
+//
+//
+// ## Splitting pointer structs
+//
+// The meat of this pass consists of defining semantics for operations that
+// produce or consume [vectors of] buffer fat pointers in terms of their
+// resource and offset parts. This is accomplished throgh the `SplitPtrStructs`
+// visitor.
+//
+// In the first pass through each function that is being lowered, the splitter
+// inserts new instructions to implement the split-structures behavior, which is
+// needed for correctness and performance. It records a list of "split users",
+// instructions that are being replaced by operations on the resource and offset
+// parts.
+//
+// Split users do not necessarily need to produce parts themselves (
+// a `load float, ptr addrspace(7)` does not, for example), but, if they do not
+// generate fat buffer pointers, they must RAUW in their replacement
+// instructions during the initial visit.
+//
+// When these new instructions are created, they use the split parts recorded
+// for their initial arguments in order to generate their replacements, creating
+// a parallel set of instructions that does not refer to the original fat
+// pointer values but instead to their resource and offset components.
+//
+// Instructions, such as `extractvalue`, that produce buffer fat pointers from
+// sources that do not have split parts, have such parts generated using
+// `extractvalue`. This is also the initial handling of PHI nodes, which
+// are then cleaned up.
+//
+// ### Conditionals
+//
+// PHI nodes are initially given resource parts via `extractvalue`. However,
+// this is not an efficient rewrite of such nodes, as, in most cases, the
+// resource part in a conditional or loop remains constant throughout the loop
+// and only the offset varies. Failing to optimize away these constant resources
+// would cause additional registers to be sent around loops and might lead to
+// waterfall loops being generated for buffer operations due to the
+// "non-uniform" resource argument.
+//
+// Therefore, after all instructions have been visited, the pointer splitter
+// post-processes all encountered conditionals. Given a PHI node or select,
+// getPossibleRsrcRoots() collects all values that the resource parts of that
+// conditional's input could come from as well as collecting all conditional
+// instructions encountered during the search. If, after filtering out the
+// initial node itself, the set of encountered conditionals is a subset of the
+// potential roots and there is a single potential resource that isn't in the
+// conditional set, that value is the only possible value the resource argument
+// could have throughout the control flow.
+//
+// If that condition is met, then a PHI node can have its resource part changed
+// to the singleton value and then be replaced by a PHI on the offsets.
+// Otherwise, each PHI node is split into two, one for the resource part and one
+// for the offset part, which replace the temporary `extractvalue` instructions
+// that were added during the first pass.
+//
+// Similar logic applies to `select`, where
+// `%z = select i1 %cond, %cond, ptr addrspace(7) %x, ptr addrspace(7) %y`
+// can be split into `%z.rsrc = %x.rsrc` and
+// `%z.off = select i1 %cond, ptr i32 %x.off, i32 %y.off`
+// if both `%x` and `%y` have the same resource part, but two `select`
+// operations will be needed if they do not.
+//
+// ### Final processing
+//
+// After conditionals have been cleaned up, the IR for each function is
+// rewritten to remove all the old instructions that have been split up.
+//
+// Any instruction that used to produce a buffer fat pointer (and therefore now
+// produces a resource-and-offset struct after type remapping) is
+// replaced as follows:
+// 1. All debug value annotations are cloned to reflect that the resource part
+//    and offset parts are computed separately and constitute different
+//    fragments of the underlying source language variable.
+// 2. All uses that were themselves split are replaced by a `poison` of the
+//    struct type, as they will themselves be erased soon. This rule, combined
+//    with debug handling, should leave the use lists of split instructions
+//    empty in almost all cases.
+// 3. If a user of the original struct-valued result remains, the structure
+//    needed for the new types to work is constructed out of the newly-defined
+//    parts, and the original instruction is replaced by this structure
+//    before being erased. Instructions requiring this construction include
+//    `ret` and `insertvalue`.
+//
+// # Consequences
+//
+// This pass does not alter the CFG.
+//
+// Alias analysis information will become coarser, as the LLVM alias analyzer
+// cannot handle the buffer intrinsics. Specifically, while we can determine
+// that the following two loads do not alias:
+// ```
+//   %y = getelementptr i32, ptr addrspace(7) %x, i32 1
+//   %a = load i32, ptr addrspace(7) %x
+//   %b = load i32, ptr addrspace(7) %y
+// ```
+// we cannot (except through some code that runs during scheduling) determine
+// that the rewritten loads below do not alias.
+// ```
+//   %y.off = add i32 %x.off, 1
+//   %a = call @llvm.amdgcn.raw.ptr.buffer.load(ptr addrspace(8) %x.rsrc, i32
+//     %x.off, ...)
+//   %b = call @llvm.amdgcn.raw.ptr.buffer.load(ptr addrspace(8)
+//     %x.rsrc, i32 %y.off, ...)
+// ```
+// However, existing alias information is preserved.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIDefines.h"
+#include "llvm/ADT/SetOperations.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/ConstantFolding.h"
+#include "llvm/CodeGen/TargetPassConfig.h"
+#include "llvm/IR/AttributeMask.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/DebugInfo.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/InstIterator.h"
+#include "llvm/IR/InstVisitor.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
+#include "llvm/IR/Metadata.h"
+#include "llvm/IR/Operator.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Pass.h"
+#include "llvm/Support/AtomicOrdering.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/ErrorHandling.h"
+#include "llvm/Transforms/Utils/Cloning.h"
+#include "llvm/Transforms/Utils/Local.h"
+#include "llvm/Transforms/Utils/ValueMapper.h"
+
+#define DEBUG_TYPE "amdgpu-lower-buffer-fat-pointers"
+
+using namespace llvm;
+
+static constexpr unsigned BufferOffsetWidth = 32;
+
+namespace {
+/// Recursively replace instances of ptr addrspace(7) and vector<Nxptr
+/// addrspace(7)> with some other type as defined by the relevant subclass.
+class BufferFatPtrTypeLoweringBase : public ValueMapTypeRemapper {
+  DenseMap<Type *, Type *> Map;
+
+  Type *remapTypeImpl(Type *Ty, SmallPtrSetImpl<StructType *> &Seen);
+
+protected:
+  virtual Type *remapScalar(PointerType *PT) = 0;
+  virtual Type *remapVector(VectorType *VT) = 0;
+
+  const DataLayout &DL;
+
+public:
+  BufferFatPtrTypeLoweringBase(const DataLayout &DL) : DL(DL) {}
+  Type *remapType(Type *SrcTy) override;
+  void clear() { Map.clear(); }
+};
+
+/// Remap ptr addrspace(7) to i160 and vector<Nxptr addrspace(7)> to
+/// vector<Nxi60> in order to correctly handling loading/storing these values
+/// from memory.
+class BufferFatPtrToIntTypeMap : public BufferFatPtrTypeLoweringBase {
+  using BufferFatPtrTypeLoweringBase::BufferFatPtrTypeLoweringBase;
+
+protected:
+  Type *remapScalar(PointerType *PT) override { return DL.getIntPtrType(PT); }
+  Type *remapVector(VectorType *VT) override { return DL.getIntPtrType(VT); }
+};
+
+/// Remap ptr addrspace(7) to {ptr addrspace(8), i32} (the resource and offset
+/// parts of the pointer) so that we can easily rewrite operations on these
+/// values that aren't loading them from or storing them to memory.
+class BufferFatPtrToStructTypeMap : public BufferFatPtrTypeLoweringBase {
+  using BufferFatPtrTypeLoweringBase::BufferFatPtrTypeLoweringBase;
+
+protected:
+  Type *remapScalar(PointerType *PT) override;
+  Type *remapVector(VectorType *VT) override;
+};
+} // namespace
+
+// This code is adapted from the type remapper in lib/Linker/IRMover.cpp
+Type *BufferFatPtrTypeLoweringBase::remapTypeImpl(
+    Type *Ty, SmallPtrSetImpl<StructType *> &Seen) {
+  Type **Entry = &Map[Ty];
+  if (*Entry)
+    return *Entry;
+  if (auto *PT = dyn_cast<PointerType>(Ty)) {
+    if (PT->getAddressSpace() == AMDGPUAS::BUFFER_FAT_POINTER) {
+      return *Entry = remapScalar(PT);
+    }
+  }
+  if (auto *VT = dyn_cast<VectorType>(Ty)) {
+    auto *PT = dyn_cast<PointerType>(VT->getElementType());
+    if (PT && PT->getAddressSpace() == AMDGPUAS::BUFFER_FAT_POINTER) {
+      return *Entry = remapVector(VT);
+    }
+    return *Entry = Ty;
+  }
+  // Whether the type is one that is structurally uniqued - that is, if it is
+  // not a named struct (the only kind of type where multiple structurally
+  // identical types that have a distinct `Type*`)
+  StructType *TyAsStruct = dyn_cast<StructType>(Ty);
+  bool IsUniqued = !TyAsStruct || TyAsStruct->isLiteral();
+  // Base case for ints, floats, opaque pointers, and so on, which don't
+  // require recursion.
+  if (Ty->getNumContainedTypes() == 0 && IsUniqued)
+    return *Entry = Ty;
+  if (!IsUniqued) {
+    // Create a dummy type for recursion purposes.
+    if (!Seen.insert(TyAsStruct).second) {
+      StructType *Placeholder = StructType::create(Ty->getContext());
+      return *Entry = Placeholder;
+    }
+  }
+  bool Changed = false;
+  SmallVector<Type *> ElementTypes;
+  ElementTypes.reserve(Ty->getNumContainedTypes());
+  for (unsigned int I = 0, E = Ty->getNumContainedTypes(); I < E; ++I) {
+    Type *OldElem = Ty->getContainedType(I);
+    Type *NewElem = remapTypeImpl(OldElem, Seen);
+    ElementTypes.push_back(NewElem);
+    Changed |= (OldElem != NewElem);
+  }
+  if (!Changed) {
+    return *Entry = Ty;
+  }
+  if (auto *ArrTy = dyn_cast<ArrayType>(Ty))
+    return *Entry = ArrayType::get(ElementTypes[0], ArrTy->getNumElements());
+  if (auto *FnTy = dyn_cast<FunctionType>(Ty))
+    return *Entry = FunctionType::get(ElementTypes[0],
+                                      ArrayRef(ElementTypes).slice(1),
+                                      FnTy->isVarArg());
+  if (auto *STy = dyn_cast<StructType>(Ty)) {
+    // Genuine opaque types don't have a remapping.
+    if (STy->isOpaque())
+      return *Entry = Ty;
+    bool IsPacked = STy->isPacked();
+    if (IsUniqued)
+      return *Entry = StructType::get(Ty->getContext(), ElementTypes, IsPacked);
+    SmallString<16> Name(STy->getName());
+    STy->setName("");
+    Type **RecursionEntry = &Map[Ty];
+    if (*RecursionEntry) {
+      auto *Placeholder = cast<StructType>(*RecursionEntry);
+      Placeholder->setBody(ElementTypes, IsPacked);
+      Placeholder->setName(Name);
+      return *Entry = Placeholder;
+    }
+    return *Entry = StructType::create(Ty->getContext(), ElementTypes, Name,
+                                       IsPacked);
+  }
+  llvm_unreachable("Unknown type of type that contains elements");
+}
+
+Type *BufferFatPtrTypeLoweringBase::remapType(Type *SrcTy) {
+  SmallPtrSet<StructType *, 2> Visited;
+  return remapTypeImpl(SrcTy, Visited);
+}
+
+Type *BufferFatPtrToStructTypeMap::remapScalar(PointerType *PT) {
+  LLVMContext &Ctx = PT->getContext();
+  return StructType::get(PointerType::get(Ctx, AMDGPUAS::BUFFER_RESOURCE),
+                         IntegerType::get(Ctx, BufferOffsetWidth));
+}
+
+Type *BufferFatPtrToStructTypeMap::remapVector(VectorType *VT) {
+  ElementCount EC = VT->getElementCount();
+  LLVMContext &Ctx = VT->getContext();
+  Type *RsrcVec =
+      VectorType::get(PointerType::get(Ctx, AMDGPUAS::BUFFER_RESOURCE), EC);
+  Type *OffVec = VectorType::get(IntegerType::get(Ctx, BufferOffsetWidth), EC);
+  return StructType::get(RsrcVec, OffVec);
+}
+
+static bool isBufferFatPtrOrVector(Type *Ty) {
+  if (auto *PT = dyn_cast<PointerType>(Ty->getScalarType()))
+    return PT->getAddressSpace() == AMDGPUAS::BUFFER_FAT_POINTER;
+  return false;
+}
+
+// True if the type is {ptr addrspace(8), i32} or a struct containing vectors of
+// those types. Used to quickly skip instructions we don't need to process.
+static bool isSplitFatPtr(Type *Ty) {
+  auto *ST = dyn_cast<StructType>(Ty);
+  if (!ST)
+    return false;
+  if (!ST->isLiteral() || ST->getNumElements() != 2)
+    return false;
+  auto *MaybeRsrc =
+      dyn_cast<PointerType>(ST->getElementType(0)->getScalarType());
+  auto *MaybeOff =
+      dyn_cast<IntegerType>(ST->getElementType(1)->getScalarType());
+  return MaybeRsrc && MaybeOff &&
+         MaybeRsrc->getAddressSpace() == AMDGPUAS::BUFFER_RESOURCE &&
+         MaybeOff->getBitWidth() == BufferOffsetWidth;
+}
+
+// True if the result type or any argument types are buffer fat pointers.
+static bool isBufferFatPtrConst(Constant *C) {
+  Type *T = C->getType();
+  return isBufferFatPtrOrVector(T) ||
+         llvm::any_of(C->operands(), [](const Use &U) {
+           return isBufferFatPtrOrVector(U.get()->getType());
+         });
+}
+
+namespace {
+/// Convert [vectors of] buffer fat pointers to integers when they are read from
+/// or stored to memory. This ensures that these pointers will have the same
+/// memory layout as before they are lowered, even though they will no longer
+/// have their previous layout in registers/in the program (they'll be broken
+/// down into resource and offset parts). This has the downside of imposing
+/// marshalling costs when reading or storing these values, but since placing
+/// such pointers into memory is an uncommon operation at best, we feel that
+/// this cost is acceptable for better performance in the common case.
+class StoreFatPtrsAsIntsVisitor
+    : public InstVisitor<StoreFatPtrsAsIntsVisitor, bool> {
+  BufferFatPtrToIntTypeMap *TypeMap;
+
+  ValueToValueMapTy ConvertedForStore;
+
+  IRBuilder<> IRB;
+
+  // Convert all the buffer fat pointers within the input value to inttegers
+  // so that it can be stored in memory.
+  Value *fatPtrsToInts(Value *V, Type *From, Type *To, const Twine &Name);
+  // Convert all the i160s that need to be buffer fat pointers (as ...
[truncated]

@krzysz00
Copy link
Contributor Author

This is the address space 7 work ported out from Phabricator, with changes made to reflect #77847 .

Landing this is annoyingly blocked on the resource-usage-dead-function test.

Copy link

github-actions bot commented Jan 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@krzysz00
Copy link
Contributor Author

@arsenm @nhaehnle Since y'all were (IIRC) the main reviewers on this over on Phab, can you take a look to see if there're any remaining comments?

@piotrAMD
Copy link
Collaborator

piotrAMD commented Feb 2, 2024

Came across an assertion, see the attachment for the reproducer repro.txt:
opt -S -mcpu=gfx1100 -amdgpu-lower-buffer-fat-pointers repro.txt

#10 0x0000000003aed749 llvm::StructLayout::StructLayout(llvm::StructType*, llvm::DataLayout const&) (/work/sw/drivers/xgl/rel/compiler/llpc/llvm/bin/opt+0x3aed749)
#11 0x0000000003af069d llvm::DataLayout::getStructLayout(llvm::StructType*) const (/work/sw/drivers/xgl/rel/compiler/llpc/llvm/bin/opt+0x3af069d)
#12 0x0000000003c03494 llvm::GEPOperator::collectOffset(llvm::DataLayout const&, unsigned int, llvm::MapVector<llvm::Value*, llvm::APInt, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int>>, llvm::SmallVector<std::pair<llvm::Value*, llvm::APInt>, 0u>>&, llvm::APInt&) const (/work/sw/drivers/xgl/rel/compiler/llpc/llvm/bin/opt+0x3c03494)
#13 0x0000000002d999d0 (anonymous namespace)::SplitPtrStructs::visitGetElementPtrInst(llvm::GetElementPtrInst&) AMDGPULowerBufferFatPointers.cpp:0:0
#14 0x0000000002d97ba6 llvm::InstVisitor<(anonymous namespace)::SplitPtrStructs, std::pair<llvm::Value*, llvm::Value*>>::visit(llvm::Instruction&) AMDGPULowerBufferFatPointers.cpp:0:0
#15 0x0000000002d8f4b3 (anonymous namespace)::AMDGPULowerBufferFatPointers::run(llvm::Module&, llvm::TargetMachine const&) AMDGPULowerBufferFatPointers.cpp:0:0

@krzysz00
Copy link
Contributor Author

krzysz00 commented Feb 2, 2024

@piotrAMD Thanks for the thorough testing! I found the issue (stale pointer) and your code also gave me an unrelated crash, namely that I wasn't correctly handling unreachable intrinssics.

@piotrAMD
Copy link
Collaborator

Thanks - the recent update has fixed the issue for me.

There is some follow-up work to be done in the lowering (instruction selection) as noted earlier in https://reviews.llvm.org/D158463#4645037, but those issues are not blockers for this patch.

@@ -352,3 +352,107 @@ loop.exit:
exit:
ret float %sum
}

;; Test that the optimization works correctly for phi nodes that repeat the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an optimization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd characterize switching

%phi.rsrc = phi [ %v.rsrc, %l1 ], [ %v.rsrc, %l2 ], ...
%phi.off = phi [ %v1.off, %l1 ], [ %v2.off, %l2 ], ...

(the natural lowering of phi)

to

%phi.rsrc = %v.rsrc
%phi.off = [ %v1.off, %l1 ], [ %v2.off, %l2 ]

as an optimization.

This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly
handled by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores
of i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr
operations are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)`
with the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the
two parts of a buffer fat pointer (the 128-bit address space 8
resource and the 32-bit address space 6 offset) visible in the IR.
This also impacts the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are
rewritten to produce or consume the resource and offset parts
separately. For example, GEP updates the offset part of the result and
a load uses the resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed
IR escapes this pass. (For operations like call, where the struct form
is needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (
https://github.com/GPUOpen-Drivers/llpc/blob/32cda89776980202597d5bf4ed4447a1bae64047/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized).
That work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops
later in legalization.

As a final matter, this commit updates atomic expansion to treat
buffer operations analogously to global ones.

(One question for reviewers: is the new pass is the right place?
Should it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
@krzysz00 krzysz00 merged commit 6540f16 into llvm:main Mar 6, 2024
3 of 4 checks passed
krzysz00 added a commit to krzysz00/llvm-project that referenced this pull request Mar 9, 2024
Someone pointed out a typo (Value* RsrcRes = RsrcRes = ...) in PR
the address space 7 lowering, this commit fixes it.
krzysz00 added a commit that referenced this pull request Mar 11, 2024
Someone pointed out a typo (Value* RsrcRes = RsrcRes = ...) in PR the
address space 7 lowering, this commit fixes it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants