Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget #78818

Merged
merged 4 commits into from
Feb 23, 2024

Conversation

agozillon
Copy link
Contributor

This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimization passes doing some unintentional malformed optimizations for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error.

This fix is related to a series of problems found in: #74603

This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be
embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose
issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive change set. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through):

  • Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction.
  • Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in.

Both of these methods seem like quite a divergence from the current status quo, which I am not entirely sure is merited for the small test this change aims to fix.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 20, 2024

@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-llvm

Author: None (agozillon)

Changes

This patch seeks to add a mechanism to raise constant (not ConstantExpr or runtime/dynamic) sized allocations into the entry block for select functions that have been inserted into a list for processing. This processing occurs during the finalize call, after OutlinedInfo regions have completed. This currently has only been utilised for createOutlinedFunction, which is triggered for TargetOp generation in the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by createOutlinedFunction to avoid subsequent optimization passes doing some unintentional malformed optimizations for AMD kernels (unsure if it occurs for other vendors). If the allocas are generated inside of the kernel and are not in the entry block and are subsequently passed to a function this can lead to required instructions being erased or manipulated in a way that causes the kernel to run into a HSA access error.

This fix is related to a series of problems found in: #74603

This problem primarily presents itself for Flang's HLFIR AssignOp currently, when utilised with a scalar temporary constant on the RHS and a descriptor type on the LHS. It will generate a call to a runtime function, wrap the RHS temporary in a newly allocated descriptor (an llvm struct), and pass both the LHS and RHS descriptor into the runtime function call. This will currently be
embedded into the middle of the target region in the user entry block, which means the allocas are also embedded in the middle, which seems to pose
issues when later passes are executed. This issue may present itself in other HLFIR operations or unrelated operations that generate allocas as a by product, but for the moment, this one test case is the only scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other suggestions, I've tried a few others (at varying levels of the flang/mlir compiler flow), but this one is the smallest and least intrusive change set. The other two, that come to mind (but I've not fully looked into, the former I tried a little with blocks but it had a few issues I'd need to think through):

  • Having a proper alloca only block (or region) generated for TargetOps that we could merge into the entry block that's generated by convertTarget's createOutlinedFunction.
  • Or diverging a little from Clang's current target generation and using the CodeExtractor to generate the user code as an outlined function region invoked from the kernel we make, with our kernel arguments passed into it. Similar to the current parallel generation. I am not sure how well this would intermingle with the existing parallel generation though that's layered in.

Both of these methods seem like quite a divergence from the current status quo, which I am not entirely sure is merited for the small test this change aims to fix.


Full diff: https://github.com/llvm/llvm-project/pull/78818.diff

3 Files Affected:

  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+12)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+52)
  • (added) mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir (+43)
diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 669104307fa0e23..e7be143695be1ac 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1495,6 +1495,11 @@ class OpenMPIRBuilder {
   /// Collection of regions that need to be outlined during finalization.
   SmallVector<OutlineInfo, 16> OutlineInfos;
 
+  /// A collection of candidate target functions that's constant allocas will
+  /// attempt to be raised on a call of finalize after all currently enqueued
+  /// outline info's have been processed.
+  SmallVector<llvm::Function *, 16> ConstantAllocaRaiseCandidates;
+
   /// Collection of owned canonical loop objects that eventually need to be
   /// free'd.
   std::forward_list<CanonicalLoopInfo> LoopInfos;
@@ -1502,6 +1507,13 @@ class OpenMPIRBuilder {
   /// Add a new region that will be outlined later.
   void addOutlineInfo(OutlineInfo &&OI) { OutlineInfos.emplace_back(OI); }
 
+  /// Add a function that's constant allocas will attempt to be raised on a
+  /// call of finalize after all currently enqueued outline info's have been
+  /// processed.
+  void addConstantAllocaRaiseCandidates(Function *F) {
+    ConstantAllocaRaiseCandidates.emplace_back(F);
+  }
+
   /// An ordered map of auto-generated variables to their unique names.
   /// It stores variables with the following names: 1) ".gomp_critical_user_" +
   /// <critical_section_name> + ".var" for "omp critical" directives; 2)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index f6cf358119fb715..15ba5118e7be07e 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -633,6 +633,30 @@ Function *OpenMPIRBuilder::getOrCreateRuntimeFunctionPtr(RuntimeFunction FnID) {
 
 void OpenMPIRBuilder::initialize() { initializeTypes(M); }
 
+static void raiseUserConstantDataAllocasToEntryBlock(IRBuilderBase &Builder,
+                                                     Function *Function) {
+  BasicBlock &EntryBlock = Function->getEntryBlock();
+  Instruction *MoveLocInst = EntryBlock.getFirstNonPHI();
+
+  // Loop over blocks looking for allocas, skip the entry block allocas here are
+  // in the appropriate place.
+  for (auto Block = std::next(Function->begin(), 1); Block != Function->end();
+       Block++) {
+    for (auto Inst = Block->getReverseIterator()->begin();
+         Inst != Block->getReverseIterator()->end();) {
+      if (auto *AllocaInst =
+              llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
+        Inst++;
+        if (!isa<ConstantData>(AllocaInst->getArraySize()))
+          continue;
+        AllocaInst->moveBeforePreserving(MoveLocInst);
+      } else {
+        Inst++;
+      }
+    }
+  }
+}
+
 void OpenMPIRBuilder::finalize(Function *Fn) {
   SmallPtrSet<BasicBlock *, 32> ParallelRegionBlockSet;
   SmallVector<BasicBlock *, 32> Blocks;
@@ -728,6 +752,28 @@ void OpenMPIRBuilder::finalize(Function *Fn) {
   // Remove work items that have been completed.
   OutlineInfos = std::move(DeferredOutlines);
 
+  // The createTarget functions embeds user written code into
+  // the target region which may inject allocas which need to
+  // be moved to the entry block of our target or risk malformed
+  // optimisations by later passes, this is only relevant for
+  // the device pass which appears to be a little more delicate
+  // when it comes to optimisations (however, we do not block on
+  // that here, it's up to the inserter to the list to do so).
+  // This notbaly has to occur after the OutlinedInfo candidates
+  // have been extracted so we have an end product that will not
+  // be implicitly adversely affected by any raises unless
+  // intentionally appended to the list.
+  // NOTE: This only does so for ConstantData, it could be extended
+  // to ConstantExpr's with further effort, however, they should
+  // largely be folded when they get here. Extending it to runtime
+  // defined/read+writeable allocation sizes would be non-trivial
+  // (need to factor in movement of any stores to variables the
+  // allocation size depends on, as well as the usual loads,
+  // otherwise it'll yield the wrong result after movement) and
+  // likely be more suitable as an LLVM optimisation pass.
+  for (Function *F : ConstantAllocaRaiseCandidates)
+    raiseUserConstantDataAllocasToEntryBlock(Builder, F);
+
   EmitMetadataErrorReportFunctionTy &&ErrorReportFn =
       [](EmitMetadataErrorKind Kind,
          const TargetRegionEntryInfo &EntryInfo) -> void {
@@ -5033,6 +5079,12 @@ static Function *createOutlinedFunction(
 
   BasicBlock *UserCodeEntryBB = Builder.GetInsertBlock();
 
+  // As we embed the user code in the middle of our target region after we
+  // generate entry code, we must move what allocas we can into the entry
+  // block to avoid possible breaking optimisations for device
+  if (OMPBuilder.Config.isTargetDevice())
+    OMPBuilder.addConstantAllocaRaiseCandidates(Func);
+
   // Insert target deinit call in the device compilation pass.
   Builder.restoreIP(CBFunc(Builder.saveIP(), Builder.saveIP()));
   if (OMPBuilder.Config.isTargetDevice())
diff --git a/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
new file mode 100644
index 000000000000000..be521fbe1f01a6d
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
@@ -0,0 +1,43 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+// A small condensed version of a problem requiring constant alloca raising in
+// Target Region Entries for user injected code, found in an issue in the Flang
+// compiler. Certain LLVM IR optimisation passes will perform runtime breaking 
+// transformations on allocations not found to be in the entry block, current
+// OpenMP dialect lowering of TargetOp's will inject user allocations after
+// compiler generated entry code, in a seperate block, this test checks that
+// a small function which attempts to raise some of these (specifically 
+// constant sized) allocations performs its task reasonably in these 
+// scenarios. 
+
+module attributes {omp.is_target_device = true} {
+  llvm.func @_QQmain() attributes {omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
+    %1 = llvm.mlir.constant(1 : i64) : i64
+    %2 = llvm.alloca %1 x !llvm.struct<(ptr)> : (i64) -> !llvm.ptr
+    %3 = omp.map_info var_ptr(%2 : !llvm.ptr, !llvm.struct<(ptr)>) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
+    omp.target map_entries(%3 -> %arg0 : !llvm.ptr) {
+    ^bb0(%arg0: !llvm.ptr):
+      %4 = llvm.mlir.constant(1 : i32) : i32
+      %5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
+      %6 = llvm.mlir.constant(50 : i32) : i32
+      %7 = llvm.mlir.constant(1 : i64) : i64
+      %8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
+      llvm.store %6, %8 : i32, !llvm.ptr
+      %9 = llvm.mlir.undef : !llvm.struct<(ptr)>
+      %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)> 
+      llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
+      %88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+      omp.terminator
+    }
+    llvm.return
+  }
+  llvm.func @_ExternalCall(!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+}
+
+// CHECK:      define weak_odr protected void @{{.*}}QQmain_l{{.*}}({{.*}}, {{.*}}) {
+// CHECK-NEXT: entry:
+// CHECK-NEXT:  %[[MOVED_ALLOCA1:.*]] = alloca { ptr }, align 8
+// CHECK-NEXT:  %[[MOVED_ALLOCA2:.*]] = alloca i32, i64 1, align 4
+// CHECK-NEXT:  %[[MAP_ARG_ALLOCA:.*]] = alloca ptr, align 8
+
+// CHECK: user_code.entry:                                  ; preds = %entry

@kiranchandramohan
Copy link
Contributor

Could you share the HLFIR IR that needs this change?

@agozillon
Copy link
Contributor Author

Could you share the HLFIR IR that needs this change?

Sure, here you go:

module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<i1, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr, dense<64> : vector<4xi64>>, #dlti.dl_entry<f16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i32, dense<32> : vector<2xi64>>, #dlti.dl_entry<i16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i8, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr<272>, dense<64> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<271>, dense<32> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<270>, dense<32> : vector<4xi64>>, #dlti.dl_entry<f128, dense<128> : vector<2xi64>>, #dlti.dl_entry<f64, dense<64> : vector<2xi64>>, #dlti.dl_entry<f80, dense<128> : vector<2xi64>>, #dlti.dl_entry<i128, dense<128> : vector<2xi64>>, #dlti.dl_entry<i64, dense<64> : vector<2xi64>>, #dlti.dl_entry<"dlti.stack_alignment", 128 : i64>, #dlti.dl_entry<"dlti.endianness", "little">>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu", omp.is_gpu = false, omp.is_target_device = false, omp.requires = #omp<clause_requires none>, omp.version = #omp.version<version = 11>} {
  func.func @_QQmain() attributes {fir.bindc_name = "main"} {
    %0 = fir.alloca !fir.box<!fir.heap<i32>> {bindc_name = "test", uniq_name = "_QFEtest"}
    %1 = fir.zero_bits !fir.heap<i32>
    %2 = fir.embox %1 : (!fir.heap<i32>) -> !fir.box<!fir.heap<i32>>
    fir.store %2 to %0 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %3:2 = hlfir.declare %0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFEtest"} : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> (!fir.ref<!fir.box<!fir.heap<i32>>>, !fir.ref<!fir.box<!fir.heap<i32>>>)
    %4 = fir.allocmem i32 {fir.must_be_heap = true, uniq_name = "_QFEtest.alloc"}
    %5 = fir.embox %4 : (!fir.heap<i32>) -> !fir.box<!fir.heap<i32>>
    fir.store %5 to %3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %6 = fir.load %3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %7 = fir.box_offset %3#1 base_addr : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> !fir.llvm_ptr<!fir.ref<i32>>
    %8 = fir.box_addr %6 : (!fir.box<!fir.heap<i32>>) -> !fir.heap<i32>
    %9 = omp.map_info var_ptr(%8 : !fir.heap<i32>, i32) var_ptr_ptr(%7 : !fir.llvm_ptr<!fir.ref<i32>>) map_clauses(tofrom) capture(ByRef) -> !fir.heap<i32> {name = "test"}
    %10 = omp.map_info var_ptr(%3#1 : !fir.ref<!fir.box<!fir.heap<i32>>>, !fir.box<!fir.heap<i32>>) map_clauses(tofrom) capture(ByRef) members(%9 : !fir.heap<i32>) -> !fir.ref<!fir.box<!fir.heap<i32>>> {name = "test"}
    omp.target map_entries(%10 -> %arg0 : !fir.ref<!fir.box<!fir.heap<i32>>>) {
    ^bb0(%arg0: !fir.ref<!fir.box<!fir.heap<i32>>>):
      %20:2 = hlfir.declare %arg0 {fortran_attrs = #fir.var_attrs<allocatable>, uniq_name = "_QFEtest"} : (!fir.ref<!fir.box<!fir.heap<i32>>>) -> (!fir.ref<!fir.box<!fir.heap<i32>>>, !fir.ref<!fir.box<!fir.heap<i32>>>)
      %c50_i32 = arith.constant 50 : i32
      hlfir.assign %c50_i32 to %20#0 realloc : i32, !fir.ref<!fir.box<!fir.heap<i32>>>
      omp.terminator
    }
    %c6_i32 = arith.constant 6 : i32
    %11 = fir.address_of(@_QQclXf0a9a187466a9dd5699bdae5c0b697a3) : !fir.ref<!fir.char<1,81>>
    %12 = fir.convert %11 : (!fir.ref<!fir.char<1,81>>) -> !fir.ref<i8>
    %c9_i32 = arith.constant 9 : i32
    %13 = fir.call @_FortranAioBeginExternalListOutput(%c6_i32, %12, %c9_i32) fastmath<contract> : (i32, !fir.ref<i8>, i32) -> !fir.ref<i8>
    %14 = fir.load %3#0 : !fir.ref<!fir.box<!fir.heap<i32>>>
    %15 = fir.box_addr %14 : (!fir.box<!fir.heap<i32>>) -> !fir.heap<i32>
    %16 = fir.load %15 : !fir.heap<i32>
    %c2_i32 = arith.constant 2 : i32
    %17 = arith.addi %16, %c2_i32 : i32
    %18 = fir.call @_FortranAioOutputInteger32(%13, %17) fastmath<contract> : (!fir.ref<i8>, i32) -> i1
    %19 = fir.call @_FortranAioEndIoStatement(%13) fastmath<contract> : (!fir.ref<i8>) -> i32
    return
  }
  fir.global linkonce @_QQclXf0a9a187466a9dd5699bdae5c0b697a3 constant : !fir.char<1,81> {
    %0 = fir.string_lit "/home/agozillo/git/flang-dev/work-dir/declare-target-map/single-value-alloca.f90\00"(81) : !fir.char<1,81>
    fir.has_value %0 : !fir.char<1,81>
  }
  func.func private @_FortranAioBeginExternalListOutput(i32, !fir.ref<i8>, i32) -> !fir.ref<i8> attributes {fir.io, fir.runtime}
  func.func private @_FortranAioOutputInteger32(!fir.ref<i8>, i32) -> i1 attributes {fir.io, fir.runtime}
  func.func private @_FortranAioEndIoStatement(!fir.ref<i8>) -> i32 attributes {fir.io, fir.runtime}
  fir.global @_QQEnvironmentDefaults constant : !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>> {
    %0 = fir.zero_bits !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>>
    fir.has_value %0 : !fir.ref<tuple<i32, !fir.ref<!fir.array<0xtuple<!fir.ref<i8>, !fir.ref<i8>>>>>>
  }
}

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

@agozillon
Copy link
Contributor Author

Probably worth noting this issue doesn't exist for the current variations of target parallel from what I've found, likely because the CodeExtractor magics the user code into a seperate function with an appropriate entry block (my knowledge on the lowering of parallel is still a WIP, so please take it with a grain of salt and anyone more familiar please feel free to override my statement).

@agozillon
Copy link
Contributor Author

Small ping for some reviewer attention next week if any time can be spared

@agozillon
Copy link
Contributor Author

Another small ping for reviewer attention if at all possible please.

Copy link
Member

@TIFitis TIFitis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine to me. However I'm not familiar about when and why this is required, so please wait for other reviewers as well.

Also, should we have a test for this in the OMPIRBuilderTest?

for (auto Inst = Block->getReverseIterator()->begin();
Inst != Block->getReverseIterator()->end();) {
if (auto *AllocaInst =
llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Remove llvm::

Comment on lines 641 to 642
// Loop over blocks looking for allocas, skip the entry block allocas here are
// in the appropriate place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of this comment is a little off. Can you please rephrase it.

/// Add a function that's constant allocas will attempt to be raised on a
/// call of finalize after all currently enqueued outline info's have been
/// processed.
void addConstantAllocaRaiseCandidates(Function *F) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so if we don't wish it to be, it was primarily to mimic the current style we use for addOutlineInfo if I recall correctly. Happy to change it!

@agozillon
Copy link
Contributor Author

Thank you very much for the review @TIFitis I'll update this at the beginning of next week. And I'll hold off until we get a second reviewer that can sign off.

…se function pass to finalize, utilised in convertTarget

This patch seeks to add a mechanism to raise constant
(not ConstantExpr or runtime/dynamic) sized allocations
into the entry block for select functions that have been
inserted into a list for processing. This processing occurs
during the finalize call, after OutlinedInfo regions have
completed. This currently has only been utilised for
createOutlinedFunction, which is triggered for
TargetOp generation in the OpenMP MLIR dialect
lowering to LLVM-IR.

This currently is required for Target kernels generated by
createOutlinedFunction to avoid subsequent optimisation
passes doing some unintentional malformed optimisaitions
for AMD kernels (unsure if it occurs for other vendors). If
the allocas are generated inside of the kernel and are
not in the entry block and are subsequently passed to a
function this can lead to required instructions being
erased or manipulated in a way that causes the kernel
to run into a HSA access error.

This fix is related to a series of problems found in:
llvm#74603

This problem primarily presents itself for Flang's HLFIR
AssignOp currently, when utilised with a scalar temporary
constant on the RHS and a descriptor type on the LHS. It
will generate a call to a runtime function, wrap the RHS
temporary in a newly allocated descriptor (an llvm
struct), and pass both the LHS and RHS descriptor into
the runtime function call. This will currently be
embedded into the middle of the target region in the
user entry block, which means the allocas are also
embedded in the middle, which seems to pose
issues when later passes are executed. This issue
may present itself in other HLFIR operations or
unrelated operations that generate allocas as a by
product, but for the moment, this one test case is
the only scenario i've found this problem.

Perhaps this is not the appropriate fix, I am very open to other
suggestions, I've tried a few others (at varying levels of the
flang/mlir compiler flow), but this one is the smallest and least
intrusive changeset. The other two, that come to mind (but I've
not fully looked into, the former I tried a little with blocks but it
had a few issues I'd need to think through):
*  Having a proper alloca only block (or region) generated for TargetOps
   that we could merge into the entry block that's generated by
   convertTarget's createOutlinedFunction.
* Or diverging a little from Clang's current target generation and using
  the CodeExtractor to generate the user code as an outlined function
  region invoked from the kernel we make, with our kernel arguments
  passed into it. Similar to the current parallel generation. I am not sure
  how well this would intermingle with the existing parallel generation
  though that's layered in.

Both of these methods seem like quite a divergeance from the current
status quo, which I am not entirely sure is meritted for the small test
this change aims to fix.
- Fix badly written comment
- Remove unneccesary function
- remove unrequired llvm:: prefix
@agozillon agozillon force-pushed the alloca-raise-fix-attempt-main branch from 5d93304 to f568ba2 Compare February 5, 2024 14:50
@agozillon
Copy link
Contributor Author

Updated PR based on recent comments (commit) rebased on recent upstream changes to make sure it remains functional.

Changes made were:

  • Fix badly written comment
  • Remove unnecessary function
  • remove unrequired llvm:: prefix

Still looking for another reviewer for this PR if possible or further discussion.

@agozillon
Copy link
Contributor Author

It slipped my mind that @TIFitis asked for an OpenMPIRBuilderTest, so I have now added one testing the alloca raising that this PR seeks to add in the latest commit!

Copy link

github-actions bot commented Feb 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

…to OpenMPIRBuilderTest.cpp

Forgot to add this test yesertday and remembered today!
@agozillon agozillon force-pushed the alloca-raise-fix-attempt-main branch from b2c79ef to f076f69 Compare February 6, 2024 15:17
@kiranchandramohan
Copy link
Contributor

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

Is your observation that the OpenMPIRBuilder creates a separate Alloca block and this Alloca block will become the entry block in LLVM IR. So even if we move all the allocas to the entry block in FIR/HLFIR/LLVM dialect MLIR, it still won't be the entry block in LLVM IR?

Is it possible to merge the Alloca block to the succeeding block (which is hopefully the entry block at the MLIR dialects)? And if we have lifted in MLIR all to the entry block this will all work out fine?

@agozillon
Copy link
Contributor Author

agozillon commented Feb 6, 2024

The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.

Is your observation that the OpenMPIRBuilder creates a separate Alloca block and this Alloca block will become the entry block in LLVM IR. So even if we move all the allocas to the entry block in FIR/HLFIR/LLVM dialect MLIR, it still won't be the entry block in LLVM IR?

That's correct unfortunately with our current lowering method (which I believe is just how we traditionally generate the kernels from what I can gather from code comments) we generate an initial entry point that does some load in/setup of kernel input and then performs a libomptarget runtime invocation for initialization (__kmpc_target_init) before branching into a fail condition or the user entry code, which is effectively the lowered MLIR code in the region. So any allocas in the region will still be emitted outside of the alloca block currently.

Is it possible to merge the Alloca block to the succeeding block (which is hopefully the entry block at the MLIR dialects)? And if we have lifted in MLIR all to the entry block this will all work out fine?

I thought about doing this, but there's a few issues (some may be simpler to overcome than I think or maybe my thinking is just bad) I thought of that might get in the way of it:

  1. I don't think we can just merge the MLIR generated block into the front of the newly generated OpenMPIRBuilder target block as we could end up with out of synch operations, if anything in the MLIR generated block depends on any of the input arguments.
  2. The inverse of merging at the bottom of the block becomes a bit of an issue due to the kmpc_target_init call into a branch, as we can't meld it directly behind the branch.
  3. We could maybe splice it into the OpenMPIRBuilder generated entry block, above the call to kmpc_target_init and branch below the argument entry code. However, we actually do not really have the concept of an strict alloca block when it comes to TargetOp (or any of the OpenMP region constructs from what I can tell), we simply prepend allocas to the top of the target region, and this primarily seems to be enforced in the FIR -> LLVM dialect lowering, the HLFIR/FIR CodeGen seems less strict on this. Enforcing a specific block at the FIR -> LLVM dialect lowering level didn't seem particularly hard when I tried it (but I didn't test it extensively). However, as I recall after doing this I found that it's not so easy to reliably find this alloca block when we're trying to merge them, as we can' t guarantee the first generated LLVM-IR block is the alloca block (at least from my limited testing). So we'd need some way to identify the block for merging.

We also need to be a tad careful when doing this as we have the nested parallel operations to consider that also do their own alloca movements and extraction, it's why I opted to defer the raising to the finalize method and restrict it solely to the Target kernel entry function.

It's been a little bit since I looked into this in detail now, so hopefully I am recollecting the above well enough! I am happy to look into it again to recollect more details if that's something we'd be interested in doing.

@agozillon
Copy link
Contributor Author

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

@kiranchandramohan
Copy link
Contributor

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

No need to pursue this.

I have couple of questions inline. Also, would this be better as an LLVM pass?

Comment on lines +20 to +29
%4 = llvm.mlir.constant(1 : i32) : i32
%5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
%6 = llvm.mlir.constant(50 : i32) : i32
%7 = llvm.mlir.constant(1 : i64) : i64
%8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
llvm.store %6, %8 : i32, !llvm.ptr
%9 = llvm.mlir.undef : !llvm.struct<(ptr)>
%10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)>
llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
%88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a loop around this block, would this still move the alloca to the entry block? If so, is that still semantically equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great question @kiranchandramohan and I encourage anyone else that has any thoughts on possible scenarios that this might break to also bring things up where there may be concerns. My LLVM knowledge is still very much a WIP.

In this case it would move the alloca to the entry block whether or not it's in a loop related series of blocks. From an MLIR perspective currently I don't think it's possible to have a series of loop blocks around the initial basic block of TargetOp (but I may be wrong on this and would love to know a scenario where this is possible to test), although, it is possible to have blocks chained from it relating to loops as normal. I believe even in the case of omp.parallel/omp.wsloop it's internal to the initial basic block, and this code actively avoids manipulating this kind of nested region currently, the entirety of the parallel do should be outlined into a seperate function by the time we try to raise allocations, the raising aims to only happen inside of the kernel entry function (TargetOp) currently, the parallel do has it's own allocation raising/sinking that occurs via the CodeExtractor that we need to avoid tampering with (as it will break it).

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

However, as we do still need to consider this being used in C/C++ where it is possible, if we look at the below C/C++ examples where I've tried to come up with some breaking scenarios (perhaps my logic isn't quite sound and there is better examples that would cause breakage, in which case please do bring them up!):

Version 1

int test = 0;
#pragma omp target map(tofrom : test)
{
  for (int i = 0; i < 10; i++) {
    int j = 10;
    j += 10;
    test += j;
  }
}

Version 2

int test = 0;
#pragma omp target map(tofrom : test)
{
  int j = 10;
  for (int i = 0; i < 10; i++) {
    j += 10;
    test += j;
  }
}

In both cases, when we emit the device IR the alloca for j is moved to the entry block alongside i, and the correct results are achieved, so I believe it is still semantically equivalent. Without the initialization of j with 10 we get the same UB in both cases as well.

The main case where I can see it being an issue (and there may be others, which I would love to hear if someone can think of them so I can investigate) is if we have an allocation that is size dependent on some loop dependent variable, making it a dynamic allocation, however, this PR doesn't tackle this case (and doesn't handle constant expressions either, it needs them folded into a constant value currently), it'd be a lot more complex to deal with and Flang hasn't presented a case where we need to handle this scenario yet (hopefully never). If we get to the point of requiring this it'd require a significantly more complex solution I think.

However, disclaimer again, I am not the most LLVM savvy person yet, so anyone that may be reading, please do feel free to mention cases that it may be an issue that you can think of or if any of the above may be incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

There is a block construct in Fortran 2008 that can be used for this.

Also we have passes that can convert to Allocs to Allocas.

class AllocMemConversion : public mlir::OpRewritePattern<fir::AllocMemOp> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pinging @jdoerfert if he has any opinion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the TargetOp, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca.

There is a block construct in Fortran 2008 that can be used for this.

Thank you Kiran! I can give this a shot then (provided it currently works in a target region, not tested it myself directly, so I am a little interested in that). Sorry that my Fortran knowledge is rather lacking in areas and thank you for introducing me to new stuff!

Also we have passes that can convert to Allocs to Allocas.

class AllocMemConversion : public mlir::OpRewritePattern<fir::AllocMemOp> {

I'll have a look into the pass, it could be quite useful in the future. I did notice it when I was looking around for other possible solutions and it caught my eye, but I wasn't quite sure if we needed to look into an extensive solution just yet. While I don't think the pass would solve the issue (perhaps I am incorrect with my thinking however), it would be a good basis to create a pass that might.

However, I think if we wanted to do this pass at an MLIR level we'd have to enforce an entry block for Target regions, and I believe it may be unnecessary to be a pass at that point, during the lowering to LLVM Dialect I think we try to move all allocas into the Target (or any) entry block, and if we make sure we have or generate an explicit entry block at this point, I think all allocas should hopefully end up naturally in the entry block where it's possible at least.

Copy link
Contributor Author

@agozillon agozillon Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the following small little test case using blocks, it makes some attempt to generate the same behavior as the C++ test case above, with some extra on top:

PROGRAM main
    integer, allocatable :: a
    allocate(a)
    a = 5
 
!$omp target map(tofrom:a)
   sub_call : BLOCK
      INTEGER :: b
      b = a + 2
      CALL Sub(b)

      do i = 1, 10
        BLOCK
           integer :: j 
           j = 10
           j = j + 10
           a = a + j 
        END BLOCK
        a = a + b
     end do
    END BLOCK sub_call
!$omp end target

print *, a
END PROGRAM main
  
SUBROUTINE Sub(B)
  INTEGER :: b
  b = b * b
END SUBROUTINE Sub

It yields the same answer for both target and regular Fortran host (as does moving j outside and into the first block yield identical results), what is interesting is that forcing undefined behavior via removing the initial assignment to j, the regular Fortran host code will print a garbage value, whereas the target will return an actual consistent value, effectively not emitting the addition of j in the loop to a. This happens with or without the alloca raise pass here, so seems a seperate inconsistency (and in any case is undefined behavior, so we're free to do as we wish I'd imagine).

I did, however, have to use the deprecated fir flow via -flang-deprecated-no-hlfir (which still uses the alloca raise, so it would still produce the relevant inconsistencies), as this little test has unfortunately shown another very similar compiler crash for allocatables to the one that this PR fixes (crashes with or without this PR sadly). So that will be fun to look into, I am unsure if it's a related or entirely new issue at this point though (will look into it more in the near future).

Comment on lines +5091 to +5095
// As we embed the user code in the middle of our target region after we
// generate entry code, we must move what allocas we can into the entry
// block to avoid possible breaking optimisations for device
if (OMPBuilder.Config.isTargetDevice())
OMPBuilder.ConstantAllocaRaiseCandidates.emplace_back(Func);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this code is going to run for all device targets. I guess you said that this is something specific for AMD GPU backends. Do you know whether such transformations apply to other targets as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately can only speak to AMD GPU backends as it's all I've tested the fortran offload examples I work with on and don't know enough about the other backends (e.g. NVIDIA) to really comment on them unfortunately.

I can see if I can restrict it to the AMD target triple if that's something you'd like me to do to be cautious however? not sure how complicated it'd be to do so here, but I imagine it wouldn't be.

I believe someone did manage to get some of the offloading working for NVIDIA GPU's a while ago if I recall correctly, but I am not sure how easy that was for them, so I could in theory test that if we'd like to as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have target specific code here at the moment. So no need to check this here.

@agozillon
Copy link
Contributor Author

Just a small ping to check if you're happy with the fix as is @kiranchandramohan or if you'd prefer me to dig into the block merge/splice method to see if it's feasible?

No need to pursue this.

Sounds good, if it ever becomes necessary to pursue the other option I am more than happy to do so.

I have couple of questions inline. Also, would this be better as an LLVM pass?

I would like to avoid it being an LLVM pass for now if that's at all possible, it's a rather small/trivial piece of functionality that only gets performed on the Target/kernel entry function at the moment, I'd rather other people not start (or at least be able to) running it outside of the intended use-case for the moment. I think the easier to revert it is the better it is for the time being if we have to look at other options. However, if it expands to more complex use-cases (e.g. dynamic allocations or non-foldable constants, although, if we have to cover these cases I'd lean towards the entry block splicing or an alternative) then certainly it should be moved to an LLVM pass.

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think constant allocas are fine to be hoisted. But as mentioned elsewhere there could be other cases where allocas might cause problems and might require a generic solution.

If there are no concerns expressed in a day or two you can consider submitting this patch and generalize this later on.

@agozillon
Copy link
Contributor Author

Just a little forewarning, I will land this PR tomorrow, so if anyone has any concerns now would be the time to bring them forward! However, we can always revert the PR if there is afterwards (or any unexpected issues), but bringing them up just now would be ideal :-)

@agozillon agozillon merged commit dcf4ca5 into llvm:main Feb 23, 2024
3 of 4 checks passed
@agozillon
Copy link
Contributor Author

Landed this PR now, if anyone encounters any unexpected behaviour relating to allocas being raised by this PR please drop a message here or send me a message and then we can revert the PR or make the necessary changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants