Skip to content

Commit

Permalink
[OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca rai…
Browse files Browse the repository at this point in the history
…se function pass to finalize, utilised in convertTarget (#78818)

This patch seeks to add a mechanism to raise constant (not ConstantExpr
or runtime/dynamic) sized allocations into the entry block for select
functions that have been inserted into a list for processing. This
processing occurs during the finalize call, after OutlinedInfo regions
have completed. This currently has only been utilised for
createOutlinedFunction, which is triggered for TargetOp generation in
the OpenMP MLIR dialect lowering to LLVM-IR.

This currently is required for Target kernels generated by
createOutlinedFunction to avoid subsequent optimization passes doing
some unintentional malformed optimizations for AMD kernels (unsure if it
occurs for other vendors). If the allocas are generated inside of the
kernel and are not in the entry block and are subsequently passed to a
function this can lead to required instructions being erased or
manipulated in a way that causes the kernel to run into a HSA access
error.

This fix is related to a series of problems found in:
#74603

This problem primarily presents itself for Flang's HLFIR AssignOp
currently, when utilised with a scalar temporary constant on the RHS and
a descriptor type on the LHS. It will generate a call to a runtime
function, wrap the RHS temporary in a newly allocated descriptor (an
llvm struct), and pass both the LHS and RHS descriptor into the runtime
function call. This will currently be
embedded into the middle of the target region in the user entry block,
which means the allocas are also embedded in the middle, which seems to
pose
issues when later passes are executed. This issue may present itself in
other HLFIR operations or unrelated operations that generate allocas as
a by product, but for the moment, this one test case is the only
scenario I've found this problem.

Perhaps this is not the appropriate fix, I am very open to other
suggestions, I've tried a few others (at varying levels of the
flang/mlir compiler flow), but this one is the smallest and least
intrusive change set. The other two, that come to mind (but I've not
fully looked into, the former I tried a little with blocks but it had a
few issues I'd need to think through):

- Having a proper alloca only block (or region) generated for TargetOps
that we could merge into the entry block that's generated by
convertTarget's createOutlinedFunction.
- Or diverging a little from Clang's current target generation and using
the CodeExtractor to generate the user code as an outlined function
region invoked from the kernel we make, with our kernel arguments passed
into it. Similar to the current parallel generation. I am not sure how
well this would intermingle with the existing parallel generation though
that's layered in.

Both of these methods seem like quite a divergence from the current
status quo, which I am not entirely sure is merited for the small test
this change aims to fix.
  • Loading branch information
agozillon committed Feb 23, 2024
1 parent 47aee8b commit dcf4ca5
Show file tree
Hide file tree
Showing 4 changed files with 249 additions and 0 deletions.
5 changes: 5 additions & 0 deletions llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
Original file line number Diff line number Diff line change
Expand Up @@ -1506,6 +1506,11 @@ class OpenMPIRBuilder {
/// Collection of regions that need to be outlined during finalization.
SmallVector<OutlineInfo, 16> OutlineInfos;

/// A collection of candidate target functions that's constant allocas will
/// attempt to be raised on a call of finalize after all currently enqueued
/// outline info's have been processed.
SmallVector<llvm::Function *, 16> ConstantAllocaRaiseCandidates;

/// Collection of owned canonical loop objects that eventually need to be
/// free'd.
std::forward_list<CanonicalLoopInfo> LoopInfos;
Expand Down
51 changes: 51 additions & 0 deletions llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,29 @@ Function *OpenMPIRBuilder::getOrCreateRuntimeFunctionPtr(RuntimeFunction FnID) {

void OpenMPIRBuilder::initialize() { initializeTypes(M); }

static void raiseUserConstantDataAllocasToEntryBlock(IRBuilderBase &Builder,
Function *Function) {
BasicBlock &EntryBlock = Function->getEntryBlock();
Instruction *MoveLocInst = EntryBlock.getFirstNonPHI();

// Loop over blocks looking for constant allocas, skipping the entry block
// as any allocas there are already in the desired location.
for (auto Block = std::next(Function->begin(), 1); Block != Function->end();
Block++) {
for (auto Inst = Block->getReverseIterator()->begin();
Inst != Block->getReverseIterator()->end();) {
if (auto *AllocaInst = dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
Inst++;
if (!isa<ConstantData>(AllocaInst->getArraySize()))
continue;
AllocaInst->moveBeforePreserving(MoveLocInst);
} else {
Inst++;
}
}
}
}

void OpenMPIRBuilder::finalize(Function *Fn) {
SmallPtrSet<BasicBlock *, 32> ParallelRegionBlockSet;
SmallVector<BasicBlock *, 32> Blocks;
Expand Down Expand Up @@ -737,6 +760,28 @@ void OpenMPIRBuilder::finalize(Function *Fn) {
// Remove work items that have been completed.
OutlineInfos = std::move(DeferredOutlines);

// The createTarget functions embeds user written code into
// the target region which may inject allocas which need to
// be moved to the entry block of our target or risk malformed
// optimisations by later passes, this is only relevant for
// the device pass which appears to be a little more delicate
// when it comes to optimisations (however, we do not block on
// that here, it's up to the inserter to the list to do so).
// This notbaly has to occur after the OutlinedInfo candidates
// have been extracted so we have an end product that will not
// be implicitly adversely affected by any raises unless
// intentionally appended to the list.
// NOTE: This only does so for ConstantData, it could be extended
// to ConstantExpr's with further effort, however, they should
// largely be folded when they get here. Extending it to runtime
// defined/read+writeable allocation sizes would be non-trivial
// (need to factor in movement of any stores to variables the
// allocation size depends on, as well as the usual loads,
// otherwise it'll yield the wrong result after movement) and
// likely be more suitable as an LLVM optimisation pass.
for (Function *F : ConstantAllocaRaiseCandidates)
raiseUserConstantDataAllocasToEntryBlock(Builder, F);

EmitMetadataErrorReportFunctionTy &&ErrorReportFn =
[](EmitMetadataErrorKind Kind,
const TargetRegionEntryInfo &EntryInfo) -> void {
Expand Down Expand Up @@ -5043,6 +5088,12 @@ static Function *createOutlinedFunction(

BasicBlock *UserCodeEntryBB = Builder.GetInsertBlock();

// As we embed the user code in the middle of our target region after we
// generate entry code, we must move what allocas we can into the entry
// block to avoid possible breaking optimisations for device
if (OMPBuilder.Config.isTargetDevice())
OMPBuilder.ConstantAllocaRaiseCandidates.emplace_back(Func);

// Insert target deinit call in the device compilation pass.
Builder.restoreIP(CBFunc(Builder.saveIP(), Builder.saveIP()));
if (OMPBuilder.Config.isTargetDevice())
Expand Down
150 changes: 150 additions & 0 deletions llvm/unittests/Frontend/OpenMPIRBuilderTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5989,6 +5989,156 @@ TEST_F(OpenMPIRBuilderTest, TargetRegionDevice) {
EXPECT_TRUE(isa<ReturnInst>(ExitBlock->getFirstNonPHI()));
}

TEST_F(OpenMPIRBuilderTest, ConstantAllocaRaise) {
OpenMPIRBuilder OMPBuilder(*M);
OMPBuilder.setConfig(
OpenMPIRBuilderConfig(true, false, false, false, false, false, false));
OMPBuilder.initialize();

F->setName("func");
IRBuilder<> Builder(BB);
OpenMPIRBuilder::LocationDescription Loc({Builder.saveIP(), DL});

LoadInst *Value = nullptr;
StoreInst *TargetStore = nullptr;
llvm::SmallVector<llvm::Value *, 1> CapturedArgs = {
Constant::getNullValue(PointerType::get(Ctx, 0))};

auto SimpleArgAccessorCB =
[&](llvm::Argument &Arg, llvm::Value *Input, llvm::Value *&RetVal,
llvm::OpenMPIRBuilder::InsertPointTy AllocaIP,
llvm::OpenMPIRBuilder::InsertPointTy CodeGenIP) {
if (!OMPBuilder.Config.isTargetDevice()) {
RetVal = cast<llvm::Value>(&Arg);
return CodeGenIP;
}

Builder.restoreIP(AllocaIP);

llvm::Value *Addr = Builder.CreateAlloca(
Arg.getType()->isPointerTy()
? Arg.getType()
: Type::getInt64Ty(Builder.getContext()),
OMPBuilder.M.getDataLayout().getAllocaAddrSpace());
llvm::Value *AddrAscast =
Builder.CreatePointerBitCastOrAddrSpaceCast(Addr, Input->getType());
Builder.CreateStore(&Arg, AddrAscast);

Builder.restoreIP(CodeGenIP);

RetVal = Builder.CreateLoad(Arg.getType(), AddrAscast);

return Builder.saveIP();
};

llvm::OpenMPIRBuilder::MapInfosTy CombinedInfos;
auto GenMapInfoCB = [&](llvm::OpenMPIRBuilder::InsertPointTy codeGenIP)
-> llvm::OpenMPIRBuilder::MapInfosTy & {
CreateDefaultMapInfos(OMPBuilder, CapturedArgs, CombinedInfos);
return CombinedInfos;
};

llvm::Value *RaiseAlloca = nullptr;

auto BodyGenCB = [&](OpenMPIRBuilder::InsertPointTy AllocaIP,
OpenMPIRBuilder::InsertPointTy CodeGenIP)
-> OpenMPIRBuilder::InsertPointTy {
Builder.restoreIP(CodeGenIP);
RaiseAlloca = Builder.CreateAlloca(Builder.getInt32Ty());
Value = Builder.CreateLoad(Type::getInt32Ty(Ctx), CapturedArgs[0]);
TargetStore = Builder.CreateStore(Value, RaiseAlloca);
return Builder.saveIP();
};

IRBuilder<>::InsertPoint EntryIP(&F->getEntryBlock(),
F->getEntryBlock().getFirstInsertionPt());
TargetRegionEntryInfo EntryInfo("parent", /*DeviceID=*/1, /*FileID=*/2,
/*Line=*/3, /*Count=*/0);

Builder.restoreIP(
OMPBuilder.createTarget(Loc, EntryIP, EntryIP, EntryInfo, /*NumTeams=*/-1,
/*NumThreads=*/0, CapturedArgs, GenMapInfoCB,
BodyGenCB, SimpleArgAccessorCB));

Builder.CreateRetVoid();
OMPBuilder.finalize();

// Check outlined function
EXPECT_FALSE(verifyModule(*M, &errs()));
EXPECT_NE(TargetStore, nullptr);
Function *OutlinedFn = TargetStore->getFunction();
EXPECT_NE(F, OutlinedFn);

EXPECT_TRUE(OutlinedFn->hasWeakODRLinkage());
// Account for the "implicit" first argument.
EXPECT_EQ(OutlinedFn->getName(), "__omp_offloading_1_2_parent_l3");
EXPECT_EQ(OutlinedFn->arg_size(), 2U);
EXPECT_TRUE(OutlinedFn->getArg(1)->getType()->isPointerTy());

// Check entry block, to see if we have raised our alloca
// from the body to the entry block.
auto &EntryBlock = OutlinedFn->getEntryBlock();

// Check that we have moved our alloca created in the
// BodyGenCB function, to the top of the function.
Instruction *Alloca1 = EntryBlock.getFirstNonPHI();
EXPECT_NE(Alloca1, nullptr);
EXPECT_TRUE(isa<AllocaInst>(Alloca1));
EXPECT_EQ(Alloca1, RaiseAlloca);

// Verify we have not altered the rest of the function
// inappropriately with our alloca movement.
auto *Alloca2 = Alloca1->getNextNode();
EXPECT_TRUE(isa<AllocaInst>(Alloca2));
auto *Store2 = Alloca2->getNextNode();
EXPECT_TRUE(isa<StoreInst>(Store2));

auto *InitCall = dyn_cast<CallInst>(Store2->getNextNode());
EXPECT_NE(InitCall, nullptr);
EXPECT_EQ(InitCall->getCalledFunction()->getName(), "__kmpc_target_init");
EXPECT_EQ(InitCall->arg_size(), 2U);
EXPECT_TRUE(isa<GlobalVariable>(InitCall->getArgOperand(0)));
auto *KernelEnvGV = cast<GlobalVariable>(InitCall->getArgOperand(0));
EXPECT_TRUE(isa<ConstantStruct>(KernelEnvGV->getInitializer()));
auto *KernelEnvC = cast<ConstantStruct>(KernelEnvGV->getInitializer());
EXPECT_TRUE(isa<ConstantStruct>(KernelEnvC->getAggregateElement(0U)));
auto *ConfigC = cast<ConstantStruct>(KernelEnvC->getAggregateElement(0U));
EXPECT_EQ(ConfigC->getAggregateElement(0U),
ConstantInt::get(Type::getInt8Ty(Ctx), true));
EXPECT_EQ(ConfigC->getAggregateElement(1U),
ConstantInt::get(Type::getInt8Ty(Ctx), true));
EXPECT_EQ(ConfigC->getAggregateElement(2U),
ConstantInt::get(Type::getInt8Ty(Ctx), OMP_TGT_EXEC_MODE_GENERIC));

auto *EntryBlockBranch = EntryBlock.getTerminator();
EXPECT_NE(EntryBlockBranch, nullptr);
EXPECT_EQ(EntryBlockBranch->getNumSuccessors(), 2U);

// Check user code block
auto *UserCodeBlock = EntryBlockBranch->getSuccessor(0);
EXPECT_EQ(UserCodeBlock->getName(), "user_code.entry");
auto *Load1 = UserCodeBlock->getFirstNonPHI();
EXPECT_TRUE(isa<LoadInst>(Load1));
auto *Load2 = Load1->getNextNode();
EXPECT_TRUE(isa<LoadInst>(Load2));
EXPECT_EQ(Load2, Value);
EXPECT_EQ(Load2->getNextNode(), TargetStore);
auto *Deinit = TargetStore->getNextNode();
EXPECT_NE(Deinit, nullptr);

auto *DeinitCall = dyn_cast<CallInst>(Deinit);
EXPECT_NE(DeinitCall, nullptr);
EXPECT_EQ(DeinitCall->getCalledFunction()->getName(), "__kmpc_target_deinit");
EXPECT_EQ(DeinitCall->arg_size(), 0U);

EXPECT_TRUE(isa<ReturnInst>(DeinitCall->getNextNode()));

// Check exit block
auto *ExitBlock = EntryBlockBranch->getSuccessor(1);
EXPECT_EQ(ExitBlock->getName(), "worker.exit");
EXPECT_TRUE(isa<ReturnInst>(ExitBlock->getFirstNonPHI()));
}

TEST_F(OpenMPIRBuilderTest, CreateTask) {
using InsertPointTy = OpenMPIRBuilder::InsertPointTy;
OpenMPIRBuilder OMPBuilder(*M);
Expand Down
43 changes: 43 additions & 0 deletions mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s

// A small condensed version of a problem requiring constant alloca raising in
// Target Region Entries for user injected code, found in an issue in the Flang
// compiler. Certain LLVM IR optimisation passes will perform runtime breaking
// transformations on allocations not found to be in the entry block, current
// OpenMP dialect lowering of TargetOp's will inject user allocations after
// compiler generated entry code, in a seperate block, this test checks that
// a small function which attempts to raise some of these (specifically
// constant sized) allocations performs its task reasonably in these
// scenarios.

module attributes {omp.is_target_device = true} {
llvm.func @_QQmain() attributes {omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
%1 = llvm.mlir.constant(1 : i64) : i64
%2 = llvm.alloca %1 x !llvm.struct<(ptr)> : (i64) -> !llvm.ptr
%3 = omp.map_info var_ptr(%2 : !llvm.ptr, !llvm.struct<(ptr)>) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
omp.target map_entries(%3 -> %arg0 : !llvm.ptr) {
^bb0(%arg0: !llvm.ptr):
%4 = llvm.mlir.constant(1 : i32) : i32
%5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
%6 = llvm.mlir.constant(50 : i32) : i32
%7 = llvm.mlir.constant(1 : i64) : i64
%8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
llvm.store %6, %8 : i32, !llvm.ptr
%9 = llvm.mlir.undef : !llvm.struct<(ptr)>
%10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)>
llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
%88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
omp.terminator
}
llvm.return
}
llvm.func @_ExternalCall(!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
}

// CHECK: define weak_odr protected void @{{.*}}QQmain_l{{.*}}({{.*}}, {{.*}}) {
// CHECK-NEXT: entry:
// CHECK-NEXT: %[[MOVED_ALLOCA1:.*]] = alloca { ptr }, align 8
// CHECK-NEXT: %[[MOVED_ALLOCA2:.*]] = alloca i32, i64 1, align 4
// CHECK-NEXT: %[[MAP_ARG_ALLOCA:.*]] = alloca ptr, align 8

// CHECK: user_code.entry: ; preds = %entry

0 comments on commit dcf4ca5

Please sign in to comment.