Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Generalize local accessor to shared mem pass #5149

Merged
merged 8 commits into from
Jan 20, 2022
5 changes: 5 additions & 0 deletions clang/lib/Driver/ToolChains/Clang.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5787,6 +5787,11 @@ void Clang::ConstructJob(Compilation &C, const JobAction &JA,
CmdArgs.push_back("-treat-scalable-fixed-error-as-warning");
}

// Enable local accessor to shared memory pass for SYCL.
if (isa<BackendJobAction>(JA) && IsSYCL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdtoguchi, shouldn't the condition check IsSYCLOffloadDevice instead of IsSYCL?
abi/user_mangling.cpp and regression/fsycl-save-temps.cpp from check-sycl suite fails on my system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. If this is to be only set for device, then this should be using IsSYCLOffloadDevice, as IsSYCL is for all compilations (host and device) when SYCL device offloading is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get following error:

clang (LLVM option parsing): Unknown command line argument '-sycl-enable-local-accessor'. Try: 'clang (LLVM option parsing) --help'
clang (LLVM option parsing): Did you mean '--enable-local-reassign'?

Honestly, I don't really understand why the options is not visible for FE in host mode, but using IsSYCLOffloadDevice fixed the problem and I don't think running the pass enabled by -sycl-enable-local-accessor is needed in the host mode.
Another mystery is why this issue is not exposed by CI system. I suppose it some how related to the difference in cmake configuration - I don't build NVPTX and AMDGPU targets, which I suppose link the library with "unknown" option.

We definitely need to do more investigation on this issue.

@jchlanda, FYI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional problem: -sycl-enable-local-accessor is only being set when we do some kind of code generation step. As the device compilation does not do this, -sycl-enable-local-accessor is never set for device. It is only being emitted for host compilations as that goes through the assembling step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at a high level, -sycl-enable-local-accessor does only emit for the code generation step with the nvptx64 target. The steps are not combined for nvptx64 allowing the option to only be emitted for device there. Kind of a round-about way to restrict the option, but it can leak out to host if -S is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, @jchlanda is away this week.
@AerialMantis, can someone else to take a look into this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I'll have someone else take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR up to fix this #5408
I believe the CI did not catch this as it builds for both cuda and hip then runs for each backend, correct me if I am wrong. If you build for CUDA or HIP then -sycl-enable-local-accessor is then usable.
I think that -sycl-enable-local-accessor is not available when not building for CUDA/HIP because the pass is initialised within the llvm NVPTX and AMDGPU backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost glad that this bug surfaced @bader , @mdtoguchi . TBH, -sycl-enable-local-accessor is a workaround, that I never liked. The problem I had was that there wasn't a way to tell that a kernel was compiled from SYCL. Simply relying on the calling convention (CallingConv::AMDGPU_KERNEL) is not enough, as there are multiple paths that use it (OpenCL, OpenMP, SYCL). I was wondering if it would be better to follow NVIDIA here and use metadata nodes to denote kernels (https://github.com/intel/llvm/blob/HEAD/clang/lib/CodeGen/TargetInfo.cpp#L7242). This would work for all the passes that we only want to run on SYCL kernels (for instance we'd like to generalize https://github.com/intel/llvm/blob/sycl/llvm/lib/Target/NVPTX/SYCL/GlobalOffset.cpp would benefit from it).

CmdArgs.push_back("-mllvm");
CmdArgs.push_back("-sycl-enable-local-accessor");
}
mdtoguchi marked this conversation as resolved.
Show resolved Hide resolved
// These two are potentially updated by AddClangCLArgs.
codegenoptions::DebugInfoKind DebugInfoKind = codegenoptions::NoDebugInfo;
bool EmitCodeView = false;
Expand Down
8 changes: 6 additions & 2 deletions clang/lib/Driver/ToolChains/HIPAMD.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,12 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
const llvm::opt::ArgList &Args) const {
// Construct lld command.
// The output from ld.lld is an HSA code object file.
ArgStringList LldArgs{"-flavor", "gnu", "--no-undefined", "-shared",
"-plugin-opt=-amdgpu-internalize-symbols"};
ArgStringList LldArgs{"-flavor",
"gnu",
"--no-undefined",
"-shared",
"-plugin-opt=-amdgpu-internalize-symbols",
"-plugin-opt=-sycl-enable-local-accessor"};
mdtoguchi marked this conversation as resolved.
Show resolved Hide resolved

auto &TC = getToolChain();
auto &D = TC.getDriver();
Expand Down
11 changes: 11 additions & 0 deletions clang/test/Driver/sycl-local-accessor-opt.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
/// Check the correct handling of sycl-enable-local-accessor option.

// REQUIRES: clang-driver

// RUN: %clang -fsycl -### %s 2>&1 \
// RUN: | FileCheck -check-prefix=CHECK-NO-OPT %s
// CHECK-NO-OPT-NOT: "-sycl-enable-local-accessor"

// RUN: %clang -fsycl -fsycl-targets=nvptx64-nvidia-cuda -### %s 2>&1 \
// RUN: | FileCheck %s
// CHECK: "-sycl-enable-local-accessor"
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
//
// This pass operates on SYCL kernels being compiled to CUDA. It modifies
// kernel entry points which take pointers to shared memory and modifies them
// to take offsets into shared memory (represented by a symbol in the shared address
// space). The SYCL runtime is expected to provide offsets rather than pointers
// to these functions.
// to take offsets into shared memory (represented by a symbol in the shared
// address space). The SYCL runtime is expected to provide offsets rather than
// pointers to these functions.
//
//===----------------------------------------------------------------------===//

Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/SYCLLowerIR/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ add_llvm_component_library(LLVMSYCLLowerIR
ESIMDVerifier.cpp
MutatePrintfAddrspace.cpp

LocalAccessorToSharedMemory.cpp

ADDITIONAL_HEADER_DIRS
${LLVM_MAIN_INCLUDE_DIR}/llvm/SYCLLowerIR
${LLVM_MAIN_SRC_DIR}/projects/vc-intrinsics/GenXIntrinsics/include
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,92 +14,115 @@
//
//===----------------------------------------------------------------------===//

#include "LocalAccessorToSharedMemory.h"
#include "../MCTargetDesc/NVPTXBaseInfo.h"
#include "llvm/SYCLLowerIR/LocalAccessorToSharedMemory.h"
#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/PassManager.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/IPO.h"

using namespace llvm;

#define DEBUG_TYPE "localaccessortosharedmemory"

static bool EnableLocalAccessor;

static cl::opt<bool, true> EnableLocalAccessorFlag(
"sycl-enable-local-accessor", cl::Hidden,
cl::desc("Enable local accessor to shared memory optimisation."),
cl::location(EnableLocalAccessor), cl::init(false));

namespace llvm {
void initializeLocalAccessorToSharedMemoryPass(PassRegistry &);
}
} // namespace llvm

namespace {

class LocalAccessorToSharedMemory : public ModulePass {
mlychkov marked this conversation as resolved.
Show resolved Hide resolved
private:
enum class ArchType { Cuda, AMDHSA, Unsupported };

struct KernelPayload {
KernelPayload(Function *Kernel, MDNode *MD = nullptr)
: Kernel(Kernel), MD(MD){};
Function *Kernel;
MDNode *MD;
};

unsigned SharedASValue = 0;

public:
static char ID;
LocalAccessorToSharedMemory() : ModulePass(ID) {}

bool runOnModule(Module &M) override {
if (!EnableLocalAccessor)
return false;

auto AT = StringSwitch<ArchType>(M.getTargetTriple().c_str())
.Case("nvptx64-nvidia-cuda", ArchType::Cuda)
.Case("nvptx-nvidia-cuda", ArchType::Cuda)
.Case("amdgcn-amd-amdhsa", ArchType::AMDHSA)
.Default(ArchType::Unsupported);

// Invariant: This pass is only intended to operate on SYCL kernels being
// compiled to the `nvptx{,64}-nvidia-cuda` triple.
// TODO: make sure that non-SYCL kernels are not impacted.
// compiled to either `nvptx{,64}-nvidia-cuda`, or `amdgcn-amd-amdhsa`
// triples.
if (ArchType::Unsupported == AT)
return false;

if (skipModule(M))
return false;

// Keep track of whether the module was changed.
auto Changed = false;
switch (AT) {
case ArchType::Cuda:
// ADDRESS_SPACE_SHARED = 3,
SharedASValue = 3;
break;
case ArchType::AMDHSA:
// LOCAL_ADDRESS = 3,
SharedASValue = 3;
break;
default:
SharedASValue = 0;
break;
}

// Access `nvvm.annotations` to determine which functions are kernel entry
// points.
auto NvvmMetadata = M.getNamedMetadata("nvvm.annotations");
if (!NvvmMetadata)
SmallVector<KernelPayload> Kernels;
SmallVector<std::pair<Function *, KernelPayload>> NewToOldKernels;
populateKernels(M, Kernels, AT);
if (Kernels.empty())
return false;

for (auto MetadataNode : NvvmMetadata->operands()) {
if (MetadataNode->getNumOperands() != 3)
continue;
// Process the function and if changed, update the metadata.
for (auto K : Kernels) {
auto *NewKernel = processKernel(M, K.Kernel);
if (NewKernel)
NewToOldKernels.push_back(std::make_pair(NewKernel, K));
}

// NVPTX identifies kernel entry points using metadata nodes of the form:
// !X = !{<function>, !"kernel", i32 1}
const MDOperand &TypeOperand = MetadataNode->getOperand(1);
auto Type = dyn_cast<MDString>(TypeOperand);
if (!Type)
continue;
// Only process kernel entry points.
if (Type->getString() != "kernel")
continue;
if (NewToOldKernels.empty())
return false;

// Get a pointer to the entry point function from the metadata.
const MDOperand &FuncOperand = MetadataNode->getOperand(0);
if (!FuncOperand)
continue;
auto FuncConstant = dyn_cast<ConstantAsMetadata>(FuncOperand);
if (!FuncConstant)
continue;
auto Func = dyn_cast<Function>(FuncConstant->getValue());
if (!Func)
continue;
postProcessKernels(NewToOldKernels, AT);

// Process the function and if changed, update the metadata.
auto NewFunc = this->ProcessFunction(M, Func);
if (NewFunc) {
Changed = true;
MetadataNode->replaceOperandWith(
0, llvm::ConstantAsMetadata::get(NewFunc));
}
}
return true;
}

return Changed;
virtual llvm::StringRef getPassName() const override {
return "SYCL Local Accessor to Shared Memory";
}

Function *ProcessFunction(Module &M, Function *F) {
private:
Function *processKernel(Module &M, Function *F) {
// Check if this function is eligible by having an argument that uses shared
// memory.
auto UsesLocalMemory = false;
for (Function::arg_iterator FA = F->arg_begin(), FE = F->arg_end();
FA != FE; ++FA) {
if (FA->getType()->isPointerTy()) {
UsesLocalMemory =
FA->getType()->getPointerAddressSpace() == ADDRESS_SPACE_SHARED;
}
if (UsesLocalMemory) {
if (FA->getType()->isPointerTy() &&
FA->getType()->getPointerAddressSpace() == SharedASValue) {
UsesLocalMemory = true;
break;
}
}
Expand All @@ -111,9 +134,9 @@ class LocalAccessorToSharedMemory : public ModulePass {
// Create a global symbol to CUDA shared memory.
auto SharedMemGlobalName = F->getName().str();
SharedMemGlobalName.append("_shared_mem");
auto SharedMemGlobalType =
auto *SharedMemGlobalType =
ArrayType::get(Type::getInt8Ty(M.getContext()), 0);
auto SharedMemGlobal = new GlobalVariable(
auto *SharedMemGlobal = new GlobalVariable(
/* Module= */ M,
/* Type= */ &*SharedMemGlobalType,
/* IsConstant= */ false,
Expand All @@ -122,7 +145,7 @@ class LocalAccessorToSharedMemory : public ModulePass {
/* Name= */ Twine{SharedMemGlobalName},
/* InsertBefore= */ nullptr,
/* ThreadLocalMode= */ GlobalValue::NotThreadLocal,
/* AddressSpace= */ ADDRESS_SPACE_SHARED,
/* AddressSpace= */ SharedASValue,
/* IsExternallyInitialized= */ false);
SharedMemGlobal->setAlignment(Align(4));

Expand All @@ -139,7 +162,7 @@ class LocalAccessorToSharedMemory : public ModulePass {
for (Function::arg_iterator FA = F->arg_begin(), FE = F->arg_end();
FA != FE; ++FA, ++i) {
if (FA->getType()->isPointerTy() &&
FA->getType()->getPointerAddressSpace() == ADDRESS_SPACE_SHARED) {
FA->getType()->getPointerAddressSpace() == SharedASValue) {
// Replace pointers to shared memory with i32 offsets.
Arguments.push_back(Type::getInt32Ty(M.getContext()));
ArgumentAttributes.push_back(
Expand Down Expand Up @@ -178,8 +201,8 @@ class LocalAccessorToSharedMemory : public ModulePass {
if (ArgumentReplaced[i]) {
// If this argument was replaced, then create a `getelementptr`
// instruction that uses it to recreate the pointer that was replaced.
auto InsertBefore = &NF->getEntryBlock().front();
auto PtrInst = GetElementPtrInst::CreateInBounds(
auto *InsertBefore = &NF->getEntryBlock().front();
auto *PtrInst = GetElementPtrInst::CreateInBounds(
/* PointeeType= */ SharedMemGlobalType,
/* Ptr= */ SharedMemGlobal,
/* IdxList= */
Expand All @@ -191,7 +214,7 @@ class LocalAccessorToSharedMemory : public ModulePass {
// Then create a bitcast to make sure the new pointer is the same type
// as the old one. This will only ever be a `i8 addrspace(3)*` to `i32
// addrspace(3)*` type of cast.
auto CastInst = new BitCastInst(PtrInst, FA->getType());
auto *CastInst = new BitCastInst(PtrInst, FA->getType());
CastInst->insertAfter(PtrInst);
NewValueForUse = CastInst;
}
Expand All @@ -217,11 +240,85 @@ class LocalAccessorToSharedMemory : public ModulePass {
return NF;
}

virtual llvm::StringRef getPassName() const {
return "localaccessortosharedmemory";
void populateCudaKernels(Module &M, SmallVector<KernelPayload> &Kernels) {
// Access `nvvm.annotations` to determine which functions are kernel entry
// points.
auto *NvvmMetadata = M.getNamedMetadata("nvvm.annotations");
if (!NvvmMetadata)
return;

for (auto *MetadataNode : NvvmMetadata->operands()) {
if (MetadataNode->getNumOperands() != 3)
continue;

// NVPTX identifies kernel entry points using metadata nodes of the form:
// !X = !{<function>, !"kernel", i32 1}
const MDOperand &TypeOperand = MetadataNode->getOperand(1);
auto *Type = dyn_cast<MDString>(TypeOperand);
if (!Type)
continue;
// Only process kernel entry points.
if (Type->getString() != "kernel")
continue;

// Get a pointer to the entry point function from the metadata.
const MDOperand &FuncOperand = MetadataNode->getOperand(0);
if (!FuncOperand)
continue;
auto *FuncConstant = dyn_cast<ConstantAsMetadata>(FuncOperand);
if (!FuncConstant)
continue;
auto *Func = dyn_cast<Function>(FuncConstant->getValue());
if (!Func)
continue;

Kernels.push_back(KernelPayload(Func, MetadataNode));
}
}

void populateAMDKernels(Module &M, SmallVector<KernelPayload> &Kernels) {
for (auto &F : M) {
if (F.getCallingConv() == CallingConv::AMDGPU_KERNEL)
Kernels.push_back(KernelPayload(&F));
}
}
};

void populateKernels(Module &M, SmallVector<KernelPayload> &Kernels,
ArchType AT) {
switch (AT) {
case ArchType::Cuda:
return populateCudaKernels(M, Kernels);
case ArchType::AMDHSA:
return populateAMDKernels(M, Kernels);
default:
llvm_unreachable("Unsupported arch type.");
}
}

void postProcessCudaKernels(
SmallVector<std::pair<Function *, KernelPayload>> &NewToOldKernels) {
for (auto &Pair : NewToOldKernels) {
std::get<1>(Pair).MD->replaceOperandWith(
0, llvm::ConstantAsMetadata::get(std::get<0>(Pair)));
}
}

void postProcessAMDKernels(
SmallVector<std::pair<Function *, KernelPayload>> &NewToOldKernels) {}

void postProcessKernels(
SmallVector<std::pair<Function *, KernelPayload>> &NewToOldKernels,
ArchType AT) {
switch (AT) {
case ArchType::Cuda:
return postProcessCudaKernels(NewToOldKernels);
case ArchType::AMDHSA:
return postProcessAMDKernels(NewToOldKernels);
default:
llvm_unreachable("Unsupported arch type.");
}
}
};
} // end anonymous namespace

char LocalAccessorToSharedMemory::ID = 0;
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPU.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ FunctionPass *createAMDGPUPostLegalizeCombiner(bool IsOptNone);
FunctionPass *createAMDGPURegBankCombiner(bool IsOptNone);
void initializeAMDGPURegBankCombinerPass(PassRegistry &);

void initializeLocalAccessorToSharedMemoryPass(PassRegistry &);

// SI Passes
FunctionPass *createGCNDPPCombinePass();
FunctionPass *createSIAnnotateControlFlowPass();
Expand Down