[LinkerWrapper] Support relocatable linking for offloading #80066

jhuber6 · 2024-01-30T21:47:03Z

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a special section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entries that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behavior of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guaranteed to be correct.

llvmbot · 2024-01-30T21:47:32Z

@llvm/pr-subscribers-llvm-binary-utilities

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a sepcial section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behaviour of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.

Full diff: https://github.com/llvm/llvm-project/pull/80066.diff

5 Files Affected:

(modified) clang/test/Driver/linker-wrapper.c (+2-1)
(modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7)
(modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1)
(modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4)
(modified) llvm/lib/Object/OffloadBinary.cpp (+1-1)

diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
 
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
 // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
   return *Path;
 }
 
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+  llvm::MD5 Hasher;
+  llvm::MD5::MD5Result Hash;
+  Hasher.update(Str);
+  Hasher.final(Hash);
+  return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+  Expected<std::string> ObjcopyPath =
+      findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+  if (!ObjcopyPath)
+    return ObjcopyPath.takeError();
+
+  // Use the linker output file to get a unique hash. This creates a unique
+  // identifier to rename the sections to that is deterministic to the contents.
+  auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+                            : MemoryBuffer::getFileOrSTDIN(Output);
+  if (!BufferOrErr)
+    return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+                             Output.str().c_str());
+  std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+  SmallVector<StringRef> ObjcopyArgs = {
+      *ObjcopyPath,
+      Output,
+  };
+
+  // Remove the old .llvm.offloading section to prevent further linking.
+  ObjcopyArgs.emplace_back("--remove-section");
+  ObjcopyArgs.emplace_back(".llvm.offloading");
+  for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+    auto Section = (Prefix + "_offloading_entries").str();
+    // Rename the offloading entires to make them private to this link unit.
+    ObjcopyArgs.emplace_back("--rename-section");
+    ObjcopyArgs.emplace_back(
+        Args.MakeArgString(Section + "=" + Section + Suffix));
+
+    // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+    // newly renamed section containing the offloading entries.
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+                                                "__start_" + Section + Suffix));
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+                                                "__stop_" + Section + Suffix));
+  }
+
+  if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+    return Err;
+
+  return Error::success();
+}
+
 /// Runs the wrapped linker job with the newly created input.
 Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
     LinkerArgs.push_back(Arg);
   if (Error Err = executeCommands(LinkerPath, LinkerArgs))
     return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+    if (Error Err = relocateOffloadSection(Args, ExecutableName))
+      return Err;
+
   return Error::success();
 }
 
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
   case OFK_OpenMP:
     if (Error Err = offloading::wrapOpenMPBinaries(
             M, BuffersToWrap,
-            offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+            offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+            /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
       return std::move(Err);
     break;
   case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
 getDeviceInput(const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
 
-  // If the user is requesting a reloctable link we ignore the device code. The
-  // actual linker will merge the embedded device code sections so they can be
-  // linked when the executable is finally created.
-  if (Args.hasArg(OPT_relocatable))
-    return SmallVector<SmallVector<OffloadFile>>{};
-
   StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
   SmallVector<StringRef> LibraryPaths;
   for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
 /// \param EntryArray Optional pair pointing to the `__start` and `__stop`
 /// symbols holding the `__tgt_offload_entry` array.
 /// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
 llvm::Error wrapOpenMPBinaries(llvm::Module &M,
                                llvm::ArrayRef<llvm::ArrayRef<char>> Images,
                                EntryArrayTy EntryArray,
-                               llvm::StringRef Suffix = "");
+                               llvm::StringRef Suffix = "",
+                               bool Relocatable = false);
 
 /// Wraps the input fatbinary image into the module \p M as global symbols and
 /// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
 ///
 /// Global variable that represents BinDesc is returned.
 GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
-                              EntryArrayTy EntryArray, StringRef Suffix) {
+                              EntryArrayTy EntryArray, StringRef Suffix,
+                              bool Relocatable) {
   LLVMContext &C = M.getContext();
   auto [EntriesB, EntriesE] = EntryArray;
 
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
                                      GlobalVariable::InternalLinkage, Data,
                                      ".omp_offloading.device_image" + Suffix);
     Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
-    Image->setSection(".llvm.offloading");
+    Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+                                  : ".llvm.offloading");
     Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
 
     StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
 
 Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
                                      EntryArrayTy EntryArray,
-                                     llvm::StringRef Suffix) {
-  GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+                                     llvm::StringRef Suffix, bool Relocatable) {
+  GlobalVariable *Desc =
+      createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
   if (!Desc)
     return createStringError(inconvertibleErrorCode(),
                              "No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
       if (!NameOrErr)
         return NameOrErr.takeError();
 
-      if (!NameOrErr->equals(".llvm.offloading"))
+      if (!NameOrErr->starts_with(".llvm.offloading"))
         continue;
     }

llvmbot · 2024-01-30T21:47:32Z

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a sepcial section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behaviour of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.

Full diff: https://github.com/llvm/llvm-project/pull/80066.diff

5 Files Affected:

(modified) clang/test/Driver/linker-wrapper.c (+2-1)
(modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7)
(modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1)
(modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4)
(modified) llvm/lib/Object/OffloadBinary.cpp (+1-1)

diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
 
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
 // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
   return *Path;
 }
 
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+  llvm::MD5 Hasher;
+  llvm::MD5::MD5Result Hash;
+  Hasher.update(Str);
+  Hasher.final(Hash);
+  return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+  Expected<std::string> ObjcopyPath =
+      findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+  if (!ObjcopyPath)
+    return ObjcopyPath.takeError();
+
+  // Use the linker output file to get a unique hash. This creates a unique
+  // identifier to rename the sections to that is deterministic to the contents.
+  auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+                            : MemoryBuffer::getFileOrSTDIN(Output);
+  if (!BufferOrErr)
+    return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+                             Output.str().c_str());
+  std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+  SmallVector<StringRef> ObjcopyArgs = {
+      *ObjcopyPath,
+      Output,
+  };
+
+  // Remove the old .llvm.offloading section to prevent further linking.
+  ObjcopyArgs.emplace_back("--remove-section");
+  ObjcopyArgs.emplace_back(".llvm.offloading");
+  for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+    auto Section = (Prefix + "_offloading_entries").str();
+    // Rename the offloading entires to make them private to this link unit.
+    ObjcopyArgs.emplace_back("--rename-section");
+    ObjcopyArgs.emplace_back(
+        Args.MakeArgString(Section + "=" + Section + Suffix));
+
+    // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+    // newly renamed section containing the offloading entries.
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+                                                "__start_" + Section + Suffix));
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+                                                "__stop_" + Section + Suffix));
+  }
+
+  if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+    return Err;
+
+  return Error::success();
+}
+
 /// Runs the wrapped linker job with the newly created input.
 Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
     LinkerArgs.push_back(Arg);
   if (Error Err = executeCommands(LinkerPath, LinkerArgs))
     return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+    if (Error Err = relocateOffloadSection(Args, ExecutableName))
+      return Err;
+
   return Error::success();
 }
 
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
   case OFK_OpenMP:
     if (Error Err = offloading::wrapOpenMPBinaries(
             M, BuffersToWrap,
-            offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+            offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+            /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
       return std::move(Err);
     break;
   case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
 getDeviceInput(const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
 
-  // If the user is requesting a reloctable link we ignore the device code. The
-  // actual linker will merge the embedded device code sections so they can be
-  // linked when the executable is finally created.
-  if (Args.hasArg(OPT_relocatable))
-    return SmallVector<SmallVector<OffloadFile>>{};
-
   StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
   SmallVector<StringRef> LibraryPaths;
   for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
 /// \param EntryArray Optional pair pointing to the `__start` and `__stop`
 /// symbols holding the `__tgt_offload_entry` array.
 /// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
 llvm::Error wrapOpenMPBinaries(llvm::Module &M,
                                llvm::ArrayRef<llvm::ArrayRef<char>> Images,
                                EntryArrayTy EntryArray,
-                               llvm::StringRef Suffix = "");
+                               llvm::StringRef Suffix = "",
+                               bool Relocatable = false);
 
 /// Wraps the input fatbinary image into the module \p M as global symbols and
 /// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
 ///
 /// Global variable that represents BinDesc is returned.
 GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
-                              EntryArrayTy EntryArray, StringRef Suffix) {
+                              EntryArrayTy EntryArray, StringRef Suffix,
+                              bool Relocatable) {
   LLVMContext &C = M.getContext();
   auto [EntriesB, EntriesE] = EntryArray;
 
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
                                      GlobalVariable::InternalLinkage, Data,
                                      ".omp_offloading.device_image" + Suffix);
     Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
-    Image->setSection(".llvm.offloading");
+    Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+                                  : ".llvm.offloading");
     Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
 
     StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
 
 Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
                                      EntryArrayTy EntryArray,
-                                     llvm::StringRef Suffix) {
-  GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+                                     llvm::StringRef Suffix, bool Relocatable) {
+  GlobalVariable *Desc =
+      createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
   if (!Desc)
     return createStringError(inconvertibleErrorCode(),
                              "No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
       if (!NameOrErr)
         return NameOrErr.takeError();
 
-      if (!NameOrErr->equals(".llvm.offloading"))
+      if (!NameOrErr->starts_with(".llvm.offloading"))
         continue;
     }

jhuber6 · 2024-01-30T21:48:31Z

This is related to the discussions at the #77018 issue.

yxsamliu · 2024-01-31T16:45:38Z

clang/test/Driver/linker-wrapper.c

@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK



need to check device fatbin bundling is done and device wrapper fatbin variable using internal linkage and in postfixed section.

Added a test for the HIP and CUDA cases as well as an additional check on passing -r to the wrapper image generation.

Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct.

Artem-B · 2024-01-31T17:55:53Z

So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage.

We also use a sepcial section called something like omp_offloading_entries

Typo in 'special' in the description.

jhuber6 · 2024-01-31T18:15:48Z

So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage.

It's more or less doing -fno-gpu-rdc on a subset of files. So you can do GPU linking via -fgpu-rdc for a handful and then merge them together so nothing else will link with it as if you did -fno-gpu-rdc.

We also use a sepcial section called something like omp_offloading_entries

Typo in 'special' in the description.

Thanks, will fix.

Artem-B

LGTM.

Artem-B · 2024-01-31T18:09:06Z

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

@@ -265,6 +329,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
    LinkerArgs.push_back(Arg);
  if (Error Err = executeCommands(LinkerPath, LinkerArgs))
    return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+    if (Error Err = relocateOffloadSection(Args, ExecutableName))


We could just return relocateOffloadSection(Args, ExecutableName)

Artem-B · 2024-01-31T18:14:49Z

llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h

@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
 /// \param EntryArray Optional pair pointing to the `__start` and `__stop`
 /// symbols holding the `__tgt_offload_entry` array.
 /// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.


Nit: "Indicate whether the binary is a relocatable object" may work a bit better for describing intent. Current description seems to describe an implementation detail.

github-actions · 2024-01-31T18:22:41Z

✅ With the latest revision this PR passed the C/C++ code formatter.

Artem-B · 2024-01-31T18:28:48Z

Supporting such mixed mode opens an interesting set of issues we may need to consider going forward:

who/where/how runs initializers in the fully linked parts?
Are public functions in the fully linked parts visible to the functions in partially linked parts? In the full-rdc mode they would, as if it's a plain C++ compilation. In partial they would not as the main GPU executable and the partial parts will be in separate executables.

This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code.

jhuber6 · 2024-01-31T18:35:56Z

Supporting such mixed mode opens an interesting set of issues we may need to consider going forward:

who/where/how runs initializers in the fully linked parts?

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Are public functions in the fully linked parts visible to the functions in partially linked parts? In the full-rdc mode they would, as if it's a plain C++ compilation. In partial they would not as the main GPU executable and the partial parts will be in separate executables.

This has the same semantics as a -fno-gpu-rdc compilation. Any public __device__ function will not be available to be linked if someone did this across a boundary.

This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally. But in some cases they may wish to keep GPU code "private" to a subset of the project for some other purposes. Doing a relocatable link with the offloading toolchain shows enough intent in my mind that we don't need to worry about people being confused so long as we document what it does.

Artem-B · 2024-01-31T19:18:26Z

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally

Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library).

Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable.

we don't need to worry about people being confused so long as we document what it does.

We do need to document how it works. Documenting what does not work, or works differently is also important, IMO.
We do need to worry about users and their expectations.

jhuber6 · 2024-01-31T19:28:57Z

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc.

It'll definitely have an effect different from full linking, but the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally

Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library).

This was one of the things I was wondering about, since we could alternatively make a new flag for this outside of -r so it's explicit. Right now I just kind of assumed that passing -r through the offloading toolchain (via CUDA or whatever) was somewhat explicit enough, as if regular -r behaviour is desired they could just use clang or ld normally.

Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable.

Yeah, coming up with a name is difficult. You could just call it device linking, since it's more or less just doing the device link step ahead of time instead of passing it to when we make the final executable.

We do need to document how it works. Documenting what does not work, or works differently is also important, IMO. We do need to worry about users and their expectations.

Yes, I should probably update this with some documentation. I'm not sure where it would go however, maybe just in the clang-linker-wrapper's page.

Artem-B · 2024-01-31T20:02:15Z

the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users.

Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-)

Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do.

jhuber6 · 2024-01-31T20:09:26Z

the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users.

To me, if the person owning the library chooses to do this in their build system it would speak to an intentional decision not to leak the GPU parts of their code. with shipped via a static library. More or less, this option would give people a way to ship static libraries with the same GPU / Offloading semantics as a shared library.

Another advantage of this option is that it allows someone to build an application with GPU offloading using clang and then ship it as a static library that can be linked with GCC or some other compiler so long as the user has the appropriate runtime.

Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-)

Very true, I need to remind myself that I'm probably the only one who knows how this thing works in detail.

Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do.

Will do.

jhuber6 · 2024-02-07T13:14:41Z

ping

yxsamliu

LGTM. Thanks

jhuber6 requested review from arsenm, Artem-B, jdoerfert, pozulp, shiltian, yxsamliu, ronlieb and saiislam January 30, 2024 21:47

llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' llvm:binary-utilities labels Jan 30, 2024

jhuber6 mentioned this pull request Jan 31, 2024

[AMDGPU] Creating relocatable object (-r) from rdc objects (-fgpu-rdc) fails with lld error attempted static link of dynamic object in /opt/rocm-6.0.0/lib #77018

Closed

yxsamliu reviewed Jan 31, 2024

View reviewed changes

jhuber6 force-pushed the Relocatable branch from 2172b4e to af382e0 Compare January 31, 2024 17:15

Artem-B approved these changes Jan 31, 2024

View reviewed changes

Address comments

1a91a8a

Format

7db493a

jhuber6 added 2 commits January 31, 2024 14:41

Add Docs

4c480e8

Format

8c90e6d

yxsamliu approved these changes Feb 7, 2024

View reviewed changes

jhuber6 merged commit 5c84054 into llvm:main Feb 7, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LinkerWrapper] Support relocatable linking for offloading #80066

[LinkerWrapper] Support relocatable linking for offloading #80066

jhuber6 commented Jan 30, 2024 •

edited

llvmbot commented Jan 30, 2024 •

edited

llvmbot commented Jan 30, 2024

jhuber6 commented Jan 30, 2024

yxsamliu Jan 31, 2024

jhuber6 Jan 31, 2024

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B left a comment

Artem-B Jan 31, 2024

Artem-B Jan 31, 2024

github-actions bot commented Jan 31, 2024 •

edited

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

jhuber6 commented Feb 7, 2024

yxsamliu left a comment

		@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
		// RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
		// RUN: %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=RELOCATABLE-LINK

[LinkerWrapper] Support relocatable linking for offloading #80066

[LinkerWrapper] Support relocatable linking for offloading #80066

Conversation

jhuber6 commented Jan 30, 2024 • edited

llvmbot commented Jan 30, 2024 • edited

llvmbot commented Jan 30, 2024

jhuber6 commented Jan 30, 2024

yxsamliu Jan 31, 2024

Choose a reason for hiding this comment

jhuber6 Jan 31, 2024

Choose a reason for hiding this comment

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B left a comment

Choose a reason for hiding this comment

Artem-B Jan 31, 2024

Choose a reason for hiding this comment

Artem-B Jan 31, 2024

Choose a reason for hiding this comment

github-actions bot commented Jan 31, 2024 • edited

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

Artem-B commented Jan 31, 2024

jhuber6 commented Jan 31, 2024

jhuber6 commented Feb 7, 2024

yxsamliu left a comment

Choose a reason for hiding this comment

jhuber6 commented Jan 30, 2024 •

edited

llvmbot commented Jan 30, 2024 •

edited

github-actions bot commented Jan 31, 2024 •

edited