Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LinkerWrapper] Support relocatable linking for offloading #80066

Merged
merged 5 commits into from
Feb 7, 2024

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Jan 30, 2024

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a special section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entries that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behavior of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guaranteed to be correct.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' llvm:binary-utilities labels Jan 30, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 30, 2024

@llvm/pr-subscribers-llvm-binary-utilities

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)

Changes

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a sepcial section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behaviour of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.


Full diff: https://github.com/llvm/llvm-project/pull/80066.diff

5 Files Affected:

  • (modified) clang/test/Driver/linker-wrapper.c (+2-1)
  • (modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7)
  • (modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1)
  • (modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4)
  • (modified) llvm/lib/Object/OffloadBinary.cpp (+1-1)
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
 
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
 // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
   return *Path;
 }
 
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+  llvm::MD5 Hasher;
+  llvm::MD5::MD5Result Hash;
+  Hasher.update(Str);
+  Hasher.final(Hash);
+  return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+  Expected<std::string> ObjcopyPath =
+      findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+  if (!ObjcopyPath)
+    return ObjcopyPath.takeError();
+
+  // Use the linker output file to get a unique hash. This creates a unique
+  // identifier to rename the sections to that is deterministic to the contents.
+  auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+                            : MemoryBuffer::getFileOrSTDIN(Output);
+  if (!BufferOrErr)
+    return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+                             Output.str().c_str());
+  std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+  SmallVector<StringRef> ObjcopyArgs = {
+      *ObjcopyPath,
+      Output,
+  };
+
+  // Remove the old .llvm.offloading section to prevent further linking.
+  ObjcopyArgs.emplace_back("--remove-section");
+  ObjcopyArgs.emplace_back(".llvm.offloading");
+  for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+    auto Section = (Prefix + "_offloading_entries").str();
+    // Rename the offloading entires to make them private to this link unit.
+    ObjcopyArgs.emplace_back("--rename-section");
+    ObjcopyArgs.emplace_back(
+        Args.MakeArgString(Section + "=" + Section + Suffix));
+
+    // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+    // newly renamed section containing the offloading entries.
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+                                                "__start_" + Section + Suffix));
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+                                                "__stop_" + Section + Suffix));
+  }
+
+  if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+    return Err;
+
+  return Error::success();
+}
+
 /// Runs the wrapped linker job with the newly created input.
 Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
     LinkerArgs.push_back(Arg);
   if (Error Err = executeCommands(LinkerPath, LinkerArgs))
     return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+    if (Error Err = relocateOffloadSection(Args, ExecutableName))
+      return Err;
+
   return Error::success();
 }
 
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
   case OFK_OpenMP:
     if (Error Err = offloading::wrapOpenMPBinaries(
             M, BuffersToWrap,
-            offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+            offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+            /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
       return std::move(Err);
     break;
   case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
 getDeviceInput(const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
 
-  // If the user is requesting a reloctable link we ignore the device code. The
-  // actual linker will merge the embedded device code sections so they can be
-  // linked when the executable is finally created.
-  if (Args.hasArg(OPT_relocatable))
-    return SmallVector<SmallVector<OffloadFile>>{};
-
   StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
   SmallVector<StringRef> LibraryPaths;
   for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
 /// \param EntryArray Optional pair pointing to the `__start` and `__stop`
 /// symbols holding the `__tgt_offload_entry` array.
 /// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
 llvm::Error wrapOpenMPBinaries(llvm::Module &M,
                                llvm::ArrayRef<llvm::ArrayRef<char>> Images,
                                EntryArrayTy EntryArray,
-                               llvm::StringRef Suffix = "");
+                               llvm::StringRef Suffix = "",
+                               bool Relocatable = false);
 
 /// Wraps the input fatbinary image into the module \p M as global symbols and
 /// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
 ///
 /// Global variable that represents BinDesc is returned.
 GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
-                              EntryArrayTy EntryArray, StringRef Suffix) {
+                              EntryArrayTy EntryArray, StringRef Suffix,
+                              bool Relocatable) {
   LLVMContext &C = M.getContext();
   auto [EntriesB, EntriesE] = EntryArray;
 
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
                                      GlobalVariable::InternalLinkage, Data,
                                      ".omp_offloading.device_image" + Suffix);
     Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
-    Image->setSection(".llvm.offloading");
+    Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+                                  : ".llvm.offloading");
     Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
 
     StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
 
 Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
                                      EntryArrayTy EntryArray,
-                                     llvm::StringRef Suffix) {
-  GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+                                     llvm::StringRef Suffix, bool Relocatable) {
+  GlobalVariable *Desc =
+      createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
   if (!Desc)
     return createStringError(inconvertibleErrorCode(),
                              "No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
       if (!NameOrErr)
         return NameOrErr.takeError();
 
-      if (!NameOrErr->equals(".llvm.offloading"))
+      if (!NameOrErr->starts_with(".llvm.offloading"))
         continue;
     }
 

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 30, 2024

@llvm/pr-subscribers-clang

Author: Joseph Huber (jhuber6)

Changes

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the .llvm.offloading section so it can be
linked later. We also use a sepcial section called something like
omp_offloading_entries to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via -r to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.

$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out

One question is whether or not this should be the default behaviour of
-r when run through the linker-wrapper or a special option. Standard
-r behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.


Full diff: https://github.com/llvm/llvm-project/pull/80066.diff

5 Files Affected:

  • (modified) clang/test/Driver/linker-wrapper.c (+2-1)
  • (modified) clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp (+64-7)
  • (modified) llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h (+3-1)
  • (modified) llvm/lib/Frontend/Offloading/OffloadWrapper.cpp (+7-4)
  • (modified) llvm/lib/Object/OffloadBinary.cpp (+1-1)
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
 // RUN:   --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
 // RUN:   %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
 
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
 // RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
   return *Path;
 }
 
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+  llvm::MD5 Hasher;
+  llvm::MD5::MD5Result Hash;
+  Hasher.update(Str);
+  Hasher.final(Hash);
+  return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+  Expected<std::string> ObjcopyPath =
+      findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+  if (!ObjcopyPath)
+    return ObjcopyPath.takeError();
+
+  // Use the linker output file to get a unique hash. This creates a unique
+  // identifier to rename the sections to that is deterministic to the contents.
+  auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+                            : MemoryBuffer::getFileOrSTDIN(Output);
+  if (!BufferOrErr)
+    return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+                             Output.str().c_str());
+  std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+  SmallVector<StringRef> ObjcopyArgs = {
+      *ObjcopyPath,
+      Output,
+  };
+
+  // Remove the old .llvm.offloading section to prevent further linking.
+  ObjcopyArgs.emplace_back("--remove-section");
+  ObjcopyArgs.emplace_back(".llvm.offloading");
+  for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+    auto Section = (Prefix + "_offloading_entries").str();
+    // Rename the offloading entires to make them private to this link unit.
+    ObjcopyArgs.emplace_back("--rename-section");
+    ObjcopyArgs.emplace_back(
+        Args.MakeArgString(Section + "=" + Section + Suffix));
+
+    // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+    // newly renamed section containing the offloading entries.
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+                                                "__start_" + Section + Suffix));
+    ObjcopyArgs.emplace_back("--redefine-sym");
+    ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+                                                "__stop_" + Section + Suffix));
+  }
+
+  if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+    return Err;
+
+  return Error::success();
+}
+
 /// Runs the wrapped linker job with the newly created input.
 Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
     LinkerArgs.push_back(Arg);
   if (Error Err = executeCommands(LinkerPath, LinkerArgs))
     return Err;
+
+  if (Args.hasArg(OPT_relocatable))
+    if (Error Err = relocateOffloadSection(Args, ExecutableName))
+      return Err;
+
   return Error::success();
 }
 
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
   case OFK_OpenMP:
     if (Error Err = offloading::wrapOpenMPBinaries(
             M, BuffersToWrap,
-            offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+            offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+            /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
       return std::move(Err);
     break;
   case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
 getDeviceInput(const ArgList &Args) {
   llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
 
-  // If the user is requesting a reloctable link we ignore the device code. The
-  // actual linker will merge the embedded device code sections so they can be
-  // linked when the executable is finally created.
-  if (Args.hasArg(OPT_relocatable))
-    return SmallVector<SmallVector<OffloadFile>>{};
-
   StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
   SmallVector<StringRef> LibraryPaths;
   for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
 /// \param EntryArray Optional pair pointing to the `__start` and `__stop`
 /// symbols holding the `__tgt_offload_entry` array.
 /// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
 llvm::Error wrapOpenMPBinaries(llvm::Module &M,
                                llvm::ArrayRef<llvm::ArrayRef<char>> Images,
                                EntryArrayTy EntryArray,
-                               llvm::StringRef Suffix = "");
+                               llvm::StringRef Suffix = "",
+                               bool Relocatable = false);
 
 /// Wraps the input fatbinary image into the module \p M as global symbols and
 /// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
 ///
 /// Global variable that represents BinDesc is returned.
 GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
-                              EntryArrayTy EntryArray, StringRef Suffix) {
+                              EntryArrayTy EntryArray, StringRef Suffix,
+                              bool Relocatable) {
   LLVMContext &C = M.getContext();
   auto [EntriesB, EntriesE] = EntryArray;
 
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
                                      GlobalVariable::InternalLinkage, Data,
                                      ".omp_offloading.device_image" + Suffix);
     Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
-    Image->setSection(".llvm.offloading");
+    Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+                                  : ".llvm.offloading");
     Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
 
     StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
 
 Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
                                      EntryArrayTy EntryArray,
-                                     llvm::StringRef Suffix) {
-  GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+                                     llvm::StringRef Suffix, bool Relocatable) {
+  GlobalVariable *Desc =
+      createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
   if (!Desc)
     return createStringError(inconvertibleErrorCode(),
                              "No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
       if (!NameOrErr)
         return NameOrErr.takeError();
 
-      if (!NameOrErr->equals(".llvm.offloading"))
+      if (!NameOrErr->starts_with(".llvm.offloading"))
         continue;
     }
 

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 30, 2024

This is related to the discussions at the #77018 issue.

@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
// RUN: %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check device fatbin bundling is done and device wrapper fatbin variable using internal linkage and in postfixed section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for the HIP and CUDA cases as well as an additional check on passing -r to the wrapper image generation.

Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a sepcial section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.

One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.

This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entires that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.

Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```

Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.

```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out
```

One question is whether or not this should be the default behaviour of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guranteed to be correct.
@Artem-B
Copy link
Member

Artem-B commented Jan 31, 2024

So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage.

We also use a sepcial section called something like omp_offloading_entries

Typo in 'special' in the description.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 31, 2024

So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage.

It's more or less doing -fno-gpu-rdc on a subset of files. So you can do GPU linking via -fgpu-rdc for a handful and then merge them together so nothing else will link with it as if you did -fno-gpu-rdc.

We also use a sepcial section called something like omp_offloading_entries

Typo in 'special' in the description.

Thanks, will fix.

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -265,6 +329,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
LinkerArgs.push_back(Arg);
if (Error Err = executeCommands(LinkerPath, LinkerArgs))
return Err;

if (Args.hasArg(OPT_relocatable))
if (Error Err = relocateOffloadSection(Args, ExecutableName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just return relocateOffloadSection(Args, ExecutableName)

@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
/// \param EntryArray Optional pair pointing to the `__start` and `__stop`
/// symbols holding the `__tgt_offload_entry` array.
/// \param Suffix An optional suffix appended to the emitted symbols.
/// \param Relocatable Indicate if we need to change the offloading section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "Indicate whether the binary is a relocatable object" may work a bit better for describing intent. Current description seems to describe an implementation detail.

Copy link

github-actions bot commented Jan 31, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@Artem-B
Copy link
Member

Artem-B commented Jan 31, 2024

Supporting such mixed mode opens an interesting set of issues we may need to consider going forward:

  • who/where/how runs initializers in the fully linked parts?
  • Are public functions in the fully linked parts visible to the functions in partially linked parts? In the full-rdc mode they would, as if it's a plain C++ compilation. In partial they would not as the main GPU executable and the partial parts will be in separate executables.

This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 31, 2024

Supporting such mixed mode opens an interesting set of issues we may need to consider going forward:

who/where/how runs initializers in the fully linked parts?

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Are public functions in the fully linked parts visible to the functions in partially linked parts? In the full-rdc mode they would, as if it's a plain C++ compilation. In partial they would not as the main GPU executable and the partial parts will be in separate executables.

This has the same semantics as a -fno-gpu-rdc compilation. Any public __device__ function will not be available to be linked if someone did this across a boundary.

This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally. But in some cases they may wish to keep GPU code "private" to a subset of the project for some other purposes. Doing a relocatable link with the offloading toolchain shows enough intent in my mind that we don't need to worry about people being confused so long as we document what it does.

@Artem-B
Copy link
Member

Artem-B commented Jan 31, 2024

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally

Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library).

Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable.

we don't need to worry about people being confused so long as we document what it does.

We do need to document how it works. Documenting what does not work, or works differently is also important, IMO.
We do need to worry about users and their expectations.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 31, 2024

I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.

Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc.

It'll definitely have an effect different from full linking, but the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

The idea is that users already get C++-like behavior with the new driver and -fgpu-rdc generally

Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library).

This was one of the things I was wondering about, since we could alternatively make a new flag for this outside of -r so it's explicit. Right now I just kind of assumed that passing -r through the offloading toolchain (via CUDA or whatever) was somewhat explicit enough, as if regular -r behaviour is desired they could just use clang or ld normally.

Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable.

Yeah, coming up with a name is difficult. You could just call it device linking, since it's more or less just doing the device link step ahead of time instead of passing it to when we make the final executable.

We do need to document how it works. Documenting what does not work, or works differently is also important, IMO. We do need to worry about users and their expectations.

Yes, I should probably update this with some documentation. I'm not sure where it would go however, maybe just in the clang-linker-wrapper's page.

@Artem-B
Copy link
Member

Artem-B commented Jan 31, 2024

the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users.

Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-)

Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Jan 31, 2024

the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.

That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users.

To me, if the person owning the library chooses to do this in their build system it would speak to an intentional decision not to leak the GPU parts of their code. with shipped via a static library. More or less, this option would give people a way to ship static libraries with the same GPU / Offloading semantics as a shared library.

Another advantage of this option is that it allows someone to build an application with GPU offloading using clang and then ship it as a static library that can be linked with GCC or some other compiler so long as the user has the appropriate runtime.

Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-)

Very true, I need to remind myself that I'm probably the only one who knows how this thing works in detail.

Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do.

Will do.

@jhuber6
Copy link
Contributor Author

jhuber6 commented Feb 7, 2024

ping

Copy link
Collaborator

@yxsamliu yxsamliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@jhuber6 jhuber6 merged commit 5c84054 into llvm:main Feb 7, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category llvm:binary-utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants