[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

lamb-j · 2024-03-18T17:38:34Z

Currently, when the -relink-builtin-bitcodes-postop option is used we link builtin bitcodes twice: once before optimization, and again after optimization.

With this change, we omit the pre-opt linking when the option is set, and we rename the option to the following:

-Xclang -mlink-builtin-bitcodes-postopt
(-Xclang -mno-link-builtin-bitcodes-postopt)

The goal of this change is to reduce compile time. We do lose the theoretical benefits of pre-opt linking, but in practice these are small than the overhead of linking twice. However we may be able to address this in a future patch by adjusting the position of the builtin-bitcode linking pass.

Compilations not setting the option are unaffected

Currently, when the -relink-builtin-bitcodes-postop option is used we link builtin bitcodes twice: once before optimization, and again after optimization. With this change, we omit the pre-opt linking when the option is set, and we rename the option to the following: -link-builtin-bitcodes-postopt

llvmbot · 2024-03-18T17:39:08Z

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Author: Jacob Lambert (lamb-j)

Changes

Currently, when the -relink-builtin-bitcodes-postop option is used we link builtin bitcodes twice: once before optimization, and again after optimization.

With this change, we omit the pre-opt linking when the option is set, and we rename the option to the following:

-link-builtin-bitcodes-postopt

The goal of this change is to reduce compile time. We do lose the benefits of pre-opt linking, but we may be able to address this in a future patch by adjusting the position of the builtin-bitcode linking pass.

Compilations not setting the relink (link) option are unaffected

Full diff: https://github.com/llvm/llvm-project/pull/85672.diff

3 Files Affected:

(modified) clang/lib/CodeGen/BackendUtil.cpp (+5-8)
(modified) clang/lib/CodeGen/CodeGenAction.cpp (+3-3)
(modified) clang/lib/CodeGen/LinkInModulesPass.cpp (-4)

diff --git a/clang/lib/CodeGen/BackendUtil.cpp b/clang/lib/CodeGen/BackendUtil.cpp
index 82b30b8d815629..c5571eb15ce3a9 100644
--- a/clang/lib/CodeGen/BackendUtil.cpp
+++ b/clang/lib/CodeGen/BackendUtil.cpp
@@ -112,9 +112,9 @@ static cl::opt<bool> ClSanitizeOnOptimizerEarlyEP(
 extern cl::opt<InstrProfCorrelator::ProfCorrelatorKind> ProfileCorrelate;
 
 // Re-link builtin bitcodes after optimization
-cl::opt<bool> ClRelinkBuiltinBitcodePostop(
-    "relink-builtin-bitcode-postop", cl::Optional,
-    cl::desc("Re-link builtin bitcodes after optimization."), cl::init(false));
+cl::opt<bool> ClLinkBuiltinBitcodePostopt(
+    "link-builtin-bitcode-postopt", cl::Optional,
+    cl::desc("Link builtin bitcodes after optimization."), cl::init(false));
 } // namespace llvm
 
 namespace {
@@ -1051,11 +1051,8 @@ void EmitAssemblyHelper::RunOptimizationPipeline(
     }
   }
 
-  // Re-link against any bitcodes supplied via the -mlink-builtin-bitcode option
-  // Some optimizations may generate new function calls that would not have
-  // been linked pre-optimization (i.e. fused sincos calls generated by
-  // AMDGPULibCalls::fold_sincos.)
-  if (ClRelinkBuiltinBitcodePostop)
+  // Link against bitcodes supplied via the -mlink-builtin-bitcode option
+  if (ClLinkBuiltinBitcodePostopt)
     MPM.addPass(LinkInModulesPass(BC, false));
 
   // Add a verifier pass if requested. We don't have to do this if the action
diff --git a/clang/lib/CodeGen/CodeGenAction.cpp b/clang/lib/CodeGen/CodeGenAction.cpp
index bb9aaba025fa59..51f54d178672d1 100644
--- a/clang/lib/CodeGen/CodeGenAction.cpp
+++ b/clang/lib/CodeGen/CodeGenAction.cpp
@@ -58,7 +58,7 @@ using namespace llvm;
 #define DEBUG_TYPE "codegenaction"
 
 namespace llvm {
-extern cl::opt<bool> ClRelinkBuiltinBitcodePostop;
+extern cl::opt<bool> ClLinkBuiltinBitcodePostopt;
 }
 
 namespace clang {
@@ -359,7 +359,7 @@ void BackendConsumer::HandleTranslationUnit(ASTContext &C) {
   }
 
   // Link each LinkModule into our module.
-  if (LinkInModules(getModule()))
+  if (!LinkBuiltinBitcodesPostopt && LinkInModules(getModule()))
     return;
 
   for (auto &F : getModule()->functions()) {
@@ -1213,7 +1213,7 @@ void CodeGenAction::ExecuteAction() {
                          std::move(LinkModules), *VMContext, nullptr);
 
   // Link in each pending link module.
-  if (Result.LinkInModules(&*TheModule))
+  if (!LinkBuiltinBitcodesPostopt && Result.LinkInModules(&*TheModule))
     return;
 
   // PR44896: Force DiscardValueNames as false. DiscardValueNames cannot be
diff --git a/clang/lib/CodeGen/LinkInModulesPass.cpp b/clang/lib/CodeGen/LinkInModulesPass.cpp
index 929539cc8f3346..ce1c0ebe046a61 100644
--- a/clang/lib/CodeGen/LinkInModulesPass.cpp
+++ b/clang/lib/CodeGen/LinkInModulesPass.cpp
@@ -28,10 +28,6 @@ PreservedAnalyses LinkInModulesPass::run(Module &M, ModuleAnalysisManager &AM) {
   if (!BC)
     return PreservedAnalyses::all();
 
-  // Re-load bitcode modules from files
-  if (BC->ReloadModules(&M))
-    report_fatal_error("Bitcode module re-loading failed, aborted!");
-
   if (BC->LinkInModules(&M, ShouldLinkFiles))
     report_fatal_error("Bitcode module re-linking failed, aborted!");

b-sumner · 2024-03-18T17:51:09Z

This seems like it could have a significant hit on performance. Have the runtime performance effects been measured?

jhuber6 · 2024-03-19T15:59:48Z

This means that there won't be any optimizations on these definitions, correct? Likely not ideal to have no inlining even if it saves compilation time.

This "post-op" linking is only required because we emit calls to functions that don't exist in the module. The way we solved this in OpenMP is by always providing the library at the link step and making the optimization passes not emit new calls if we are in "post-link LTO" as determined by module flags.

We could theoretically just force all AMDGPU compilation to go through the LTO pass, something like ld.lld --start-lib ockl.bc ocml.bc --end-lib. That would have the effect of fixing-up any missing definitions as it will only extract if needed. Problem is that the device libs have protected visibility until the next release cycle so this won't actually internalize.

lamb-j · 2024-04-02T18:23:09Z

This seems like it could have a significant hit on performance. Have the runtime performance effects been measured?

@b-sumner I just got some initial results back from CQE related to this change. There are some small performance hits (1-4%) for some applications, but as a whole this improves performance across the OpenCL apps tested. (See comment below for possible explanation)

This means that there won't be any optimizations on these definitions, correct? Likely not ideal to have no inlining even if it saves compilation time.

This "post-op" linking is only required because we emit calls to functions that don't exist in the module. The way we solved this in OpenMP is by always providing the library at the link step and making the optimization passes not emit new calls if we are in "post-link LTO" as determined by module flags.

We could theoretically just force all AMDGPU compilation to go through the LTO pass, something like ld.lld --start-lib ockl.bc ocml.bc --end-lib. That would have the effect of fixing-up any missing definitions as it will only extract if needed. Problem is that the device libs have protected visibility until the next release cycle so this won't actually internalize.

@jhuber6 In the context in which this option will be used (CLR compilation), we do perform a distinct codegen call after generating the output bitcodes where this option is used and the device libraries are linked in. So in a sense, we are deferring some optimization of the device libraries to that second call.

From our offline discussions, the consensus seems to be that a better long-term strategy is needed for linking in device libraries, preferably something that doesn't rely on -mlink-builtin-bitcode, and that there is progress being made to update the libraries to support an alternative linking strategy. I am totally on board with this effort, and would be happy to help get us there.

However there are still several barriers in the libraries (example ockl) that need to be ironed out before we can fully switch to a new approach: switching from globals to libcall recognition, needing separate entry points for optimized versions, issues with wavefront, etc.

In the meantime, I'd like to push this change forward, as I think it's an incremental improvement over what we currently have:

It removes the hacky "link twice" logic that reloaded bitcodes from the filesystem
It simplifies the "link-postop" option description and use case
It improves performance for OpenCL benchmarks, particularly those with a very short runtime where a significant portion of that time results from compilation time.

b-sumner · 2024-04-02T19:32:19Z

I doubt it's a good idea to give any weight to very short running kernels where the compilation time is mistakenly being factored into the runtime. I want to be sure that a kernel which is calling math functions or whatever billions of times is as optimized as we can make it.

jhuber6 · 2024-04-02T21:24:59Z

The motivation for this change is a 20% drop in performance on an OpenCL kernel due to increased build times, right? Is this some core software that has been deemed unacceptable to regress performance on even though it's so small the compilation time is the majority of time spent? We do need to further improve our infrastructure to make linking this stuff directly more viable however.

clang/lib/CodeGen/LinkInModulesPass.cpp

lamb-j · 2024-05-01T17:25:31Z

I've been working on testing this patch with an array of OpenCL benchmarks over the past month. We did some high-level regression testing with the following benchmarks:

BlackMagic
Linpack_Dgemm
babelstream-Double
babelstream-Float
chimex
clfft
clmem
clpeak-Double-precision compute
clpeak-Global memory bandwidth
clpeak-Integer compute
clpeak-Single-precision compute
clpeak-Transfer bandwidth
computeApps
dgemm_linux
fahbench
flopscl
ge-workspace
ge_rdppenality
indigo-benchmark
lattice
luxmark
luxmark4
mixbench-ocl-ro
ocltst
shoc
silentarmy
viennacl

With apps, we didn't see any significant regressions.

I also did some in-depth testing with FAHbench and Chimex:

FAHBench

Current:

Final score:  216.8422, 218.2792, 218.3647
Scaled score: 216.8422 (23558 atoms)

App Runtime: 1m42.181s, 1m42.185s,  1m42.167s

Compilation time: 3226 ms

With this PR:

Final score:  222.3547,  219.8134, 223.3722
Scaled score: 222.3547 (23558 atoms)

App Runtime: 1m40.852s, 1m40.850s,  1m40.849s

 Compilation time: 1822 ms

Between the two builds, the total runtime difference is ~1.3 seconds, and the difference in compilation is also ~1.3 seconds. So it does seem to support that we're only removing overhead with this PR, not introducing regressions. I also looked into the intermediate files. If we dump the two final .so files, they're nearly identical, with only a few lines differing.

Chimex

Current:

Correlation matrices computation time: 2.3876s on GPU 
[Theoretical max: @13.9 TFLOPS, 1659.3 kHz; 83% efficiency]
[Algorithm max:   @13.9 TFLOPS, 1634.6 kHz; 84% efficiency]

Compilation Time: 742 ms

With this PR:

Correlation matrices computation time: 1.9782s on GPU
[Theoretical max: @13.9 TFLOPS, 1659.3 kHz; 100% efficiency]
[Algorithm max:   @13.9 TFLOPS, 1634.6 kHz; 101% efficiency]

 Compilation Time: 551 ms

b-sumner · 2024-05-01T19:05:29Z

@lamb-j I appreciate your carrying out these performance tests!

Would you be open to providing an option to enable post-link optimization that could be off by default for now?

lamb-j · 2024-05-01T19:13:31Z

@lamb-j I appreciate your carrying out these performance tests!

Would you be open to providing an option to enable post-link optimization that could be off by default for now?

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

b-sumner · 2024-05-01T19:45:08Z

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

jhuber6 · 2024-05-01T19:49:07Z

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

Presumably you'd add --no-link-builtin-bitcodes-postopt and then use Args.hasFlag(PosFlag, NegFlag, Default).

b-sumner · 2024-05-01T19:54:43Z

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

Presumably you'd add --no-link-builtin-bitcodes-postopt and then use Args.hasFlag(PosFlag, NegFlag, Default).

I assume I'd need an -Xclang or -Wl, in front? And it would need some test coverage.

lamb-j · 2024-05-02T21:58:13Z

Switched the option from a cl::opt to a Clang/CodeGen option.

Also added a LIT test which tests both -mlink-builtin-bitcode-postopt and -mno-link-builtin-bitcode-postopt

b-sumner · 2024-05-07T18:44:47Z

Looks good to me.

jhuber6

Hopefully in the future we can handle this in the linker better.

After 11a6799 "[clang][CodeGen] Omit pre-opt link when post-opt is link requested (llvm#85672)" I'm seeing a new warning: > BackendConsumer.h:37:22: error: private field 'FileMgr' is not used > [-Werror,-Wunused-private-field] Remove the field since it's no longer used.

After 11a6799 "[clang][CodeGen] Omit pre-opt link when post-opt is link requested (#85672)" I'm seeing a new warning: > BackendConsumer.h:37:22: error: private field 'FileMgr' is not used [-Werror,-Wunused-private-field] Remove the field since it's no longer used.

lamb-j · 2024-05-08T16:37:26Z

#91495

Follow up PR to fix warning with now-unused private variable in BackendConsumer class (FileMgr)

lamb-j requested review from scott-linder, b-sumner, kzhuravl and jhuber6 March 18, 2024 17:38

llvmbot added clang Clang issues not falling into any other category clang:codegen labels Mar 18, 2024

fix option name

9a6757c

lamb-j added 2 commits March 18, 2024 11:37

Fix analyses return and typo

e2b6956

Update comment

2238bf9

Remove ReloadModules() declaration/definition

064744a

lamb-j commented Apr 29, 2024

View reviewed changes

clang/lib/CodeGen/LinkInModulesPass.cpp Show resolved Hide resolved

Merge branch 'main' into omit-first-link

65cdd73

lamb-j added 2 commits May 2, 2024 14:52

Merge branch 'llvm:main' into omit-first-link

4cf9233

Switch to a regular Clang option, and add a LIT test

2dc99c2

llvmbot added clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 2, 2024

Switch lit test from Driver to CodeGen

7f4c296

jhuber6 approved these changes May 7, 2024

View reviewed changes

lamb-j requested a review from yxsamliu May 7, 2024 18:49

yxsamliu approved these changes May 7, 2024

View reviewed changes

lamb-j merged commit 11a6799 into llvm:main May 8, 2024
4 checks passed

lamb-j deleted the omit-first-link branch May 8, 2024 15:11

bogner mentioned this pull request May 8, 2024

Fix unused private field warning #91500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

lamb-j commented Mar 18, 2024 •

edited

llvmbot commented Mar 18, 2024 •

edited

b-sumner commented Mar 18, 2024

jhuber6 commented Mar 19, 2024

lamb-j commented Apr 2, 2024

b-sumner commented Apr 2, 2024

jhuber6 commented Apr 2, 2024

lamb-j commented May 1, 2024 •

edited

b-sumner commented May 1, 2024

lamb-j commented May 1, 2024

b-sumner commented May 1, 2024

jhuber6 commented May 1, 2024 •

edited

b-sumner commented May 1, 2024 •

edited

lamb-j commented May 2, 2024

b-sumner commented May 7, 2024

jhuber6 left a comment

lamb-j commented May 8, 2024

[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

Conversation

lamb-j commented Mar 18, 2024 • edited

llvmbot commented Mar 18, 2024 • edited

b-sumner commented Mar 18, 2024

jhuber6 commented Mar 19, 2024

lamb-j commented Apr 2, 2024

b-sumner commented Apr 2, 2024

jhuber6 commented Apr 2, 2024

lamb-j commented May 1, 2024 • edited

b-sumner commented May 1, 2024

lamb-j commented May 1, 2024

b-sumner commented May 1, 2024

jhuber6 commented May 1, 2024 • edited

b-sumner commented May 1, 2024 • edited

lamb-j commented May 2, 2024

b-sumner commented May 7, 2024

jhuber6 left a comment

Choose a reason for hiding this comment

lamb-j commented May 8, 2024

lamb-j commented Mar 18, 2024 •

edited

llvmbot commented Mar 18, 2024 •

edited

lamb-j commented May 1, 2024 •

edited

jhuber6 commented May 1, 2024 •

edited

b-sumner commented May 1, 2024 •

edited