Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang][CodeGen] Omit pre-opt link when post-opt is link requested #85672

Merged
merged 9 commits into from
May 8, 2024

Conversation

lamb-j
Copy link
Contributor

@lamb-j lamb-j commented Mar 18, 2024

Currently, when the -relink-builtin-bitcodes-postop option is used we link builtin bitcodes twice: once before optimization, and again after optimization.

With this change, we omit the pre-opt linking when the option is set, and we rename the option to the following:

-Xclang -mlink-builtin-bitcodes-postopt
(-Xclang -mno-link-builtin-bitcodes-postopt)

The goal of this change is to reduce compile time. We do lose the theoretical benefits of pre-opt linking, but in practice these are small than the overhead of linking twice. However we may be able to address this in a future patch by adjusting the position of the builtin-bitcode linking pass.

Compilations not setting the option are unaffected

Currently, when the -relink-builtin-bitcodes-postop option is used
we link builtin bitcodes twice: once before optimization, and again
after optimization.

With this change, we omit the pre-opt linking when the option is
set, and we rename the option to the following:

  -link-builtin-bitcodes-postopt
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen labels Mar 18, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Mar 18, 2024

@llvm/pr-subscribers-clang-driver
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Author: Jacob Lambert (lamb-j)

Changes

Currently, when the -relink-builtin-bitcodes-postop option is used we link builtin bitcodes twice: once before optimization, and again after optimization.

With this change, we omit the pre-opt linking when the option is set, and we rename the option to the following:

-link-builtin-bitcodes-postopt

The goal of this change is to reduce compile time. We do lose the benefits of pre-opt linking, but we may be able to address this in a future patch by adjusting the position of the builtin-bitcode linking pass.

Compilations not setting the relink (link) option are unaffected


Full diff: https://github.com/llvm/llvm-project/pull/85672.diff

3 Files Affected:

  • (modified) clang/lib/CodeGen/BackendUtil.cpp (+5-8)
  • (modified) clang/lib/CodeGen/CodeGenAction.cpp (+3-3)
  • (modified) clang/lib/CodeGen/LinkInModulesPass.cpp (-4)
diff --git a/clang/lib/CodeGen/BackendUtil.cpp b/clang/lib/CodeGen/BackendUtil.cpp
index 82b30b8d815629..c5571eb15ce3a9 100644
--- a/clang/lib/CodeGen/BackendUtil.cpp
+++ b/clang/lib/CodeGen/BackendUtil.cpp
@@ -112,9 +112,9 @@ static cl::opt<bool> ClSanitizeOnOptimizerEarlyEP(
 extern cl::opt<InstrProfCorrelator::ProfCorrelatorKind> ProfileCorrelate;
 
 // Re-link builtin bitcodes after optimization
-cl::opt<bool> ClRelinkBuiltinBitcodePostop(
-    "relink-builtin-bitcode-postop", cl::Optional,
-    cl::desc("Re-link builtin bitcodes after optimization."), cl::init(false));
+cl::opt<bool> ClLinkBuiltinBitcodePostopt(
+    "link-builtin-bitcode-postopt", cl::Optional,
+    cl::desc("Link builtin bitcodes after optimization."), cl::init(false));
 } // namespace llvm
 
 namespace {
@@ -1051,11 +1051,8 @@ void EmitAssemblyHelper::RunOptimizationPipeline(
     }
   }
 
-  // Re-link against any bitcodes supplied via the -mlink-builtin-bitcode option
-  // Some optimizations may generate new function calls that would not have
-  // been linked pre-optimization (i.e. fused sincos calls generated by
-  // AMDGPULibCalls::fold_sincos.)
-  if (ClRelinkBuiltinBitcodePostop)
+  // Link against bitcodes supplied via the -mlink-builtin-bitcode option
+  if (ClLinkBuiltinBitcodePostopt)
     MPM.addPass(LinkInModulesPass(BC, false));
 
   // Add a verifier pass if requested. We don't have to do this if the action
diff --git a/clang/lib/CodeGen/CodeGenAction.cpp b/clang/lib/CodeGen/CodeGenAction.cpp
index bb9aaba025fa59..51f54d178672d1 100644
--- a/clang/lib/CodeGen/CodeGenAction.cpp
+++ b/clang/lib/CodeGen/CodeGenAction.cpp
@@ -58,7 +58,7 @@ using namespace llvm;
 #define DEBUG_TYPE "codegenaction"
 
 namespace llvm {
-extern cl::opt<bool> ClRelinkBuiltinBitcodePostop;
+extern cl::opt<bool> ClLinkBuiltinBitcodePostopt;
 }
 
 namespace clang {
@@ -359,7 +359,7 @@ void BackendConsumer::HandleTranslationUnit(ASTContext &C) {
   }
 
   // Link each LinkModule into our module.
-  if (LinkInModules(getModule()))
+  if (!LinkBuiltinBitcodesPostopt && LinkInModules(getModule()))
     return;
 
   for (auto &F : getModule()->functions()) {
@@ -1213,7 +1213,7 @@ void CodeGenAction::ExecuteAction() {
                          std::move(LinkModules), *VMContext, nullptr);
 
   // Link in each pending link module.
-  if (Result.LinkInModules(&*TheModule))
+  if (!LinkBuiltinBitcodesPostopt && Result.LinkInModules(&*TheModule))
     return;
 
   // PR44896: Force DiscardValueNames as false. DiscardValueNames cannot be
diff --git a/clang/lib/CodeGen/LinkInModulesPass.cpp b/clang/lib/CodeGen/LinkInModulesPass.cpp
index 929539cc8f3346..ce1c0ebe046a61 100644
--- a/clang/lib/CodeGen/LinkInModulesPass.cpp
+++ b/clang/lib/CodeGen/LinkInModulesPass.cpp
@@ -28,10 +28,6 @@ PreservedAnalyses LinkInModulesPass::run(Module &M, ModuleAnalysisManager &AM) {
   if (!BC)
     return PreservedAnalyses::all();
 
-  // Re-load bitcode modules from files
-  if (BC->ReloadModules(&M))
-    report_fatal_error("Bitcode module re-loading failed, aborted!");
-
   if (BC->LinkInModules(&M, ShouldLinkFiles))
     report_fatal_error("Bitcode module re-linking failed, aborted!");
 

@b-sumner
Copy link

This seems like it could have a significant hit on performance. Have the runtime performance effects been measured?

@jhuber6
Copy link
Contributor

jhuber6 commented Mar 19, 2024

This means that there won't be any optimizations on these definitions, correct? Likely not ideal to have no inlining even if it saves compilation time.

This "post-op" linking is only required because we emit calls to functions that don't exist in the module. The way we solved this in OpenMP is by always providing the library at the link step and making the optimization passes not emit new calls if we are in "post-link LTO" as determined by module flags.

We could theoretically just force all AMDGPU compilation to go through the LTO pass, something like ld.lld --start-lib ockl.bc ocml.bc --end-lib. That would have the effect of fixing-up any missing definitions as it will only extract if needed. Problem is that the device libs have protected visibility until the next release cycle so this won't actually internalize.

@lamb-j
Copy link
Contributor Author

lamb-j commented Apr 2, 2024

This seems like it could have a significant hit on performance. Have the runtime performance effects been measured?

@b-sumner I just got some initial results back from CQE related to this change. There are some small performance hits (1-4%) for some applications, but as a whole this improves performance across the OpenCL apps tested. (See comment below for possible explanation)

This means that there won't be any optimizations on these definitions, correct? Likely not ideal to have no inlining even if it saves compilation time.

This "post-op" linking is only required because we emit calls to functions that don't exist in the module. The way we solved this in OpenMP is by always providing the library at the link step and making the optimization passes not emit new calls if we are in "post-link LTO" as determined by module flags.

We could theoretically just force all AMDGPU compilation to go through the LTO pass, something like ld.lld --start-lib ockl.bc ocml.bc --end-lib. That would have the effect of fixing-up any missing definitions as it will only extract if needed. Problem is that the device libs have protected visibility until the next release cycle so this won't actually internalize.

@jhuber6 In the context in which this option will be used (CLR compilation), we do perform a distinct codegen call after generating the output bitcodes where this option is used and the device libraries are linked in. So in a sense, we are deferring some optimization of the device libraries to that second call.

From our offline discussions, the consensus seems to be that a better long-term strategy is needed for linking in device libraries, preferably something that doesn't rely on -mlink-builtin-bitcode, and that there is progress being made to update the libraries to support an alternative linking strategy. I am totally on board with this effort, and would be happy to help get us there.

However there are still several barriers in the libraries (example ockl) that need to be ironed out before we can fully switch to a new approach: switching from globals to libcall recognition, needing separate entry points for optimized versions, issues with wavefront, etc.

In the meantime, I'd like to push this change forward, as I think it's an incremental improvement over what we currently have:

  • It removes the hacky "link twice" logic that reloaded bitcodes from the filesystem
  • It simplifies the "link-postop" option description and use case
  • It improves performance for OpenCL benchmarks, particularly those with a very short runtime where a significant portion of that time results from compilation time.

@b-sumner
Copy link

b-sumner commented Apr 2, 2024

I doubt it's a good idea to give any weight to very short running kernels where the compilation time is mistakenly being factored into the runtime. I want to be sure that a kernel which is calling math functions or whatever billions of times is as optimized as we can make it.

@jhuber6
Copy link
Contributor

jhuber6 commented Apr 2, 2024

The motivation for this change is a 20% drop in performance on an OpenCL kernel due to increased build times, right? Is this some core software that has been deemed unacceptable to regress performance on even though it's so small the compilation time is the majority of time spent? We do need to further improve our infrastructure to make linking this stuff directly more viable however.

@lamb-j
Copy link
Contributor Author

lamb-j commented May 1, 2024

I've been working on testing this patch with an array of OpenCL benchmarks over the past month. We did some high-level regression testing with the following benchmarks:

BlackMagic
Linpack_Dgemm
babelstream-Double
babelstream-Float
chimex
clfft
clmem
clpeak-Double-precision compute
clpeak-Global memory bandwidth
clpeak-Integer compute
clpeak-Single-precision compute
clpeak-Transfer bandwidth
computeApps
dgemm_linux
fahbench
flopscl
ge-workspace
ge_rdppenality
indigo-benchmark
lattice
luxmark
luxmark4
mixbench-ocl-ro
ocltst
shoc
silentarmy
viennacl

With apps, we didn't see any significant regressions.

I also did some in-depth testing with FAHbench and Chimex:

FAHBench

Current:

Final score:  216.8422, 218.2792, 218.3647
Scaled score: 216.8422 (23558 atoms)

App Runtime: 1m42.181s, 1m42.185s,  1m42.167s

Compilation time: 3226 ms

With this PR:

Final score:  222.3547,  219.8134, 223.3722
Scaled score: 222.3547 (23558 atoms)

App Runtime: 1m40.852s, 1m40.850s,  1m40.849s

 Compilation time: 1822 ms

Between the two builds, the total runtime difference is ~1.3 seconds, and the difference in compilation is also ~1.3 seconds. So it does seem to support that we're only removing overhead with this PR, not introducing regressions. I also looked into the intermediate files. If we dump the two final .so files, they're nearly identical, with only a few lines differing.

Chimex

Current:

Correlation matrices computation time: 2.3876s on GPU 
[Theoretical max: @13.9 TFLOPS, 1659.3 kHz; 83% efficiency]
[Algorithm max:   @13.9 TFLOPS, 1634.6 kHz; 84% efficiency]

Compilation Time: 742 ms

With this PR:

Correlation matrices computation time: 1.9782s on GPU
[Theoretical max: @13.9 TFLOPS, 1659.3 kHz; 100% efficiency]
[Algorithm max:   @13.9 TFLOPS, 1634.6 kHz; 101% efficiency]

 Compilation Time: 551 ms

@b-sumner
Copy link

b-sumner commented May 1, 2024

@lamb-j I appreciate your carrying out these performance tests!

Would you be open to providing an option to enable post-link optimization that could be off by default for now?

@lamb-j
Copy link
Contributor Author

lamb-j commented May 1, 2024

@lamb-j I appreciate your carrying out these performance tests!

Would you be open to providing an option to enable post-link optimization that could be off by default for now?

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

@b-sumner
Copy link

b-sumner commented May 1, 2024

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

@jhuber6
Copy link
Contributor

jhuber6 commented May 1, 2024

Currently if a user doesn't supply the new "-link-builtin-bitcodes-postopt" option, linking builtin bitcodes happens first, then the optimization pipeline follows. Does that cover the case you're talking about?

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

Presumably you'd add --no-link-builtin-bitcodes-postopt and then use Args.hasFlag(PosFlag, NegFlag, Default).

@b-sumner
Copy link

b-sumner commented May 1, 2024

I'm thinking of an option that developers can use. If -link-builtin-bitcodes-postopt, becomes the default, how can developers disable it?

Presumably you'd add --no-link-builtin-bitcodes-postopt and then use Args.hasFlag(PosFlag, NegFlag, Default).

I assume I'd need an -Xclang or -Wl, in front? And it would need some test coverage.

@llvmbot llvmbot added clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 2, 2024
@lamb-j
Copy link
Contributor Author

lamb-j commented May 2, 2024

Switched the option from a cl::opt to a Clang/CodeGen option.

Also added a LIT test which tests both -mlink-builtin-bitcode-postopt and -mno-link-builtin-bitcode-postopt

@b-sumner
Copy link

b-sumner commented May 7, 2024

Looks good to me.

Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully in the future we can handle this in the linker better.

@lamb-j lamb-j requested a review from yxsamliu May 7, 2024 18:49
@lamb-j lamb-j merged commit 11a6799 into llvm:main May 8, 2024
4 checks passed
@lamb-j lamb-j deleted the omit-first-link branch May 8, 2024 15:11
bogner added a commit to bogner/llvm-project that referenced this pull request May 8, 2024
After 11a6799 "[clang][CodeGen] Omit pre-opt link when post-opt
is link requested (llvm#85672)" I'm seeing a new warning:

> BackendConsumer.h:37:22: error: private field 'FileMgr' is not used
> [-Werror,-Wunused-private-field]

Remove the field since it's no longer used.
bogner added a commit that referenced this pull request May 8, 2024
After 11a6799 "[clang][CodeGen] Omit pre-opt link when post-opt is
link requested (#85672)" I'm seeing a new warning:

> BackendConsumer.h:37:22: error: private field 'FileMgr' is not used
[-Werror,-Wunused-private-field]

Remove the field since it's no longer used.
@lamb-j
Copy link
Contributor Author

lamb-j commented May 8, 2024

#91495

Follow up PR to fix warning with now-unused private variable in BackendConsumer class (FileMgr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:codegen clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants