Skip to content

Conversation

aganea
Copy link
Member

@aganea aganea commented Sep 2, 2025

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the "Thin Link" (the analysis phase on the summary).

After PR, linking clang.exe:
Capture d’écran 2025-09-02 082021

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
Capture d’écran 2025-09-02 102048

Capture d’écran 2025-09-02 102227

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:
Capture d’écran 2025-09-02 103758

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.


As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

Copy link
Collaborator

@zmodem zmodem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay to me. I assuming that the overhead of these is basically zero when not running with --time-trace?

bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }

bool verify(const Function &F) {
llvm::TimeTraceScope timeScope("Verifier");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is enough; do you really need the "Dominator Tree Builder" and "Verifier visit' timers as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

// FIXME: We strip const here because the inst visitor strips const.
visit(const_cast<Function &>(F));
{
llvm::TimeTraceScope domScope("Verifier visit");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope name copy-past-o?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks for noting. There was another one below.

@aganea
Copy link
Member Author

aganea commented Sep 3, 2025

Seems okay to me. I assuming that the overhead of these is basically zero when not running with --time-trace?

Yes, if --time-trace is disabled, it just adds an access to a global TLS:

if (TimeTraceProfilerInstance != nullptr)

Copy link

github-actions bot commented Sep 3, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Collaborator

@zmodem zmodem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some importScopes which may not actually be about importing, but overall lgtm.

@llvmbot llvmbot added lld debuginfo lld:COFF platform:windows LTO Link time optimization (regular/full LTO or ThinLTO) llvm:ir llvm:transforms labels Sep 3, 2025
@llvmbot
Copy link
Member

llvmbot commented Sep 3, 2025

@llvm/pr-subscribers-lto
@llvm/pr-subscribers-platform-windows
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-lld

Author: Alexandre Ganea (aganea)

Changes

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

After PR, linking clang.exe:
<img width="3839" height="2026" alt="Capture d’écran 2025-09-02 082021" src="https://github.com/user-attachments/assets/bf0c85ba-2f85-4bbf-a5c1-800039b56910" />

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
<img width="1940" height="1008" alt="Capture d’écran 2025-09-02 102048" src="https://github.com/user-attachments/assets/60b28630-7995-45ce-9e8c-13f3cb5312e0" />

<img width="1919" height="811" alt="Capture d’écran 2025-09-02 102227" src="https://github.com/user-attachments/assets/6db88dfe-a708-4f3a-b708-c3a16e26c2ef" />

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:
<img width="1930" height="1007" alt="Capture d’écran 2025-09-02 103758" src="https://github.com/user-attachments/assets/c0091f24-460d-49ae-944b-78c478f7d284" />

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.


As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?


Patch is 30.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156471.diff

11 Files Affected:

  • (modified) lld/COFF/SymbolTable.cpp (+6-2)
  • (modified) llvm/lib/Bitcode/Reader/MetadataLoader.cpp (+2)
  • (modified) llvm/lib/IR/AutoUpgrade.cpp (+2)
  • (modified) llvm/lib/IR/DebugInfo.cpp (+2)
  • (modified) llvm/lib/IR/Module.cpp (+2)
  • (modified) llvm/lib/IR/Verifier.cpp (+3)
  • (modified) llvm/lib/LTO/LTO.cpp (+24-21)
  • (modified) llvm/lib/LTO/LTOBackend.cpp (+13-4)
  • (modified) llvm/lib/Transforms/IPO/FunctionImport.cpp (+106-89)
  • (modified) llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp (+2)
  • (modified) llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (+3)
diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index 0a88807c00dd5..335f3d65a078f 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -1440,8 +1440,12 @@ void SymbolTable::compileBitcodeFiles() {
   llvm::TimeTraceScope timeScope("Compile bitcode");
   ScopedTimer t(ctx.ltoTimer);
   lto.reset(new BitcodeCompiler(ctx));
-  for (BitcodeFile *f : bitcodeFileInstances)
-    lto->add(*f);
+  {
+    llvm::TimeTraceScope addScope("Add bitcode file instances");
+    for (BitcodeFile *f : bitcodeFileInstances)
+      lto->add(*f);
+  }
+  llvm::TimeTraceScope compileScope("LTO compile");
   for (InputFile *newObj : lto->compile()) {
     ObjFile *obj = cast<ObjFile>(newObj);
     obj->parse();
diff --git a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
index 738e47b8b16c4..a5cedadd30981 100644
--- a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
+++ b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
@@ -43,6 +43,7 @@
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/TimeProfiler.h"
 
 #include <algorithm>
 #include <cassert>
@@ -1052,6 +1053,7 @@ void MetadataLoader::MetadataLoaderImpl::callMDTypeCallback(Metadata **Val,
 /// Parse a METADATA_BLOCK. If ModuleLevel is true then we are parsing
 /// module level metadata.
 Error MetadataLoader::MetadataLoaderImpl::parseMetadata(bool ModuleLevel) {
+  llvm::TimeTraceScope timeScope("Parse metadata");
   if (!ModuleLevel && MetadataList.hasFwdRefs())
     return error("Invalid metadata: fwd refs into function blocks");
 
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index 7ea9c6dff13b8..8034b3ffe273e 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -48,6 +48,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/NVPTXAddrSpace.h"
 #include "llvm/Support/Regex.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/TargetParser/Triple.h"
 #include <cstdint>
 #include <cstring>
@@ -5256,6 +5257,7 @@ bool llvm::UpgradeDebugInfo(Module &M) {
   if (DisableAutoUpgradeDebugInfo)
     return false;
 
+  llvm::TimeTraceScope timeScope("Upgrade debug info");
   // We need to get metadata before the module is verified (i.e., getModuleFlag
   // makes assumptions that we haven't verified yet). Carefully extract the flag
   // from the metadata.
diff --git a/llvm/lib/IR/DebugInfo.cpp b/llvm/lib/IR/DebugInfo.cpp
index b468d929b0280..166521a276643 100644
--- a/llvm/lib/IR/DebugInfo.cpp
+++ b/llvm/lib/IR/DebugInfo.cpp
@@ -36,6 +36,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/TimeProfiler.h"
 #include <algorithm>
 #include <cassert>
 #include <optional>
@@ -563,6 +564,7 @@ bool llvm::stripDebugInfo(Function &F) {
 }
 
 bool llvm::StripDebugInfo(Module &M) {
+  llvm::TimeTraceScope timeScope("Strip debug info");
   bool Changed = false;
 
   for (NamedMDNode &NMD : llvm::make_early_inc_range(M.named_metadata())) {
diff --git a/llvm/lib/IR/Module.cpp b/llvm/lib/IR/Module.cpp
index 70d364176062f..30b5e48652b28 100644
--- a/llvm/lib/IR/Module.cpp
+++ b/llvm/lib/IR/Module.cpp
@@ -44,6 +44,7 @@
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/RandomNumberGenerator.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/VersionTuple.h"
 #include <cassert>
 #include <cstdint>
@@ -478,6 +479,7 @@ Error Module::materializeAll() {
 }
 
 Error Module::materializeMetadata() {
+  llvm::TimeTraceScope timeScope("Materialize metadata");
   if (!Materializer)
     return Error::success();
   return Materializer->materializeMetadata();
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index da05ff166122f..06ddb4574c860 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -119,6 +119,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ModRef.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include <algorithm>
 #include <cassert>
@@ -399,6 +400,7 @@ class Verifier : public InstVisitor<Verifier>, VerifierSupport {
   bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }
 
   bool verify(const Function &F) {
+    llvm::TimeTraceScope timeScope("Verifier");
     assert(F.getParent() == &M &&
            "An instance of this class only works with a specific module!");
 
@@ -2832,6 +2834,7 @@ static Instruction *getSuccPad(Instruction *Terminator) {
 }
 
 void Verifier::verifySiblingFuncletUnwinds() {
+  llvm::TimeTraceScope timeScope("Verifier verify sibling funclet unwinds");
   SmallPtrSet<Instruction *, 8> Visited;
   SmallPtrSet<Instruction *, 8> Active;
   for (const auto &Pair : SiblingFuncletInfo) {
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index 35d24c17bbd93..89192b39e811f 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -631,6 +631,7 @@ LTO::~LTO() = default;
 void LTO::addModuleToGlobalRes(ArrayRef<InputFile::Symbol> Syms,
                                ArrayRef<SymbolResolution> Res,
                                unsigned Partition, bool InSummary) {
+  llvm::TimeTraceScope timeScope("LTO add module to global resolution");
   auto *ResI = Res.begin();
   auto *ResE = Res.end();
   (void)ResE;
@@ -731,6 +732,7 @@ static void writeToResolutionFile(raw_ostream &OS, InputFile *Input,
 
 Error LTO::add(std::unique_ptr<InputFile> Input,
                ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add input", Input->getName());
   assert(!CalledGetMaxTasks);
 
   if (Conf.ResolutionFile)
@@ -756,6 +758,7 @@ Error LTO::add(std::unique_ptr<InputFile> Input,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addModule(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                unsigned ModI, ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add module", Input.getName());
   Expected<BitcodeLTOInfo> LTOInfo = Input.Mods[ModI].getLTOInfo();
   if (!LTOInfo)
     return LTOInfo.takeError();
@@ -850,6 +853,7 @@ Expected<
 LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                    BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                    ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add regular LTO");
   RegularLTOState::AddedModule Mod;
   Expected<std::unique_ptr<Module>> MOrErr =
       BM.getLazyModule(RegularLTO.Ctx, /*ShouldLazyLoadMetadata*/ true,
@@ -1024,6 +1028,7 @@ LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
 
 Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
                           bool LivenessFromIndex) {
+  llvm::TimeTraceScope timeScope("LTO link regular LTO");
   std::vector<GlobalValue *> Keep;
   for (GlobalValue *GV : Mod.Keep) {
     if (LivenessFromIndex && !ThinLTO.CombinedIndex.isGUIDLive(GV->getGUID())) {
@@ -1063,6 +1068,7 @@ Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                 ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add thin LTO");
   ArrayRef<SymbolResolution> ResTmp = Res;
   for (const InputFile::Symbol &Sym : Syms) {
     assert(!ResTmp.empty());
@@ -1252,6 +1258,7 @@ Error LTO::run(AddStreamFn AddStream, FileCache Cache) {
 
 void lto::updateMemProfAttributes(Module &Mod,
                                   const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("LTO update memprof attributes");
   if (Index.withSupportsHotColdNew())
     return;
 
@@ -1282,6 +1289,7 @@ void lto::updateMemProfAttributes(Module &Mod,
 }
 
 Error LTO::runRegularLTO(AddStreamFn AddStream) {
+  llvm::TimeTraceScope timeScope("Run regular LTO");
   // Setup optimization remarks.
   auto DiagFileOrErr = lto::setupLLVMOptimizationRemarks(
       RegularLTO.CombinedModule->getContext(), Conf.RemarksFilename,
@@ -1294,10 +1302,12 @@ Error LTO::runRegularLTO(AddStreamFn AddStream) {
 
   // Finalize linking of regular LTO modules containing summaries now that
   // we have computed liveness information.
-  for (auto &M : RegularLTO.ModsWithSummaries)
-    if (Error Err = linkRegularLTO(std::move(M),
-                                   /*LivenessFromIndex=*/true))
-      return Err;
+  {
+    llvm::TimeTraceScope timeScope("Link regular LTO");
+    for (auto &M : RegularLTO.ModsWithSummaries)
+      if (Error Err = linkRegularLTO(std::move(M), /*LivenessFromIndex=*/true))
+        return Err;
+  }
 
   // Ensure we don't have inconsistently split LTO units with type tests.
   // FIXME: this checks both LTO and ThinLTO. It happens to work as we take
@@ -1526,6 +1536,9 @@ class InProcessThinBackend : public CGThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (in-process)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       Expected<std::unique_ptr<Module>> MOrErr = BM.parseModule(BackendContext);
@@ -1536,9 +1549,6 @@ class InProcessThinBackend : public CGThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
-
     if (ShouldEmitIndexFiles) {
       if (auto E = emitFiles(ImportList, ModuleID, ModuleID.str()))
         return E;
@@ -1639,6 +1649,9 @@ class FirstRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (first round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn CGAddStream,
                               AddStreamFn IRAddStream) {
       LTOLLVMContext BackendContext(Conf);
@@ -1650,8 +1663,6 @@ class FirstRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly, IRAddStream);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     // Like InProcessThinBackend, we produce index files as needed for
     // FirstRoundThinBackend. However, these files are not generated for
     // SecondRoundThinBackend.
@@ -1735,6 +1746,9 @@ class SecondRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (second round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       std::unique_ptr<Module> LoadedModule =
@@ -1744,8 +1758,6 @@ class SecondRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          /*CodeGenOnly=*/true);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     if (!Cache.isValid() || !CombinedIndex.modulePaths().count(ModuleID) ||
         all_of(CombinedIndex.getModuleHash(ModuleID),
                [](uint32_t V) { return V == 0; }))
@@ -1915,13 +1927,9 @@ ThinBackend lto::createWriteIndexesThinBackend(
 
 Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
                       const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols) {
+  llvm::TimeTraceScope timeScope("Run ThinLTO");
   LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
   ThinLTO.CombinedIndex.releaseTemporaryMemory();
-  timeTraceProfilerBegin("ThinLink", StringRef(""));
-  auto TimeTraceScopeExit = llvm::make_scope_exit([]() {
-    if (llvm::timeTraceProfilerEnabled())
-      llvm::timeTraceProfilerEnd();
-  });
   if (ThinLTO.ModuleMap.empty())
     return Error::success();
 
@@ -2069,11 +2077,6 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
 
   generateParamAccessSummary(ThinLTO.CombinedIndex);
 
-  if (llvm::timeTraceProfilerEnabled())
-    llvm::timeTraceProfilerEnd();
-
-  TimeTraceScopeExit.release();
-
   auto &ModuleMap =
       ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
 
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 5e8cd12fe040b..ce42fc526beac 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -366,6 +366,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
               bool IsThinLTO, ModuleSummaryIndex *ExportSummary,
               const ModuleSummaryIndex *ImportSummary,
               const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("opt");
   if (EmbedBitcode == LTOBitcodeEmbedding::EmbedPostMergePreOptimized) {
     // FIXME: the motivation for capturing post-merge bitcode and command line
     // is replicating the compilation environment from bitcode, without needing
@@ -399,6 +400,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
 static void codegen(const Config &Conf, TargetMachine *TM,
                     AddStreamFn AddStream, unsigned Task, Module &Mod,
                     const ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("codegen");
   if (Conf.PreCodeGenModuleHook && !Conf.PreCodeGenModuleHook(Task, Mod))
     return;
 
@@ -552,6 +554,7 @@ Error lto::finalizeOptimizationRemarks(
 Error lto::backend(const Config &C, AddStreamFn AddStream,
                    unsigned ParallelCodeGenParallelismLevel, Module &Mod,
                    ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("LTO backend");
   Expected<const Target *> TOrErr = initAndLookupTarget(C, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -577,6 +580,7 @@ Error lto::backend(const Config &C, AddStreamFn AddStream,
 
 static void dropDeadSymbols(Module &Mod, const GVSummaryMapTy &DefinedGlobals,
                             const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("Drop dead symbols");
   std::vector<GlobalValue*> DeadGVs;
   for (auto &GV : Mod.global_values())
     if (GlobalValueSummary *GVS = DefinedGlobals.lookup(GV.getGUID()))
@@ -603,6 +607,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
                        MapVector<StringRef, BitcodeModule> *ModuleMap,
                        bool CodeGenOnly, AddStreamFn IRAddStream,
                        const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("Thin backend", Mod.getModuleIdentifier());
   Expected<const Target *> TOrErr = initAndLookupTarget(Conf, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -679,6 +684,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return finalizeOptimizationRemarks(std::move(DiagnosticOutputFile));
 
   auto ModuleLoader = [&](StringRef Identifier) {
+    llvm::TimeTraceScope moduleLoaderScope("Module loader", Identifier);
     assert(Mod.getContext().isODRUniquingDebugTypes() &&
            "ODR Type uniquing should be enabled on the context");
     if (ModuleMap) {
@@ -712,10 +718,13 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return MOrErr;
   };
 
-  FunctionImporter Importer(CombinedIndex, ModuleLoader,
-                            ClearDSOLocalOnDeclarations);
-  if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
-    return Err;
+  {
+    llvm::TimeTraceScope importScope("Import functions");
+    FunctionImporter Importer(CombinedIndex, ModuleLoader,
+                              ClearDSOLocalOnDeclarations);
+    if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
+      return Err;
+  }
 
   // Do this after any importing so that imported code is updated.
   updateMemProfAttributes(Mod, CombinedIndex);
diff --git a/llvm/lib/Transforms/IPO/FunctionImport.cpp b/llvm/lib/Transforms/IPO/FunctionImport.cpp
index 7bcb20de46ff6..96b274e2f45a9 100644
--- a/llvm/lib/Transforms/IPO/FunctionImport.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionImport.cpp
@@ -40,6 +40,7 @@
 #include "llvm/Support/JSON.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/SourceMgr.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO/Internalize.h"
 #include "llvm/Transforms/Utils/Cloning.h"
@@ -1550,6 +1551,7 @@ void llvm::computeDeadSymbolsWithConstProp(
     const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
     function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
     bool ImportEnabled) {
+  llvm::TimeTraceScope timeScope("Dead symbols");
   computeDeadSymbolsAndUpdateIndirectCalls(Index, GUIDPreservedSymbols,
                                            isPrevailing);
   if (ImportEnabled)
@@ -1664,6 +1666,7 @@ bool llvm::convertToDeclaration(GlobalValue &GV) {
 void llvm::thinLTOFinalizeInModule(Module &TheModule,
                                    const GVSummaryMapTy &DefinedGlobals,
                                    bool PropagateAttrs) {
+  llvm::TimeTraceScope timeScope("ThinLTO finalize in module");
   DenseSet<Comdat *> NonPrevailingComdats;
   auto FinalizeInModule = [&](GlobalValue &GV, bool Propagate = false) {
     // See if the global summary analysis computed a new resolved linkage.
@@ -1791,6 +1794,7 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
 /// Run internalization on \p TheModule based on symmary analysis.
 void llvm::thinLTOInternalizeModule(Module &TheModule,
                                     const GVSummaryMapTy &DefinedGlobals) {
+  llvm::TimeTraceScope timeScope("ThinLTO internalize module");
   // Declare a callback for the internalize pass that will ask for every
   // candidate GlobalValue if it can be internalized or not.
   auto MustPreserveGV = [&](const GlobalValue &GV) -> bool {
@@ -1885,6 +1889,7 @@ Expected<bool> FunctionImporter::importFunctions(
 
   // Do the actual import of functions now, one Module at a time
   for (const auto &ModName : ImportList.getSourceModules()) {
+    llvm::TimeTraceScope timeScope("Import", ModName);
     // Get the module for the import
     Expected<std::unique_ptr<Module>> SrcModuleOrErr = ModuleLoader(ModName);
     if (!SrcModuleOrErr)
@@ -1900,102 +1905,114 @@ Expected<bool> FunctionImporter::importFunctions(
 
     // Find the globals to import
     SetVector<GlobalValue *> GlobalsToImport;
-    for (Function &F : *SrcModule) {
-      if (!F.hasName())
-        continue;
-      auto GUID = F.getGUID();
-      auto MaybeImportType = ImportList.getImportType(ModName, GUID);
-      bool ImportDefinition = MaybeImportType == GlobalValueSummary::Definition;
-
-      LLVM_DEBUG(dbgs() << (MaybeImportType ? "Is" : "Not")
-                        << " importing function"
-                        << (ImportDefinition
-                                ? " definition "
-                                : (MaybeImportType ? " declaration " : " "))
-                        << GUID << " " << F.getName() << " from "
-                        << SrcModule->getSourceFileName() << "\n");
-      if (ImportDefinition) {
-        if (Error Err = F.materialize())
-          return std::move(Err);
-        // MemProf should match function's definition and summary,
-        // 'thinlto_src_module' is needed.
-        if (EnableImportMetadata || EnableMemProfContextDisambiguation) {
-          // Add 'thinlto_src_module' and 'thinlto_src_file' metadata for
-          // statistics and debugging.
-          F.setMetadata(
-              "thinlto_src_module",
-              MDNode::get(DestModule.getContext(),
-                          {MDString::get(Dest...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Sep 3, 2025

@llvm/pr-subscribers-debuginfo

Author: Alexandre Ganea (aganea)

Changes

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

After PR, linking clang.exe:
<img width="3839" height="2026" alt="Capture d’écran 2025-09-02 082021" src="https://github.com/user-attachments/assets/bf0c85ba-2f85-4bbf-a5c1-800039b56910" />

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
<img width="1940" height="1008" alt="Capture d’écran 2025-09-02 102048" src="https://github.com/user-attachments/assets/60b28630-7995-45ce-9e8c-13f3cb5312e0" />

<img width="1919" height="811" alt="Capture d’écran 2025-09-02 102227" src="https://github.com/user-attachments/assets/6db88dfe-a708-4f3a-b708-c3a16e26c2ef" />

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:
<img width="1930" height="1007" alt="Capture d’écran 2025-09-02 103758" src="https://github.com/user-attachments/assets/c0091f24-460d-49ae-944b-78c478f7d284" />

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.


As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?


Patch is 30.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156471.diff

11 Files Affected:

  • (modified) lld/COFF/SymbolTable.cpp (+6-2)
  • (modified) llvm/lib/Bitcode/Reader/MetadataLoader.cpp (+2)
  • (modified) llvm/lib/IR/AutoUpgrade.cpp (+2)
  • (modified) llvm/lib/IR/DebugInfo.cpp (+2)
  • (modified) llvm/lib/IR/Module.cpp (+2)
  • (modified) llvm/lib/IR/Verifier.cpp (+3)
  • (modified) llvm/lib/LTO/LTO.cpp (+24-21)
  • (modified) llvm/lib/LTO/LTOBackend.cpp (+13-4)
  • (modified) llvm/lib/Transforms/IPO/FunctionImport.cpp (+106-89)
  • (modified) llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp (+2)
  • (modified) llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (+3)
diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index 0a88807c00dd5..335f3d65a078f 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -1440,8 +1440,12 @@ void SymbolTable::compileBitcodeFiles() {
   llvm::TimeTraceScope timeScope("Compile bitcode");
   ScopedTimer t(ctx.ltoTimer);
   lto.reset(new BitcodeCompiler(ctx));
-  for (BitcodeFile *f : bitcodeFileInstances)
-    lto->add(*f);
+  {
+    llvm::TimeTraceScope addScope("Add bitcode file instances");
+    for (BitcodeFile *f : bitcodeFileInstances)
+      lto->add(*f);
+  }
+  llvm::TimeTraceScope compileScope("LTO compile");
   for (InputFile *newObj : lto->compile()) {
     ObjFile *obj = cast<ObjFile>(newObj);
     obj->parse();
diff --git a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
index 738e47b8b16c4..a5cedadd30981 100644
--- a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
+++ b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
@@ -43,6 +43,7 @@
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/TimeProfiler.h"
 
 #include <algorithm>
 #include <cassert>
@@ -1052,6 +1053,7 @@ void MetadataLoader::MetadataLoaderImpl::callMDTypeCallback(Metadata **Val,
 /// Parse a METADATA_BLOCK. If ModuleLevel is true then we are parsing
 /// module level metadata.
 Error MetadataLoader::MetadataLoaderImpl::parseMetadata(bool ModuleLevel) {
+  llvm::TimeTraceScope timeScope("Parse metadata");
   if (!ModuleLevel && MetadataList.hasFwdRefs())
     return error("Invalid metadata: fwd refs into function blocks");
 
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index 7ea9c6dff13b8..8034b3ffe273e 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -48,6 +48,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/NVPTXAddrSpace.h"
 #include "llvm/Support/Regex.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/TargetParser/Triple.h"
 #include <cstdint>
 #include <cstring>
@@ -5256,6 +5257,7 @@ bool llvm::UpgradeDebugInfo(Module &M) {
   if (DisableAutoUpgradeDebugInfo)
     return false;
 
+  llvm::TimeTraceScope timeScope("Upgrade debug info");
   // We need to get metadata before the module is verified (i.e., getModuleFlag
   // makes assumptions that we haven't verified yet). Carefully extract the flag
   // from the metadata.
diff --git a/llvm/lib/IR/DebugInfo.cpp b/llvm/lib/IR/DebugInfo.cpp
index b468d929b0280..166521a276643 100644
--- a/llvm/lib/IR/DebugInfo.cpp
+++ b/llvm/lib/IR/DebugInfo.cpp
@@ -36,6 +36,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/TimeProfiler.h"
 #include <algorithm>
 #include <cassert>
 #include <optional>
@@ -563,6 +564,7 @@ bool llvm::stripDebugInfo(Function &F) {
 }
 
 bool llvm::StripDebugInfo(Module &M) {
+  llvm::TimeTraceScope timeScope("Strip debug info");
   bool Changed = false;
 
   for (NamedMDNode &NMD : llvm::make_early_inc_range(M.named_metadata())) {
diff --git a/llvm/lib/IR/Module.cpp b/llvm/lib/IR/Module.cpp
index 70d364176062f..30b5e48652b28 100644
--- a/llvm/lib/IR/Module.cpp
+++ b/llvm/lib/IR/Module.cpp
@@ -44,6 +44,7 @@
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/RandomNumberGenerator.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/VersionTuple.h"
 #include <cassert>
 #include <cstdint>
@@ -478,6 +479,7 @@ Error Module::materializeAll() {
 }
 
 Error Module::materializeMetadata() {
+  llvm::TimeTraceScope timeScope("Materialize metadata");
   if (!Materializer)
     return Error::success();
   return Materializer->materializeMetadata();
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index da05ff166122f..06ddb4574c860 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -119,6 +119,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ModRef.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include <algorithm>
 #include <cassert>
@@ -399,6 +400,7 @@ class Verifier : public InstVisitor<Verifier>, VerifierSupport {
   bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }
 
   bool verify(const Function &F) {
+    llvm::TimeTraceScope timeScope("Verifier");
     assert(F.getParent() == &M &&
            "An instance of this class only works with a specific module!");
 
@@ -2832,6 +2834,7 @@ static Instruction *getSuccPad(Instruction *Terminator) {
 }
 
 void Verifier::verifySiblingFuncletUnwinds() {
+  llvm::TimeTraceScope timeScope("Verifier verify sibling funclet unwinds");
   SmallPtrSet<Instruction *, 8> Visited;
   SmallPtrSet<Instruction *, 8> Active;
   for (const auto &Pair : SiblingFuncletInfo) {
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index 35d24c17bbd93..89192b39e811f 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -631,6 +631,7 @@ LTO::~LTO() = default;
 void LTO::addModuleToGlobalRes(ArrayRef<InputFile::Symbol> Syms,
                                ArrayRef<SymbolResolution> Res,
                                unsigned Partition, bool InSummary) {
+  llvm::TimeTraceScope timeScope("LTO add module to global resolution");
   auto *ResI = Res.begin();
   auto *ResE = Res.end();
   (void)ResE;
@@ -731,6 +732,7 @@ static void writeToResolutionFile(raw_ostream &OS, InputFile *Input,
 
 Error LTO::add(std::unique_ptr<InputFile> Input,
                ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add input", Input->getName());
   assert(!CalledGetMaxTasks);
 
   if (Conf.ResolutionFile)
@@ -756,6 +758,7 @@ Error LTO::add(std::unique_ptr<InputFile> Input,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addModule(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                unsigned ModI, ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add module", Input.getName());
   Expected<BitcodeLTOInfo> LTOInfo = Input.Mods[ModI].getLTOInfo();
   if (!LTOInfo)
     return LTOInfo.takeError();
@@ -850,6 +853,7 @@ Expected<
 LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                    BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                    ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add regular LTO");
   RegularLTOState::AddedModule Mod;
   Expected<std::unique_ptr<Module>> MOrErr =
       BM.getLazyModule(RegularLTO.Ctx, /*ShouldLazyLoadMetadata*/ true,
@@ -1024,6 +1028,7 @@ LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
 
 Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
                           bool LivenessFromIndex) {
+  llvm::TimeTraceScope timeScope("LTO link regular LTO");
   std::vector<GlobalValue *> Keep;
   for (GlobalValue *GV : Mod.Keep) {
     if (LivenessFromIndex && !ThinLTO.CombinedIndex.isGUIDLive(GV->getGUID())) {
@@ -1063,6 +1068,7 @@ Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                 ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add thin LTO");
   ArrayRef<SymbolResolution> ResTmp = Res;
   for (const InputFile::Symbol &Sym : Syms) {
     assert(!ResTmp.empty());
@@ -1252,6 +1258,7 @@ Error LTO::run(AddStreamFn AddStream, FileCache Cache) {
 
 void lto::updateMemProfAttributes(Module &Mod,
                                   const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("LTO update memprof attributes");
   if (Index.withSupportsHotColdNew())
     return;
 
@@ -1282,6 +1289,7 @@ void lto::updateMemProfAttributes(Module &Mod,
 }
 
 Error LTO::runRegularLTO(AddStreamFn AddStream) {
+  llvm::TimeTraceScope timeScope("Run regular LTO");
   // Setup optimization remarks.
   auto DiagFileOrErr = lto::setupLLVMOptimizationRemarks(
       RegularLTO.CombinedModule->getContext(), Conf.RemarksFilename,
@@ -1294,10 +1302,12 @@ Error LTO::runRegularLTO(AddStreamFn AddStream) {
 
   // Finalize linking of regular LTO modules containing summaries now that
   // we have computed liveness information.
-  for (auto &M : RegularLTO.ModsWithSummaries)
-    if (Error Err = linkRegularLTO(std::move(M),
-                                   /*LivenessFromIndex=*/true))
-      return Err;
+  {
+    llvm::TimeTraceScope timeScope("Link regular LTO");
+    for (auto &M : RegularLTO.ModsWithSummaries)
+      if (Error Err = linkRegularLTO(std::move(M), /*LivenessFromIndex=*/true))
+        return Err;
+  }
 
   // Ensure we don't have inconsistently split LTO units with type tests.
   // FIXME: this checks both LTO and ThinLTO. It happens to work as we take
@@ -1526,6 +1536,9 @@ class InProcessThinBackend : public CGThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (in-process)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       Expected<std::unique_ptr<Module>> MOrErr = BM.parseModule(BackendContext);
@@ -1536,9 +1549,6 @@ class InProcessThinBackend : public CGThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
-
     if (ShouldEmitIndexFiles) {
       if (auto E = emitFiles(ImportList, ModuleID, ModuleID.str()))
         return E;
@@ -1639,6 +1649,9 @@ class FirstRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (first round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn CGAddStream,
                               AddStreamFn IRAddStream) {
       LTOLLVMContext BackendContext(Conf);
@@ -1650,8 +1663,6 @@ class FirstRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly, IRAddStream);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     // Like InProcessThinBackend, we produce index files as needed for
     // FirstRoundThinBackend. However, these files are not generated for
     // SecondRoundThinBackend.
@@ -1735,6 +1746,9 @@ class SecondRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (second round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       std::unique_ptr<Module> LoadedModule =
@@ -1744,8 +1758,6 @@ class SecondRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          /*CodeGenOnly=*/true);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     if (!Cache.isValid() || !CombinedIndex.modulePaths().count(ModuleID) ||
         all_of(CombinedIndex.getModuleHash(ModuleID),
                [](uint32_t V) { return V == 0; }))
@@ -1915,13 +1927,9 @@ ThinBackend lto::createWriteIndexesThinBackend(
 
 Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
                       const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols) {
+  llvm::TimeTraceScope timeScope("Run ThinLTO");
   LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
   ThinLTO.CombinedIndex.releaseTemporaryMemory();
-  timeTraceProfilerBegin("ThinLink", StringRef(""));
-  auto TimeTraceScopeExit = llvm::make_scope_exit([]() {
-    if (llvm::timeTraceProfilerEnabled())
-      llvm::timeTraceProfilerEnd();
-  });
   if (ThinLTO.ModuleMap.empty())
     return Error::success();
 
@@ -2069,11 +2077,6 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
 
   generateParamAccessSummary(ThinLTO.CombinedIndex);
 
-  if (llvm::timeTraceProfilerEnabled())
-    llvm::timeTraceProfilerEnd();
-
-  TimeTraceScopeExit.release();
-
   auto &ModuleMap =
       ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
 
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 5e8cd12fe040b..ce42fc526beac 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -366,6 +366,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
               bool IsThinLTO, ModuleSummaryIndex *ExportSummary,
               const ModuleSummaryIndex *ImportSummary,
               const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("opt");
   if (EmbedBitcode == LTOBitcodeEmbedding::EmbedPostMergePreOptimized) {
     // FIXME: the motivation for capturing post-merge bitcode and command line
     // is replicating the compilation environment from bitcode, without needing
@@ -399,6 +400,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
 static void codegen(const Config &Conf, TargetMachine *TM,
                     AddStreamFn AddStream, unsigned Task, Module &Mod,
                     const ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("codegen");
   if (Conf.PreCodeGenModuleHook && !Conf.PreCodeGenModuleHook(Task, Mod))
     return;
 
@@ -552,6 +554,7 @@ Error lto::finalizeOptimizationRemarks(
 Error lto::backend(const Config &C, AddStreamFn AddStream,
                    unsigned ParallelCodeGenParallelismLevel, Module &Mod,
                    ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("LTO backend");
   Expected<const Target *> TOrErr = initAndLookupTarget(C, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -577,6 +580,7 @@ Error lto::backend(const Config &C, AddStreamFn AddStream,
 
 static void dropDeadSymbols(Module &Mod, const GVSummaryMapTy &DefinedGlobals,
                             const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("Drop dead symbols");
   std::vector<GlobalValue*> DeadGVs;
   for (auto &GV : Mod.global_values())
     if (GlobalValueSummary *GVS = DefinedGlobals.lookup(GV.getGUID()))
@@ -603,6 +607,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
                        MapVector<StringRef, BitcodeModule> *ModuleMap,
                        bool CodeGenOnly, AddStreamFn IRAddStream,
                        const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("Thin backend", Mod.getModuleIdentifier());
   Expected<const Target *> TOrErr = initAndLookupTarget(Conf, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -679,6 +684,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return finalizeOptimizationRemarks(std::move(DiagnosticOutputFile));
 
   auto ModuleLoader = [&](StringRef Identifier) {
+    llvm::TimeTraceScope moduleLoaderScope("Module loader", Identifier);
     assert(Mod.getContext().isODRUniquingDebugTypes() &&
            "ODR Type uniquing should be enabled on the context");
     if (ModuleMap) {
@@ -712,10 +718,13 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return MOrErr;
   };
 
-  FunctionImporter Importer(CombinedIndex, ModuleLoader,
-                            ClearDSOLocalOnDeclarations);
-  if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
-    return Err;
+  {
+    llvm::TimeTraceScope importScope("Import functions");
+    FunctionImporter Importer(CombinedIndex, ModuleLoader,
+                              ClearDSOLocalOnDeclarations);
+    if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
+      return Err;
+  }
 
   // Do this after any importing so that imported code is updated.
   updateMemProfAttributes(Mod, CombinedIndex);
diff --git a/llvm/lib/Transforms/IPO/FunctionImport.cpp b/llvm/lib/Transforms/IPO/FunctionImport.cpp
index 7bcb20de46ff6..96b274e2f45a9 100644
--- a/llvm/lib/Transforms/IPO/FunctionImport.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionImport.cpp
@@ -40,6 +40,7 @@
 #include "llvm/Support/JSON.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/SourceMgr.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO/Internalize.h"
 #include "llvm/Transforms/Utils/Cloning.h"
@@ -1550,6 +1551,7 @@ void llvm::computeDeadSymbolsWithConstProp(
     const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
     function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
     bool ImportEnabled) {
+  llvm::TimeTraceScope timeScope("Dead symbols");
   computeDeadSymbolsAndUpdateIndirectCalls(Index, GUIDPreservedSymbols,
                                            isPrevailing);
   if (ImportEnabled)
@@ -1664,6 +1666,7 @@ bool llvm::convertToDeclaration(GlobalValue &GV) {
 void llvm::thinLTOFinalizeInModule(Module &TheModule,
                                    const GVSummaryMapTy &DefinedGlobals,
                                    bool PropagateAttrs) {
+  llvm::TimeTraceScope timeScope("ThinLTO finalize in module");
   DenseSet<Comdat *> NonPrevailingComdats;
   auto FinalizeInModule = [&](GlobalValue &GV, bool Propagate = false) {
     // See if the global summary analysis computed a new resolved linkage.
@@ -1791,6 +1794,7 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
 /// Run internalization on \p TheModule based on symmary analysis.
 void llvm::thinLTOInternalizeModule(Module &TheModule,
                                     const GVSummaryMapTy &DefinedGlobals) {
+  llvm::TimeTraceScope timeScope("ThinLTO internalize module");
   // Declare a callback for the internalize pass that will ask for every
   // candidate GlobalValue if it can be internalized or not.
   auto MustPreserveGV = [&](const GlobalValue &GV) -> bool {
@@ -1885,6 +1889,7 @@ Expected<bool> FunctionImporter::importFunctions(
 
   // Do the actual import of functions now, one Module at a time
   for (const auto &ModName : ImportList.getSourceModules()) {
+    llvm::TimeTraceScope timeScope("Import", ModName);
     // Get the module for the import
     Expected<std::unique_ptr<Module>> SrcModuleOrErr = ModuleLoader(ModName);
     if (!SrcModuleOrErr)
@@ -1900,102 +1905,114 @@ Expected<bool> FunctionImporter::importFunctions(
 
     // Find the globals to import
     SetVector<GlobalValue *> GlobalsToImport;
-    for (Function &F : *SrcModule) {
-      if (!F.hasName())
-        continue;
-      auto GUID = F.getGUID();
-      auto MaybeImportType = ImportList.getImportType(ModName, GUID);
-      bool ImportDefinition = MaybeImportType == GlobalValueSummary::Definition;
-
-      LLVM_DEBUG(dbgs() << (MaybeImportType ? "Is" : "Not")
-                        << " importing function"
-                        << (ImportDefinition
-                                ? " definition "
-                                : (MaybeImportType ? " declaration " : " "))
-                        << GUID << " " << F.getName() << " from "
-                        << SrcModule->getSourceFileName() << "\n");
-      if (ImportDefinition) {
-        if (Error Err = F.materialize())
-          return std::move(Err);
-        // MemProf should match function's definition and summary,
-        // 'thinlto_src_module' is needed.
-        if (EnableImportMetadata || EnableMemProfContextDisambiguation) {
-          // Add 'thinlto_src_module' and 'thinlto_src_file' metadata for
-          // statistics and debugging.
-          F.setMetadata(
-              "thinlto_src_module",
-              MDNode::get(DestModule.getContext(),
-                          {MDString::get(Dest...
[truncated]

@teresajohnson
Copy link
Contributor

Thanks for the PR, will take a look shortly. But some comments/questions on the description below.

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

Can you clarify what you mean by "full LTO" on the index? I assume you mean the thin link? "full LTO" typically refers to IR based LTO. But yes, the thin link which operates on the index is indeed a serial phase, and can be non-trivial for large applications.

After PR, linking clang.exe: Capture d’écran 2025-09-02 082021

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo): Capture d’écran 2025-09-02 102048

Capture d’écran 2025-09-02 102227 Disabling the debug info verifier by adding `/mllvm:-disable-auto-upgrade-debug-info` on the command-line brings down ThinLTO link times from **10 min 7 sec** to **7 min 13 sec**, which is quite significant: Capture d’écran 2025-09-02 103758

We in fact have long used -disable-auto-upgrade-debug-info for our distributed ThinLTO backend compiles to avoid this overhead, but our build system ensures that the IR is all built from a consistent version of clang.

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.

As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

"regular LTO" typically means IR based LTO. Here again I assume you mean the thin link on the index?

Avoiding repeated metadata loading would save time, but presumably at the cost of (much?) higher peak memory. This isn't something we have looked at as we used distributed ThinLTO where the backends are completely separate processes.

@aganea
Copy link
Member Author

aganea commented Sep 3, 2025

Thanks for the PR, will take a look shortly.

Thank you in advance!

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

Can you clarify what you mean by "full LTO" on the index? I assume you mean the thin link? "full LTO" typically refers to IR based LTO. But yes, the thin link which operates on the index is indeed a serial phase, and can be non-trivial for large applications.

Yes the ThinLink phase.

We in fact have long used -disable-auto-upgrade-debug-info for our distributed ThinLTO backend compiles to avoid this overhead, but our build system ensures that the IR is all built from a consistent version of clang.

Great to know that you're using it!

"regular LTO" typically means IR based LTO. Here again I assume you mean the thin link on the index?

Yes.

Avoiding repeated metadata loading would save time, but presumably at the cost of (much?) higher peak memory. This isn't something we have looked at as we used distributed ThinLTO where the backends are completely separate processes.

We mostly do ThinLTO link locally, since our users are scattered throughout the globe and not all in a consistent physical location. Sending the artifacts (Bitcode .OBJs) to a cloud can be too long with varying ISP connections. If we had shells in the cloud for all our users, distributed ThinLTO would work, but in the game industry we don't work like that.

I'll take a look at how much extra memory the above metadata loading would take, if done upfront, and perhaps we could even gate it in front of a command-line flag. However if that saves 1-2 min on the whole link, that is quite significant for our iteration times.

{MDString::get(DestModule.getContext(),
SrcModule->getSourceFileName())}));
{
llvm::TimeTraceScope findGlobalsScope("Find globals");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is finding functions to import. So to be consistent with the other 2 which are named Globals and Aliases, this should be Functions ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (BitcodeFile *f : bitcodeFileInstances)
lto->add(*f);
}
llvm::TimeTraceScope compileScope("LTO compile");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a "Compile bitcode" TimeTraceScope just above in the same scope. Is this needed too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this scope at the begining of BitcodeCompiler::compile() and removed the one at the top of this function.

llvm::TimeTraceScope timeScope("Run ThinLTO");
LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
ThinLTO.CombinedIndex.releaseTemporaryMemory();
timeTraceProfilerBegin("ThinLink", StringRef(""));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this removed, we've lost the timer of just the index-based thin link. Why remove it?

Copy link
Member Author

@aganea aganea Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change, the ThinLink tag is now back.

const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
bool ImportEnabled) {
llvm::TimeTraceScope timeScope("Dead symbols");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Drop dead symbols and propagate attributes"? It might be good to have timers on each of the below functions as well, to distinguish their times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

SrcModule->getSourceFileName())}));
}
GlobalsToImport.insert(Fn);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there timers on the actual IR moving below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the functions below have tags in them.

@aganea
Copy link
Member Author

aganea commented Sep 3, 2025

Updated as suggested. It looks like this now:

image

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but I think it would be good to change occurrences of "full LTO" and "regular LTO" in the description to something like the "LTO thin link" to avoid confusion, as those terms typically refer specifically to IR based LTO.

@aganea
Copy link
Member Author

aganea commented Sep 5, 2025

lgtm but I think it would be good to change occurrences of "full LTO" and "regular LTO" in the description to something like the "LTO thin link" to avoid confusion, as those terms typically refer specifically to IR based LTO.

Are we talking about this tag?
Capture d’écran 2025-09-05 121607

This codepath (LTO::runRegularLTO()) is also taken when Full LTO is used, right? (as in when -flto is used) I can conditionally change it to "LTO thin link (summary)" or "LTO think link (analysis)" perhaps when ThinLTO is in effect? It's a bit confusing I find, since we already have two others tags that say "ThinLink", one in LTO::runThinLTO() and one in ThinLTOCodeGenerator::run(). What would be a good naming for these three tags?

@teresajohnson
Copy link
Contributor

I wasn't talking about the tags in the profile, I just meant in the PR summary. I.e. you have "it actually was the full LTO on the index" and "while the "regular LTO" index phase is being executed". From my understanding both of these are referring to the index-based thin link. I was suggesting changing those to specify that and not use the terms "full LTO" and "regular LTO" which are terms that are used to refer to IR based LTO links (merging all the IR, not using an index).

@aganea
Copy link
Member Author

aganea commented Sep 5, 2025

I wasn't talking about the tags in the profile, I just meant in the PR summary. I.e. you have "it actually was the full LTO on the index" and "while the "regular LTO" index phase is being executed". From my understanding both of these are referring to the index-based thin link. I was suggesting changing those to specify that and not use the terms "full LTO" and "regular LTO" which are terms that are used to refer to IR based LTO links (merging all the IR, not using an index).

Sounds good, will do, thank you!

@aganea aganea merged commit 5cda242 into llvm:main Sep 5, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants