[LLD][COFF] Add more `--time-trace` tags for ThinLTO linking #156471

aganea · 2025-09-02T15:03:35Z

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the "Thin Link" (the analysis phase on the summary).

After PR, linking clang.exe:

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.

As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

zmodem

Seems okay to me. I assuming that the overhead of these is basically zero when not running with --time-trace?

zmodem · 2025-09-03T10:51:22Z

llvm/lib/IR/Verifier.cpp

  bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }

  bool verify(const Function &F) {
+    llvm::TimeTraceScope timeScope("Verifier");


Maybe this is enough; do you really need the "Dominator Tree Builder" and "Verifier visit' timers as well?

zmodem · 2025-09-03T10:51:43Z

llvm/lib/IR/Verifier.cpp

    // FIXME: We strip const here because the inst visitor strips const.
-    visit(const_cast<Function &>(F));
+    {
+      llvm::TimeTraceScope domScope("Verifier visit");


scope name copy-past-o?

Changed, thanks for noting. There was another one below.

aganea · 2025-09-03T12:55:59Z

Seems okay to me. I assuming that the overhead of these is basically zero when not running with --time-trace?

Yes, if --time-trace is disabled, it just adds an access to a global TLS:

llvm-project/llvm/lib/Support/TimeProfiler.cpp

Line 451 in d29dc18

if (TimeTraceProfilerInstance != nullptr)

github-actions · 2025-09-03T12:56:27Z

✅ With the latest revision this PR passed the C/C++ code formatter.

zmodem

I think there are some importScopes which may not actually be about importing, but overall lgtm.

llvmbot · 2025-09-03T15:07:05Z

@llvm/pr-subscribers-lto
@llvm/pr-subscribers-platform-windows
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-lld

Author: Alexandre Ganea (aganea)

Changes

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

After PR, linking clang.exe:
<img width="3839" height="2026" alt="Capture d’écran 2025-09-02 082021" src="https://github.com/user-attachments/assets/bf0c85ba-2f85-4bbf-a5c1-800039b56910" />

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
<img width="1940" height="1008" alt="Capture d’écran 2025-09-02 102048" src="https://github.com/user-attachments/assets/60b28630-7995-45ce-9e8c-13f3cb5312e0" />

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:
<img width="1930" height="1007" alt="Capture d’écran 2025-09-02 103758" src="https://github.com/user-attachments/assets/c0091f24-460d-49ae-944b-78c478f7d284" />

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.

As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

Patch is 30.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156471.diff

11 Files Affected:

(modified) lld/COFF/SymbolTable.cpp (+6-2)
(modified) llvm/lib/Bitcode/Reader/MetadataLoader.cpp (+2)
(modified) llvm/lib/IR/AutoUpgrade.cpp (+2)
(modified) llvm/lib/IR/DebugInfo.cpp (+2)
(modified) llvm/lib/IR/Module.cpp (+2)
(modified) llvm/lib/IR/Verifier.cpp (+3)
(modified) llvm/lib/LTO/LTO.cpp (+24-21)
(modified) llvm/lib/LTO/LTOBackend.cpp (+13-4)
(modified) llvm/lib/Transforms/IPO/FunctionImport.cpp (+106-89)
(modified) llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp (+2)
(modified) llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (+3)

diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index 0a88807c00dd5..335f3d65a078f 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -1440,8 +1440,12 @@ void SymbolTable::compileBitcodeFiles() {
   llvm::TimeTraceScope timeScope("Compile bitcode");
   ScopedTimer t(ctx.ltoTimer);
   lto.reset(new BitcodeCompiler(ctx));
-  for (BitcodeFile *f : bitcodeFileInstances)
-    lto->add(*f);
+  {
+    llvm::TimeTraceScope addScope("Add bitcode file instances");
+    for (BitcodeFile *f : bitcodeFileInstances)
+      lto->add(*f);
+  }
+  llvm::TimeTraceScope compileScope("LTO compile");
   for (InputFile *newObj : lto->compile()) {
     ObjFile *obj = cast<ObjFile>(newObj);
     obj->parse();
diff --git a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
index 738e47b8b16c4..a5cedadd30981 100644
--- a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
+++ b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
@@ -43,6 +43,7 @@
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/TimeProfiler.h"
 
 #include <algorithm>
 #include <cassert>
@@ -1052,6 +1053,7 @@ void MetadataLoader::MetadataLoaderImpl::callMDTypeCallback(Metadata **Val,
 /// Parse a METADATA_BLOCK. If ModuleLevel is true then we are parsing
 /// module level metadata.
 Error MetadataLoader::MetadataLoaderImpl::parseMetadata(bool ModuleLevel) {
+  llvm::TimeTraceScope timeScope("Parse metadata");
   if (!ModuleLevel && MetadataList.hasFwdRefs())
     return error("Invalid metadata: fwd refs into function blocks");
 
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index 7ea9c6dff13b8..8034b3ffe273e 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -48,6 +48,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/NVPTXAddrSpace.h"
 #include "llvm/Support/Regex.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/TargetParser/Triple.h"
 #include <cstdint>
 #include <cstring>
@@ -5256,6 +5257,7 @@ bool llvm::UpgradeDebugInfo(Module &M) {
   if (DisableAutoUpgradeDebugInfo)
     return false;
 
+  llvm::TimeTraceScope timeScope("Upgrade debug info");
   // We need to get metadata before the module is verified (i.e., getModuleFlag
   // makes assumptions that we haven't verified yet). Carefully extract the flag
   // from the metadata.
diff --git a/llvm/lib/IR/DebugInfo.cpp b/llvm/lib/IR/DebugInfo.cpp
index b468d929b0280..166521a276643 100644
--- a/llvm/lib/IR/DebugInfo.cpp
+++ b/llvm/lib/IR/DebugInfo.cpp
@@ -36,6 +36,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/TimeProfiler.h"
 #include <algorithm>
 #include <cassert>
 #include <optional>
@@ -563,6 +564,7 @@ bool llvm::stripDebugInfo(Function &F) {
 }
 
 bool llvm::StripDebugInfo(Module &M) {
+  llvm::TimeTraceScope timeScope("Strip debug info");
   bool Changed = false;
 
   for (NamedMDNode &NMD : llvm::make_early_inc_range(M.named_metadata())) {
diff --git a/llvm/lib/IR/Module.cpp b/llvm/lib/IR/Module.cpp
index 70d364176062f..30b5e48652b28 100644
--- a/llvm/lib/IR/Module.cpp
+++ b/llvm/lib/IR/Module.cpp
@@ -44,6 +44,7 @@
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/RandomNumberGenerator.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/VersionTuple.h"
 #include <cassert>
 #include <cstdint>
@@ -478,6 +479,7 @@ Error Module::materializeAll() {
 }
 
 Error Module::materializeMetadata() {
+  llvm::TimeTraceScope timeScope("Materialize metadata");
   if (!Materializer)
     return Error::success();
   return Materializer->materializeMetadata();
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index da05ff166122f..06ddb4574c860 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -119,6 +119,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ModRef.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include <algorithm>
 #include <cassert>
@@ -399,6 +400,7 @@ class Verifier : public InstVisitor<Verifier>, VerifierSupport {
   bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }
 
   bool verify(const Function &F) {
+    llvm::TimeTraceScope timeScope("Verifier");
     assert(F.getParent() == &M &&
            "An instance of this class only works with a specific module!");
 
@@ -2832,6 +2834,7 @@ static Instruction *getSuccPad(Instruction *Terminator) {
 }
 
 void Verifier::verifySiblingFuncletUnwinds() {
+  llvm::TimeTraceScope timeScope("Verifier verify sibling funclet unwinds");
   SmallPtrSet<Instruction *, 8> Visited;
   SmallPtrSet<Instruction *, 8> Active;
   for (const auto &Pair : SiblingFuncletInfo) {
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index 35d24c17bbd93..89192b39e811f 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -631,6 +631,7 @@ LTO::~LTO() = default;
 void LTO::addModuleToGlobalRes(ArrayRef<InputFile::Symbol> Syms,
                                ArrayRef<SymbolResolution> Res,
                                unsigned Partition, bool InSummary) {
+  llvm::TimeTraceScope timeScope("LTO add module to global resolution");
   auto *ResI = Res.begin();
   auto *ResE = Res.end();
   (void)ResE;
@@ -731,6 +732,7 @@ static void writeToResolutionFile(raw_ostream &OS, InputFile *Input,
 
 Error LTO::add(std::unique_ptr<InputFile> Input,
                ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add input", Input->getName());
   assert(!CalledGetMaxTasks);
 
   if (Conf.ResolutionFile)
@@ -756,6 +758,7 @@ Error LTO::add(std::unique_ptr<InputFile> Input,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addModule(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                unsigned ModI, ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add module", Input.getName());
   Expected<BitcodeLTOInfo> LTOInfo = Input.Mods[ModI].getLTOInfo();
   if (!LTOInfo)
     return LTOInfo.takeError();
@@ -850,6 +853,7 @@ Expected<
 LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                    BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                    ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add regular LTO");
   RegularLTOState::AddedModule Mod;
   Expected<std::unique_ptr<Module>> MOrErr =
       BM.getLazyModule(RegularLTO.Ctx, /*ShouldLazyLoadMetadata*/ true,
@@ -1024,6 +1028,7 @@ LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
 
 Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
                           bool LivenessFromIndex) {
+  llvm::TimeTraceScope timeScope("LTO link regular LTO");
   std::vector<GlobalValue *> Keep;
   for (GlobalValue *GV : Mod.Keep) {
     if (LivenessFromIndex && !ThinLTO.CombinedIndex.isGUIDLive(GV->getGUID())) {
@@ -1063,6 +1068,7 @@ Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                 ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add thin LTO");
   ArrayRef<SymbolResolution> ResTmp = Res;
   for (const InputFile::Symbol &Sym : Syms) {
     assert(!ResTmp.empty());
@@ -1252,6 +1258,7 @@ Error LTO::run(AddStreamFn AddStream, FileCache Cache) {
 
 void lto::updateMemProfAttributes(Module &Mod,
                                   const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("LTO update memprof attributes");
   if (Index.withSupportsHotColdNew())
     return;
 
@@ -1282,6 +1289,7 @@ void lto::updateMemProfAttributes(Module &Mod,
 }
 
 Error LTO::runRegularLTO(AddStreamFn AddStream) {
+  llvm::TimeTraceScope timeScope("Run regular LTO");
   // Setup optimization remarks.
   auto DiagFileOrErr = lto::setupLLVMOptimizationRemarks(
       RegularLTO.CombinedModule->getContext(), Conf.RemarksFilename,
@@ -1294,10 +1302,12 @@ Error LTO::runRegularLTO(AddStreamFn AddStream) {
 
   // Finalize linking of regular LTO modules containing summaries now that
   // we have computed liveness information.
-  for (auto &M : RegularLTO.ModsWithSummaries)
-    if (Error Err = linkRegularLTO(std::move(M),
-                                   /*LivenessFromIndex=*/true))
-      return Err;
+  {
+    llvm::TimeTraceScope timeScope("Link regular LTO");
+    for (auto &M : RegularLTO.ModsWithSummaries)
+      if (Error Err = linkRegularLTO(std::move(M), /*LivenessFromIndex=*/true))
+        return Err;
+  }
 
   // Ensure we don't have inconsistently split LTO units with type tests.
   // FIXME: this checks both LTO and ThinLTO. It happens to work as we take
@@ -1526,6 +1536,9 @@ class InProcessThinBackend : public CGThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (in-process)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       Expected<std::unique_ptr<Module>> MOrErr = BM.parseModule(BackendContext);
@@ -1536,9 +1549,6 @@ class InProcessThinBackend : public CGThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
-
     if (ShouldEmitIndexFiles) {
       if (auto E = emitFiles(ImportList, ModuleID, ModuleID.str()))
         return E;
@@ -1639,6 +1649,9 @@ class FirstRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (first round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn CGAddStream,
                               AddStreamFn IRAddStream) {
       LTOLLVMContext BackendContext(Conf);
@@ -1650,8 +1663,6 @@ class FirstRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly, IRAddStream);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     // Like InProcessThinBackend, we produce index files as needed for
     // FirstRoundThinBackend. However, these files are not generated for
     // SecondRoundThinBackend.
@@ -1735,6 +1746,9 @@ class SecondRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (second round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       std::unique_ptr<Module> LoadedModule =
@@ -1744,8 +1758,6 @@ class SecondRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          /*CodeGenOnly=*/true);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     if (!Cache.isValid() || !CombinedIndex.modulePaths().count(ModuleID) ||
         all_of(CombinedIndex.getModuleHash(ModuleID),
                [](uint32_t V) { return V == 0; }))
@@ -1915,13 +1927,9 @@ ThinBackend lto::createWriteIndexesThinBackend(
 
 Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
                       const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols) {
+  llvm::TimeTraceScope timeScope("Run ThinLTO");
   LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
   ThinLTO.CombinedIndex.releaseTemporaryMemory();
-  timeTraceProfilerBegin("ThinLink", StringRef(""));
-  auto TimeTraceScopeExit = llvm::make_scope_exit([]() {
-    if (llvm::timeTraceProfilerEnabled())
-      llvm::timeTraceProfilerEnd();
-  });
   if (ThinLTO.ModuleMap.empty())
     return Error::success();
 
@@ -2069,11 +2077,6 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
 
   generateParamAccessSummary(ThinLTO.CombinedIndex);
 
-  if (llvm::timeTraceProfilerEnabled())
-    llvm::timeTraceProfilerEnd();
-
-  TimeTraceScopeExit.release();
-
   auto &ModuleMap =
       ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
 
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 5e8cd12fe040b..ce42fc526beac 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -366,6 +366,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
               bool IsThinLTO, ModuleSummaryIndex *ExportSummary,
               const ModuleSummaryIndex *ImportSummary,
               const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("opt");
   if (EmbedBitcode == LTOBitcodeEmbedding::EmbedPostMergePreOptimized) {
     // FIXME: the motivation for capturing post-merge bitcode and command line
     // is replicating the compilation environment from bitcode, without needing
@@ -399,6 +400,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
 static void codegen(const Config &Conf, TargetMachine *TM,
                     AddStreamFn AddStream, unsigned Task, Module &Mod,
                     const ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("codegen");
   if (Conf.PreCodeGenModuleHook && !Conf.PreCodeGenModuleHook(Task, Mod))
     return;
 
@@ -552,6 +554,7 @@ Error lto::finalizeOptimizationRemarks(
 Error lto::backend(const Config &C, AddStreamFn AddStream,
                    unsigned ParallelCodeGenParallelismLevel, Module &Mod,
                    ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("LTO backend");
   Expected<const Target *> TOrErr = initAndLookupTarget(C, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -577,6 +580,7 @@ Error lto::backend(const Config &C, AddStreamFn AddStream,
 
 static void dropDeadSymbols(Module &Mod, const GVSummaryMapTy &DefinedGlobals,
                             const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("Drop dead symbols");
   std::vector<GlobalValue*> DeadGVs;
   for (auto &GV : Mod.global_values())
     if (GlobalValueSummary *GVS = DefinedGlobals.lookup(GV.getGUID()))
@@ -603,6 +607,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
                        MapVector<StringRef, BitcodeModule> *ModuleMap,
                        bool CodeGenOnly, AddStreamFn IRAddStream,
                        const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("Thin backend", Mod.getModuleIdentifier());
   Expected<const Target *> TOrErr = initAndLookupTarget(Conf, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -679,6 +684,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return finalizeOptimizationRemarks(std::move(DiagnosticOutputFile));
 
   auto ModuleLoader = [&](StringRef Identifier) {
+    llvm::TimeTraceScope moduleLoaderScope("Module loader", Identifier);
     assert(Mod.getContext().isODRUniquingDebugTypes() &&
            "ODR Type uniquing should be enabled on the context");
     if (ModuleMap) {
@@ -712,10 +718,13 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return MOrErr;
   };
 
-  FunctionImporter Importer(CombinedIndex, ModuleLoader,
-                            ClearDSOLocalOnDeclarations);
-  if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
-    return Err;
+  {
+    llvm::TimeTraceScope importScope("Import functions");
+    FunctionImporter Importer(CombinedIndex, ModuleLoader,
+                              ClearDSOLocalOnDeclarations);
+    if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
+      return Err;
+  }
 
   // Do this after any importing so that imported code is updated.
   updateMemProfAttributes(Mod, CombinedIndex);
diff --git a/llvm/lib/Transforms/IPO/FunctionImport.cpp b/llvm/lib/Transforms/IPO/FunctionImport.cpp
index 7bcb20de46ff6..96b274e2f45a9 100644
--- a/llvm/lib/Transforms/IPO/FunctionImport.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionImport.cpp
@@ -40,6 +40,7 @@
 #include "llvm/Support/JSON.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/SourceMgr.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO/Internalize.h"
 #include "llvm/Transforms/Utils/Cloning.h"
@@ -1550,6 +1551,7 @@ void llvm::computeDeadSymbolsWithConstProp(
     const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
     function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
     bool ImportEnabled) {
+  llvm::TimeTraceScope timeScope("Dead symbols");
   computeDeadSymbolsAndUpdateIndirectCalls(Index, GUIDPreservedSymbols,
                                            isPrevailing);
   if (ImportEnabled)
@@ -1664,6 +1666,7 @@ bool llvm::convertToDeclaration(GlobalValue &GV) {
 void llvm::thinLTOFinalizeInModule(Module &TheModule,
                                    const GVSummaryMapTy &DefinedGlobals,
                                    bool PropagateAttrs) {
+  llvm::TimeTraceScope timeScope("ThinLTO finalize in module");
   DenseSet<Comdat *> NonPrevailingComdats;
   auto FinalizeInModule = [&](GlobalValue &GV, bool Propagate = false) {
     // See if the global summary analysis computed a new resolved linkage.
@@ -1791,6 +1794,7 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
 /// Run internalization on \p TheModule based on symmary analysis.
 void llvm::thinLTOInternalizeModule(Module &TheModule,
                                     const GVSummaryMapTy &DefinedGlobals) {
+  llvm::TimeTraceScope timeScope("ThinLTO internalize module");
   // Declare a callback for the internalize pass that will ask for every
   // candidate GlobalValue if it can be internalized or not.
   auto MustPreserveGV = [&](const GlobalValue &GV) -> bool {
@@ -1885,6 +1889,7 @@ Expected<bool> FunctionImporter::importFunctions(
 
   // Do the actual import of functions now, one Module at a time
   for (const auto &ModName : ImportList.getSourceModules()) {
+    llvm::TimeTraceScope timeScope("Import", ModName);
     // Get the module for the import
     Expected<std::unique_ptr<Module>> SrcModuleOrErr = ModuleLoader(ModName);
     if (!SrcModuleOrErr)
@@ -1900,102 +1905,114 @@ Expected<bool> FunctionImporter::importFunctions(
 
     // Find the globals to import
     SetVector<GlobalValue *> GlobalsToImport;
-    for (Function &F : *SrcModule) {
-      if (!F.hasName())
-        continue;
-      auto GUID = F.getGUID();
-      auto MaybeImportType = ImportList.getImportType(ModName, GUID);
-      bool ImportDefinition = MaybeImportType == GlobalValueSummary::Definition;
-
-      LLVM_DEBUG(dbgs() << (MaybeImportType ? "Is" : "Not")
-                        << " importing function"
-                        << (ImportDefinition
-                                ? " definition "
-                                : (MaybeImportType ? " declaration " : " "))
-                        << GUID << " " << F.getName() << " from "
-                        << SrcModule->getSourceFileName() << "\n");
-      if (ImportDefinition) {
-        if (Error Err = F.materialize())
-          return std::move(Err);
-        // MemProf should match function's definition and summary,
-        // 'thinlto_src_module' is needed.
-        if (EnableImportMetadata || EnableMemProfContextDisambiguation) {
-          // Add 'thinlto_src_module' and 'thinlto_src_file' metadata for
-          // statistics and debugging.
-          F.setMetadata(
-              "thinlto_src_module",
-              MDNode::get(DestModule.getContext(),
-                          {MDString::get(Dest...
[truncated]

llvmbot · 2025-09-03T15:07:05Z

@llvm/pr-subscribers-debuginfo

Author: Alexandre Ganea (aganea)

Changes

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

After PR, linking clang.exe:
<img width="3839" height="2026" alt="Capture d’écran 2025-09-02 082021" src="https://github.com/user-attachments/assets/bf0c85ba-2f85-4bbf-a5c1-800039b56910" />

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
<img width="1940" height="1008" alt="Capture d’écran 2025-09-02 102048" src="https://github.com/user-attachments/assets/60b28630-7995-45ce-9e8c-13f3cb5312e0" />

Disabling the debug info verifier by adding /mllvm:-disable-auto-upgrade-debug-info on the command-line brings down ThinLTO link times from 10 min 7 sec to 7 min 13 sec, which is quite significant:
<img width="1930" height="1007" alt="Capture d’écran 2025-09-02 103758" src="https://github.com/user-attachments/assets/c0091f24-460d-49ae-944b-78c478f7d284" />

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.

As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

Patch is 30.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/156471.diff

11 Files Affected:

(modified) lld/COFF/SymbolTable.cpp (+6-2)
(modified) llvm/lib/Bitcode/Reader/MetadataLoader.cpp (+2)
(modified) llvm/lib/IR/AutoUpgrade.cpp (+2)
(modified) llvm/lib/IR/DebugInfo.cpp (+2)
(modified) llvm/lib/IR/Module.cpp (+2)
(modified) llvm/lib/IR/Verifier.cpp (+3)
(modified) llvm/lib/LTO/LTO.cpp (+24-21)
(modified) llvm/lib/LTO/LTOBackend.cpp (+13-4)
(modified) llvm/lib/Transforms/IPO/FunctionImport.cpp (+106-89)
(modified) llvm/lib/Transforms/IPO/WholeProgramDevirt.cpp (+2)
(modified) llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (+3)

diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index 0a88807c00dd5..335f3d65a078f 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -1440,8 +1440,12 @@ void SymbolTable::compileBitcodeFiles() {
   llvm::TimeTraceScope timeScope("Compile bitcode");
   ScopedTimer t(ctx.ltoTimer);
   lto.reset(new BitcodeCompiler(ctx));
-  for (BitcodeFile *f : bitcodeFileInstances)
-    lto->add(*f);
+  {
+    llvm::TimeTraceScope addScope("Add bitcode file instances");
+    for (BitcodeFile *f : bitcodeFileInstances)
+      lto->add(*f);
+  }
+  llvm::TimeTraceScope compileScope("LTO compile");
   for (InputFile *newObj : lto->compile()) {
     ObjFile *obj = cast<ObjFile>(newObj);
     obj->parse();
diff --git a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
index 738e47b8b16c4..a5cedadd30981 100644
--- a/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
+++ b/llvm/lib/Bitcode/Reader/MetadataLoader.cpp
@@ -43,6 +43,7 @@
 #include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/TimeProfiler.h"
 
 #include <algorithm>
 #include <cassert>
@@ -1052,6 +1053,7 @@ void MetadataLoader::MetadataLoaderImpl::callMDTypeCallback(Metadata **Val,
 /// Parse a METADATA_BLOCK. If ModuleLevel is true then we are parsing
 /// module level metadata.
 Error MetadataLoader::MetadataLoaderImpl::parseMetadata(bool ModuleLevel) {
+  llvm::TimeTraceScope timeScope("Parse metadata");
   if (!ModuleLevel && MetadataList.hasFwdRefs())
     return error("Invalid metadata: fwd refs into function blocks");
 
diff --git a/llvm/lib/IR/AutoUpgrade.cpp b/llvm/lib/IR/AutoUpgrade.cpp
index 7ea9c6dff13b8..8034b3ffe273e 100644
--- a/llvm/lib/IR/AutoUpgrade.cpp
+++ b/llvm/lib/IR/AutoUpgrade.cpp
@@ -48,6 +48,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/NVPTXAddrSpace.h"
 #include "llvm/Support/Regex.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/TargetParser/Triple.h"
 #include <cstdint>
 #include <cstring>
@@ -5256,6 +5257,7 @@ bool llvm::UpgradeDebugInfo(Module &M) {
   if (DisableAutoUpgradeDebugInfo)
     return false;
 
+  llvm::TimeTraceScope timeScope("Upgrade debug info");
   // We need to get metadata before the module is verified (i.e., getModuleFlag
   // makes assumptions that we haven't verified yet). Carefully extract the flag
   // from the metadata.
diff --git a/llvm/lib/IR/DebugInfo.cpp b/llvm/lib/IR/DebugInfo.cpp
index b468d929b0280..166521a276643 100644
--- a/llvm/lib/IR/DebugInfo.cpp
+++ b/llvm/lib/IR/DebugInfo.cpp
@@ -36,6 +36,7 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/TimeProfiler.h"
 #include <algorithm>
 #include <cassert>
 #include <optional>
@@ -563,6 +564,7 @@ bool llvm::stripDebugInfo(Function &F) {
 }
 
 bool llvm::StripDebugInfo(Module &M) {
+  llvm::TimeTraceScope timeScope("Strip debug info");
   bool Changed = false;
 
   for (NamedMDNode &NMD : llvm::make_early_inc_range(M.named_metadata())) {
diff --git a/llvm/lib/IR/Module.cpp b/llvm/lib/IR/Module.cpp
index 70d364176062f..30b5e48652b28 100644
--- a/llvm/lib/IR/Module.cpp
+++ b/llvm/lib/IR/Module.cpp
@@ -44,6 +44,7 @@
 #include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/RandomNumberGenerator.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/VersionTuple.h"
 #include <cassert>
 #include <cstdint>
@@ -478,6 +479,7 @@ Error Module::materializeAll() {
 }
 
 Error Module::materializeMetadata() {
+  llvm::TimeTraceScope timeScope("Materialize metadata");
   if (!Materializer)
     return Error::success();
   return Materializer->materializeMetadata();
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index da05ff166122f..06ddb4574c860 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -119,6 +119,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/ModRef.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include <algorithm>
 #include <cassert>
@@ -399,6 +400,7 @@ class Verifier : public InstVisitor<Verifier>, VerifierSupport {
   bool hasBrokenDebugInfo() const { return BrokenDebugInfo; }
 
   bool verify(const Function &F) {
+    llvm::TimeTraceScope timeScope("Verifier");
     assert(F.getParent() == &M &&
            "An instance of this class only works with a specific module!");
 
@@ -2832,6 +2834,7 @@ static Instruction *getSuccPad(Instruction *Terminator) {
 }
 
 void Verifier::verifySiblingFuncletUnwinds() {
+  llvm::TimeTraceScope timeScope("Verifier verify sibling funclet unwinds");
   SmallPtrSet<Instruction *, 8> Visited;
   SmallPtrSet<Instruction *, 8> Active;
   for (const auto &Pair : SiblingFuncletInfo) {
diff --git a/llvm/lib/LTO/LTO.cpp b/llvm/lib/LTO/LTO.cpp
index 35d24c17bbd93..89192b39e811f 100644
--- a/llvm/lib/LTO/LTO.cpp
+++ b/llvm/lib/LTO/LTO.cpp
@@ -631,6 +631,7 @@ LTO::~LTO() = default;
 void LTO::addModuleToGlobalRes(ArrayRef<InputFile::Symbol> Syms,
                                ArrayRef<SymbolResolution> Res,
                                unsigned Partition, bool InSummary) {
+  llvm::TimeTraceScope timeScope("LTO add module to global resolution");
   auto *ResI = Res.begin();
   auto *ResE = Res.end();
   (void)ResE;
@@ -731,6 +732,7 @@ static void writeToResolutionFile(raw_ostream &OS, InputFile *Input,
 
 Error LTO::add(std::unique_ptr<InputFile> Input,
                ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add input", Input->getName());
   assert(!CalledGetMaxTasks);
 
   if (Conf.ResolutionFile)
@@ -756,6 +758,7 @@ Error LTO::add(std::unique_ptr<InputFile> Input,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addModule(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                unsigned ModI, ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add module", Input.getName());
   Expected<BitcodeLTOInfo> LTOInfo = Input.Mods[ModI].getLTOInfo();
   if (!LTOInfo)
     return LTOInfo.takeError();
@@ -850,6 +853,7 @@ Expected<
 LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
                    BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                    ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add regular LTO");
   RegularLTOState::AddedModule Mod;
   Expected<std::unique_ptr<Module>> MOrErr =
       BM.getLazyModule(RegularLTO.Ctx, /*ShouldLazyLoadMetadata*/ true,
@@ -1024,6 +1028,7 @@ LTO::addRegularLTO(InputFile &Input, ArrayRef<SymbolResolution> InputRes,
 
 Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
                           bool LivenessFromIndex) {
+  llvm::TimeTraceScope timeScope("LTO link regular LTO");
   std::vector<GlobalValue *> Keep;
   for (GlobalValue *GV : Mod.Keep) {
     if (LivenessFromIndex && !ThinLTO.CombinedIndex.isGUIDLive(GV->getGUID())) {
@@ -1063,6 +1068,7 @@ Error LTO::linkRegularLTO(RegularLTOState::AddedModule Mod,
 Expected<ArrayRef<SymbolResolution>>
 LTO::addThinLTO(BitcodeModule BM, ArrayRef<InputFile::Symbol> Syms,
                 ArrayRef<SymbolResolution> Res) {
+  llvm::TimeTraceScope timeScope("LTO add thin LTO");
   ArrayRef<SymbolResolution> ResTmp = Res;
   for (const InputFile::Symbol &Sym : Syms) {
     assert(!ResTmp.empty());
@@ -1252,6 +1258,7 @@ Error LTO::run(AddStreamFn AddStream, FileCache Cache) {
 
 void lto::updateMemProfAttributes(Module &Mod,
                                   const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("LTO update memprof attributes");
   if (Index.withSupportsHotColdNew())
     return;
 
@@ -1282,6 +1289,7 @@ void lto::updateMemProfAttributes(Module &Mod,
 }
 
 Error LTO::runRegularLTO(AddStreamFn AddStream) {
+  llvm::TimeTraceScope timeScope("Run regular LTO");
   // Setup optimization remarks.
   auto DiagFileOrErr = lto::setupLLVMOptimizationRemarks(
       RegularLTO.CombinedModule->getContext(), Conf.RemarksFilename,
@@ -1294,10 +1302,12 @@ Error LTO::runRegularLTO(AddStreamFn AddStream) {
 
   // Finalize linking of regular LTO modules containing summaries now that
   // we have computed liveness information.
-  for (auto &M : RegularLTO.ModsWithSummaries)
-    if (Error Err = linkRegularLTO(std::move(M),
-                                   /*LivenessFromIndex=*/true))
-      return Err;
+  {
+    llvm::TimeTraceScope timeScope("Link regular LTO");
+    for (auto &M : RegularLTO.ModsWithSummaries)
+      if (Error Err = linkRegularLTO(std::move(M), /*LivenessFromIndex=*/true))
+        return Err;
+  }
 
   // Ensure we don't have inconsistently split LTO units with type tests.
   // FIXME: this checks both LTO and ThinLTO. It happens to work as we take
@@ -1526,6 +1536,9 @@ class InProcessThinBackend : public CGThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (in-process)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       Expected<std::unique_ptr<Module>> MOrErr = BM.parseModule(BackendContext);
@@ -1536,9 +1549,6 @@ class InProcessThinBackend : public CGThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
-
     if (ShouldEmitIndexFiles) {
       if (auto E = emitFiles(ImportList, ModuleID, ModuleID.str()))
         return E;
@@ -1639,6 +1649,9 @@ class FirstRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (first round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn CGAddStream,
                               AddStreamFn IRAddStream) {
       LTOLLVMContext BackendContext(Conf);
@@ -1650,8 +1663,6 @@ class FirstRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          Conf.CodeGenOnly, IRAddStream);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     // Like InProcessThinBackend, we produce index files as needed for
     // FirstRoundThinBackend. However, these files are not generated for
     // SecondRoundThinBackend.
@@ -1735,6 +1746,9 @@ class SecondRoundThinBackend : public InProcessThinBackend {
       const std::map<GlobalValue::GUID, GlobalValue::LinkageTypes> &ResolvedODR,
       const GVSummaryMapTy &DefinedGlobals,
       MapVector<StringRef, BitcodeModule> &ModuleMap) override {
+    auto ModuleID = BM.getModuleIdentifier();
+    llvm::TimeTraceScope timeScope("Run ThinLTO backend thread (second round)",
+                                   ModuleID);
     auto RunThinBackend = [&](AddStreamFn AddStream) {
       LTOLLVMContext BackendContext(Conf);
       std::unique_ptr<Module> LoadedModule =
@@ -1744,8 +1758,6 @@ class SecondRoundThinBackend : public InProcessThinBackend {
                          ImportList, DefinedGlobals, &ModuleMap,
                          /*CodeGenOnly=*/true);
     };
-
-    auto ModuleID = BM.getModuleIdentifier();
     if (!Cache.isValid() || !CombinedIndex.modulePaths().count(ModuleID) ||
         all_of(CombinedIndex.getModuleHash(ModuleID),
                [](uint32_t V) { return V == 0; }))
@@ -1915,13 +1927,9 @@ ThinBackend lto::createWriteIndexesThinBackend(
 
 Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
                       const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols) {
+  llvm::TimeTraceScope timeScope("Run ThinLTO");
   LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
   ThinLTO.CombinedIndex.releaseTemporaryMemory();
-  timeTraceProfilerBegin("ThinLink", StringRef(""));
-  auto TimeTraceScopeExit = llvm::make_scope_exit([]() {
-    if (llvm::timeTraceProfilerEnabled())
-      llvm::timeTraceProfilerEnd();
-  });
   if (ThinLTO.ModuleMap.empty())
     return Error::success();
 
@@ -2069,11 +2077,6 @@ Error LTO::runThinLTO(AddStreamFn AddStream, FileCache Cache,
 
   generateParamAccessSummary(ThinLTO.CombinedIndex);
 
-  if (llvm::timeTraceProfilerEnabled())
-    llvm::timeTraceProfilerEnd();
-
-  TimeTraceScopeExit.release();
-
   auto &ModuleMap =
       ThinLTO.ModulesToCompile ? *ThinLTO.ModulesToCompile : ThinLTO.ModuleMap;
 
diff --git a/llvm/lib/LTO/LTOBackend.cpp b/llvm/lib/LTO/LTOBackend.cpp
index 5e8cd12fe040b..ce42fc526beac 100644
--- a/llvm/lib/LTO/LTOBackend.cpp
+++ b/llvm/lib/LTO/LTOBackend.cpp
@@ -366,6 +366,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
               bool IsThinLTO, ModuleSummaryIndex *ExportSummary,
               const ModuleSummaryIndex *ImportSummary,
               const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("opt");
   if (EmbedBitcode == LTOBitcodeEmbedding::EmbedPostMergePreOptimized) {
     // FIXME: the motivation for capturing post-merge bitcode and command line
     // is replicating the compilation environment from bitcode, without needing
@@ -399,6 +400,7 @@ bool lto::opt(const Config &Conf, TargetMachine *TM, unsigned Task, Module &Mod,
 static void codegen(const Config &Conf, TargetMachine *TM,
                     AddStreamFn AddStream, unsigned Task, Module &Mod,
                     const ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("codegen");
   if (Conf.PreCodeGenModuleHook && !Conf.PreCodeGenModuleHook(Task, Mod))
     return;
 
@@ -552,6 +554,7 @@ Error lto::finalizeOptimizationRemarks(
 Error lto::backend(const Config &C, AddStreamFn AddStream,
                    unsigned ParallelCodeGenParallelismLevel, Module &Mod,
                    ModuleSummaryIndex &CombinedIndex) {
+  llvm::TimeTraceScope timeScope("LTO backend");
   Expected<const Target *> TOrErr = initAndLookupTarget(C, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -577,6 +580,7 @@ Error lto::backend(const Config &C, AddStreamFn AddStream,
 
 static void dropDeadSymbols(Module &Mod, const GVSummaryMapTy &DefinedGlobals,
                             const ModuleSummaryIndex &Index) {
+  llvm::TimeTraceScope timeScope("Drop dead symbols");
   std::vector<GlobalValue*> DeadGVs;
   for (auto &GV : Mod.global_values())
     if (GlobalValueSummary *GVS = DefinedGlobals.lookup(GV.getGUID()))
@@ -603,6 +607,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
                        MapVector<StringRef, BitcodeModule> *ModuleMap,
                        bool CodeGenOnly, AddStreamFn IRAddStream,
                        const std::vector<uint8_t> &CmdArgs) {
+  llvm::TimeTraceScope timeScope("Thin backend", Mod.getModuleIdentifier());
   Expected<const Target *> TOrErr = initAndLookupTarget(Conf, Mod);
   if (!TOrErr)
     return TOrErr.takeError();
@@ -679,6 +684,7 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return finalizeOptimizationRemarks(std::move(DiagnosticOutputFile));
 
   auto ModuleLoader = [&](StringRef Identifier) {
+    llvm::TimeTraceScope moduleLoaderScope("Module loader", Identifier);
     assert(Mod.getContext().isODRUniquingDebugTypes() &&
            "ODR Type uniquing should be enabled on the context");
     if (ModuleMap) {
@@ -712,10 +718,13 @@ Error lto::thinBackend(const Config &Conf, unsigned Task, AddStreamFn AddStream,
     return MOrErr;
   };
 
-  FunctionImporter Importer(CombinedIndex, ModuleLoader,
-                            ClearDSOLocalOnDeclarations);
-  if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
-    return Err;
+  {
+    llvm::TimeTraceScope importScope("Import functions");
+    FunctionImporter Importer(CombinedIndex, ModuleLoader,
+                              ClearDSOLocalOnDeclarations);
+    if (Error Err = Importer.importFunctions(Mod, ImportList).takeError())
+      return Err;
+  }
 
   // Do this after any importing so that imported code is updated.
   updateMemProfAttributes(Mod, CombinedIndex);
diff --git a/llvm/lib/Transforms/IPO/FunctionImport.cpp b/llvm/lib/Transforms/IPO/FunctionImport.cpp
index 7bcb20de46ff6..96b274e2f45a9 100644
--- a/llvm/lib/Transforms/IPO/FunctionImport.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionImport.cpp
@@ -40,6 +40,7 @@
 #include "llvm/Support/JSON.h"
 #include "llvm/Support/Path.h"
 #include "llvm/Support/SourceMgr.h"
+#include "llvm/Support/TimeProfiler.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO/Internalize.h"
 #include "llvm/Transforms/Utils/Cloning.h"
@@ -1550,6 +1551,7 @@ void llvm::computeDeadSymbolsWithConstProp(
     const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
     function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
     bool ImportEnabled) {
+  llvm::TimeTraceScope timeScope("Dead symbols");
   computeDeadSymbolsAndUpdateIndirectCalls(Index, GUIDPreservedSymbols,
                                            isPrevailing);
   if (ImportEnabled)
@@ -1664,6 +1666,7 @@ bool llvm::convertToDeclaration(GlobalValue &GV) {
 void llvm::thinLTOFinalizeInModule(Module &TheModule,
                                    const GVSummaryMapTy &DefinedGlobals,
                                    bool PropagateAttrs) {
+  llvm::TimeTraceScope timeScope("ThinLTO finalize in module");
   DenseSet<Comdat *> NonPrevailingComdats;
   auto FinalizeInModule = [&](GlobalValue &GV, bool Propagate = false) {
     // See if the global summary analysis computed a new resolved linkage.
@@ -1791,6 +1794,7 @@ void llvm::thinLTOFinalizeInModule(Module &TheModule,
 /// Run internalization on \p TheModule based on symmary analysis.
 void llvm::thinLTOInternalizeModule(Module &TheModule,
                                     const GVSummaryMapTy &DefinedGlobals) {
+  llvm::TimeTraceScope timeScope("ThinLTO internalize module");
   // Declare a callback for the internalize pass that will ask for every
   // candidate GlobalValue if it can be internalized or not.
   auto MustPreserveGV = [&](const GlobalValue &GV) -> bool {
@@ -1885,6 +1889,7 @@ Expected<bool> FunctionImporter::importFunctions(
 
   // Do the actual import of functions now, one Module at a time
   for (const auto &ModName : ImportList.getSourceModules()) {
+    llvm::TimeTraceScope timeScope("Import", ModName);
     // Get the module for the import
     Expected<std::unique_ptr<Module>> SrcModuleOrErr = ModuleLoader(ModName);
     if (!SrcModuleOrErr)
@@ -1900,102 +1905,114 @@ Expected<bool> FunctionImporter::importFunctions(
 
     // Find the globals to import
     SetVector<GlobalValue *> GlobalsToImport;
-    for (Function &F : *SrcModule) {
-      if (!F.hasName())
-        continue;
-      auto GUID = F.getGUID();
-      auto MaybeImportType = ImportList.getImportType(ModName, GUID);
-      bool ImportDefinition = MaybeImportType == GlobalValueSummary::Definition;
-
-      LLVM_DEBUG(dbgs() << (MaybeImportType ? "Is" : "Not")
-                        << " importing function"
-                        << (ImportDefinition
-                                ? " definition "
-                                : (MaybeImportType ? " declaration " : " "))
-                        << GUID << " " << F.getName() << " from "
-                        << SrcModule->getSourceFileName() << "\n");
-      if (ImportDefinition) {
-        if (Error Err = F.materialize())
-          return std::move(Err);
-        // MemProf should match function's definition and summary,
-        // 'thinlto_src_module' is needed.
-        if (EnableImportMetadata || EnableMemProfContextDisambiguation) {
-          // Add 'thinlto_src_module' and 'thinlto_src_file' metadata for
-          // statistics and debugging.
-          F.setMetadata(
-              "thinlto_src_module",
-              MDNode::get(DestModule.getContext(),
-                          {MDString::get(Dest...
[truncated]

teresajohnson · 2025-09-03T15:26:02Z

Thanks for the PR, will take a look shortly. But some comments/questions on the description below.

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

Can you clarify what you mean by "full LTO" on the index? I assume you mean the thin link? "full LTO" typically refers to IR based LTO. But yes, the thin link which operates on the index is indeed a serial phase, and can be non-trivial for large applications.

After PR, linking clang.exe:

Linking our custom (Unreal Engine game) binary gives a completly different picture, probably because of using Unity files, and the sheer amount of input files (we're providing over 60GB of .OBJs/.LIBs). Exploring a bit all this, it turns out "Import functions" is dominant because of the debug info verifier (called from llvm::UpgradeDebugInfo):
Disabling the debug info verifier by adding `/mllvm:-disable-auto-upgrade-debug-info` on the command-line brings down ThinLTO link times from **10 min 7 sec** to **7 min 13 sec**, which is quite significant:

We in fact have long used -disable-auto-upgrade-debug-info for our distributed ThinLTO backend compiles to avoid this overhead, but our build system ensures that the IR is all built from a consistent version of clang.

However now what becomes dominant is parsing the metadata from the .OBJ files (that is MetadataLoader::MetadataLoaderImpl::parseMetadata). The total cumulated time on all threads for this (metadata parsing) is ~2 h 6 sec, in contrast to the cumulated "opt" for all units is 56 min, and "codegen" is 1 h 41 min.

As a separate discussion, when running ThinLTO in-process, I wonder if we couldn't parse the metadata only once for each module, instead of separately parsing all imported modules on each ThinLTO thread. Which parses each of them more than once, if my understanding is correct. This would probably require some thread synchronization gymnastics, but the impact could be quite significant. Another avenue would be to parse & retain the metadata in advance, while the "regular LTO" index phase is being executed (where not much happens on the other threads). @teresajohnson any opinion on all this?

"regular LTO" typically means IR based LTO. Here again I assume you mean the thin link on the index?

Avoiding repeated metadata loading would save time, but presumably at the cost of (much?) higher peak memory. This isn't something we have looked at as we used distributed ThinLTO where the backends are completely separate processes.

aganea · 2025-09-03T15:51:07Z

Thanks for the PR, will take a look shortly.

Thank you in advance!

In order to better see what's going on during ThinLTO linking, this PR adds more profile tags when using --time-trace on a lld-link.exe invocation. I was trying to understand what was the long delay (not multithreaded) before the actual ThinLTO multithreaded opt/codegen -- it actually was the full LTO on the index.

Can you clarify what you mean by "full LTO" on the index? I assume you mean the thin link? "full LTO" typically refers to IR based LTO. But yes, the thin link which operates on the index is indeed a serial phase, and can be non-trivial for large applications.

Yes the ThinLink phase.

We in fact have long used -disable-auto-upgrade-debug-info for our distributed ThinLTO backend compiles to avoid this overhead, but our build system ensures that the IR is all built from a consistent version of clang.

Great to know that you're using it!

"regular LTO" typically means IR based LTO. Here again I assume you mean the thin link on the index?

Yes.

Avoiding repeated metadata loading would save time, but presumably at the cost of (much?) higher peak memory. This isn't something we have looked at as we used distributed ThinLTO where the backends are completely separate processes.

We mostly do ThinLTO link locally, since our users are scattered throughout the globe and not all in a consistent physical location. Sending the artifacts (Bitcode .OBJs) to a cloud can be too long with varying ISP connections. If we had shells in the cloud for all our users, distributed ThinLTO would work, but in the game industry we don't work like that.

I'll take a look at how much extra memory the above metadata loading would take, if done upfront, and perhaps we could even gate it in front of a command-line flag. However if that saves 1-2 min on the whole link, that is quite significant for our iteration times.

teresajohnson · 2025-09-03T15:32:42Z

llvm/lib/Transforms/IPO/FunctionImport.cpp

-                          {MDString::get(DestModule.getContext(),
-                                         SrcModule->getSourceFileName())}));
+    {
+      llvm::TimeTraceScope findGlobalsScope("Find globals");


This is finding functions to import. So to be consistent with the other 2 which are named Globals and Aliases, this should be Functions ?

teresajohnson · 2025-09-03T15:38:37Z

lld/COFF/SymbolTable.cpp

+    for (BitcodeFile *f : bitcodeFileInstances)
+      lto->add(*f);
+  }
+  llvm::TimeTraceScope compileScope("LTO compile");


There's already a "Compile bitcode" TimeTraceScope just above in the same scope. Is this needed too?

I moved this scope at the begining of BitcodeCompiler::compile() and removed the one at the top of this function.

teresajohnson · 2025-09-03T15:45:02Z

llvm/lib/LTO/LTO.cpp

+  llvm::TimeTraceScope timeScope("Run ThinLTO");
  LLVM_DEBUG(dbgs() << "Running ThinLTO\n");
  ThinLTO.CombinedIndex.releaseTemporaryMemory();
-  timeTraceProfilerBegin("ThinLink", StringRef(""));


With this removed, we've lost the timer of just the index-based thin link. Why remove it?

Reverted this change, the ThinLink tag is now back.

teresajohnson · 2025-09-03T15:47:11Z

llvm/lib/Transforms/IPO/FunctionImport.cpp

    const DenseSet<GlobalValue::GUID> &GUIDPreservedSymbols,
    function_ref<PrevailingType(GlobalValue::GUID)> isPrevailing,
    bool ImportEnabled) {
+  llvm::TimeTraceScope timeScope("Dead symbols");


Maybe "Drop dead symbols and propagate attributes"? It might be good to have timers on each of the below functions as well, to distinguish their times.

teresajohnson · 2025-09-03T15:50:04Z

llvm/lib/Transforms/IPO/FunctionImport.cpp

+                                           SrcModule->getSourceFileName())}));
+          }
+          GlobalsToImport.insert(Fn);
        }


Are there timers on the actual IR moving below?

All the functions below have tags in them.

aganea · 2025-09-03T19:52:19Z

Updated as suggested. It looks like this now:

teresajohnson

lgtm but I think it would be good to change occurrences of "full LTO" and "regular LTO" in the description to something like the "LTO thin link" to avoid confusion, as those terms typically refer specifically to IR based LTO.

aganea · 2025-09-05T16:22:46Z

lgtm but I think it would be good to change occurrences of "full LTO" and "regular LTO" in the description to something like the "LTO thin link" to avoid confusion, as those terms typically refer specifically to IR based LTO.

Are we talking about this tag?

This codepath (LTO::runRegularLTO()) is also taken when Full LTO is used, right? (as in when -flto is used) I can conditionally change it to "LTO thin link (summary)" or "LTO think link (analysis)" perhaps when ThinLTO is in effect? It's a bit confusing I find, since we already have two others tags that say "ThinLink", one in LTO::runThinLTO() and one in ThinLTOCodeGenerator::run(). What would be a good naming for these three tags?

teresajohnson · 2025-09-05T18:59:58Z

I wasn't talking about the tags in the profile, I just meant in the PR summary. I.e. you have "it actually was the full LTO on the index" and "while the "regular LTO" index phase is being executed". From my understanding both of these are referring to the index-based thin link. I was suggesting changing those to specify that and not use the terms "full LTO" and "regular LTO" which are terms that are used to refer to IR based LTO links (merging all the IR, not using an index).

aganea · 2025-09-05T19:23:27Z

I wasn't talking about the tags in the profile, I just meant in the PR summary. I.e. you have "it actually was the full LTO on the index" and "while the "regular LTO" index phase is being executed". From my understanding both of these are referring to the index-based thin link. I was suggesting changing those to specify that and not use the terms "full LTO" and "regular LTO" which are terms that are used to refer to IR based LTO links (merging all the IR, not using an index).

Sounds good, will do, thank you!

[LLD][COFF] Add more --time-trace tags for ThinLTO linking

bfb0de6

aganea requested review from tru, mstorsjo, pcc, zmodem, vitalybuka and cjacek September 2, 2025 15:03

zmodem reviewed Sep 3, 2025

View reviewed changes

Remove some unneeded tags

076d475

zmodem approved these changes Sep 3, 2025

View reviewed changes

Fix tags + clang-format

a4cf7ef

llvmbot added lld debuginfo lld:COFF platform:windows LTO Link time optimization (regular/full LTO or ThinLTO) llvm:ir llvm:transforms labels Sep 3, 2025

teresajohnson reviewed Sep 3, 2025

View reviewed changes

Adjust some tags following review

13bf437

teresajohnson approved these changes Sep 4, 2025

View reviewed changes

aganea merged commit 5cda242 into llvm:main Sep 5, 2025
9 checks passed

[LLD][COFF] Add more --time-trace tags for ThinLTO linking #156471

[LLD][COFF] Add more --time-trace tags for ThinLTO linking #156471

Uh oh!

Conversation

aganea commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zmodem left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aganea commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zmodem left a comment

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Sep 3, 2025

Uh oh!

teresajohnson commented Sep 3, 2025

Uh oh!

aganea commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aganea Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aganea commented Sep 3, 2025

Uh oh!

teresajohnson left a comment

Choose a reason for hiding this comment

Uh oh!

aganea commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teresajohnson commented Sep 5, 2025

Uh oh!

aganea commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

[LLD][COFF] Add more `--time-trace` tags for ThinLTO linking #156471

[LLD][COFF] Add more `--time-trace` tags for ThinLTO linking #156471

aganea commented Sep 2, 2025 •

edited

Loading

github-actions bot commented Sep 3, 2025 •

edited

Loading

llvmbot commented Sep 3, 2025 •

edited

Loading

aganea commented Sep 3, 2025 •

edited

Loading

aganea Sep 3, 2025 •

edited

Loading

aganea commented Sep 5, 2025 •

edited

Loading