Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Modules] Add -cc1 -flate-module-map-file to load module maps after PCMs #88893

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ilya-biryukov
Copy link
Contributor

@ilya-biryukov ilya-biryukov commented Apr 16, 2024

If the same module map is passed to multiple compilation actions that build PCMs and later load them, we currently create a new FileID for it every time a PCM gets built.

This is not very problematic in terms of performance, but it causes us to waste source location space when those PCMs get loaded together: the module map will take source location space in each of the PCMs, effectively multiplying the space it takes by the number of PCMs that we load.

E.g. PR #83660 has inadvertenly caused more PCMs to be loaded in our internal builds and some actions tipped over the source location limit, which led us to discovering this issue.

To provide a workaround, this commit introduces a new command-line flag to load module maps after PCMs and ensure we reuse existing FileID if it's available. This allows to reduce the source location space taken by module maps because we will use an existing FileID from some loaded PCM instead of creating a new one each time.

There are probably some holes in the approach for more complicated uses. In particular:

  • It is unclear how the module map shadowing rules are affected.
  • Code replaying AST files is not aware of the new option and will keep using fmodule-map-file, it is unclear whether this actually causes any issues
  • ModuleDepCollector does not distinguish between late and non-late module maps.

The new flag is -cc1 for now to allow providing a quick workaround to relieve pressure from our builds. In the long term, we should ideally look for ways to remove the flag and change the default behavior to avoid duplicated allocations of the source location space for the same module map file.

If the same module map is passed to multiple compilation actions that
build PCMs and later load them, we currently create a new FileID for it
every time a PCM gets built.

This is not very problematic in terms of performance, but it causes us
to waste source location space when those PCMs get loaded together: the
module map will take source location space in each of the PCMs,
effectively multiplying the space it takes by the number of PCMs that we
load.

PR llvm#83660 has inadvertenly caused more PCMs to be loaded in our internal
builds and some actions tipped over the source location limit.

To provide a workaround, introduce an option to load module maps after
PCMs and ensure we reuse existing FileID if it's available. This allows
to reduce the source location space taken by module maps because we will
use an existing FileID from some loaded PCM instead of creating a new
one each time.

There are probably some holes in the approach for more complicated uses.
In particular:
- It is unclear how the module map shadowing rules are affected.
- Code replaying AST files is not aware of the new option and will keep
  using `fmodule-map-file`, it is unclear whether this acutally causes
  any issues
- ModuleDepCollector does not distinguish betwee late and non-late
  module maps.

The new flag is `-cc1` for now to allow providing a quick workaround
for our builds in the wake of llvm#83660 without figuring out all the
details. I plan to start a follow up discussion on Discourse to see if
this flag should be promoted to Clang flag or if there are alternative
solutions that avoid wasting the source location space in duplicate
module maps.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules labels Apr 16, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 16, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-modules

Author: Ilya Biryukov (ilya-biryukov)

Changes

If the same module map is passed to multiple compilation actions that build PCMs and later load them, we currently create a new FileID for it every time a PCM gets built.

This is not very problematic in terms of performance, but it causes us to waste source location space when those PCMs get loaded together: the module map will take source location space in each of the PCMs, effectively multiplying the space it takes by the number of PCMs that we load.

PR #83660 has inadvertenly caused more PCMs to be loaded in our internal builds and some actions tipped over the source location limit.

To provide a workaround, introduce an option to load module maps after PCMs and ensure we reuse existing FileID if it's available. This allows to reduce the source location space taken by module maps because we will use an existing FileID from some loaded PCM instead of creating a new one each time.

There are probably some holes in the approach for more complicated uses. In particular:

  • It is unclear how the module map shadowing rules are affected.
  • Code replaying AST files is not aware of the new option and will keep using fmodule-map-file, it is unclear whether this acutally causes any issues
  • ModuleDepCollector does not distinguish betwee late and non-late module maps.

The new flag is -cc1 for now to allow providing a quick workaround for our builds in the wake of #83660 without figuring out all the details for a more sophisticated solution. I plan to start a follow up discussion on Discourse for a broader problem
of source location space exhaustion in modular builds.


Full diff: https://github.com/llvm/llvm-project/pull/88893.diff

7 Files Affected:

  • (modified) clang/include/clang/Driver/Options.td (+6)
  • (modified) clang/include/clang/Frontend/FrontendOptions.h (+3)
  • (modified) clang/lib/Frontend/FrontendAction.cpp (+11-3)
  • (modified) clang/lib/Frontend/Rewrite/FrontendActions.cpp (+1)
  • (modified) clang/lib/Lex/ModuleMap.cpp (+4-1)
  • (modified) clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp (+2)
  • (added) clang/test/Modules/late-module-maps.cpp (+93)
diff --git a/clang/include/clang/Driver/Options.td b/clang/include/clang/Driver/Options.td
index e24626913add76..d3fdc054076751 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -3179,6 +3179,12 @@ def fmodule_map_file : Joined<["-"], "fmodule-map-file=">,
   MetaVarName<"<file>">,
   HelpText<"Load this module map file">,
   MarshallingInfoStringVector<FrontendOpts<"ModuleMapFiles">>;
+def flate_module_map_file : Joined<["-"], "flate-module-map-file=">,
+  Group<f_Group>,
+  Visibility<[CC1Option]>,
+  MetaVarName<"<file>">,
+  HelpText<"Load this module map file after all modules">,
+  MarshallingInfoStringVector<FrontendOpts<"LateModuleMapFiles">>;
 def fmodule_file : Joined<["-"], "fmodule-file=">,
   Group<i_Group>,
   Visibility<[ClangOption, CC1Option, CLOption]>,
diff --git a/clang/include/clang/Frontend/FrontendOptions.h b/clang/include/clang/Frontend/FrontendOptions.h
index a738c1f3757682..cc45e68899c7e8 100644
--- a/clang/include/clang/Frontend/FrontendOptions.h
+++ b/clang/include/clang/Frontend/FrontendOptions.h
@@ -536,6 +536,9 @@ class FrontendOptions {
   /// The list of module map files to load before processing the input.
   std::vector<std::string> ModuleMapFiles;
 
+  /// \brief The list of module map files to load after ModuleFile.
+  std::vector<std::string> LateModuleMapFiles;
+
   /// The list of additional prebuilt module files to load before
   /// processing the input.
   std::vector<std::string> ModuleFiles;
diff --git a/clang/lib/Frontend/FrontendAction.cpp b/clang/lib/Frontend/FrontendAction.cpp
index b7c9967316f0b8..14745584c6ccaa 100644
--- a/clang/lib/Frontend/FrontendAction.cpp
+++ b/clang/lib/Frontend/FrontendAction.cpp
@@ -919,14 +919,17 @@ bool FrontendAction::BeginSourceFile(CompilerInstance &CI,
   if (!BeginSourceFileAction(CI))
     return false;
 
-  // If we were asked to load any module map files, do so now.
-  for (const auto &Filename : CI.getFrontendOpts().ModuleMapFiles) {
+  auto LoadModuleMap = [&](StringRef Filename) {
     if (auto File = CI.getFileManager().getOptionalFileRef(Filename))
       CI.getPreprocessor().getHeaderSearchInfo().loadModuleMapFile(
           *File, /*IsSystem*/false);
     else
       CI.getDiagnostics().Report(diag::err_module_map_not_found) << Filename;
-  }
+  };
+  // If we were asked to load any module map files, do so now.
+  for (StringRef Filename : CI.getFrontendOpts().ModuleMapFiles)
+    LoadModuleMap(Filename);
+  // Note that late module maps will be loaded after modules.
 
   // If compiling implementation of a module, load its module map file now.
   (void)CI.getPreprocessor().getCurrentModuleImplementation();
@@ -1038,6 +1041,11 @@ bool FrontendAction::BeginSourceFile(CompilerInstance &CI,
           diag::warn_eagerly_load_for_standard_cplusplus_modules);
   }
 
+  // Some module maps are loaded after modules to allow avoiding redundant
+  // processing if they were processed by modules.
+  for (StringRef Filename : CI.getFrontendOpts().LateModuleMapFiles)
+    LoadModuleMap(Filename);
+
   // If there is a layout overrides file, attach an external AST source that
   // provides the layouts from that file.
   if (!CI.getFrontendOpts().OverrideRecordLayoutsFile.empty() &&
diff --git a/clang/lib/Frontend/Rewrite/FrontendActions.cpp b/clang/lib/Frontend/Rewrite/FrontendActions.cpp
index cf5a9437e89e6c..f6e7aea06535de 100644
--- a/clang/lib/Frontend/Rewrite/FrontendActions.cpp
+++ b/clang/lib/Frontend/Rewrite/FrontendActions.cpp
@@ -255,6 +255,7 @@ class RewriteIncludesAction::RewriteImportsListener : public ASTReaderListener {
         Filename, InputKind(Language::Unknown, InputKind::Precompiled));
     Instance.getFrontendOpts().ModuleFiles.clear();
     Instance.getFrontendOpts().ModuleMapFiles.clear();
+    Instance.getFrontendOpts().LateModuleMapFiles.clear();
     // Don't recursively rewrite imports. We handle them all at the top level.
     Instance.getPreprocessorOutputOpts().RewriteImports = false;
 
diff --git a/clang/lib/Lex/ModuleMap.cpp b/clang/lib/Lex/ModuleMap.cpp
index eed7eca2e73562..232420b58f6004 100644
--- a/clang/lib/Lex/ModuleMap.cpp
+++ b/clang/lib/Lex/ModuleMap.cpp
@@ -3130,7 +3130,10 @@ bool ModuleMap::parseModuleMapFile(FileEntryRef File, bool IsSystem,
   if (ID.isInvalid()) {
     auto FileCharacter =
         IsSystem ? SrcMgr::C_System_ModuleMap : SrcMgr::C_User_ModuleMap;
-    ID = SourceMgr.createFileID(File, ExternModuleLoc, FileCharacter);
+    if (ExternModuleLoc.isValid())
+      ID = SourceMgr.createFileID(File, ExternModuleLoc, FileCharacter);
+    else // avoid creating redundant FileIDs, they take sloc space in PCMs.
+      ID = SourceMgr.getOrCreateFileID(File, FileCharacter);
   }
 
   assert(Target && "Missing target information");
diff --git a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp b/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
index e19f19b2528c15..a45a8e5267389c 100644
--- a/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
+++ b/clang/lib/Tooling/DependencyScanning/ModuleDepCollector.cpp
@@ -242,6 +242,8 @@ ModuleDepCollector::getInvocationAdjustedForModuleBuildWithoutOutputs(
   // Remove directly passed modulemap files. They will get added back if they
   // were actually used.
   CI.getMutFrontendOpts().ModuleMapFiles.clear();
+  // Late module maps can only be readded into ModuleMapFiles.
+  CI.getMutFrontendOpts().LateModuleMapFiles.clear();
 
   auto DepModuleMapFiles = collectModuleMapFiles(Deps.ClangModuleDeps);
   for (StringRef ModuleMapFile : Deps.ModuleMapFileDeps) {
diff --git a/clang/test/Modules/late-module-maps.cpp b/clang/test/Modules/late-module-maps.cpp
new file mode 100644
index 00000000000000..2ca4af3ae203a0
--- /dev/null
+++ b/clang/test/Modules/late-module-maps.cpp
@@ -0,0 +1,93 @@
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+
+// We need to check we don't waste source location space for the same file
+// (i.e. base.modulemap) when it's passed to multiple PCM file.
+//
+// *** First, try to use normal module map files.
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.modulemap -fmodule-map-file=%t/a.modulemap \
+// RUN:   -fmodule-name=a -xc++ -emit-module -o %t/a.pcm %t/a.modulemap
+
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.modulemap -fmodule-map-file=%t/a.modulemap -fmodule-map-file=%t/b.modulemap \
+// RUN:   -fmodule-file=%t/a.pcm \
+// RUN:   -fmodule-name=b -xc++ -emit-module -o %t/b.pcm %t/b.modulemap
+
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.modulemap -fmodule-map-file=%t/a.modulemap -fmodule-map-file=%t/b.modulemap \
+// RUN:   -fmodule-file=%t/a.pcm -fmodule-file=%t/b.pcm \
+// RUN:   -fsyntax-only -print-stats %t/use.cpp 2>&1 \
+// RUN:      | FileCheck %t/use.cpp
+
+// *** Switch to -flate-module-map-file and check it produces less loaded SLO entries.
+// RUN: rm %t/*.pcm
+
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -flate-module-map-file=%t/base.modulemap -flate-module-map-file=%t/a.modulemap \
+// RUN:   -fmodule-name=a -xc++ -emit-module -o %t/a.pcm %t/a.modulemap
+
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -flate-module-map-file=%t/base.modulemap -flate-module-map-file=%t/a.modulemap -flate-module-map-file=%t/b.modulemap \
+// RUN:   -fmodule-file=%t/a.pcm \
+// RUN:   -fmodule-name=b -xc++ -emit-module -o %t/b.pcm %t/b.modulemap
+
+// RUN: %clang_cc1 -fmodules -fno-implicit-modules \
+// RUN:   -flate-module-map-file=%t/base.modulemap -flate-module-map-file=%t/a.modulemap -flate-module-map-file=%t/b.modulemap \
+// RUN:   -fmodule-file=%t/a.pcm -fmodule-file=%t/b.pcm \
+// RUN:   -fsyntax-only -print-stats %t/use.cpp 2>&1 \
+// RUN:      | FileCheck --check-prefix=CHECK-LATE %t/use.cpp
+
+//--- use.cpp
+// This is a very shaky check, it would be nice if it was more directly looking at which files were loaded.
+// We load 2 SLocEntries less with flate-module-map-file, they correspond to savings from base.modulemap and a.modulemap
+// reusing the FileID in a.pcm and b.pcm.
+// We also have 3 less local SLocEntries, because we get to reuse FileIDs all module maps from PCMs.
+//
+// CHECK: Source Manager Stats:
+// CHECK: 7 local SLocEntries
+// CHECK: 13 loaded SLocEntries
+
+// CHECK-LATE: Source Manager Stats:
+// CHECK-LATE: 4 local SLocEntries
+// CHECK-LATE: 11 loaded SLocEntries
+#include "a.h"
+#include "b.h"
+#include "assert.h"
+
+int main() {
+    return a() + b();
+}
+
+//--- base.modulemap
+module "base" {
+    textual header "assert.h"
+}
+
+//--- a.modulemap
+module "a" {
+    header "a.h"
+    use "base"
+}
+
+//--- b.modulemap
+module "b" {
+    header "b.h"
+    use "a"
+    use "base"
+}
+
+
+//--- assert.h
+#define ASSERT
+
+//--- a.h
+#include "assert.h"
+inline int a() { return 1; }
+
+//--- b.h
+#include "a.h"
+#include "assert.h"
+
+inline int b() { return a() + 1; }
+

@zygoloid
Copy link
Collaborator

Have you tried changing the behavior of the existing -fmodule-map-file= to load the file after we load module files? If so, what kinds of things do you see breaking or changing? If we can avoid it, I think it would be better to only have a single mode here, and loading the module files first seems to make a lot of sense given that they can make it unnecessary to load the module map files at all.

@ian-twilightcoder
Copy link
Contributor

@ilya-biryukov
Copy link
Contributor Author

Have you tried changing the behavior of the existing -fmodule-map-file= to load the file after we load module files? If so, what kinds of things do you see breaking or changing? If we can avoid it, I think it would be better to only have a single mode here, and loading the module files first seems to make a lot of sense given that they can make it unnecessary to load the module map files at all.

I did not do it because I was sure this will break the "module shadowing" rules after reading the code.
I'll try it now and get back with my findings.

@ilya-biryukov
Copy link
Contributor Author

There's quite a few failures:
https://gist.github.com/ilya-biryukov/380d84dfe53f839f449231eb9a2dd46c

The logic that sets up module structure from module maps before starting to load PCMs definitely needs to be reworked (and I'm not sure if it can be in the first place). I believe we get away with it internally because we always pass either a top-level PCM or a top-level module map for every top-level dependency of our compilation. Therefore, our loading of PCMs ends up being self-contained, which is not always the case. This probably can't be fixed.

Btw, #89005 might make this change unnecessary for us as it should also unbreak our existing builds.

@ilya-biryukov
Copy link
Contributor Author

I wanted to bring this up again. We were hitting more source location exhaustion bugs internally recently and landing this would really help us cope with them.
This could be a temporary solution for us that we can rip off after 64 bit source locations are the default. The flag is under -cc1, so this should be doable.

@AaronBallman I know we were planning to do the move to 64bit source locations in not-very-distant future, so I wanted to CC you on this patch in case you have opinions.

I would appreciate more feedback from the reviewers already in the list, but also adding my colleague @hokein to get some more eyes on it.

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I just missed it initially.)

I took a quick look and the change looks at least not bad to me.

But I didn't have experience with clang modules and I don't have an idea for the concerns you list.

I don't mind accepting this if you want to land this. I feel it won't affect users. But let's give some time for discussing.

@ilya-biryukov
Copy link
Contributor Author

@ChuanqiXu9 thanks for taking a look! Makes sense to give others a chance to chime in, I will wait for more comments.

@jansvoboda11
Copy link
Contributor

To better understand the motivation, I have two questions:

  • Why do you give Clang module maps describing modules that are already described in PCM files? Unless the module map describes some other top-level module that's yet to be built, the information it provides is redundant. Is this impossible to detect in your build system?
  • Why do you need to wait for 64-bit source locations to be the default upstream? Is the concern that it's not a well-tested configuration right now and you don't want to shoulder the burden of qualifying that mode?

This patch looks good to me, but I think it'd be worth pursuing making this the default behavior, i.e. investigating the test failures and at least understanding what exactly is going on.

@ilya-biryukov
Copy link
Contributor Author

To better understand the motivation, I have two questions:

  • Why do you give Clang module maps describing modules that are already described in PCM files? Unless the module map describes some other top-level module that's yet to be built, the information it provides is redundant. Is this impossible to detect in your build system?

It's possible, but much more complicated to do at build system level and it would be much easier to solve this in the compiler. I want to be very clear that we also commit to support this flag while we are using it, and to remove it when we stop doing so. So if it causes any issues, we're there to help and take on the maintenance burden.

The build system is Bazel, the code is available upstream too, btw. Although the specific code that runs modules is not enabled by default, and, AFAIK, Google internally is the only user of that. Definitely the only user that cares about the scale that causes these issues.

  • Why do you need to wait for 64-bit source locations to be the default upstream? Is the concern that it's not a well-tested configuration right now and you don't want to shoulder the burden of qualifying that mode?

The biggest concern that we have for 64bit source location is performance costs. We expect some of our compile actions to time out or run out of memory after the change (because the added overhead is significant), so we would like to budget for it and make sure we get a chance to test it in advance of the upstream defaults actually changing.

All of that takes time, so this change is intended to give us a way out the corner that we're in until 64 bit source locations are in.

This patch looks good to me, but I think it'd be worth pursuing making this the default behavior, i.e. investigating the test failures and at least understanding what exactly is going on.

I can take another look, but my vague memory from the investigation I made last time I checked, is that there were some module map shadowing semantics that we could not support unless we read the module map before loading the PCM. We do not use that internally and rely on the fact that each module name is uniquely identified by its module map, hence we can use the flag without giving it too much thought. I will need to refresh my memory, but I believe the conclusion I had at the time was that this shadowing is fundamentally incompatible with loading the module maps later than PCMs, because it affects which PCMs need to loaded. (The conclusions are a little rushed, I may misremember, will take another look next week).

That was another reason why I felt that having this as a -cc1 flag is the only reasonable option, otherwise the interface just gets way too complicated.

Copy link
Contributor

@jansvoboda11 jansvoboda11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! This LGTM, but I agree with the sentiment that I'd be nice to have this as the default behavior. Any effort towards that would be appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants