Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't -fmodules-embed-all-files be the default? #72383

Open
boris-kolpackov opened this issue Nov 15, 2023 · 31 comments · May be fixed by #74419
Open

Shouldn't -fmodules-embed-all-files be the default? #72383

boris-kolpackov opened this issue Nov 15, 2023 · 31 comments · May be fixed by #74419
Labels
clang:modules C++20 modules and Clang Header Modules

Comments

@boris-kolpackov
Copy link
Contributor

Consider these two translation units:

// hello.mxx

module;

#include <string>
#include <string_view>

export module hello;

export namespace hello
{
  std::string
  say_hello (const std::string_view& name)
  {
    return "Hello, " + std::string (name) + '!';
  }
}
// main.cxx

#include <string_view>

import hello;

void f (const std::string_view&) {}

int
main ()
{
  hello::say_hello ("World");
}

If I compile hello.mxx with Clang 16, then remove hello.mxx, and attempt to compile main.cxx, I get an error:

clang++-16 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-16 -std=c++2b -fmodule-file=hello=hello.pcm -o main.o -c -x c++ main.cxx
main.cxx:5:1: fatal error: malformed or corrupted AST file: 'could not find file '/tmp/hello-simple/hello.mxx' referenced by AST file 'hello.pcm''

This can be fixed by adding -Xclang -fmodules-embed-all-files when compiling hello.mxx.

However, -fmodules-embed-all-files appears to no longer be necessary for this example if using Clang 17 or later:

clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
# no error

But a minor change in main.cxx can bring its requirement back if compiling via -frewrite-includes:

// main.cxx

#include <string_view>

import hello;

void f (const std::string_view&) {}

int
main ()
{
  f (hello::say_hello ("World"));  // <-- NOW CALLING f().
}
clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
# no error

But:

clang++-17 -std=c++2b -x c++-module -E -frewrite-includes -o hello.ii hello.mxx
clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.ii
rm hello.ii
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
fatal error: cannot open file '/tmp/hello-simple1/hello.ii': No such file or directory

And if I add -Xclang -fmodules-embed-all-files to the second command, then everything again compiles fine. I get exactly the same behavior with Clang 18.

I have two questions about this:

  1. Is -fmodules-embed-all-files the default starting from Clang 17? If the answer is yes, then there seems to be a bug in the -fdirectives-only interaction.

  2. If -fmodules-embed-all-files is not the default, then should it not be made so? A BMI that still has references to source files is quite brittle since it can be moved around. AFAIK, neither GCC nor MSVC have this restriction for their BMIs.

@iains @ChuanqiXu9 @dwblaikie

@EugeneZelenko EugeneZelenko added clang:modules C++20 modules and Clang Header Modules and removed new issue labels Nov 15, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 15, 2023

@llvm/issue-subscribers-clang-modules

Author: Boris Kolpackov (boris-kolpackov)

Consider these two translation units:
// hello.mxx

module;

#include &lt;string&gt;
#include &lt;string_view&gt;

export module hello;

export namespace hello
{
  std::string
  say_hello (const std::string_view&amp; name)
  {
    return "Hello, " + std::string (name) + '!';
  }
}
// main.cxx

#include &lt;string_view&gt;

import hello;

void f (const std::string_view&amp;) {}

int
main ()
{
  hello::say_hello ("World");
}

If I compile hello.mxx with Clang 16, then remove hello.mxx, and attempt to compile main.cxx, I get an error:

clang++-16 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-16 -std=c++2b -fmodule-file=hello=hello.pcm -o main.o -c -x c++ main.cxx
main.cxx:5:1: fatal error: malformed or corrupted AST file: 'could not find file '/tmp/hello-simple/hello.mxx' referenced by AST file 'hello.pcm''

This can be fixed by adding -Xclang -fmodules-embed-all-files when compiling hello.mxx.

However, -fmodules-embed-all-files appears to no longer be necessary for this example if using Clang 17 or later:

clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
# no error

But a minor change in main.cxx can bring its requirement back if compiling via -frewrite-includes:

// main.cxx

#include &lt;string_view&gt;

import hello;

void f (const std::string_view&amp;) {}

int
main ()
{
  f (hello::say_hello ("World"));  // &lt;-- NOW CALLING f().
}
clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.mxx
mv hello.mxx hello.mxx.bak
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
# no error

But:

clang++-17 -std=c++2b -x c++-module -E -frewrite-includes -o hello.ii hello.mxx
clang++-17 -std=c++2b -fmodule-output=hello.pcm -o hello.pcm.o -c -x c++-module hello.ii
rm hello.ii
clang++-17 -std=c++2b -fmodule-file=hello=hello.pcm -o hello.o -c -x c++ main.cxx
fatal error: cannot open file '/tmp/hello-simple1/hello.ii': No such file or directory

And if I add -Xclang -fmodules-embed-all-files to the second command, then everything again compiles fine. I get exactly the same behavior with Clang 18.

I have two questions about this:

  1. Is -fmodules-embed-all-files the default starting from Clang 17? If the answer is yes, then there seems to be a bug in the -fdirectives-only interaction.

  2. If -fmodules-embed-all-files is not the default, then should it not be made so? A BMI that still has references to source files is quite brittle since it can be moved around. AFAIK, neither GCC nor MSVC have this restriction for their BMIs.

@iains @ChuanqiXu9 @dwblaikie

@dwblaikie
Copy link
Collaborator

For myself, I think not embedding the source is the right default. (the build depends on scattered source files on disk without modules, I'm OK with it depending on them (even indirectly through module files) with modules)
For Google - we embed source because of constraints of our distributed build system, and we're more than happy for our internal needs/uses to happen to match external users, so we'd benefit from the change in default to something that looked more like our situation.
Probably some performance numbers on the cost of doing this to demonstrate it's not a major cost (probably compile time/module generation time, and disk usage numbers would be useful) would be helpful in making this determination.

@ChuanqiXu9
Copy link
Member

ChuanqiXu9 commented Nov 16, 2023

Is -fmodules-embed-all-files the default starting from Clang 17? If the answer is yes, then there seems to be a bug in the -fdirectives-only interaction.

No.

If -fmodules-embed-all-files is not the default, then should it not be made so? A BMI that still has references to source files is quite brittle since it can be moved around. AFAIK, neither GCC nor MSVC have this restriction for their BMIs.

I am happy to make it the default at least for C++20 modules. It will be pretty helpful for distributed build and sandbox-based build.

@boris-kolpackov
Copy link
Contributor Author

While I don't disagree with any of the points made (it's a tradeoff), I want to highlight one aspect that I think is overlooked: not embedding the sources prevents you from moving the BMI. The fact that Clang 17 now seems to be needing the sources in fewer cases might actually make the matter worse since you usually don't get an error even if you moved the BMI. It took me probably half a day yesterday to try to understand what on earth is going on.

@boris-kolpackov
Copy link
Contributor Author

Probably some performance numbers on the cost of doing this to demonstrate it's not a major cost (probably compile time/module generation time, and disk usage numbers would be useful) would be helpful in making this determination.

Here is one date point towards that: the libc++'s std.cppm BMI is 31MB. Not small but I think not a deal breaker either, seeing that it's probably an outlier, size-wise.

@ChuanqiXu9
Copy link
Member

The size of embedded files are much smaller than the size of the BMI.

Let's make it in 2 weeks if no objection comes in.

@dwblaikie
Copy link
Collaborator

Here is one date point towards that: the libc++'s std.cppm BMI is 31MB. Not small but I think not a deal breaker either, seeing that it's probably an outlier, size-wise.

Not sure I follow - is it 31MB without embedded files, or with them? How much does it change with/without?

AFAIK, neither GCC nor MSVC have this restriction for their BMIs.

Could we get more details on that? Do they embed the source files? Do they do something else?

@boris-kolpackov
Copy link
Contributor Author

Not sure I follow - is it 31MB without embedded files, or with them? How much does it change with/without?

It is 31662KB with and 30574KB without.

Could we get more details on that? Do they embed the source files? Do they do something else?

I cannot say for sure whether they embed source files (my guess is that they do not, they just serialize the AST). But I can say for sure they don't have shallow referenced to source files since in build2 we always go through -frewrite-includes (or equivalent) and then remove the partially preprocessed TU (because we compress them).

@dwblaikie
Copy link
Collaborator

Not sure I follow - is it 31MB without embedded files, or with them? How much does it change with/without?

It is 31662KB with and 30574KB without.

Fair enough, that sounds alright.

Could we get more details on that? Do they embed the source files? Do they do something else?

I cannot say for sure whether they embed source files (my guess is that they do not, they just serialize the AST). But I can say for sure they don't have shallow referenced to source files since in build2 we always go through -frewrite-includes (or equivalent) and then remove the partially preprocessed TU (because we compress them).

I wouldn't necessarily conclude that - preprocessed files might still have .file directives that could result in the output file referencing real files on disk. Though how the downstream diagnostics work in the absence of those files - fair, if that works it works. But I'd like to know what GCC and MSVC are doing before we go forward with this work.

@boris-kolpackov
Copy link
Contributor Author

boris-kolpackov commented Nov 27, 2023

I wouldn't necessarily conclude that - preprocessed files might still have .file directives that could result in the output file referencing real files on disk. Though how the downstream diagnostics work in the absence of those files - fair, if that works it works.

I am not sure I follow the reasoning here: yes, the preprocessed .ii files have #line directives that point to the original source files. And in the diagnostics we see file/line/column of the original source file. This all works as expected, no issues there AFAIK. The issue is that, without -fmodules-embed-all-files, Clang actually tries to access the .ii file from which the .pcm file was built when it compiles consumers of that .pcm.

But I'd like to know what GCC and MSVC are doing before we go forward with this work.

I am probably missing something here, but I don't think that GCC and MSVC not having such a requirement are doing something special. Rather, Clang needing access to the original source code in addition to .pcm is doing an unusual thing that needs explaining, so to speak. I believe both GCC and MSVC just serialize the AST (probably with all the #line directive information for diagnostics) and it's not clear why they would also need access to the source code. But, again, maybe I am missing some inherent reason for why a BMI would normally also have to be accompanied with the source file (other than diagnostics).

@dwblaikie
Copy link
Collaborator

I am probably missing something here, but I don't think that GCC and MSVC not having such a requirement are doing something special. Rather, Clang needing access to the original source code in addition to .pcm is doing an unusual thing that needs explaining, so to speak. I believe both GCC and MSVC just serialize the AST (probably with all the #line directive information for diagnostics) and it's not clear why they would also need access to the source code. But, again, maybe I am missing some inherent reason for why a BMI would normally also have to be accompanied with the source file (other than diagnostics).

GCC at least also quotes source code in diagnostics, as I understand it - so I'd be curious to know whether they have the same dependence, or they embed the source code? If they do embed source, that seems helpful to know/double-check our direction here.

@ChuanqiXu9
Copy link
Member

CC @iains for the details in GCC.

@boris-kolpackov
Copy link
Contributor Author

GCC at least also quotes source code in diagnostics, as I understand it

But shouldn't the quoted source code in diagnostics come from the original (unpreprocessed) source file rather than from the preprocessed? Though I suppose if it's only partially preprocessed (with -frewrite-includes) it shouldn't matter.

Just to make doubly sure, the setup is:

test.cppm -- original/unpreprocessed source (not removed)
test.ii   -- partially preprocessed source (removed after test.pcm is produced)
test.pcm  -- produced by compiling test.ii
user.cpp  -- consumer of test.pcm (the error say that test.ii is missing occurs when compiling this file)

@dwblaikie
Copy link
Collaborator

GCC at least also quotes source code in diagnostics, as I understand it

But shouldn't the quoted source code in diagnostics come from the original (unpreprocessed) source file rather than from the preprocessed? Though I suppose if it's only partially preprocessed (with -frewrite-includes) it shouldn't matter.

Just to make doubly sure, the setup is:

test.cppm -- original/unpreprocessed source (not removed)
test.ii   -- partially preprocessed source (removed after test.pcm is produced)
test.pcm  -- produced by compiling test.ii
user.cpp  -- consumer of test.pcm (the error say that test.ii is missing occurs when compiling this file)

ah, that seems like a different concern than the one outlined at the start of the bug - perhaps other compilers only depend on the files referenced by #file directives, whereas clang depends on the actual input file?

It'd be good to understand more about how other compilers deal with these cases - it doesn't seem obvious to me that an implementation would refer back to the original source files (how would that work when you are compiling the original code anyway? Maybe you only have access to the actual input file, not the files referenced in #file directives... - so depending on the real files seems like a problem too)

@ChuanqiXu9
Copy link
Member

Sent #74419

@ChuanqiXu9
Copy link
Member

CC @iains and Gaby (I don't know why I can't CC Gaby directly. I'll try to send a private mail.)

@boris-kolpackov
Copy link
Contributor Author

I don't know why I can't CC Gaby directly

@GabrielDosReis

@GabrielDosReis
Copy link

Thanks for pinging via email.

No, MSVC does not embedded the input source file in its IFC.

There is a pending ask from EDG for the IFC spec to have some form of "resolved token streams" partition embedded in an IFC, but I think that is different.

@GabrielDosReis
Copy link

But shouldn't the quoted source code in diagnostics come from the original (unpreprocessed) source file rather than from the preprocessed?

That is what I would expect.

@iains
Copy link
Contributor

iains commented Dec 9, 2023

GCC has references to the source file paths (one in a section which is essentially human-readable information about how the BMI was built; but I do not believe that participates in the validation of whether a BMI is applicable). The other instance is related to .file directives, I think (you might be better to confirm the intent with @urnathan).

As for embedding source text GCC does not do this AFAIK.

If I remember previous conversations correctly, the reason for this in clang was to do with improving debug experience with clang modules (edit: e.g. when builds are distributed), @Bigcheese ?

@iains
Copy link
Contributor

iains commented Dec 9, 2023

But shouldn't the quoted source code in diagnostics come from the original (unpreprocessed) source file rather than from the preprocessed?

That is what I would expect.

+1

@jyknight
Copy link
Member

jyknight commented Dec 9, 2023

It makes no sense that Clang requires the original file in order to compile, but the solution is to fix Clang to stop requiring that.

I think it does not make sense to modify Clang to embed all the source code into the module by default. If Clang cannot read a source file, it is reasonable for diagnostics to skip printing the source-code context for that error, but why should there should be any other issues?

@Bigcheese
Copy link
Contributor

If I remember previous conversations correctly, the reason for this in clang was to do with improving debug experience with clang modules (edit: e.g. when builds are distributed), @Bigcheese ?

Source locations are pretty fundamental in Clang and are used for lots of things, but the actual content itself has two uses that I'm aware of. The first is for diagnostics, and it would be sad if using modules meant you got worse diagnostics. The other is that we store pointers into file buffers, so Lexer::getSpelling can return data that points at memory mapped files.

@ChuanqiXu9
Copy link
Member

It makes no sense that Clang requires the original file in order to compile, but the solution is to fix Clang to stop requiring that.

I think it does not make sense to modify Clang to embed all the source code into the module by default. If Clang cannot read a source file, it is reasonable for diagnostics to skip printing the source-code context for that error, but why should there should be any other issues?

The good side of embedding source files is that it is easy to implement. But the bad side is that it takes more space in the BMI files. Is there any other bad points?

And if we don't want to embedding source files, we may need to touch seperate parts of lexer which may read the source files from my experience, it may require a larger refactoration.

@jyknight
Copy link
Member

Embedding all source files seems to move even farther from being able to generate a BMI which doesn't require a full rebuild of all transitive dependencies -- even for source changes which shouldn't affect the interface.

I know we're already not able to do that today, but if we decide to embed entire sources, I'm worried that effectively closes the door on such an idea for good.

@ChuanqiXu9
Copy link
Member

Embedding all source files seems to move even farther from being able to generate a BMI which doesn't require a full rebuild of all transitive dependencies -- even for source changes which shouldn't affect the interface.

But that is already the case today, isn't it? For example, the recorded source locations of declarations will change after we made almost any change to the current source file. Also I feel embedding the source files may not prevent us to reduce the transitive dependencies.

That said:

// mod2.cppm
export module mod2;
...

// mod1.cppm
export module mod1;
import mod2;
...

What we want to do is to remain the BMI of mod1.cppm unchanged if we only touched mod2.cppm. And it looks like embedding source files may not be an enemy to this.

@boris-kolpackov
Copy link
Contributor Author

I know we're already not able to do that today, but if we decide to embed entire sources, I'm worried that effectively closes the door on such an idea for good.

If we now embed the source files by default (so that users don't have to specify -fmodules-embed-all-files explicitly), then it effectively becomes an implementation detail. If/when all the issues that currently require this are fixed, we can stop embedding without affecting any users.

@GabrielDosReis
Copy link

The good side of embedding source files is that it is easy to implement. But the bad side is that it takes more space in the BMI files. Is there any other bad points?

I know next to nothing about how PCM file format works, so what I am going to say here (from IFC perspective may not apply).

If you embed the input source file in the PCM file, and a subsequence change to the source file does not result in any semantics change (e.g. some form of comments or other edits) in the sense that the BMI is the same as before then all users of that PCM file would have to be recompiled just because of the change to the embedded source file. Whether that is a scenario that you deem important is up to you; but from the IFC perspective, it isn't something we would do by default.

@ChuanqiXu9
Copy link
Member

The good side of embedding source files is that it is easy to implement. But the bad side is that it takes more space in the BMI files. Is there any other bad points?

I know next to nothing about how PCM file format works, so what I am going to say here (from IFC perspective may not apply).

If you embed the input source file in the PCM file, and a subsequence change to the source file does not result in any semantics change (e.g. some form of comments or other edits) in the sense that the BMI is the same as before then all users of that PCM file would have to be recompiled just because of the change to the embedded source file. Whether that is a scenario that you deem important is up to you; but from the IFC perspective, it isn't something we would do by default.

Yeah, it sounds like https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755. And I am curious how MSVC handles the changed diagnostic problem (the reported locations in diagnostic may not be accurate) and the debug information problem (the locations in the debug information may not be accurate) if we changed the source locations in a module unit but don't recompile the users?

@boris-kolpackov
Copy link
Contributor Author

If you embed the input source file in the PCM file, and a subsequence change to the source file does not result in any semantics change (e.g. some form of comments or other edits) in the sense that the BMI is the same as before then all users of that PCM file would have to be recompiled just because of the change to the embedded source file.

While this sounds plausible, I think it's somewhat theoretical: if the source file changes then the build system will recompile the BMI anyway and any consumers will also need to be recompiled unless there is some more elaborate, hash-based change tracking involved (in which case this more involved approach can also ignore/take into account the extent of changes to the embedded source code).

@GabrielDosReis
Copy link

While this sounds plausible, I think it's somewhat theoretical

One man's theory is another man's practice. What I described is a concrete scenario observed in practice in dev inner loops.

If the source file changes then the build system will recompile the BMI anyway and any consumers will also need to be recompiled unless there is some more elaborate, hash-based change tracking involved (in which case this more involved approach can also ignore/take into account the extent of changes to the embedded source code).

The compile is invoked to generate a new BMI, correct, BUT the on-disk file holding the actual BMI does not need to change if there is NO binary-diff (say as reported by tool like move-if-change Autotools lore).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:modules C++20 modules and Clang Header Modules
Projects
None yet
9 participants