New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducible env summaries #9345
Conversation
I don't think having persistent entries in the environment can be described as just an optimisation. They allow persistent modules to shadow non-persistent modules in the environment -- which IIRC was their original motivation. This information still needs to be reflected in the summary. I think a slightly subtler change is needed: persistent modules should only be added to the summary if they are shadowing a non-persistent module. |
The initially opened module are also persistent modules, so we probably also want to record implicit persistent modules that shadow those explicit persistent modules. This implies that we would record the cmis present of the same file systems that shadows the initially opened modules. But since those cmis may affect the compilation, this might be a reasonable compromise? |
I'm not sure I quite follow your previous comment, but it sounds like you are suggesting the same thing I said. I think my criteria -- "persistent modules should only be added to the summary if they are shadowing a non-persistent module" -- is sufficient to capture all cases that need to be recorded. My reasoning is as follows: |
I had a slightly more complex criteria in mind (with the idea of always adding persistent modules that were explicitly requested), but your criteria is probably better. |
78d5c5d
to
3ccc623
Compare
A property I like about my criteria is that it is not dependent on the contents of the module, only on the initial environment. That means that you can take the environments from a |
Where are we with this PR? I think the problem it addresses (recording all visibile .cmi files) is an embarrassment and should be fixed by 4.11. |
The current version implements the proposition of @lpw25 , and records in the summary only persistent modules that took precedence over a non-persistent module. There is still a problematic corner case however, if we have two incomparable (for dependencies) modules A and B that both shadow a non-persistent module in the initial environment. Then A can be correctly compiled with both the B-shadowed or non-shadowed initial environment, and this uncertainty will be reflected in the cmt file. However, this reproducibility issue reflects an existing deeper issue. Moreover, this only affects libraries redefining stdlib modules; and there is a work-around for those: explicitly define a total order between all redefined modules. If we are fine for now with the limitation above, and @lpw25 agrees with the implementation, I think the PR can be merged for 4.11 . |
I don't understand what "and this uncertainty will be reflected in the cmt file" means, can you give an example and explains what happens there and why it is an issue? |
Consider a project that contains two empty persistent modules List and is compiled without Then depending on the presence of option.cmi on the file system, the module List can be compiled with an initial environment in which the Option name refers to either the persistent module Option, or the standard library alias Option. This difference of initial environments will be reflected in the environment summary of the cmt files. Which means that building in parallel the list and option compilation units may result in three outputs: the compilation of Lists has seen option.cmi, the compilation of Option has seen list.cmi, or the two compilations did not observe the other compilation unit cmis. In other words, libraries that shadows more than one standard library module will need to take some care to achieve reproducible builds. (by using |
Sorry for going on with the naive question, but is this really stdlib specific? If I depend on a findlib package that exports some modules Foo and Bar (which I don't use; plus others I use), and I also have two modules Foo and Bar in my project, won't the same issue happen as well? (Assuming that the library's Foo and Bar are included in the library archive but are not -linkall and not effectful, they would not be required at link-time so the resulting program would work fine.) |
This is really specific to the initially opened library (which can only be the
Before this PR, we were recording all persistent modules from stage 2 and 4 in the environment summary. After this PR, we are recording only the persistent modules introduced in stage 4, that are shadowing the submodules of the main module of the initial opened library added to the environment in phase 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Octachron thanks for the explanation (this is really tricky!). I have another beginner question: why are we doing step (4) at all, which introduces this non-reproducibility? Why not record only the persistent modules (of the include path) that are required to type-check the program (starting with the module lookup of |
We don't want to record the state of the file system at the start of the compilation in the compiled files. Consequently, we only add persistent modules to the env summary if they have an observable action on the initial environment. This is only the case if they shadow a non-persistent module of the initially opened library (which can only be Stdlib currently).
3ccc623
to
c041b03
Compare
The step 4 is used to shadow modules from the initially opened library. This is at least partially a backward compatibility feature with the non-prefixed library standard library (where persistent modules in the path could shadow the persistent modules from the standard library). It is also necessary to make the |
I will need more time to understand @Octachron's new explanation and I suspect that more newbie questions will follow. Taking a step back: I don't want my idle curiosity to slow down the integration of this necessary fix in the compiler. If you and Leo agree it's good, please merge once the CI agrees, we can keep discussing this. But I am a frightened by the amount of complexity in this corner of our language. It shouldn't take a week of conversation to understand what is the initial environment of a source file. From a blurred distance it sounds like we are doing this wrong, and we should change to a simpler semantics -- which could very well take work to be found, or possibly have some downsides compared to the current too-complex-to-see-the-problems one. |
In #2041, Jérémie gave the following pseudocode for initial environment construction when invoked with
If I understand correctly, before the current PR the The problem with not recording anything is that then the lookup semantics changes, as a name looked-up in the summary may resolve to a shadowed version coming from I have two questions:
|
For your first point, indeed we are not recording the shadowing of stdlib's persistent modules. For the second point, yes indeed the |
In other words, the compiler assumes that persistent unit names are unique (in the include directories). I have another question: how does the content of the environment influence |
I believe that the API we currently use to include a directory is wrong. When we are asked to open an include directory, we look at all the persistent modules in it and we add them one by one -- in the current environment and in the summary. This PR is trying to help reproducibility (in an incomplete way) by forgetting strategically to add some of those persistent modules in the summary. Instead we should have a higher-level environment action to add an include directory, which would be stored in the summary. From this summary we can rebuild an environment (assuming, as mentioned before, that persistent module names are unique) by redoing the persistent-modules lookup for a directory at rebuild time. (The non-summary part would eagerly list the persistent modules and add them to the environment tables, to correctly implement shadowing without using a stacked/layered representation.) |
I think that makes the assumption that include directories will be in the same location when reading |
I believe the issue is that the environment end up in to debug info of the cmo/cma |
@hhugo Thanks! It is in the field @lpw25 that's a good remark, and it certainly makes the idea more difficult to use, but I'm not convinced it is dead yet. A few orthogonal and slightly rambly remarks:
|
I disagree with this approach for two reasons:
|
That makes sense, thanks. I had trouble understanding point (2) at first, given that we can always lookup a module that is not recorded in the environment. The problem comes, again, from persistent units that shadow submodules. So you could in theory build an example where a persistent unit shadows a submodule, but it is not "used" enough by the type-checker to be required, and so you lose track of it with Alain's approach. Is that right? |
Yeah, that's the case I was thinking of |
Note that there is a tension between a summary being future-proof (being able to type other things than the module it was given) and reproducible. To be able to type-check as much other expressions as possible, you want to detect all persistent modules on the system (at least in the include directories); to be reproducible you want to depend as little as possible on the filesystem state. One possible long-term approach would be to expect to be given a precise set of dependencies, and in the include directories only record those dependencies -- and refuse to load the components on non-dependencies. This helps reproducibility, as non-dependencies are not stored at all, but it hurts future-proofiness, as the environment is not valid anymore if the dependencies change. This is what we can already have if we require local-project persistent modules to be passed as Note that this approach is close to Alain's strategy, instead that we ask the build system to compute "required modules" instead of the type-checker. (It can be combined with the "trick" of only remembering shadowing persistent modules in the environment, but it does not need to anymore.) If we wanted to be able to compute dependencies on the fly during type-checking (as Alain has suggested using something like |
W.r.t
Note that propositions in that direction have already been suggested on caml-devel on the 9th and 10th of last October by Alain and myself. EDIT: Actually, that is where this whole discussion about the reproducibility started. |
I went back to have a look; if I understand correctly your suggestion was basically to fulfill feature wish #7080, and this later turned into the PR #9056 allowing to pass |
Closes #9307
Since the refactoring of the initial environment, in #2041, the environment summary contains the list of the available cmis in the load paths. This creates an issue for reproducible builds since those summaries end up snapshoting the state of the file system at the time of the build which is unstable for parallel builds.
However, after #2235, the presence of those cmis in the environment is only an optimization once the standard library has been opened: if a module is missing in the environment, the type checker will look for a corresponding cmi on the file system even if no cmi have been recorded with this name.
Since this is only an optimisation, the present PR proposes to solve the reproducibility issue by not storing at all the present cmi files in the environment summary. The second commit in this PR goes a little further and removes the
Env_persistent
entries from the summary type because they were essentially unused.