Reproducible env summaries #9345

Octachron · 2020-03-04T16:10:58Z

Since the refactoring of the initial environment, in #2041, the environment summary contains the list of the available cmis in the load paths. This creates an issue for reproducible builds since those summaries end up snapshoting the state of the file system at the time of the build which is unstable for parallel builds.

However, after #2235, the presence of those cmis in the environment is only an optimization once the standard library has been opened: if a module is missing in the environment, the type checker will look for a corresponding cmi on the file system even if no cmi have been recorded with this name.

Since this is only an optimisation, the present PR proposes to solve the reproducibility issue by not storing at all the present cmi files in the environment summary. The second commit in this PR goes a little further and removes the Env_persistent entries from the summary type because they were essentially unused.

lpw25 · 2020-03-04T17:56:30Z

I don't think having persistent entries in the environment can be described as just an optimisation. They allow persistent modules to shadow non-persistent modules in the environment -- which IIRC was their original motivation. This information still needs to be reflected in the summary.

I think a slightly subtler change is needed: persistent modules should only be added to the summary if they are shadowing a non-persistent module.

Octachron · 2020-03-06T12:57:28Z

The initially opened module are also persistent modules, so we probably also want to record implicit persistent modules that shadow those explicit persistent modules. This implies that we would record the cmis present of the same file systems that shadows the initially opened modules. But since those cmis may affect the compilation, this might be a reasonable compromise?

lpw25 · 2020-03-06T13:26:16Z

I'm not sure I quite follow your previous comment, but it sounds like you are suggesting the same thing I said.

I think my criteria -- "persistent modules should only be added to the summary if they are shadowing a non-persistent module" -- is sufficient to capture all cases that need to be recorded. My reasoning is as follows:
i) If a persistent module does not shadow anything then it can be removed without affecting lookup, since modules that are not present in the environment are assumed to be persistent
ii) It is not possible for a persistent module to shadow a persistent module because they are entirely defined by their name.
iii) If a persistent module shadows a non-persistent module then it will be present in the summary by my criteria

Octachron · 2020-03-06T13:58:36Z

I had a slightly more complex criteria in mind (with the idea of always adding persistent modules that were explicitly requested), but your criteria is probably better.

lpw25 · 2020-03-06T14:02:01Z

A property I like about my criteria is that it is not dependent on the contents of the module, only on the initial environment. That means that you can take the environments from a .cmt file and correctly query them about any module -- not just the ones that appeared in the original ml file.

xavierleroy · 2020-04-09T08:41:30Z

Where are we with this PR? I think the problem it addresses (recording all visibile .cmi files) is an embarrassment and should be fixed by 4.11.

Octachron · 2020-04-09T13:07:42Z

The current version implements the proposition of @lpw25 , and records in the summary only persistent modules that took precedence over a non-persistent module.

There is still a problematic corner case however, if we have two incomparable (for dependencies) modules A and B that both shadow a non-persistent module in the initial environment. Then A can be correctly compiled with both the B-shadowed or non-shadowed initial environment, and this uncertainty will be reflected in the cmt file. However, this reproducibility issue reflects an existing deeper issue. Moreover, this only affects libraries redefining stdlib modules; and there is a work-around for those: explicitly define a total order between all redefined modules.

If we are fine for now with the limitation above, and @lpw25 agrees with the implementation, I think the PR can be merged for 4.11 .

gasche · 2020-04-09T15:56:05Z

I don't understand what "and this uncertainty will be reflected in the cmt file" means, can you give an example and explains what happens there and why it is an issue?

Octachron · 2020-04-10T12:46:25Z

Consider a project that contains two empty persistent modules

List
Option

and is compiled without -nopervasives nor -nostdlib . Thus the two persistent modules shadow the List and Option modules of the standard library.

Then depending on the presence of option.cmi on the file system, the module List can be compiled with an initial environment in which the Option name refers to either the persistent module Option, or the standard library alias Option. This difference of initial environments will be reflected in the environment summary of the cmt files.

Which means that building in parallel the list and option compilation units may result in three outputs: the compilation of Lists has seen option.cmi, the compilation of Option has seen list.cmi, or the two compilations did not observe the other compilation unit cmis. In other words, libraries that shadows more than one standard library module will need to take some care to achieve reproducible builds. (by using -nopervasives for instance)

gasche · 2020-04-10T15:30:05Z

Sorry for going on with the naive question, but is this really stdlib specific? If I depend on a findlib package that exports some modules Foo and Bar (which I don't use; plus others I use), and I also have two modules Foo and Bar in my project, won't the same issue happen as well?

(Assuming that the library's Foo and Bar are included in the library archive but are not -linkall and not effectful, they would not be required at link-time so the resulting program would work fine.)

Octachron · 2020-04-14T14:05:48Z

This is really specific to the initially opened library (which can only be the Stdlib currently ). Currently, the initial environment is build in five stages:

start with the predefined environment
add the persistent modules from the initially opened library
open the main module of the initially opened library
add the persistent modules from all other directories in the path
open the initially opened modules provided by the -open flag

Before this PR, we were recording all persistent modules from stage 2 and 4 in the environment summary.

After this PR, we are recording only the persistent modules introduced in stage 4, that are shadowing the submodules of the main module of the initial opened library added to the environment in phase 3.

lpw25

LGTM

gasche · 2020-04-18T12:55:36Z

@Octachron thanks for the explanation (this is really tricky!). I have another beginner question: why are we doing step (4) at all, which introduces this non-reproducibility? Why not record only the persistent modules (of the include path) that are required to type-check the program (starting with the module lookup of -open arguments at step 5)?

We don't want to record the state of the file system at the start of the compilation in the compiled files. Consequently, we only add persistent modules to the env summary if they have an observable action on the initial environment. This is only the case if they shadow a non-persistent module of the initially opened library (which can only be Stdlib currently).

Octachron · 2020-04-20T08:35:26Z

The step 4 is used to shadow modules from the initially opened library.
Without this step, since there is already a resolution (e.g. Stdlib.Option) in scope for those modules, we would stop at this choice and never look up to the persistent modules in the path (e.g. Option).

This is at least partially a backward compatibility feature with the non-prefixed library standard library (where persistent modules in the path could shadow the persistent modules from the standard library).

It is also necessary to make the Bigarray library usable since the only effect of the library with recent version of the compiler is to shadow a stdlib module. However, the library module is nowadays the same as the one from the stdlib.

gasche · 2020-04-20T09:47:56Z

I will need more time to understand @Octachron's new explanation and I suspect that more newbie questions will follow. Taking a step back: I don't want my idle curiosity to slow down the integration of this necessary fix in the compiler. If you and Leo agree it's good, please merge once the CI agrees, we can keep discussing this.

But I am a frightened by the amount of complexity in this corner of our language. It shouldn't take a week of conversation to understand what is the initial environment of a source file. From a blurred distance it sounds like we are doing this wrong, and we should change to a simpler semantics -- which could very well take work to be found, or possibly have some downsides compared to the current too-complex-to-see-the-problems one.

gasche · 2020-04-20T15:05:48Z

In #2041, Jérémie gave the following pseudocode for initial environment construction when invoked with -I d1..dn -open M1..Mm (no -no-stdlib).

env <- empty

env <- add_dir(env, stdlib_dir)
env <- open(env, Stdlib)

env <- add_dir(env, d1)
...
env <- add_dir(env, dn)

env <- open(env, M1)
...
env <- open(env, Mm)

If I understand correctly, before the current PR the add_dir(env, d1..dn) would result in recording the list of all .cmi visible in these directories (including the current one, implicitly added last), bad for reproducibility.

The problem with not recording anything is that then the lookup semantics changes, as a name looked-up in the summary may resolve to a shadowed version coming from add_dir(env, stdlib_dir) or open(env, Stdlib), when a module in d1..dn should have been visible. So in this PR you propose to record exactly the d1..dn units that would shadow one in dir(stdlib_dir) or open(Stdlib).

I have two questions:

Reading the implementation now, you are not recording those that shadow a persistent unit, so you are not recording those that shadow dir(stdlib_dir), is that correct? This corresponds to assuming that all the modules there are also submodules of Stdlib, so looking for shadowing in open(StdliB) is enough. But then what about the camlInternal* modules, that are not aliased in Stdlib?
How is the add_dir(stdlib_dir) part processed, does it run with the same shadowing-checking logic? If I understand correctly, it does. That is correct because at this point in time there are no modules at all in the current environment; so the stdlib's persistent modules are not recorded in the environment anymore.

Octachron · 2020-04-20T15:58:05Z

For your first point, indeed we are not recording the shadowing of stdlib's persistent modules.
However, for persistent modules, the compiler relies on the assumption that doing a lookup in the include path d_n, ..., d_1 and in the environment after the List.iter (add_dir env [d_1,...,d_n]) is equivalent. Thus, if we have the environment summary and the include path, we can reproduce any environment lookups. In other words, persistent modules are uniquely identified by their name and the include path. Consequently, only the shadowing of submodules of the stdlib needs to be recorded.

For the second point, yes indeed the add_dir stdlib_dir is processed with the same logic. Thus, we would record the shadowing of predefined non-persistent modules by the standard library if there were such things.

gasche · 2020-04-20T16:08:01Z

The compiler relies on the assumption that doing a lookup in the include path d_n, ..., d_1 and in the environment after the List.iter (add_dir env [d_1,...,d_n]) is equivalent.

In other words, the compiler assumes that persistent unit names are unique (in the include directories).
That makes sense, thanks.

I have another question: how does the content of the environment influence .cma files as reported by the original bug #9307 ? Looking at the code, I found that environment summariess were saved in .cmt files (this is what #9039 adresses), but I don't see any sign of them being stored in .cma or .cmi files. My best guess is that they affect the crc computation, but do you know how?

gasche · 2020-04-20T16:24:16Z

I believe that the API we currently use to include a directory is wrong.

When we are asked to open an include directory, we look at all the persistent modules in it and we add them one by one -- in the current environment and in the summary. This PR is trying to help reproducibility (in an incomplete way) by forgetting strategically to add some of those persistent modules in the summary.

Instead we should have a higher-level environment action to add an include directory, which would be stored in the summary. From this summary we can rebuild an environment (assuming, as mentioned before, that persistent module names are unique) by redoing the persistent-modules lookup for a directory at rebuild time. (The non-summary part would eagerly list the persistent modules and add them to the environment tables, to correctly implement shadowing without using a stacked/layered representation.)

lpw25 · 2020-04-20T18:18:03Z

I think that makes the assumption that include directories will be in the same location when reading .cmt files. This is not likely to be the case for directories that are within the same library/package. They will be within the build directories when creating the .cmt file but within the installation directories when reading it.

hhugo · 2020-04-20T19:46:06Z

but I don't see any sign of them being stored in .cma or .cmi files

I believe the issue is that the environment end up in to debug info of the cmo/cma

gasche · 2020-04-20T20:11:40Z

@hhugo Thanks! It is in the field ev_typenv: Env.summary in Instruct.debug_event.

@lpw25 that's a good remark, and it certainly makes the idea more difficult to use, but I'm not convinced it is dead yet. A few orthogonal and slightly rambly remarks:

Now that I understand the problem domain, I like @alainfrisch's approach (of trimming the environment to keep only the imported compilation units) better. It is more deterministic than the one in this PR (the small source of incompleteness identified by @Octachron is gone), and it results in an environment that is more meaningful -- it contains exactly all the persistent units required to build the environment used to type-check the .cmt file. (It has the limitation you point out that you may not be able to check other related expressions without providing extra -I flags -- note that a hybrid approach where we keep include directories and also compilation units may work best -- but at least it is simple to define and build-order-independent. It doesn't take a week to understand what it does and why this is the right thing.)
The libraries within the same library/package are precisely those for which we experience reproducibility issues. Now that the compiler supports it, we should get build systems to pass .cmi files directly within the current project, instead of include directories, for those, based on the dependency information they have already computed. Then we wouldn't need to mess with the initial environment as in the present PR.
If we know where the current library is going to be installed, we could just add it as an extra -I flag, and then the problem you point out goes away. This is even easier in an OCAMLPATH world (Towards an OCAMLPATH #8898) where the -I flag is likely to be -I +<library-name>.
In fact, if the current library/package/project is built with relative paths, then having those in the environment summary may be enough to express the environment of an installed .cmt or .cmi file, provided the relative paths from that artefact for its local neighbors is unchanged in the installed location (... which I think is broken by the Windows-performance trick of pushing all the .cmi together in a single repository).

lpw25 · 2020-04-21T07:05:15Z

I like @alainfrisch's approach (of trimming the environment to keep only the imported compilation units) better. It is more deterministic than the one in this PR (the small source of incompleteness identified by @Octachron is gone)

I disagree with this approach for two reasons:

It is actually less "deterministic" than the approach in this PR -- it depends on the whims of type inference. Precisely which cmi files must be read to type check a file is not easy to derive from just looking at the file.
It means that the environments in cmt files and debug events are only partially correct. You cannot correctly perform a lookup with them that was not performed when type checking the current unit. Thanks to reason 1 it is difficult to know which lookups are allowed, and again you are mostly at the whims of the type inference for which lookups are performed. Something simple like reading a cmt file and trying to print the type of a node in the ast could fairly easily produce in an incorrect result.

gasche · 2020-04-21T07:14:31Z

That makes sense, thanks. I had trouble understanding point (2) at first, given that we can always lookup a module that is not recorded in the environment. The problem comes, again, from persistent units that shadow submodules. So you could in theory build an example where a persistent unit shadows a submodule, but it is not "used" enough by the type-checker to be required, and so you lose track of it with Alain's approach. Is that right?

lpw25 · 2020-04-21T07:42:20Z

Is that right?

Yeah, that's the case I was thinking of

gasche · 2020-04-21T08:11:15Z

Note that there is a tension between a summary being future-proof (being able to type other things than the module it was given) and reproducible. To be able to type-check as much other expressions as possible, you want to detect all persistent modules on the system (at least in the include directories); to be reproducible you want to depend as little as possible on the filesystem state.

One possible long-term approach would be to expect to be given a precise set of dependencies, and in the include directories only record those dependencies -- and refuse to load the components on non-dependencies. This helps reproducibility, as non-dependencies are not stored at all, but it hurts future-proofiness, as the environment is not valid anymore if the dependencies change. This is what we can already have if we require local-project persistent modules to be passed as .cmi files directly, instead of using include directories. (Well maybe we need an explicit option to remove the current working directory from the include directories for this to work well.)

Note that this approach is close to Alain's strategy, instead that we ask the build system to compute "required modules" instead of the type-checker. (It can be combined with the "trick" of only remembering shadowing persistent modules in the environment, but it does not need to anymore.)

If we wanted to be able to compute dependencies on the fly during type-checking (as Alain has suggested using something like strace to detect .cmi lookups), then we need another long-term approach, which is again to rely on the type-checker to compute the dependencies of a compilation unit, and trim those after the fact, as Alain's approach is doing. Then the "long-term" aspect is to ensure that this set of "type-checking dependencies" is clearly defined and not too hard to reason about, which you say is not currently the case (this sounds like a not-good state for a type system to be in).

trefis · 2020-04-21T08:18:47Z

W.r.t

One possible long-term approach would be to expect to be given a precise set of dependencies, and in the include directories only record those dependencies

Note that propositions in that direction have already been suggested on caml-devel on the 9th and 10th of last October by Alain and myself.

EDIT: Actually, that is where this whole discussion about the reproducibility started.

gasche · 2020-04-21T08:42:44Z

I went back to have a look; if I understand correctly your suggestion was basically to fulfill feature wish #7080, and this later turned into the PR #9056 allowing to pass -I foo.cmi directly (or maybe just foo.cmi, depending on who wins an exciting user-interface debate). I mentioned this feature above (I believe that we should use only this feature in the local project if we want reproducibility), and I actually assumed that the corresponding PR had been merged. I am just now realizing that it was not, do you konw what is the status there?

Octachron force-pushed the reproducible_env branch from 78d5c5d to 3ccc623 Compare March 6, 2020 14:00

lpw25 approved these changes Apr 18, 2020

View reviewed changes

Octachron force-pushed the reproducible_env branch from 3ccc623 to c041b03 Compare April 20, 2020 08:15

Octachron merged commit c5a1f91 into ocaml:trunk Apr 20, 2020

gasche mentioned this pull request May 15, 2020

[RFC] Add a single file to load path #9056

Open

aalekseyev mentioned this pull request Oct 15, 2020

Dune builds are not always reproducible with multiple cores ocaml/dune#3863

Closed

lpw25 mentioned this pull request Oct 26, 2020

Fix reproducibility for -no-alias-deps #9991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible env summaries #9345

Reproducible env summaries #9345

Octachron commented Mar 4, 2020 •

edited by gasche

lpw25 commented Mar 4, 2020

Octachron commented Mar 6, 2020

lpw25 commented Mar 6, 2020

Octachron commented Mar 6, 2020

lpw25 commented Mar 6, 2020

xavierleroy commented Apr 9, 2020

Octachron commented Apr 9, 2020

gasche commented Apr 9, 2020

Octachron commented Apr 10, 2020

gasche commented Apr 10, 2020

Octachron commented Apr 14, 2020

lpw25 left a comment

gasche commented Apr 18, 2020

Octachron commented Apr 20, 2020

gasche commented Apr 20, 2020 •

edited

gasche commented Apr 20, 2020

Octachron commented Apr 20, 2020

gasche commented Apr 20, 2020 •

edited

gasche commented Apr 20, 2020 •

edited

lpw25 commented Apr 20, 2020

hhugo commented Apr 20, 2020

gasche commented Apr 20, 2020 •

edited

lpw25 commented Apr 21, 2020 •

edited

gasche commented Apr 21, 2020

lpw25 commented Apr 21, 2020

gasche commented Apr 21, 2020

trefis commented Apr 21, 2020 •

edited

gasche commented Apr 21, 2020

Reproducible env summaries #9345

Reproducible env summaries #9345

Conversation

Octachron commented Mar 4, 2020 • edited by gasche

lpw25 commented Mar 4, 2020

Octachron commented Mar 6, 2020

lpw25 commented Mar 6, 2020

Octachron commented Mar 6, 2020

lpw25 commented Mar 6, 2020

xavierleroy commented Apr 9, 2020

Octachron commented Apr 9, 2020

gasche commented Apr 9, 2020

Octachron commented Apr 10, 2020

gasche commented Apr 10, 2020

Octachron commented Apr 14, 2020

lpw25 left a comment

Choose a reason for hiding this comment

gasche commented Apr 18, 2020

Octachron commented Apr 20, 2020

gasche commented Apr 20, 2020 • edited

gasche commented Apr 20, 2020

Octachron commented Apr 20, 2020

gasche commented Apr 20, 2020 • edited

gasche commented Apr 20, 2020 • edited

lpw25 commented Apr 20, 2020

hhugo commented Apr 20, 2020

gasche commented Apr 20, 2020 • edited

lpw25 commented Apr 21, 2020 • edited

gasche commented Apr 21, 2020

lpw25 commented Apr 21, 2020

gasche commented Apr 21, 2020

trefis commented Apr 21, 2020 • edited

gasche commented Apr 21, 2020

Octachron commented Mar 4, 2020 •

edited by gasche

gasche commented Apr 20, 2020 •

edited

gasche commented Apr 20, 2020 •

edited

gasche commented Apr 20, 2020 •

edited

gasche commented Apr 20, 2020 •

edited

lpw25 commented Apr 21, 2020 •

edited

trefis commented Apr 21, 2020 •

edited