Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel build with dune is not reproducible #9152

Closed
xavierleroy opened this issue Nov 12, 2023 · 17 comments
Closed

Parallel build with dune is not reproducible #9152

xavierleroy opened this issue Nov 12, 2023 · 17 comments

Comments

@xavierleroy
Copy link

xavierleroy commented Nov 12, 2023

See report at ocaml/camlp-streams#9 .

Help with understanding the issue would be appreciated. This is a trivial project with 2 OCaml source files, so it's hard to see what could be wrong in the Dune files. Thanks!

@ejgallego
Copy link
Collaborator

ejgallego commented Nov 12, 2023

See also coq/coq#17207 , and the discussion about Marshall

@ejgallego
Copy link
Collaborator

ejgallego commented Nov 12, 2023

Most likely is that ocamlc is somehow not fully deterministic (for example w.r.t. the state of the file system, what dirs/files does it scan in -modules mode?).

If that's the case, dune could make the rules more deterministic, but depending on what the exact problem is, that could lead to large losses of paralellism.

@xavierleroy
Copy link
Author

Most likely is that ocamlc is somehow not fully deterministic

Evidence needed. We've worked pretty hard to make builds reproducible. Is there any way to understand what's going on here, besides us replacing Dune with a shell script?

@ejgallego
Copy link
Collaborator

More discussion about Coq problems in coq/coq#11229

@ejgallego
Copy link
Collaborator

Most likely is that ocamlc is somehow not fully deterministic

Evidence needed.

Dune does little special here; will just call ocamlopt twice, do you have an idea on how dune could alter the output of ocamlopt ?

We have similar problems in Coq, and they were due to Marshal, tho we are still confused as what is really going on. I understand that OCaml still uses Marshal to write .cm* files, right?

So this is where I would look next.

We've worked pretty hard to make builds reproducible. Is there any way to understand what's going on here, besides us replacing Dune with a shell script?

For your simple project you can indeed just take _build/log and reproduce these commands in different parallel setups.

Dune often find these kind of problems more often than make, as it does setup -j N automatically, which rarely happens in the make world as users / devs often forget to add the -j or to add the corresponding logic for CI CPU cores detection etc...

@xavierleroy
Copy link
Author

From a distance, the Coq problems seem to come from its hash-consing mechanism. The OCaml compilers don't do any hash consing. It is true that they use marshaling, and that marshaling is sensitive to sharing, but it looks like our sharing is reproducible, as we haven't got any report of non-reproducibility recently, except for this specific library.

The library in question is trivial but uses a somewhat unusual (to me at least) Dune file, with the :standard modifier. Could someone who knows Dune inside out check that this is not a problem?

@ejgallego
Copy link
Collaborator

One problem in Coq was due to hash-consing, but we still have the bug and we couldn't trace it back to hashconsing, trying with No_sharing is unfortunately not an option for Coq.

The library in question is trivial but uses a somewhat unusual (to me at least) Dune file, with the :standard modifier. Could someone who knows Dune inside out check that this is not a problem?

That's not a problem as far as I can see, :standard just means "the actual set of flags", so you can write stuff like

  (flags :standard \ -O2)

for example.

@ejgallego
Copy link
Collaborator

ejgallego commented Nov 12, 2023

IMHO the best next step is to get the two _build/log for the differing runs (cc @bmwiedemann) , then reproduce without Dune.

@bmwiedemann
Copy link

I extracted the two _build/logs and they differ thusly:

--- dune.logs/log.1     2039-12-15 17:21:00.276666666 +0100
+++ dune.logs/log.2     2023-11-13 04:04:12.463333332 +0100
@@ -2,8 +2,8 @@
 # OCAMLPARAM: unset
 # Shared cache: disabled
 # Workspace root: /home/abuild/rpmbuild/BUILD/ocaml-camlp-streams-5.0.1
-# Auto-detected concurrency: 1
-$ /usr/bin/ocamlc.opt -config > /tmp/dune_f31f1b_output
+# Auto-detected concurrency: 4
+$ /usr/bin/ocamlc.opt -config > /tmp/dune_d2fa9c_output
 # Dune context:
 #  { name = "default"
 #  ; kind = "default"
@@ -126,15 +126,15 @@
 $ /usr/bin/ocaml -I +compiler-libs /home/abuild/rpmbuild/BUILD/ocaml-camlp-streams-5.0.1/_build/.dune/default/dune.ml
 $ (cd _build/default && /usr/bin/ocamlc.opt -w -40 -w -3 -g -bin-annot -I test/.stream_stdlib.objs/byte -no-alias-deps -o test/.stream_stdlib.objs/byte/stream_stdlib.cmo -c -impl test/stream_stdlib.ml)
 $ (cd _build/default && /usr/bin/ocamlc.opt -w -40 -g -bin-annot -I test/.stream_camlp_streams.objs/byte -I .camlp_streams.objs/byte -no-alias-deps -o test/.stream_camlp_streams.objs/byte/stream_camlp_streams.cmo -c -impl test/stream_camlp_streams.ml)
-$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -w -3 -g -I test/.stream_stdlib.objs/byte -I test/.stream_stdlib.objs/native -intf-suffix .ml -no-alias-deps -o test/.stream_stdlib.objs/native/stream_stdlib.cmx -c -impl test/stream_stdlib.ml)
 $ (cd _build/default && /usr/bin/ocamlc.opt -w -40 -w -3 -g -bin-annot -I test/.linking.eobjs/byte -I .camlp_streams.objs/byte -I test/.stream_stdlib.objs/byte -no-alias-deps -o test/.linking.eobjs/byte/dune__exe__Linking.cmo -c -impl test/linking.ml)
-$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -I test/.stream_camlp_streams.objs/byte -I test/.stream_camlp_streams.objs/native -I .camlp_streams.objs/byte -I .camlp_streams.objs/native -intf-suffix .ml -no-alias-deps -o test/.stream_camlp_streams.objs/native/stream_camlp_streams.cmx -c -impl test/stream_camlp_streams.ml)
+$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -w -3 -g -I test/.stream_stdlib.objs/byte -I test/.stream_stdlib.objs/native -intf-suffix .ml -no-alias-deps -o test/.stream_stdlib.objs/native/stream_stdlib.cmx -c -impl test/stream_stdlib.ml)
 $ (cd _build/default && /usr/bin/ocamlc.opt -w -40 -g -bin-annot -I test/.equality.eobjs/byte -I .camlp_streams.objs/byte -I test/.stream_camlp_streams.objs/byte -I test/.stream_stdlib.objs/byte -no-alias-deps -o test/.equality.eobjs/byte/dune__exe__Equality.cmo -c -impl test/equality.ml)
+$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -I test/.stream_camlp_streams.objs/byte -I test/.stream_camlp_streams.objs/native -I .camlp_streams.objs/byte -I .camlp_streams.objs/native -intf-suffix .ml -no-alias-deps -o test/.stream_camlp_streams.objs/native/stream_camlp_streams.cmx -c -impl test/stream_camlp_streams.ml)
 $ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -w -3 -g -a -o test/stream_stdlib.cmxa test/.stream_stdlib.objs/native/stream_stdlib.cmx)
 $ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -w -3 -g -I test/.linking.eobjs/byte -I test/.linking.eobjs/native -I .camlp_streams.objs/byte -I .camlp_streams.objs/native -I test/.stream_stdlib.objs/byte -I test/.stream_stdlib.objs/native -intf-suffix .ml -no-alias-deps -o test/.linking.eobjs/native/dune__exe__Linking.cmx -c -impl test/linking.ml)
 $ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -a -o test/stream_camlp_streams.cmxa test/.stream_camlp_streams.objs/native/stream_camlp_streams.cmx)
 $ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -I test/.equality.eobjs/byte -I test/.equality.eobjs/native -I .camlp_streams.objs/byte -I .camlp_streams.objs/native -I test/.stream_camlp_streams.objs/byte -I test/.stream_camlp_streams.objs/native -I test/.stream_stdlib.objs/byte -I test/.stream_stdlib.objs/native -intf-suffix .ml -no-alias-deps -o test/.equality.eobjs/native/dune__exe__Equality.cmx -c -impl test/equality.ml)
 $ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -w -3 -g -o test/linking.exe camlp_streams.cmxa test/stream_stdlib.cmxa test/.linking.eobjs/native/dune__exe__Linking.cmx)
-$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -o test/equality.exe test/stream_stdlib.cmxa camlp_streams.cmxa test/stream_camlp_streams.cmxa test/.equality.eobjs/native/dune__exe__Equality.cmx)
 $ (cd _build/default/test && ./linking.exe) > _build/default/test/issue4.output
+$ (cd _build/default && /usr/bin/ocamlopt.opt -w -40 -g -o test/equality.exe test/stream_stdlib.cmxa camlp_streams.cmxa test/stream_camlp_streams.cmxa test/.equality.eobjs/native/dune__exe__Equality.cmx)
 $ (cd _build/default/test && ./equality.exe) > _build/default/test/equality.output

Which of these produce stream.cmti? And how to best re-run from the log? I tried tail -15 _build/log | cut -c 2- | bash -x without much success.

@nojb
Copy link
Collaborator

nojb commented Nov 13, 2023

Which of these produce stream.cmti?

I don't think any of the lines that are shown in your diff are actually related to stream.cmti: they only involve files in the test directory...

@ejgallego
Copy link
Collaborator

Indeed the full logs are needed to be able to reverse engineer the computation graph and see what was around the emission of stream.cmti making it maybe racy.

@hhugo
Copy link
Collaborator

hhugo commented Nov 13, 2023

following field of the cmt differ:

  • cmt_annots
  • cmt_initial_env

@ejgallego
Copy link
Collaborator

@hhugo would it be possible to obtain the diff?

@xavierleroy
Copy link
Author

That's not a problem as far as I can see, :standard just means "the actual set of flags",

You're right, I was confused, sorry about that. I had the impression that the build differs whether OCaml is < 4.14, = 4.14, or >= 5.0, and wanted to understand this better, but it's probably a dead end.

following field of the cmt differ

OK, so it's really two different data structures that are being marshaled, not just two equal structures that share differently. One possibility is that stream.cmti is generated twice by two independent commands (e.g. an ocamlc invocation and an ocamlopt invocation), and for some reason the two files differ, and one or the other wins the race. This should show up in the build logs, though.

@nojb
Copy link
Collaborator

nojb commented Nov 15, 2023

I had the impression that the build differs whether OCaml is < 4.14, = 4.14, or >= 5.0, and wanted to understand this better

Indeed, the build differs. For 4.14 (the version used in ocaml/camlp-streams#9), the source file stream.ml contains include Stream, stream.mli contains include module type of struct include Stream end and similarly for genlex.ml and genlex.mli.

For >= 5.0 the actual sources in the project repository are used instead of the include's.

Incidentally, I was able to reproduce the issue with Make, see details at ocaml/camlp-streams#9 (comment)

@xavierleroy xavierleroy removed the bug label Nov 15, 2023
@xavierleroy
Copy link
Author

It looks like Dune is innocent this time :-) Let me close this report while we look at @nojb's make-based repro.

@ejgallego ejgallego closed this as not planned Won't fix, can't repro, duplicate, stale Nov 15, 2023
@hhugo
Copy link
Collaborator

hhugo commented Nov 15, 2023

@hhugo would it be possible to obtain the diff?

Initial_env has an additional "Persistent" constructor on top of the summary in one of the file.
I didn't look at cmt_annot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants