Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] custom runtimes & reproducible builds #1845

Merged
merged 11 commits into from Jun 27, 2018

Conversation

Projects
None yet
7 participants
@xclerc
Copy link
Contributor

commented Jun 19, 2018

When using the -custom option of ocamlc, a C file containing
the information about primitives is generated. The file name
comes from a call to Filename.temp_file, meaning that successive
invocations will generate different file names.

The problem, with respect to reproducible builds, is that the file
name ultimately appears in the produced binary, making the build
non-reproducible.

This PR introduces -dcamlprimc, a command-line switch similar
to ocamlopt's -dstartup, to keep the generated file (whose name
is then derived from the specified compiler output).

@pmetzger

This comment has been minimized.

Copy link
Member

commented Jun 19, 2018

I like the idea. However, why "camlprimc" as a name? I find it kind of confusing...

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

Basically, lack of imagination... the name of the temporary file being "camlprimXYZ.c".

Suggestions are welcome!

@shindere

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

I can see some use for keeping the temporary file, for debugging purpose, but if you are only interested in reproducible builds, why do you conflate the idea of using a stable name with that of keeping the temporary file around?

I suspect that, similarly, you might want a way to use a stable name for the "startup" object file, without necessarily keeping the file itself after compilation.

@pmetzger

This comment has been minimized.

Copy link
Member

commented Jun 19, 2018

-dstabletmps ?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

Would it be risky to always use a name derived from the target?

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

@alainfrisch I understand (and partially share) your concerns.

When you suggest to always derive the file name from the
target, would the file still be created in a temporary directory?
(if so, the risk of name clash / race condition seems to be non
negligible)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2018

I was thinking indeed about creating the file in the target folder. Alternatively, one could create a temporary directory, and create files with stable names in it -- or does the full path appear in the object file?

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2018

As far as gcc is concerned, only the "basename"
appears in the produced binary. Interestingly enough,
my first implementation was actually an environment
variable (whose name was inspired by this PR) used
to control the path of the file:

diff --git a/bytecomp/bytelink.ml b/bytecomp/bytelink.ml
index 05ff3d674..59f98ef80 100644
--- a/bytecomp/bytelink.ml
+++ b/bytecomp/bytelink.ml
@@ -36,7 +36,12 @@ type link_action =
     Link_object of string * compilation_unit
       (* Name of .cmo file and descriptor of the unit *)
   | Link_archive of string * compilation_unit list
-      (* Name of .cma file and descriptors of the units to be linked. *)
+  (* Name of .cma file and descriptors of the units to be linked. *)
+
+let temp_file prefix suffix =
+  match Sys.getenv_opt "BUILD_PATH_TEMP" with
+  | None -> Filename.temp_file prefix suffix
+  | Some path -> Filename.concat path (prefix ^ suffix)
 
 (* Add C objects and options from a library descriptor *)
 (* Ignore them if -noautolink or -use-runtime or -use-prim was given *)
@@ -581,8 +586,8 @@ let link ppf objfiles output_name =
   if not !Clflags.custom_runtime then
     link_bytecode ppf tolink output_name true
   else if not !Clflags.output_c_object then begin
-    let bytecode_name = Filename.temp_file "camlcode" "" in
-    let prim_name = Filename.temp_file "camlprim" ".c" in
+    let bytecode_name = temp_file "camlcode" "" in
+    let prim_name = temp_file "camlprim" ".c" in
     try
       link_bytecode ppf tolink bytecode_name false;
       let poc = open_out prim_name in
@@ -621,7 +626,7 @@ let link ppf objfiles output_name =
     let c_file =
       if !Clflags.output_complete_object
          && not (Filename.check_suffix output_name ".c")
-      then Filename.temp_file "camlobj" ".c"
+      then temp_file "camlobj" ".c"
       else begin
         let f = basename ^ ".c" in
         if Sys.file_exists f then raise(Error(File_exists f));

However, I think it will cause other problems; e.g. if your
build system is parallel, you would need to have different
directories for the various workers.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 20, 2018

An alternative would be to use the -fdebug-prefix-map option
of gcc/clang (I completely missed this option yesterday).
It is probably slightly less portable, as other C compilers might
not support it, but it depends whether it is deemed important
to have reproducible builds everywhere.

The patch would be something like the following (plus configuration
logic to use the option only when the ccfamily is gcc/clang):

diff --git a/bytecomp/bytelink.ml b/bytecomp/bytelink.ml
index 1500cdd1c..0f8a4dba6 100644
--- a/bytecomp/bytelink.ml
+++ b/bytecomp/bytelink.ml
@@ -536,8 +536,10 @@ let link_bytecode_as_c ppf tolink outfile =
 
 let build_custom_runtime prim_name exec_name =
   let runtime_lib = "-lcamlrun" ^ !Clflags.runtime_variant in
+  let debug_prefix =
+    Printf.sprintf "-fdebug-prefix-map=%s=camlprim.c" prim_name in
   Ccomp.call_linker Ccomp.Exe exec_name
-    ([prim_name] @ List.rev !Clflags.ccobjs @ [runtime_lib])
+    ([debug_prefix; prim_name] @ List.rev !Clflags.ccobjs @ [runtime_lib])
     (Clflags.std_include_flag "-I" ^ " " ^ Config.bytecomp_c_libraries)
 
 let append_bytecode_and_cleanup bytecode_name exec_name prim_name =
@gasche

This comment has been minimized.

Copy link
Member

commented Jun 20, 2018

I think we may want both:

  • personally I think that exposing the generated files to the user is a good general UI principle and I am in favor of an option to do it (for example, in teaching situations, we can easily show to students what this file looks like)
  • there are situations were using debug-prefix-map could be useful

The question is: in the cases where we were not asked from the user to preserve the file, should we generate it in a stable position and then remove it, or should we generate a random temporary file and stabilize post-facto? I think both option have merits (eg. in case the compiler fails abruptly at this point, maybe having the file around is actually nice?).

@diml

This comment has been minimized.

Copy link
Member

commented Jun 20, 2018

The question is: in the cases where we were not asked from the user to preserve the file, should we generate it in a stable position and then remove it, or should we generate a random temporary file and stabilize post-facto?

For most build systems the latter is better. For instance, a command that evaluate the glob *.c cannot be run concurrently with a command that produce a transient .c file in the same directory, so some special support is required.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2018

From both @alainfrisch and @gasche feedback, I understand that
the current diff of this PR addresses the need to be able to examine
the C file (which is interesting per se), but while it produces reproducible
builds as a side effect it is better to use another mechanism to guarantee
that this property holds.

As a consequence, I would propose to extend this PR with the -fdebug-prefix-map
patch above (and the related configure logic). I wonder whether this flag
should always be passed when available, or should be explicitly requested
by the user.

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 21, 2018

How portable is the flag? Does it work with slightly older GCC versions, does it really work with Clang, and what about the Windows world? Wouldn't it make sense to check its availability at configure--time?

I think that we could enable it by default on all systems that support it.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2018

Does it work with slightly older GCC versions?

From what I see, the flag was introduced in gcc 4.3.0.
It looks like recent versions of e.g. Unbuntu and Debian
ship gcc versions that are more recent than that.

does it really work with Clang

It does (tested on macOS by @diml).

Wouldn't it make sense to check its availability at configure--time?

Indeed, that's what I proposed -- just wanted to gather feedback
before implementing it.

I think that we could enable it by default on all systems that support it.

I concur; in theory, some people might rely on the fact that each
build is different, but it sounds slightly unlikely (and there are
numerous possible workarounds).

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2018

(I just noticed that the configure script rejects gcc versions
older than 4.2.)

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2018

I have updated this PR with the detection and use of -fdebug-prefix-map.

@@ -536,8 +536,13 @@ let link_bytecode_as_c ppf tolink outfile =

let build_custom_runtime prim_name exec_name =
let runtime_lib = "-lcamlrun" ^ !Clflags.runtime_variant in
let debug_prefix_map =
if Config.c_has_debug_prefix_map then

This comment has been minimized.

Copy link
@diml

diml Jun 21, 2018

Member

We should probably not pass this option if -dcamlprimc is passed as well.

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 21, 2018

Author Contributor

I am not sure; it makes the build reproducible even if you change
the output name.

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 21, 2018

Author Contributor

I realized that my definition of "reproducible" was slightly
broken: I understood it as "produces the same contents"
while it is "produces the contents in the same file".

configure Outdated
@@ -409,6 +409,7 @@ export cc verbose
# Determine the C compiler family (GCC, Clang, etc)

ccfamily=`$cc -E cckind.c | grep '^[a-z]' | tr -s ' ' '-'`
cc_has_debug_prefix_map=`$cc -E cc_has_debug_prefix_map.c | grep '^[a-z]'`

This comment has been minimized.

Copy link
@diml

diml Jun 21, 2018

Member

What about just passing -fdebug-prefix-map and seeing if the compiler fails or not?

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 21, 2018

Author Contributor

Well, I reused the logic from config/auto-aux/cckind.c.
Actually, a third possibly would be to do all the work
in the aforementioned file.
I don't have a strong opinion.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2018

For instance, a command that evaluate the glob *.c cannot be run concurrently with a command that produce a transient .c file in the same directory, so some special support is required.

If this is a concern, one could use a different extension (*.tmp) -- or no extension at all -- for the temporary file and pass -x c to gcc (resp /Tc for msvc) to force it to interpret the file as C code.

Also, since only the basename seems to matter, one could create a temporary directory and put the file in it.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

@alainfrisch Another possibility we briefly considered was to
just pipe the source to the compiler (also using -xc to set the
language). Would you know whether it is possible with the msvc
toolchain?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

Apprently not possible with msvc (https://social.msdn.microsoft.com/Forums/vstudio/en-US/c6ee0a6c-2354-4fe7-9723-366f886c3c90/about-compilercl-can-input-a-source-code-to-clexe-from-stdin?forum=vclanguage).

Thanks for the pointer; however, a decade later, the feature
might have been granted.

But how would you do that without the Unix library, anyway?

Sorry I was not clear: we would generate a file, and then basically
execute cat this_file | gcc -xc - .... When invoked like that, gcc
uses an empty string as the parameter of the .file directive.
So, technically we would generate a file, but its name would not
matter any more.

(By the way, doesn't your directory-creation-based solution suffer
from a similar problem? I mean you would need Unix to create
the directory, right?)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

cat this_file | gcc -xc - ...

Ok, understood. Is this much better than creating the temporary file in the target folder, with a name derived from the target, and without the .c extension?

By the way, doesn't your directory-creation-based solution suffer
from a similar problem?

Indeed, I did not realize that the stdlib does not allow creating a directory, ...

@gasche

This comment has been minimized.

Copy link
Member

commented Jun 22, 2018

I'm a bit confused because I am not sure what problems of the proposed solution(s) @xclerc and @alainfrisch are discussing. My understanding of the current decisions are as follows:

  1. The -dcamlprimc feature to observe the C code as a file produced in the target directory is nice to have, but not the best way to get reproducible custom builds (the initial need). It is kept in the PR but not used in the final solution.

  2. The solution for reproducible builds using C files that we do not want to keep around is to produce a temporary file as we currently do, but using -fdebug-prefix-map to ask the C compiler to give it a stable name.

As far as I can tell this describes the current state of the PR which sounds fine to me, modulo the discussion on how to best implement debug-prefix-map support detection in configure.

Is the rest of the discussion trying to find other approaches that are more portable to compilers which do not support debug-prefix-map? Do we actually care about supporting reproducible builds on these systems? I personally would be content with keeping the current behavior there.

(It may be useful to review the other temporary files produced by the compiler and see whether the same approach can be used. In particular, are there other C files that could benefit from the debug-prefix-map treatment?)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

Yes, I think this is about finding an alternative, perhaps simpler and more portable solution. Also, if we add the possibility to tell the compiler to keep the file around, it would just amount to not deleting it. This seems simpler and more robust than having two different ways to generate the file name and to compile it depending on whether the file should be kept around for debugging purposes.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

Is this much better than creating the temporary file in the target folder, with a name derived from the target, and without the .c extension?

No, it is not; it only allows you to get reproducible builds with
neither keeping the file, nor relying on -fdebug-prefix-map.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

Btw https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html says:

-fdebug-prefix-map=old=new
When compiling files residing in directory old, record debugging information describing them as if the files resided in directory new instead. This can be used to replace a build-time path with an install-time path in the debug info. It can also be used to change an absolute path to a relative path by using . for new. This can give more reproducible builds, which are location independent, but may require an extra command to tell GDB where to find the source files. See also -ffile-prefix-map.

This option seems to be used here to map the full filename, while the text above mentions only mapping directory names. Is that a documented/stable-enough behavior?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

Note also the hint about the more recent -ffile-prefix-map, which would be needed if the temporary C code used __FILE__ and co.

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2018

without the .c extension

Note that this is also not completely portable. See #568 for a discussion.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

Note also the hint about the more recent -ffile-prefix-map, which would be needed if the temporary C code used FILE and co.

I will have a look - I will also (in another PR), use BUILD_PATH_PREFIX_MAP
when translating OCaml's __FILE__ and __POS__ primitives that are another
source of non-reproducible builds I identified.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

I did not use -ffile-prefix-map because it was not available locally,
and it indeed seems to be extremely recent: introduced in January
(see gcc's change log). It will be a long time before Linux distributions
use gcc 8.x by default.

-fdebug-prefix-map is available since version 4.3 (circa 2011), and
is as a consequence largely available. I reckon -ffile-prefix-map would
be better, but I think we should just add a comment to the use of
-fdebug-prefix-map (and maybe also to the detection logic) so that
someone will revisit the decision when the command-line switch is
largely available.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 22, 2018

It may be useful to review the other temporary files produced by the compiler and see whether the same approach can be used. In particular, are there other C files that could benefit from the debug-prefix-map treatment?

I did indeed check for temporary files. The native part of the
compiler seems to be OK as assemly files contain an empty
.file directive (see this commit following this discussion).
The bytecode part must be modified to add the -fdebug-prefix-map
command-line switch for the compilation of the C file used
for -output-complete-obj when the output is not a C file, e.g.:

ocamlc -output-complete-obj source.ml -o tmp.o

I have updated this PR accordingly.

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2018

I have changed the detection logic to follow @diml advice.

raise x
end else begin
let basename = Filename.chop_extension output_name in
let temps = ref [] in
let c_file =
let c_file, stable_name =

This comment has been minimized.

Copy link
@diml

diml Jun 25, 2018

Member

It seems to me that stable_name should be an option

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 25, 2018

Author Contributor

Possibly, however it is not an option in build_custom_runtime
(one can only avoid passing -fdebug-prefix-map by keeping
the generated file).

This comment has been minimized.

Copy link
@xclerc

xclerc Jun 25, 2018

Author Contributor

Sorry, I misinterpreted your comment as command-line option
rather than string option parameter -- I have pushed a new
version.

@diml

This comment has been minimized.

Copy link
Member

commented Jun 25, 2018

@alainfrisch regarding portability, I admit that I don't know what other C compilers are doing and whether they are affected at all by this issue. For instance does the msvc compiler stores absolute paths in object files? It seems to me that the proposed solution is non-invasive: it doesn't change the way things are compiled and simply pass the right option to the C compiler to make builds more reproducible, so I'm in favor of merging this PR as it.

This option seems to be used here to map the full filename, while the text above mentions only mapping directory names. Is that a documented/stable-enough behavior?

We tested it with both gcc and clang, so I tend to think we can rely on the current behavior.

@diml

This comment has been minimized.

Copy link
Member

commented Jun 25, 2018

The current patch looks good to me. @xclerc can you add a changelog entry?

@xclerc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2018

Entry added.

@diml

diml approved these changes Jun 27, 2018

@diml diml merged commit 9c182f7 into ocaml:trunk Jun 27, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.