Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] custom runtimes & reproducible builds #1845

Merged
11 commits merged into from
Jun 27, 2018
Merged

Conversation

xclerc
Copy link
Contributor

@xclerc xclerc commented Jun 19, 2018

When using the -custom option of ocamlc, a C file containing
the information about primitives is generated. The file name
comes from a call to Filename.temp_file, meaning that successive
invocations will generate different file names.

The problem, with respect to reproducible builds, is that the file
name ultimately appears in the produced binary, making the build
non-reproducible.

This PR introduces -dcamlprimc, a command-line switch similar
to ocamlopt's -dstartup, to keep the generated file (whose name
is then derived from the specified compiler output).

@pmetzger
Copy link
Member

I like the idea. However, why "camlprimc" as a name? I find it kind of confusing...

@xclerc
Copy link
Contributor Author

xclerc commented Jun 19, 2018

Basically, lack of imagination... the name of the temporary file being "camlprimXYZ.c".

Suggestions are welcome!

@shindere
Copy link
Contributor

shindere commented Jun 19, 2018 via email

@alainfrisch
Copy link
Contributor

I can see some use for keeping the temporary file, for debugging purpose, but if you are only interested in reproducible builds, why do you conflate the idea of using a stable name with that of keeping the temporary file around?

I suspect that, similarly, you might want a way to use a stable name for the "startup" object file, without necessarily keeping the file itself after compilation.

@pmetzger
Copy link
Member

-dstabletmps ?

@alainfrisch
Copy link
Contributor

Would it be risky to always use a name derived from the target?

@xclerc
Copy link
Contributor Author

xclerc commented Jun 19, 2018

@alainfrisch I understand (and partially share) your concerns.

When you suggest to always derive the file name from the
target, would the file still be created in a temporary directory?
(if so, the risk of name clash / race condition seems to be non
negligible)

@alainfrisch
Copy link
Contributor

I was thinking indeed about creating the file in the target folder. Alternatively, one could create a temporary directory, and create files with stable names in it -- or does the full path appear in the object file?

@xclerc
Copy link
Contributor Author

xclerc commented Jun 19, 2018

As far as gcc is concerned, only the "basename"
appears in the produced binary. Interestingly enough,
my first implementation was actually an environment
variable (whose name was inspired by this PR) used
to control the path of the file:

diff --git a/bytecomp/bytelink.ml b/bytecomp/bytelink.ml
index 05ff3d674..59f98ef80 100644
--- a/bytecomp/bytelink.ml
+++ b/bytecomp/bytelink.ml
@@ -36,7 +36,12 @@ type link_action =
     Link_object of string * compilation_unit
       (* Name of .cmo file and descriptor of the unit *)
   | Link_archive of string * compilation_unit list
-      (* Name of .cma file and descriptors of the units to be linked. *)
+  (* Name of .cma file and descriptors of the units to be linked. *)
+
+let temp_file prefix suffix =
+  match Sys.getenv_opt "BUILD_PATH_TEMP" with
+  | None -> Filename.temp_file prefix suffix
+  | Some path -> Filename.concat path (prefix ^ suffix)
 
 (* Add C objects and options from a library descriptor *)
 (* Ignore them if -noautolink or -use-runtime or -use-prim was given *)
@@ -581,8 +586,8 @@ let link ppf objfiles output_name =
   if not !Clflags.custom_runtime then
     link_bytecode ppf tolink output_name true
   else if not !Clflags.output_c_object then begin
-    let bytecode_name = Filename.temp_file "camlcode" "" in
-    let prim_name = Filename.temp_file "camlprim" ".c" in
+    let bytecode_name = temp_file "camlcode" "" in
+    let prim_name = temp_file "camlprim" ".c" in
     try
       link_bytecode ppf tolink bytecode_name false;
       let poc = open_out prim_name in
@@ -621,7 +626,7 @@ let link ppf objfiles output_name =
     let c_file =
       if !Clflags.output_complete_object
          && not (Filename.check_suffix output_name ".c")
-      then Filename.temp_file "camlobj" ".c"
+      then temp_file "camlobj" ".c"
       else begin
         let f = basename ^ ".c" in
         if Sys.file_exists f then raise(Error(File_exists f));

However, I think it will cause other problems; e.g. if your
build system is parallel, you would need to have different
directories for the various workers.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 20, 2018

An alternative would be to use the -fdebug-prefix-map option
of gcc/clang (I completely missed this option yesterday).
It is probably slightly less portable, as other C compilers might
not support it, but it depends whether it is deemed important
to have reproducible builds everywhere.

The patch would be something like the following (plus configuration
logic to use the option only when the ccfamily is gcc/clang):

diff --git a/bytecomp/bytelink.ml b/bytecomp/bytelink.ml
index 1500cdd1c..0f8a4dba6 100644
--- a/bytecomp/bytelink.ml
+++ b/bytecomp/bytelink.ml
@@ -536,8 +536,10 @@ let link_bytecode_as_c ppf tolink outfile =
 
 let build_custom_runtime prim_name exec_name =
   let runtime_lib = "-lcamlrun" ^ !Clflags.runtime_variant in
+  let debug_prefix =
+    Printf.sprintf "-fdebug-prefix-map=%s=camlprim.c" prim_name in
   Ccomp.call_linker Ccomp.Exe exec_name
-    ([prim_name] @ List.rev !Clflags.ccobjs @ [runtime_lib])
+    ([debug_prefix; prim_name] @ List.rev !Clflags.ccobjs @ [runtime_lib])
     (Clflags.std_include_flag "-I" ^ " " ^ Config.bytecomp_c_libraries)
 
 let append_bytecode_and_cleanup bytecode_name exec_name prim_name =

@gasche
Copy link
Member

gasche commented Jun 20, 2018

I think we may want both:

  • personally I think that exposing the generated files to the user is a good general UI principle and I am in favor of an option to do it (for example, in teaching situations, we can easily show to students what this file looks like)
  • there are situations were using debug-prefix-map could be useful

The question is: in the cases where we were not asked from the user to preserve the file, should we generate it in a stable position and then remove it, or should we generate a random temporary file and stabilize post-facto? I think both option have merits (eg. in case the compiler fails abruptly at this point, maybe having the file around is actually nice?).

@ghost
Copy link

ghost commented Jun 20, 2018

The question is: in the cases where we were not asked from the user to preserve the file, should we generate it in a stable position and then remove it, or should we generate a random temporary file and stabilize post-facto?

For most build systems the latter is better. For instance, a command that evaluate the glob *.c cannot be run concurrently with a command that produce a transient .c file in the same directory, so some special support is required.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 21, 2018

From both @alainfrisch and @gasche feedback, I understand that
the current diff of this PR addresses the need to be able to examine
the C file (which is interesting per se), but while it produces reproducible
builds as a side effect it is better to use another mechanism to guarantee
that this property holds.

As a consequence, I would propose to extend this PR with the -fdebug-prefix-map
patch above (and the related configure logic). I wonder whether this flag
should always be passed when available, or should be explicitly requested
by the user.

@gasche
Copy link
Member

gasche commented Jun 21, 2018

How portable is the flag? Does it work with slightly older GCC versions, does it really work with Clang, and what about the Windows world? Wouldn't it make sense to check its availability at configure--time?

I think that we could enable it by default on all systems that support it.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 21, 2018

Does it work with slightly older GCC versions?

From what I see, the flag was introduced in gcc 4.3.0.
It looks like recent versions of e.g. Unbuntu and Debian
ship gcc versions that are more recent than that.

does it really work with Clang

It does (tested on macOS by @diml).

Wouldn't it make sense to check its availability at configure--time?

Indeed, that's what I proposed -- just wanted to gather feedback
before implementing it.

I think that we could enable it by default on all systems that support it.

I concur; in theory, some people might rely on the fact that each
build is different, but it sounds slightly unlikely (and there are
numerous possible workarounds).

@xclerc
Copy link
Contributor Author

xclerc commented Jun 21, 2018

(I just noticed that the configure script rejects gcc versions
older than 4.2.)

@xclerc
Copy link
Contributor Author

xclerc commented Jun 21, 2018

I have updated this PR with the detection and use of -fdebug-prefix-map.

@@ -536,8 +536,13 @@ let link_bytecode_as_c ppf tolink outfile =

let build_custom_runtime prim_name exec_name =
let runtime_lib = "-lcamlrun" ^ !Clflags.runtime_variant in
let debug_prefix_map =
if Config.c_has_debug_prefix_map then
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably not pass this option if -dcamlprimc is passed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure; it makes the build reproducible even if you change
the output name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that my definition of "reproducible" was slightly
broken: I understood it as "produces the same contents"
while it is "produces the contents in the same file".

configure Outdated
@@ -409,6 +409,7 @@ export cc verbose
# Determine the C compiler family (GCC, Clang, etc)

ccfamily=`$cc -E cckind.c | grep '^[a-z]' | tr -s ' ' '-'`
cc_has_debug_prefix_map=`$cc -E cc_has_debug_prefix_map.c | grep '^[a-z]'`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just passing -fdebug-prefix-map and seeing if the compiler fails or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I reused the logic from config/auto-aux/cckind.c.
Actually, a third possibly would be to do all the work
in the aforementioned file.
I don't have a strong opinion.

@alainfrisch
Copy link
Contributor

For instance, a command that evaluate the glob *.c cannot be run concurrently with a command that produce a transient .c file in the same directory, so some special support is required.

If this is a concern, one could use a different extension (*.tmp) -- or no extension at all -- for the temporary file and pass -x c to gcc (resp /Tc for msvc) to force it to interpret the file as C code.

Also, since only the basename seems to matter, one could create a temporary directory and put the file in it.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

@alainfrisch Another possibility we briefly considered was to
just pipe the source to the compiler (also using -xc to set the
language). Would you know whether it is possible with the msvc
toolchain?

@alainfrisch
Copy link
Contributor

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

Apprently not possible with msvc (https://social.msdn.microsoft.com/Forums/vstudio/en-US/c6ee0a6c-2354-4fe7-9723-366f886c3c90/about-compilercl-can-input-a-source-code-to-clexe-from-stdin?forum=vclanguage).

Thanks for the pointer; however, a decade later, the feature
might have been granted.

But how would you do that without the Unix library, anyway?

Sorry I was not clear: we would generate a file, and then basically
execute cat this_file | gcc -xc - .... When invoked like that, gcc
uses an empty string as the parameter of the .file directive.
So, technically we would generate a file, but its name would not
matter any more.

(By the way, doesn't your directory-creation-based solution suffer
from a similar problem? I mean you would need Unix to create
the directory, right?)

@alainfrisch
Copy link
Contributor

cat this_file | gcc -xc - ...

Ok, understood. Is this much better than creating the temporary file in the target folder, with a name derived from the target, and without the .c extension?

By the way, doesn't your directory-creation-based solution suffer
from a similar problem?

Indeed, I did not realize that the stdlib does not allow creating a directory, ...

@gasche
Copy link
Member

gasche commented Jun 22, 2018

I'm a bit confused because I am not sure what problems of the proposed solution(s) @xclerc and @alainfrisch are discussing. My understanding of the current decisions are as follows:

  1. The -dcamlprimc feature to observe the C code as a file produced in the target directory is nice to have, but not the best way to get reproducible custom builds (the initial need). It is kept in the PR but not used in the final solution.

  2. The solution for reproducible builds using C files that we do not want to keep around is to produce a temporary file as we currently do, but using -fdebug-prefix-map to ask the C compiler to give it a stable name.

As far as I can tell this describes the current state of the PR which sounds fine to me, modulo the discussion on how to best implement debug-prefix-map support detection in configure.

Is the rest of the discussion trying to find other approaches that are more portable to compilers which do not support debug-prefix-map? Do we actually care about supporting reproducible builds on these systems? I personally would be content with keeping the current behavior there.

(It may be useful to review the other temporary files produced by the compiler and see whether the same approach can be used. In particular, are there other C files that could benefit from the debug-prefix-map treatment?)

@alainfrisch
Copy link
Contributor

Yes, I think this is about finding an alternative, perhaps simpler and more portable solution. Also, if we add the possibility to tell the compiler to keep the file around, it would just amount to not deleting it. This seems simpler and more robust than having two different ways to generate the file name and to compile it depending on whether the file should be kept around for debugging purposes.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

Is this much better than creating the temporary file in the target folder, with a name derived from the target, and without the .c extension?

No, it is not; it only allows you to get reproducible builds with
neither keeping the file, nor relying on -fdebug-prefix-map.

@alainfrisch
Copy link
Contributor

Btw https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html says:

-fdebug-prefix-map=old=new
When compiling files residing in directory old, record debugging information describing them as if the files resided in directory new instead. This can be used to replace a build-time path with an install-time path in the debug info. It can also be used to change an absolute path to a relative path by using . for new. This can give more reproducible builds, which are location independent, but may require an extra command to tell GDB where to find the source files. See also -ffile-prefix-map.

This option seems to be used here to map the full filename, while the text above mentions only mapping directory names. Is that a documented/stable-enough behavior?

@alainfrisch
Copy link
Contributor

alainfrisch commented Jun 22, 2018

Note also the hint about the more recent -ffile-prefix-map, which would be needed if the temporary C code used __FILE__ and co.

@lpw25
Copy link
Contributor

lpw25 commented Jun 22, 2018

without the .c extension

Note that this is also not completely portable. See #568 for a discussion.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

Note also the hint about the more recent -ffile-prefix-map, which would be needed if the temporary C code used FILE and co.

I will have a look - I will also (in another PR), use BUILD_PATH_PREFIX_MAP
when translating OCaml's __FILE__ and __POS__ primitives that are another
source of non-reproducible builds I identified.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

I did not use -ffile-prefix-map because it was not available locally,
and it indeed seems to be extremely recent: introduced in January
(see gcc's change log). It will be a long time before Linux distributions
use gcc 8.x by default.

-fdebug-prefix-map is available since version 4.3 (circa 2011), and
is as a consequence largely available. I reckon -ffile-prefix-map would
be better, but I think we should just add a comment to the use of
-fdebug-prefix-map (and maybe also to the detection logic) so that
someone will revisit the decision when the command-line switch is
largely available.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 22, 2018

It may be useful to review the other temporary files produced by the compiler and see whether the same approach can be used. In particular, are there other C files that could benefit from the debug-prefix-map treatment?

I did indeed check for temporary files. The native part of the
compiler seems to be OK as assemly files contain an empty
.file directive (see this commit following this discussion).
The bytecode part must be modified to add the -fdebug-prefix-map
command-line switch for the compilation of the C file used
for -output-complete-obj when the output is not a C file, e.g.:

ocamlc -output-complete-obj source.ml -o tmp.o

I have updated this PR accordingly.

@xclerc
Copy link
Contributor Author

xclerc commented Jun 25, 2018

I have changed the detection logic to follow @diml advice.

raise x
end else begin
let basename = Filename.chop_extension output_name in
let temps = ref [] in
let c_file =
let c_file, stable_name =
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that stable_name should be an option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, however it is not an option in build_custom_runtime
(one can only avoid passing -fdebug-prefix-map by keeping
the generated file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misinterpreted your comment as command-line option
rather than string option parameter -- I have pushed a new
version.

@ghost
Copy link

ghost commented Jun 25, 2018

@alainfrisch regarding portability, I admit that I don't know what other C compilers are doing and whether they are affected at all by this issue. For instance does the msvc compiler stores absolute paths in object files? It seems to me that the proposed solution is non-invasive: it doesn't change the way things are compiled and simply pass the right option to the C compiler to make builds more reproducible, so I'm in favor of merging this PR as it.

This option seems to be used here to map the full filename, while the text above mentions only mapping directory names. Is that a documented/stable-enough behavior?

We tested it with both gcc and clang, so I tend to think we can rely on the current behavior.

@ghost
Copy link

ghost commented Jun 25, 2018

The current patch looks good to me. @xclerc can you add a changelog entry?

@xclerc
Copy link
Contributor Author

xclerc commented Jun 27, 2018

Entry added.

@ghost ghost merged commit 9c182f7 into ocaml:trunk Jun 27, 2018
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants