Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor BUILD_PATH_PREFIX_MAP #1515

Merged
merged 3 commits into from Mar 5, 2018
Merged

Honor BUILD_PATH_PREFIX_MAP #1515

merged 3 commits into from Mar 5, 2018

Conversation

@gasche
Copy link
Member

@gasche gasche commented Dec 4, 2017

BUILD_PATH_PREFIX_MAP is an environment variable designed for "reproducible builds" (by @infinity0). It allows to specify a series of rewriting rules for absolute paths, to make the output of build tools resilient to build-directory changes. See the BUILD_PATH_PREFIX_MAP specification -- the details are a bit tricky due to the need to portably escape strings within the environment variable, but this can be ignored for the purpose of this PR.

This PR adds a new module in utils/, build_path_prefix_map.{ml,mli}, that implements encoding and decoding of build path prefix maps, according to the specification, and rewriting of absolute paths under a given map. Then a new function Location.rewrite_absolute_path is added to do the lookup the environment variable and do the rewriting (the environment lookup, and map decoding, are cached).

For example, when using this PR, the build of the findlib library becomes reproducible, while it currently is not, because it uses -g by default, which includes absolute file names (since #14 and #157).

Note that this PR does not change the behaviour of the compiler unless the BUILD_PATH_PREFIX_MAP variable is set (I expect build scripts for distributions interested in reproducible builds, such as Debian, to set it). In particular, it does not disable the use of absolute names in any way.

Final remarks:

@gasche gasche force-pushed the gasche:build-path-prefix-map branch 2 times, most recently from 05f5c92 to 67f79f7 Dec 5, 2017
@mshinwell
Copy link
Contributor

@mshinwell mshinwell commented Dec 6, 2017

Echoing what was said on the GCC patches list: setting build parameters via environment variables is fragile (a pretty good way of ensuring you don't get reproducible builds, in fact). I would rather see this as an option to the compiler used in the build systems of the relevant packages, whose maintainers are interested in this reproducibility property, rather than another OCAMLPARAM-like hack.

It might be worth investigating how Debian ensures this property in the C world. (Is it relying on most things using autoconf, for example, so a standardised option can be passed?)

@gasche
Copy link
Member Author

@gasche gasche commented Dec 6, 2017

If I understood correctly, the reproducible-builds.org infrastructure uses Ximin's gcc patches for its Debian package testing, so that's the approach they use for C.

One counter-argument in support of an environment variable, I think, is that reproducibility is a cross-cutting concern. We want all the packages on a user system to be reproducible, not just some of them for which the authors made specific build-system choices. Also, following shared conventions across different languages and build systems makes it much easier to have reproducible build in practice.

(Of course, from the perspective of someone who does not care about reproducibility, I think it is fair to ask how much complexity the change adds to the normal workings of the system.)

I'm not trying to use a slippery slope, but just as a remark: 4.06 already respects reproducible-builds specifications/conventions for another environment variable, SOURCE_DATE_EPOCH, that can be used to make programs reproducible when they include timestamps in their output.

@mshinwell
Copy link
Contributor

@mshinwell mshinwell commented Dec 6, 2017

One other thing that comes to mind: the notion of "reproducible" might be worth investigating a little more. Is it acceptable for certain sections of the executable to differ?

@lpw25 and I have discussed, actually in the context of Spacetime, moving the debug info out of the frametable (which is in the data section) into a separate section. Doing this might help our reproducibility story if only a subset (i.e. all those that matter) of sections are compared without any rewriting of paths.

Are there any cases which don't relate to debug info? I think there might be a problem with paths being embedded in the names of Flambda anonymous function symbols, although we should be able to make that go away once the gdb patches land, as the user-visible names can be specified separately in DWARF. I'm not sure I know of any others, but there probably are some.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Dec 6, 2017

One other thing that comes to mind: the notion of "reproducible" might be worth investigating a little more. Is it acceptable for certain sections of the executable to differ?

Not really I think, see the definition (the bit-by-bit bit). As an end user I really just want to simply hash my binary and compare it to the hash of a binary someone else produced on another machine.

@gasche
Copy link
Member Author

@gasche gasche commented Dec 7, 2017

Are there any cases which don't relate to debug info?

I include below an exhaustive list of all paths rewritten by the current patch. A word on terminology, I call "completed path" a path that was passed as a relative path, and is turned into an absolute path by the compiler.

  1. the completed paths to file names debug information,
    and the absolute path to the current build directory,
    which the debugger may lookup to find source files
    (they use absolute/completed paths since PRs #14 and #157)
  2. the completed path to a custom bytecode interpreter stored in the RNTM section of cmo files,
    passed using the -use-runtime option
  3. the path to the "build directory" stored in .cmt files
  4. the completed paths shown in error locations in the compiler output

Whenever I mentioned that completed paths are rewritten, note that this only applies if the path was passed as a relative path to that part of the compiler. In particular, if the user explicitly specifies an absolute path, then this path is not rewritten. Another way to think of the behavior is that the only paths that are rewritten are those resulting from a Sys.getcwd () call by the compiler itself.

@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Dec 7, 2017

I take from it than filenames in assertion or pattern matching failure messages, or string literals generated by FILE, are not currently rewritten in case when -absname is used. I'm not sure we want to prevent doing so in the future, though.

@gasche
Copy link
Member Author

@gasche gasche commented Dec 7, 2017

I'm not sure about warning messages, this may be a case of (4) that I overlooked. I'm sure that FILE currently reuses the pos_fname value of the lexing location of its AST node. I believe that whether those are absolute or relative depends on how the file path was given to the OCaml compiler.

With the proposed patchset, Location.absolute_path always applies the prefix_map rewrite. This means that if we later decide to complete FILE paths using this function, no change would be required from the point of view of this PR.

(Outside Location, rewrite_absolute_path is only used on one direct getcwd () call and two independent reimplementations of absolute_path.)

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Dec 7, 2017

Note that in the particular case of cmt[i] files this will not be sufficient to make them reproducible because of cmt_args.

@infinity0
Copy link
Contributor

@infinity0 infinity0 commented Dec 7, 2017

setting build parameters via environment variables is fragile (a pretty good way of ensuring you don't get reproducible builds, in fact).

This is normally the case for variables unrelated to reproducible builds, but this variable was specifically designed for achieving reproducible builds, and there are some differences:

For things like OCAMLPARAM, one is supplying extra information to a build process via a mechanism that is not very transparent (envvars) and more transparent mechanisms (command-line args) are preferred. For BUILD_PATH_PREFIX_MAP however, it is more accurate to think of this as removing information that the build process is already taking via a non-transparent mechanism - i.e. absolute paths on the filesystem.

From another angle: if this were implemented via a command-line flag, one would have to duplicate absolute paths in the calls to ocamlc et.al, polluting what would otherwise be a constant invocation ocamlc $args, where $args is not dependent on a specific build machine, with information that is specific to a build machine. (Then sometimes, higher-level buildsystems also like to save $args into the build output, making it unreproducible again.)

And yes, the specific GCC reviewer for my patch ignored this argument, but nevertheless I maintain that this is the preferred approach.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Dec 7, 2017

Whenever I mentioned that completed paths are rewritten, note that this only applies if the path was passed as a relative path to that part of the compiler. In particular, if the user explicitly specifies an absolute path, then this path is not rewritten.

I'm not sure I fully understand that point. You don't do the rewrite on absolute paths given as arguments (e.g. the -o, the cmo args, etc.) ?

From a build system perspective it's much better from a usability point if your build system passes absolute paths to all the tools it schedules (it makes the context in which the invocation occurs obvious in the logs and allows to repeat the a build tool invocation by simply c&p it regardless of your current environment).

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Dec 7, 2017

(Then sometimes, higher-level buildsystems also like to save $args into the build output, making it unreproducible again.)

Note that this is the case for ocamlc (see my comment about cmt files).

@gasche
Copy link
Member Author

@gasche gasche commented Dec 7, 2017

There is nothing in this patch that rewrites absolute paths given by users, but of course it would be possible to decide to do it -- and probably fairly easy. (I'd rather keep the scope of the current PR fairly small, though.)

I am not sure whether systematically passing absolute paths is the best choice for build systems. For ocamlbuild specifically, if I remember correctly we rely on the fact that the paths are relative to the project root to be able to transfer paths from _build to . easily -- typically for error messages sent back to users by the compiler.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Dec 7, 2017

I am not sure whether systematically passing absolute paths is the best choice for build systems.

I personally really think it is for the usability reasons I mention (build understanding and easy manual replay; it becomes even more obvious when you deal with build system that support simultaenous side-by-side build variants). In any case there are quite a few things that will come to you under the form of absolute paths, like paths to installed libraries.

For ocamlbuild specifically, if I remember correctly we rely on the fact that the paths are relative to the project root to be able to transfer paths from _build to . easily -- typically for error messages sent back to users by the compiler.

A trick by which I lost hours of work on documentation numerous time (as soon as last week again) because ocamldoc was making its paths absolute in certain circumstances until recently.

Making your builds in a separate directory structure that mirrors the source tree and play relative tricks is a brittle idea that has to die in my opinion. I don't think it is unreasonable to ask compiler tools to be able to read sources and writes build artefacts to/from arbitrary locations.

@gasche
Copy link
Member Author

@gasche gasche commented Dec 7, 2017

In any case there are quite a few things that will come to you under the form of absolute paths, like paths to installed libraries.

For now my work on reproducible-builds has been focused on the experimental setup proposed by the reproducible-builds.org setup, which tries to eliminate build variation coming from various factors, which do not include the installation paths of dependencies, but does include variation of the build directory (from where the package is installed).

The current PR (and its possible evolution) is more flexible than that, as arbitrary prefix rewrites may be specified through the prefix map (typically capturing and rewriting your opam root). But in the experimental testing that actually gets tested in practice, only the current directory is part of the build path, and it is its variations that I want to make sure are prevented by the PR. For Debian packages (they get tested in practice, but this would also apply to most other distributions), the installed libraries will be in a place that does not vary from one user system to another.

I am of course open to making further PRs to improve the reproducibility situation in different scenarios (for example, displacement of the opam build root; but that may quickly run into relocatability issues), provided that an adequate testing and reproduction procedure is available.

(This is the joke about the person who search their keys at night under the street lamp; I have enough keys to find under the very helpful lamp they provide to keep me busy before I look at the rest of the street.)

Copy link
Member

@damiendoligez damiendoligez left a comment

A few remarks:

  • why do you need the encode_* functions?
  • couldn't you do the parsing with ocamllex rather than by hand?
utils/build_path_prefix_map.ml Show resolved Hide resolved
parsing/location.ml Show resolved Hide resolved
@gasche
Copy link
Member Author

@gasche gasche commented Dec 13, 2017

why do you need the encode_* functions?

The way I went at this problem is to write a self-contained external library (that other projects could also use), https://gitlab.com/gasche/build_path_prefix_map/, and then include it wholesale in my PR. The encode_* functions are not used in the compiler today (although they could be, for example in the testsuite), but they were very useful to test the decode_* function code -- I used Crowbar/afl-fuzz to check the roundtrip properties in all directions.

I would rather keep the included-in-the-compiler source identical to the outside source if that's not to costly, as it makes it easier to see what version it is and whether it should be updated. (I don't expect to update this library much, given that its scope is well-defined and its feature-set satisfying, unless the specification was to change.)

couldn't you do the parsing with ocamllex rather than by hand?

Possibly, but then I'm making the whole thing sensibly more complex to build for little gain. Are you suggesting that using ocamllex would be better?

@damiendoligez
Copy link
Member

@damiendoligez damiendoligez commented Feb 7, 2018

Possibly, but then I'm making the whole thing sensibly more complex to build for little gain. Are you suggesting that using ocamllex would be better?

No, I was just wondering whether it might be better. I accept your judgement that it isn't.

@gasche
Copy link
Member Author

@gasche gasche commented Feb 20, 2018

@damiendoligez would you, by any chance, be interested in "Approving" this PR?

Copy link
Member

@damiendoligez damiendoligez left a comment

This looks good except for the fatal error message.

parsing/location.ml Outdated Show resolved Hide resolved
@gasche gasche force-pushed the gasche:build-path-prefix-map branch from 67f79f7 to 6c55877 Mar 1, 2018
@gasche
Copy link
Member Author

@gasche gasche commented Mar 1, 2018

IIUC, this failwith means if I launch the compiler with an ill-formed value for BUILD_PATH_PREFIX_MAP I might get a compiler fatal error with a message like:

Fatal error: Failure "invalid character '=' in key or value"

This is not nice. The user-visible error message must at least mention the environment variable name.

Indeed, that's an excellent point.

I pushed a new version using Misc.fatal_error:

$ BUILD_PATH_PREFIX_MAP="\"" ocamlopt -absname test2.cmx
>> Fatal error: Invalid value for the environment variable BUILD_PATH_PREFIX_MAP: invalid key/value pair "\"", no '=' separator
Fatal error: exception Misc.Fatal_error
@gasche gasche removed the high-priority label Mar 1, 2018
@gasche gasche merged commit d5eff72 into ocaml:trunk Mar 5, 2018
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dra27 dra27 mentioned this pull request Jun 8, 2018
4 of 4 tasks complete
@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jun 19, 2018

@gasche is there any reason why this new variable and its effects is not mentioned in the manual ?

@damiendoligez
Copy link
Member

@damiendoligez damiendoligez commented Jun 20, 2018

@gasche is there any reason why this new variable and its effects is not mentioned in the manual ?

That's a very good point.

@gasche
Copy link
Member Author

@gasche gasche commented Jun 20, 2018

Indeed, I should have thought of it. I am not sure where to document this, and would appreciate a recommendation here. It seems that there is no place in which all our environment variables are listed. comp.html has a "Contextual control of command-line options" section, but this is rather specific and not meant to cover all environment variables.

I'm not sure if I should start a "environment variables" section, accepting that it will be very non-exhaustive at the start, or whether I should place the BUILD_PATH_PREFIX_MAP documentation in a specific place (which one?). I would like to propose the BUILD_PATH_PREFIX_MAP documentation for cherry-pick in 4.07, so a not-too-invasive change would be welcome -- although we could do a small change in 4.07 and a larger reorganization in trunk.

@Octachron
Copy link
Member

@Octachron Octachron commented Jun 20, 2018

An option might be to add a Reproducible builds section to comp and document BUILD_PATH_PREFIX_MAP here. If an environment variable index is added at a later date, the BUILD_PATH_PREFIX_MAP entry could simply link to this section.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jun 20, 2018

I'm not sure if I should start a "environment variables" section

That's the section I would be looking for. I would move the contents of "Contextual control of command-line options" there and also since this seems to be shared between ocamlc and ocamlopt simply have all the contents in ocamlc and a link from the ocamlopt section to the ocamlc section.

P.S. I realized this because it has been mentioned to me that the reproducible build people seem to be struggling with ocaml, I guess that one could solve the problem for these packages. /cc @lamby

@lamby
Copy link

@lamby lamby commented Jun 20, 2018

(I often grep for an "environment variables" section in manpages so +1 for that part.)

I realized this because it has been mentioned to me that the reproducible build people seem to be struggling with ocaml, I guess that one could solve the problem

Well, not so much "struggling" per-se, more that we have been rather busy and overworked!

for these packages.

So I haven't confirmed that this is exactly the same issue or even that these are caused by the same underlying path; there have been other, multiple, ones in the past… but hopefully so. :) Thanks!

@gasche
Copy link
Member Author

@gasche gasche commented Nov 27, 2019

A recent question from @hannesm made me realize that I failed to provide a simple user-friendly example of usage of BUILD_PATH_PREFIX_MAP. The syntax is a list of colon-separated rewrite rules of the form <new>=<old>: if OCaml would print an absolute path prefixed by <old>, then it removes this prefix and prefixes the path with <new> instead. (See the full specification for details, in particular how to encode the characters %, :, = if they occur in paths, and the behavior if rewrite rules overlap -- the rewrites are chained from right to left).

So for example, if in my /tmp folder I create an invalid file foo.ml with content let foo =, I observe the
following results:

$ ocamlc -absname -i foo.ml 
File "/tmp/foo.ml", line 2, characters 0-0:
Error: Syntax error

# rewrite /tmp/ into nothing: "=/tmp/"
$ BUILD_PATH_PREFIX_MAP="=/tmp/" ocamlc -absname -i foo.ml 
File "./foo.ml", line 2, characters 0-0:
Error: Syntax error

# more generaly, rewrite the current directory into nothing
$ BUILD_PATH_PREFIX_MAP="=$(pwd)/" ocamlc -absname -i foo.ml 
File "./foo.ml", line 2, characters 0-0:
Error: Syntax error

# rewrite $(pwd) into /PWD
$ BUILD_PATH_PREFIX_MAP="/PWD/=$(pwd)/" ocamlc -absname -i foo.ml 
File "/PWD/foo.ml", line 2, characters 0-0:
Error: Syntax error

Note that this particular part of the compiler logic will helpfully(?) prefix ./ to the path if it doesn't look absolute enough (it doesn't start with a /), so rewriting $(pwd)/ with PWD/ instead of /PWD/ would actually result in ./PWD/foo.ml, which is not what I wanted.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Nov 27, 2019

Note that your example doesn't work on macOS because on that platform /tmp is a symlink:

> cd /tmp
> echo "lat" > foo.ml
> BUILD_PATH_PREFIX_MAP="=/tmp/" ocamlc -i -absname foo.ml 
File "/private/tmp/foo.ml", line 1, characters 0-3:
1 | lat
    ^^^
Error: Unbound value lat

you likely want to look for PWD in the environment here rather than use Sys.getcwd.

But that's a bit besides the point, -absname and relative trick in build systems are hacks that always eventually bite you.

I still think that the right strategy for reproducible builds is for your build system to always work with absolute paths (bonus point: it's easy to you understand what's happening when things go wrong and you don't need to be in a particular cwd to reproduce a command) and allow to chop prefixes up to your project root (and if applicable system roots) in all of the tool's path output. See #8665.

@gasche
Copy link
Member Author

@gasche gasche commented Nov 27, 2019

BUILD_PATH_PREFIX_MAP is a particular mechanism to allow you to chop prefixes, and then it's a matter of integrating it with whatever the compiler does internally; if it handles path in a simple, clean, consistent way, then you get good results, otherwise (as today) you get a mixed bag, but this is independent of this specific mechanism.

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Nov 27, 2019

if it handles path in a simple, clean, consistent way, then you get good results, otherwise (as today) you get a mixed bag

That's not really the point. In that particular case it's not about handling things in a consistent way. It's a mismatch between the perception of what the cwd is between the entity that invokes the compiler and the compiler itself. Just saying there's an easy solution to that problem: stop relying on relative paths.

tasmo pushed a commit to prototypefund/reproducible-website that referenced this pull request Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants
You can’t perform that action at this time.