Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible builds #21

Closed
siddharthab opened this issue Mar 25, 2018 · 8 comments
Closed

Reproducible builds #21

siddharthab opened this issue Mar 25, 2018 · 8 comments

Comments

@siddharthab
Copy link
Collaborator

The packages built have stamped information about the built timestamp, the source directory and the library directory for the installation. This is especially bothersome with docker images as different layers are created with each build.

The build timestamp can be fixed to an empty string with the --built_timestamp flag to R CMD INSTALL. For the rest, we need to build and install in a constant directory, which means fixing a /tmp path for a package, and acquiring a lock on that path so that builds in other workspaces do not interfere with this build.

@hchauvin
Copy link
Contributor

hchauvin commented Apr 5, 2018

Hi, this has become a nuisance for me as well, so I dug in a bit.

Concerning the build timestamp (which also contains R version and operation system), it leads to an additional entry in the DESCRIPTION file, and it does not seem to be present anywhere else. I could not find reference to the operating system anywhere else either.

Concerning the source directory/library directory references, for the packages I surveyed at least, they could be found in the .so, .a, ... binary files. They are not stamps added by R, but come with the debug symbols that are added by R by default. It does not seem to be possible to remove those debug symbols with a flag (see https://stackoverflow.com/questions/9607155/make-gcc-put-relative-filenames-in-debug-information), so the best option IMO might be to invoke strip on all the binary files, after they are generated by R CMD INSTALL. If debug symbols are needed, then reproducibility can probably be put aside anyway, and we can have a '--define' Bazel option to disable stripping. strip is present with Xcode on Mac OSX and is part of the binutils package on Ubuntu/Debian, and installed by default. It should be invoked with '-S' instead of '-d' for Mac OSX/Linux compatibility.

I do not guarantee this will make the builds reproducible, but it should address the two issues you pointed out, without having to acquire a lock.

If this sounds good to you, I'll try to do a proof-of-concept with an additional reproducibility test (I hope it will pass!) sometime during the week.

@siddharthab
Copy link
Collaborator Author

Hi Hadrien,

It's not just the compiler adding the debug symbols. When you take a checksum of all the files in the installed package, you will see that the checksum of some .rdb/.rdx files vary as well. I was able to load one of these files in R and see that it had references to the library directory. These checksums become identical when you keep the --library flag constant. The --built_timestamp flag is available to make the package completely reproducible but they assume that the destination library is constant.

If after this, we still want to strip the debug symbols, we can add a default Makevars file with the appropriate flags.

@hchauvin
Copy link
Contributor

hchauvin commented Apr 6, 2018

Ok got it, I was wrong. Do you have this issue resolved internally?

I just looked at whether I could find any path in the output files (like, grep -R ...), and I could only find them in the debug symbols of the .so files, so I thought "problem solved!". I don't know how this info ends up in the .rdb/rdx files, but actually even if I remove the debug symbols in the .so files, there are still a few bits that differ in the ELF header, for whatever reason.

So, to have a reproducible build for things that ultimately go into a container layer, built-timestamp, R_MAKEVARS_USER and the package path (e.g., R CMD INSTALL ) must be constant.

@siddharthab
Copy link
Collaborator Author

Resolved as much as possible in 5bb812b.

See full commit message for details and caveats.

@jayvdb
Copy link

jayvdb commented Aug 12, 2019

I've noticed in openSUSE RPMs , and it appears to also be Fedora RPMs, that the builds are not reproducible so these tricks here havent made their way into R or build systems. I havent checked Debian yet. I did notice that https://salsa.debian.org/reproducible-builds/diffoscope/commit/4d31312 is adding analysis of R packages, esp. the files which embed timestamps and paths.

Is there any ongoing effort to have R support reproducible builds?

@siddharthab
Copy link
Collaborator Author

siddharthab commented Aug 13, 2019

It is not clear with your message if you are building with bazel. This project is an extension to the bazel build system.

These rules should have reproducible builds, at least from R 3.4 onwards.

If you are building outside of bazel, use at least R 3.6, give the --built-timestamp flag when building. I have not tested it, but it will take you a longer distance. For packages with native code, you will also need to set some C flags.

@jayvdb
Copy link

jayvdb commented Aug 13, 2019

Hiya @siddharthab , I am referring to the general problem of R reproducible builds, which bazel appears to be trail-blazing.

--built-timestamp helps, but I couldn't find any inbuilt R install mechanism to avoid the varying paths in the .rdb/.rdx files. Ideally we find a way to get your solution here, merged into R core.

openSUSE/build-compare#34 does the opposite approach of what you have done here, which is ignoring those specific items which change in every build, so they dont replace the existing 'identical' build artifacts.

@siddharthab
Copy link
Collaborator Author

I thought staged installs in R 3.6 solved the problem of hard-coded paths. But I suppose the stage directory itself is not constant. R will simply need to accept a user setting as the stage directory prefix to get complete reproducibility. I suppose it can be brought up in the r-devel mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants