Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[substrate-vm] Feature request: reproducible builds #291

Open
ianopolous opened this issue Jan 28, 2018 · 8 comments
Assignees

Comments

@ianopolous
Copy link

@ianopolous ianopolous commented Jan 28, 2018

I couldn't find any information on this, but would it be hard to allow reproducible builds with substrate?

If that could be an option from early on that would be amazing.

@vjovanov

This comment has been minimized.

Copy link
Contributor

@vjovanov vjovanov commented Jan 30, 2018

By reproducible, do you mean binaries from same jars with the same flags should always be the same? If yes, here is the list of reasons they are not:

  1. HotSpot gives random identifiers to generated classes, e.g., lambdas.
  2. The analysis and compilation are parallel so all counters and static initializers are non-deterministically mutated.
  3. We build our images in an image build server in order to improve performance. This could also cause some non-determinism.

One thing I tried is to build with -H:NumberOfThreads=1 without a build server. This did not help even for the hello world image.

What is the use-case for reproducible builds?

@vjovanov vjovanov self-assigned this Jan 30, 2018
@ianopolous

This comment has been minimized.

Copy link
Author

@ianopolous ianopolous commented Jan 30, 2018

The use case is for safe distribution of binaries. Having a verifiable build (the same compiler version and config => binary identical output) means you can prove that the given binary was built from a given version of the source. This is the reason the entire debian project has mostly moved to reproducible builds. https://wiki.debian.org/ReproducibleBuilds

More on the general motivation can be read here: https://reproducible-builds.org/

Random identifiers isn't inherently a problem if you use a fixed seed (these ids in hotspot don't need to be cryptographically random I presume). So, potentially fixing the random seeds and using a single thread might get most of the way there? One also needs to be careful of things like hashmaps which are used to write out some mappings, as they will be randomly ordered.

@vjovanov

This comment has been minimized.

Copy link
Contributor

@vjovanov vjovanov commented Jan 30, 2018

Due to the dependency on HotSpot, implementing this feature will not be trivial. Luckily, we already have an ongoing project related to profile-guided optimizations that removes some non-determinism in images.

This is a good feature request. We will have to prioritize it and schedule together with other requests. If you need reproducible images for a concrete project, and this is a blocker, we can increase our priority on this issue.

@neomatrix369

This comment has been minimized.

Copy link
Contributor

@neomatrix369 neomatrix369 commented Jun 23, 2018

I'm I reading the message right here, that building JVMCI or Graal or GraalVM suite in parallel mode might not be trivial using the mx build tool, if not please let me know, so I can discuss this separately?

@concavelenz

This comment has been minimized.

Copy link

@concavelenz concavelenz commented Jan 11, 2019

Just want to note this is also important for us for Closure Compiler.

@ianopolous

This comment has been minimized.

Copy link
Author

@ianopolous ianopolous commented Sep 12, 2019

@vjovanov @thomaswue I just wanted to check in and see if there has been any progress on this. We would love to start using this for our releases in peergos and this is the last remaining issue.

@thomaswue

This comment has been minimized.

Copy link
Member

@thomaswue thomaswue commented Sep 14, 2019

One main issue to support this feature is that the values ending up in the image heap might be based on unpredictable input. There could be for example a hash map with elements relying on System#identityHashCode, which is based on random values. Or some static initializer storing a derivative value of System#currentTimeMillis.

One way to solve it would be to output some human readable intermediate file describing the results of the closed world analysis including the reachable methods as well as the image heap objects. This file could then be used to generate the exact same binary multiple times.

Would this address your use cases?

@ianopolous

This comment has been minimized.

Copy link
Author

@ianopolous ianopolous commented Sep 16, 2019

Hi @thomaswue , thank you very much for the update. I think the fundamental success criteria is a reproducible way to get the binary from something a human can derive from the source easily. I'm not sure if that would be true of such an intermediate format, especially the heap dump?

If the sources of non determinism are within the user's code, then that's their responsibility so we are only concerned about sources within graalvm itself.

The identity hashcode is based on the address in memory. So making all execution single threaded, and maybe disabling the GC should give deterministic addresses within a process. e.g. running a hello world that prints new Object().hashCode() gives the same value each run.

If something in the JVM is being initialized with the time then I'd be happy to supply the time as a parameter. Literally making the time constant may break some things.. So we could make subsequent calls increase the time by a fixed amount if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.