Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8276400: openjdk image Jars, Zips and Jmods built from the same source are not reproducible #6268

Closed

Conversation

andrew-m-leonard
Copy link

@andrew-m-leonard andrew-m-leonard commented Nov 4, 2021

This PR enables reproducible Jars, Jmods and openjdk image zip files (eg.src.zip).
It provides support for SOURCE_DATE_EPOCH for Jar, Jmod and underlying ZipOutputStream's.
It fixes the following keys issues relating to reproducibility:

  • Jar and ZipOutputStream are not SOURCE_DATE_EPOCH aware
    • Jar and ZipOutputStream now detect SOURCE_DATE_EPOCH environment setting
  • Jar and Jmod file content ordering was non-determinsitic
    • Fixes to Jar and Jmod main's to ensure sorted classes content ordering
  • openjdk image zip file generation used "zip" which is non-determinsitic
    • New openjdk build tool "GenerateZip" which produces the final determinsitic zip files as part of the build and also detects SOURCE_DATE_EPOCH

Signed-off-by: Andrew Leonard anleonar@redhat.com


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Integration blocker

 ⚠️ The change requires a CSR request to be approved.

Issue

  • JDK-8276400: openjdk image Jars, Zips and Jmods built from the same source are not reproducible

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6268/head:pull/6268
$ git checkout pull/6268

Update a local copy of the PR:
$ git checkout pull/6268
$ git pull https://git.openjdk.java.net/jdk pull/6268/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6268

View PR using the GUI difftool:
$ git pr show -t 6268

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6268.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 4, 2021

👋 Welcome back aleonard! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@andrew-m-leonard
Copy link
Author

@magicus fyi, please review if you have a moment, thanks

@openjdk
Copy link

openjdk bot commented Nov 4, 2021

@andrew-m-leonard The following labels will be automatically applied to this pull request:

  • build
  • compiler
  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

…e are not reproducible

Signed-off-by: Andrew Leonard <anleonar@redhat.com>
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 4, 2021
@mlbridge
Copy link

mlbridge bot commented Nov 4, 2021

Webrevs

@jddarcy
Copy link
Member

jddarcy commented Nov 4, 2021

/csr needed

@jddarcy
Copy link
Member

jddarcy commented Nov 4, 2021

Please file a CSR for the behavioral change of recognizing SOURCE_DATE_EPOCH, etc. Thanks.

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Nov 4, 2021
@openjdk
Copy link

openjdk bot commented Nov 4, 2021

@jddarcy has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@andrew-m-leonard please create a CSR request for issue JDK-8276400. This pull request cannot be integrated until the CSR request is approved.

@andrew-m-leonard
Copy link
Author

@jddarcy thanks, I did wonder that, i'll start a CSR, thanks

@andrew-m-leonard
Copy link
Author

Copy link
Contributor

@mbien mbien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments

Comment on lines +91 to +96
FileOutputStream out = new FileOutputStream(fname);
boolean createOk = create(new BufferedOutputStream(out, 4096));
if (ok) {
ok = createOk;
}
out.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be a try-with-resource block

Comment on lines +189 to +191
Iterator<Map.Entry<String, Path>> itr = filesToProcess.entrySet().iterator();
while(itr.hasNext()) {
Map.Entry<String, Path> entry = itr.next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be for (Map.Entry<String, Path> entry : filesToProcess.entrySet())

Comment on lines +258 to +262
InputStream is = new BufferedInputStream(new FileInputStream(file));
while ((len = is.read(buf, 0, buf.length)) != -1) {
zos.write(buf, 0, len);
}
is.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resource candidate + in.transferTo(zos)

Comment on lines +286 to +291
ArrayList<String> newArgs = new ArrayList<String>(args.length);
for (int i = 0; i < args.length; i++) {
String arg = args[i];
newArgs.add(arg);
}
return newArgs.toArray(new String[newArgs.size()]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something might be missing here. It just adds args to a list and makes it an array again + If the language level allows it toArray(String[]::new) could be used.

Comment on lines +797 to +799
Iterator<Map.Entry<String, Path>> itr = filesToProcess.entrySet().iterator();
while(itr.hasNext()) {
Map.Entry<String, Path> entry = itr.next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another for-each candidate

Comment on lines +93 to +106
long value;
String env = System.getenv("SOURCE_DATE_EPOCH");
if (env != null) {
try {
value = Long.parseLong(env);
// SOURCE_DATE_EPOCH is in seconds
value *= 1000;
} catch(NumberFormatException e) {
value = -1;
}
} else {
value = -1;
}
return new Long(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value can be returned here directly which will use auto-boxing. (new Long(long) is deprecated).

Comment on lines +56 to +71
OutputStream os = new FileOutputStream(f);
ZipOutputStream zos = new ZipOutputStream(os);
try {
zos.putNextEntry(new ZipEntry("Entry1.txt"));
zos.write(TEST_DATA.getBytes());
zos.closeEntry();
zos.putNextEntry(new ZipEntry("Entry2.txt"));
zos.write(TEST_DATA.getBytes());
zos.closeEntry();
zos.putNextEntry(new ZipEntry("Entry3.txt"));
zos.write(TEST_DATA.getBytes());
zos.closeEntry();
} finally {
zos.close();
os.close();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resource

Comment on lines +76 to +97
FileInputStream fis = new FileInputStream(f);
ZipInputStream zis = new ZipInputStream(fis);
try {
long now = System.currentTimeMillis();
ZipEntry entry;
while ((entry = zis.getNextEntry()) != null) {
if (sourceDateEpochMillis != -1) {
// SOURCE_DATE_EPOCH set, check time correct
if (entry.getTime() != sourceDateEpochMillis) {
throw new AssertionError("ZipEntry getTime() "+entry.getTime()+" not equal to SOURCE_DATE_EPOCH "+sourceDateEpochMillis);
}
} else {
// SOURCE_DATE_EPOCH not set, check time is current created within the last 60 seconds
if (entry.getTime() < (now-60000) || entry.getTime() > now) {
throw new AssertionError("ZipEntry getTime() "+entry.getTime()+" is not the current time "+now);
}
}
}
} finally {
zis.close();
fis.close();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resource

@@ -0,0 +1,299 @@
/*
* Copyright (c) 2021, 2021, Oracle and/or its affiliates. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @andrew-m-leonard, for a new file, the year should just be stated once. So just 2021,

@mlbridge
Copy link

mlbridge bot commented Nov 5, 2021

Mailing list message from Alan Bateman on build-dev:

On 04/11/2021 21:16, Andrew Leonard wrote:

This PR enables reproducible Jars, Jmods and openjdk image zip files (eg.src.zip).
It provides support for SOURCE_DATE_EPOCH for Jar, Jmod and underlying ZipOutputStream's.
It fixes the following keys issues relating to reproducibility:
- Jar and ZipOutputStream are not SOURCE_DATE_EPOCH aware
- Jar and ZipOutputStream now detect SOURCE_DATE_EPOCH environment setting
- Jar and Jmod file content ordering was non-determinsitic
- Fixes to Jar and Jmod main's to ensure sorted classes content ordering
- openjdk image zip file generation used "zip" which is non-determinsitic
- New openjdk build tool "GenerateZip" which produces the final determinsitic zip files as part of the build and also detects SOURCE_DATE_EPOCH

I think this is going to require discussion, a PR may be too premature.
Is your goal to get the JDK build and run-time images creates with jlink
to use the SOURCE_DATE_EPOCH? Are you looking for project builds that
use the jar/jmod/etc. tools to use it? Or are you looking to have every
library and application that uses the java.util.zip APIs to read it? If
it includes the latter, and the patch in the PR suggests you are, then
it will require analysis to work through the API spec changes.

One suggestion is to look at JDK-8231640 and PR 5372 [1]. The original
proposal was to use SOURCE_DATE_EPOCH. After a lengthy discussion we
converged on the system property java.properties.date rather than the
env variable. I suspect that much of that discussion applies here.

-Alan

[1] https://github.com//pull/5372

Copy link
Contributor

@LanceAndersen LanceAndersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your efforts here Andrew.

A few comments in addition to what have already been mentioned by others on a very quick scan.

For new tests, please consider using TestNG.

@@ -775,6 +790,9 @@ private void expand(File dir, String[] files, Set<String> cpaths, int version)
if (files == null)
return;

// Ensure files list is sorted for reproducible jar content
Arrays.sort(files);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a chance to measure the performance with a large number of Zip entries with this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I haven't, but my thoughts on this were assuming you had a zip with many 1000s of ZipEntries the file I/O would be far more significant. Also, you will note this is not sorting the whole set, just within each directory, so the sort won't be complex, unless you had 1000s of files in a single directory. The "non-determinism" comes from the File.list() implementation which uses OS file listing, whose order is non-deterministic.

@@ -0,0 +1,89 @@
#!/bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a specific reason to use a shell script, it would be better to convert this to java. We have been trying to reduce the usage of shell scripts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had copied an existing example, I obviously chose a bad example! I can re-write with ProcessBuilder... cheers

@andrew-m-leonard
Copy link
Author

I think this is going to require discussion, a PR may be too premature. Is your goal to get the JDK build and run-time images creates with jlink to use the SOURCE_DATE_EPOCH? Are you looking for project builds that use the jar/jmod/etc. tools to use it? Or are you looking to have every library and application that uses the java.util.zip APIs to read it? If it includes the latter, and the patch in the PR suggests you are, then it will require analysis to work through the API spec changes.

One suggestion is to look at JDK-8231640 and PR 5372 [1]. The original proposal was to use SOURCE_DATE_EPOCH. After a lengthy discussion we converged on the system property java.properties.date rather than the env variable. I suspect that much of that discussion applies here.

-Alan

[1] https://github.com/[/pull/5372](https://github.com/openjdk/jdk/pull/5372)

@AlanBateman Hi Alan, thank you for the comments. So yes, totally agree this has a way to go, I thought creating a PR would kick things off...! So just to clarify the initial objectives which you raise above. My current aim is actually purely an OpenJDK JDK image perspective, building on the work @magicus has been doing with SOURCE_DATE_EPOCH in the build, hence my natural approach to this PR. So from that perspective, it's purely a "build" change. However, the openjdk build obviously uses openjdk Jar, Jmod tooling to build itself, hence the direction I took in this PR.

Magnus's PR #5372 is useful thank you, I had not discovered that change, as you say the direction there is similar to mine, and I understand the potential pitfalls of adding doPrivileged's etc. So that resolved to a new system property java.properties.date, which makes sense.

So with that in mind, and given the objective of JDK image reproducibility, this would imply presumably an approach of a new cmd line option to Jar and Jmod tools (eg.--source-date ), since these tools are cmds?

One of the key changes in this PR is to ZipOutputStream where it sets the ZipEntry.xdostime if not set, to the current time. Now I am thinking given the "objective", if we fix all upstream tooling to ensure they set ZipEntry times, using "--source-date" (or whatever mechanism), then we can then avoid changing ZipOutputStream I believe. JDK tools like jar, jmod generate/update extra files like Manifests, and in fact I think the current jmod implementation writes all its content using the current time.

So something along these lines:
jar -cf myjar.jar --source-date "" *
jmod create --source-date "" ... my.jmod
The jar, jmod,.. tooling then ensure all ZipEntry times are set to "source-date".

I chose "source-date" name based on being synonymous with SOURCE_DATE_EPOCH, implying a similar "use case".

So taking this approach for JDK tooling should achieve the reproducible JDK image objective, without a java API behavior change I am hoping.
Thoughts?

Regards
Andrew

@magicus
Copy link
Member

magicus commented Nov 5, 2021

I agree with Alan's comment above. First of all, to be absolutely clear, I think this is a very worthy goal, and much overdue. I'll do my best to help get this implemented.

But.

This PR conflates two different issues. One of them is "changing the zip implementation in the JDK". This will affect all users of the JDK, and as Alan says, this will require a thorough discussion to get it right, including a proper CSR. Otoh, the benefit of doing this will affect all Java projects out there in the wild wanting to do reproducible builds, so this has actually a much higher leverage for the reproducible build community as a whole.

And, also, it is a prerequisite for issue number two, which is "use the new zip implementation in the build system". Once the former is in place, this is more or less a no-brainer. Most will come automatically, but there are some things to fix. But these require just the approval of developers in the build team; no long discussions or CSRs. (Also, when we get to this point, I believe we should get rid of the "zip" utility and write our own in Java, rather than unzipping and then re-zipping. But we can take that discussion when we get there.)

In a way, Alan is right when saying a PR is premature for issue #1. OTOH, my personal opinion is that "working code" is a good start for any discussion, as long as one is prepared to throw it all out and start over, or rewrite it over and over again, like @jaikiran did in the Property storage PR. So my suggestion is that your break out these changes to the zip and jar implementation, open a new PR, and start a discussion around them.

@magicus
Copy link
Member

magicus commented Nov 5, 2021

@andrew-m-leonard Just to be clear: changing arguments to the command line utilities is still an API change and will need a CSR.

My gut feeling is that using a system property that affect all users of ZipStream is preferable. That way a user could run like java -Djava.properties.date=$(SOURCE_DATE_EPOCH) -Djava.zipfile.date=$(SOURCE_DATE_EPOCH) ... (or whatever) and get reproducible behavior in all their code, for places that could otherwise be hard to fix.

@magicus
Copy link
Member

magicus commented Nov 5, 2021

(And we can easily make sure that we do that from the JDK build)

@mlbridge
Copy link

mlbridge bot commented Nov 5, 2021

Mailing list message from Alan Bateman on build-dev:

On 05/11/2021 11:39, Magnus Ihse Bursie wrote:

:
I agree with Alan's comment above. First of all, to be absolutely clear, I think this is a very worthy goal, and much overdue. I'll do my best to help get this implemented.

One suggestion is to separate out the issue of ordering of entries (in
zip/JAR and JMOD) and whether it would be the default. There may be
trade-offs to discuss, also whether it's limited to just new zip/JAR
files or updates to existing zip/JARs files.

-Alan

@andrew-m-leonard
Copy link
Author

@andrew-m-leonard Just to be clear: changing arguments to the command line utilities is still an API change and will need a CSR.

My gut feeling is that using a system property that affect all users of ZipStream is preferable. That way a user could run like java -Djava.properties.date=$(SOURCE_DATE_EPOCH) -Djava.zipfile.date=$(SOURCE_DATE_EPOCH) ... (or whatever) and get reproducible behavior in all their code, for places that could otherwise be hard to fix.

Thanks @magicus
Yes, I understand a CSR is required, I was trying to state there would not be a java.util.zip API behavior change necessarily.

Just to state again, and although I need to check this, I think if all JDK tooling ensures ZipEntry's had date set to whatever prior to calling ZipOutputStream.putNextEntry(e), then I think no change to ZipOutputStream would be needed (I think).
However, you do elude to the benefit to "others" of changing ZipOutputStream to use (say -Djava.zipfile.date), in that it makes it easier for consumers to make reproducible builds rather than tracking down all the code that calls it in their project.

GenerateZip as you say is slightly separate, and bespoke to the openjdk build. I did start by trying to create my own "zip", but that is actually quite involved due to how openjdk build makes zip files, and how it leverages zip functionality for things like --includes & --excludes. It is far easier actually to just construct the final deterministic zip file using this class at the end. But I agree, if someone could write a determinisitic "zip", that would be great please!
I will move GenerateZip to a separate PR, and make it take "timestamp" just like CreateSymbols already does.

Thanks again
Andrew

@andrew-m-leonard
Copy link
Author

Mailing list message from Alan Bateman on build-dev:

On 05/11/2021 11:39, Magnus Ihse Bursie wrote:

:
I agree with Alan's comment above. First of all, to be absolutely clear, I think this is a very worthy goal, and much overdue. I'll do my best to help get this implemented.

One suggestion is to separate out the issue of ordering of entries (in zip/JAR and JMOD) and whether it would be the default. There may be trade-offs to discuss, also whether it's limited to just new zip/JAR files or updates to existing zip/JARs files.

-Alan

Yes, makes sense, @LanceAndersen already picked up on that for file sorting. We can discuss in a separate PR.

So picking up on what @magicus suggested as well, what I shall do is split this into 3:

  1. GenerateZip openjdk build tool
  2. Changes to Jar and Jmod main's to ensure file ordering
  3. Jar, Jmod and/or ZipOutputStream changes to support a "source-date" specification

@magicus
Copy link
Member

magicus commented Nov 5, 2021

Sounds like a good plan. As for the build part ("GenerateZip"), this is not as important in the build as it used to be. Back in the days, before Jigsaw, we had a huge rt.jar that needed updating each time an incremental build was done. Hence the elaborate system for updating existing zip files. (Because recompressing the entire rt.jar was prohibitively slow.)

Now we instead have jmod/jlink which do not support incremental updates and are almost as slow as before (that's a bit underprioritized, as well...) but due to the modularization, all modules except java.base are fairly quick to link anyway.

So the old zip generation is only used for src.zip, afaik. (We might be using it in closed Oracle code as well for some side artifacts, but that is not relevant to the OpenJDK story, and to be honest, I'm not sure anymore.) This has no need to be quick to update incrementally. If it turns out to be too slow, we can move it out of the normal "jdk-image" target and add a "image-srczip" or whatever.

So, another way of stating this is that GenerateZip has as its main goal to facilitate creation of src.zip. And most of the logic in CreateZipArchive.gmk can basically just be thrown out.

@andrew-m-leonard
Copy link
Author

andrew-m-leonard commented Nov 5, 2021 via email

@andrew-m-leonard
Copy link
Author

@AlanBateman @magicus thank you both for your guidance. I have now split this bug into the 3 mentioned:

@andrew-m-leonard
Copy link
Author

Closing this PR to be replaced with 3 new PRs for the above bugs.

@mlbridge
Copy link

mlbridge bot commented Nov 8, 2021

Mailing list message from Alan Bateman on build-dev:

On 05/11/2021 19:15, Andrew Leonard wrote:

:
@AlanBateman @magicus thank you both for your guidance. I have now split this bug into the 3 mentioned:
- GenerateZip: https://bugs.openjdk.java.net/browse/JDK-8276743
- Jar/Jmod content ordering: https://bugs.openjdk.java.net/browse/JDK-8276764
- Jar/Jmod/ZipOutputStream timestamp option: https://bugs.openjdk.java.net/browse/JDK-8276766 **(requires CSR)**

Closing this PR to be replaced with 3 new PRs for the above bugs.

Thanks this seems a reasonable plan. When you get the 3rd item then
we'll need to discuss whether ZipOutputStream.putEntry overrides (or
not) the time stamps of entries that have a modification time.

-Alan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build build-dev@openjdk.org compiler compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org csr Pull request needs approved CSR before integration rfr Pull request is ready for review
6 participants