Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size #520

Closed

Conversation

simonis
Copy link
Member

@simonis simonis commented Oct 6, 2020

Summary

Work around wrong usage of ZipOutputStream.putNextEntry() in user code which can lead to the ZipException "invalid entry compressed size".

Motivation

In general it is not safe to directly write a ZipEntry obtained from ZipInputStream.getNextEntry(), ZipFile.entries(), ZipFile.getEntry() or ZipFile.stream() with ZipOutputStream.putNextEntry() to a ZipOutputStream and then read the entries data from the ZipInputStream and write it to the ZipOutputStream as follows:

 ZipEntry entry;
 ZipInputStream zis = new ZipInputStream(...);
 ZipOutputStream zos = new ZipOutputStream(...);
 while((entry = zis.getNextEntry()) != null) {
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }

The problem with this code is that the zip file format does not record the compression level used for deflation in its entries. In general, it doesn't even mandate a predefined compression ratio per compression level. Therefore the compressed size recorded in a ZipEntry read from a zip file might differ from the new compressed size produced by the receiving ZipOutputStream. Such a difference will result in a ZipException with the following message:

 java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)

The correct way of copying all entries from one zip file into another requires the creation of a new ZipEntry or at least resetting of the compressed size field. E.g.:

 while((entry = zis.getNextEntry()) != null) {
     ZipEntry newEntry = new ZipEntry(entry.getName());
     zos.putNextEntry(newEntry);
     zis.transferTo(zos);
 }

or:

 while((entry = zis.getNextEntry()) != null) {
     entry.setCompressedSize(-1);
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }

Unfortunately, there's a lot of user code out there which gets this wrong and uses the bad coding pattern described before. Searching for "java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)" gives ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of instances of this anti-pattern on GitHub when doing a code search for ZipEntry and putNextEntry(). E.g. Gradle 4.x wrapper task is affected as well as the latest version of the mockableAndroidJar task. I've recently fixed two occurrences of this pattern in OpenJDK (see JDK-8240333 and JDK-8240235) but there still exist more of them (e.g. test/jdk/java/util/zip/ZipFile/CopyJar.java which is there since 1999 :).

Description

So while this has clearly been a problem before, it apparently wasn't painful enough to trigger any action from the side of the JDK. However, recently quite some zlib forks with superior deflate/inflate performance have evolved. Using them with OpenJDK is quite straight-forward: one just has to configure the alternative implementations by setting LD_LIBRARY_PATH or LD_PRELOAD correspondingly. We've seen big saving by using these new zlib implementations for selected services in production and the only reason why we haven't enabled them by default until now is the problem I've just described. The reason why these new libraries uncover the described anti-pattern much more often is because their compression ratio is slightly different from that of the default zlib library. This can easily trigger a ZipException even if an application is not using a different compression levels but just a zip file created with another zlib version.

I'd therefore like to propose the following workaround for the wrong ZipOutputStream.putNextEntry() usage in user code:

  • ignore the compressed size if it was implicitly determined from the zip file and not explicitly set by calling ZipEntry.setCompressedSize().

  • Change the API-documentation of ZipOutputStream.putNextEntry() and JarOutputStream.putNextEntry() to explain the problem and why putNextEntry() will ignore the compressed size of a ZipEntry if that was set implicitely when reading that entry from a ZipFile or ZipInputStream.

Technical Details

A zip file consists of a stream of File Entries followed by a Central Directory (see here for a more detailed specification). Each File Entry is composed of a Local File Header (LFH) followed by the compressed Data and an optional Data Descriptor. The LFH contains the File Name and among other attributes the Compressed and Uncompressed size and CRC of the Data. In the case where the latter three attributes are not available at the time when the LFH is created, this fact will be recorded in a flag of the LFH and will trigger the creation of a Data Descriptor with the corresponding information right after the Data section. Finally, the Central Directory contains one Central Directory File Header (CDFH) for each entry of the zip archive. The CDFH is an extended version of the LFH and the ultimate reference for the contents of the zip archive. The redundancy between LFH and CDFH is a tribute to zip's long history when it was used to store archives on multiple floppy discs and the CDFH allowed to update the archive by only writing to the last disc which contained the Central Directory.

ZipEntries read with ZipInputStream.getNextEntry() will initially only contain the information from the LFH. Only after the next entry was read (or after ZipInputStream.closeEntry() was called explicitly), will the previously read entry be updated with the data from the Data Descriptor. ZipInputStream doesn't inspect the Central Directory at all.

On the other hand, ZipFile only queries the Central Directory for ZipEntry information so all ZipEntries returned by ZipFile.entries(), ZipFile.getEntry() and ZipFile.stream() will always instantly contain the full Compressed and Uncompressed Size and CRC information for each entry independently of the LFH contents.

Risks and Assumptions

If we choose to ignore the implicitly recorded compressed size in a ZipEntry read from a zip file when writing it to a ZipOutputStream, this will lead to zip files with incomplete information in the LFH and an additional Data Descriptor as described before. However, the result is still fully compatible to the zip file specification. It's also not unusual, because by default all new zip files created with ZipOutputStream will contain LFHs without Compressed and Uncompressed Size and CRC information and an additional Data Descriptor. Theoretically it is possible to create new zip files with ZipOutputStream class and Compressed and Uncompressed Size and CRC information in the LFH but that's complex and inefficient because it requires two steps. A first step to determine the crc and compressed size of the data and a second step to actually write the data to the ZipOutputStream (which will compress it a second time). This is because the current API offers no possibility to write already compressed data to a ZipOutputStream.

Consequently, the only straight-forward way of creating zip files from Java which have all the data in the LFH and no Data Descriptor is by copying ZipEntries from an existing zip file with the buggy method described before. This incidentally worked more or less reliable for a long time but breaks miserably when using different zlib implementations. Ignoring the implicitly set compressed size of ZipEntries can easily fix this problem.

I'm not aware of any tool which can not handle such files and if it exists it would have problems with the majority of Java created zip files anyway (e.g. all jar-files created with the jar tool have zip entries with incomplete LFH data and additional Data Descriptor).

Ignoring the implicitly set compressed size of ZipEntries has no measurable performance impact and will increase the size of zip archives which used to have the complete file information in the LFH before by 16 bytes per entry. On the other hand it will give us the freedom to use whatever zip implementation we like :)


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux x64 Windows x64 macOS x64
Build ✔️ (5/5 passed) ✔️ (2/2 passed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) ✔️ (9/9 passed) ✔️ (9/9 passed)

Issue

  • JDK-8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/520/head:pull/520
$ git checkout pull/520

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Oct 6, 2020

👋 Welcome back simonis! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Oct 6, 2020

@simonis The following labels will be automatically applied to this pull request:

  • core-libs
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added security core-libs labels Oct 6, 2020
@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch from 08f08ba to 149b705 Compare Oct 6, 2020
@simonis simonis marked this pull request as ready for review Oct 6, 2020
@openjdk openjdk bot added the rfr label Oct 6, 2020
@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 6, 2020

Webrevs

*
* Unless explictely set by calling {@link ZipEntry#setCompressedSize(long)}
* this output stream will ignore the compressed size of a {@code ZipEntry}
* and re-compute its value automatically after the associted data has been
Copy link
Contributor

@LanceAndersen LanceAndersen Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "associated" -> "associated"

@@ -182,6 +182,21 @@ public void setLevel(int level) {
* The default compression method will be used if no compression method
* was specified for the entry, and the current time will be used if
* the entry has no set modification time.
*
* The zip file format does not record the compression level used for the
Copy link
Contributor

@LanceAndersen LanceAndersen Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zip" ->"ZIP" for consistency'

Should this refer to the Zip File Format specification? Perhaps something like:

The PKWARE ZIP file format specification specifies that the local file header and central directory header for each ZIP entry includes the compression method used to store the entry. It does not capture the compression level or ratio that is used.

* this output stream will ignore the compressed size of a {@code ZipEntry}
* and re-compute its value automatically after the associted data has been
* completely deflated.
*
Copy link
Contributor

@AlanBateman AlanBateman Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to do a bit wordsmithing here.

I think I'd drop the first two sentences and instead start a new paragraph here (

tag) with "Unless explicitly set, the output stream will ignore ..." and see how that looks.

Probably should be "the ZipEntry" rather than "a ZipEntry" to be consistent with the parameter description.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan makes a good point. Perhaps we keep things simple and focus solely on tweaking the description of the change in behavior which is described in your last paragraph

Copy link
Member Author

@simonis simonis Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. What about using just the last sentence (as you've proposed) in the spec section and add the other to as @implNote? O you think the last sentence will be enough?

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just go with the last sentence/paragraph. Perhaps we can further simplify the paragraph/sentence with something like:

The compressed entry size will be recalculated for compressed (DEFLATED) entries when ZipEntry::setCompressedSize has not been explicitly called on the ZipEntry.

or

The compressed (DEFLATED) entry size will be recalculated when ZipEntry::setCompressedSize has not been explicitly called on the ZipEntry.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording looks much better now

@AlanBateman
Copy link
Contributor

@AlanBateman AlanBateman commented Oct 7, 2020

/csr

@openjdk openjdk bot added the csr label Oct 7, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Oct 7, 2020

@AlanBateman has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@simonis please create a CSR request and add link to it in JDK-8253952. This pull request cannot be integrated until the CSR request is approved.

@LanceAndersen
Copy link
Contributor

@LanceAndersen LanceAndersen commented Oct 7, 2020

I think as we start to move forward with the review and CSR, we should update the bug and PR description as the change is not a workaround but a change in behavior of the implementation

@simonis simonis changed the title 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code 8253952: Refine ZipOutputStream.putNextEntry() to ignore ZipEntry's compressed size Oct 7, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Oct 7, 2020

I already changed the bug description on Alan's request and just updated the PR description to reflect that. Please feel free to propose a better description.

@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch from 149b705 to 2a9427e Compare Oct 9, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Oct 9, 2020

Thanks for your feedback. I've updated the API doc as requested and created a CSR at:

https://bugs.openjdk.java.net/browse/JDK-8254284

Can somebody please also review the CSR?

Thank you and best regards,
Volker

@simonis simonis changed the title 8253952: Refine ZipOutputStream.putNextEntry() to ignore ZipEntry's compressed size 8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size Oct 9, 2020
@@ -81,6 +81,10 @@ public JarOutputStream(OutputStream out) throws IOException {
* The current time will be used if the entry has no set modification
* time.
*
* The compressed size field will be recalculated for compressed (i.e.
Copy link
Contributor

@LanceAndersen LanceAndersen Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording is better, but not sure we should state "compressed size field" as ZipEntry.setCompressedSize refers to "Sets the size of the compressed entry data." and "the compressed size to set"

I think I would omit "field" from the above so it reads similar to "The compressed size will be..." or "The compressed size value will be..."

* The compressed size field will be recalculated for compressed (i.e.
* {@code ZipEntry.DEFLATED}) entries when {@link ZipEntry#setCompressedSize(long)}
* has not been explicitly called on the {@code ZipEntry}.
*
Copy link
Contributor

@AlanBateman AlanBateman Oct 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative that might be a bit clearer to readers.

"When writing a compressed (deflated) entry, and the compressed size has not been explicitly set with the setCompressedSize method, then the compressed size written to the entry's data descriptor will be its actual compressed size."

I think we should put it after the "The default compression method .." sentence and move the sentence on the time stamp to the end.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we discuss/reference the data descriptor for a Zip entry (outside of the PKWare Zip specification) so I am not sure we should reference it in the javadoc.

Placing the sentence after "The default compression method will be used if no compression method was specified for the entry" makes sense

Copy link
Member Author

@simonis simonis Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your input. I've tried to somehow merge both suggestions :)

Did you had a chance to look at the CSR? I think somebody has to review it before I can move it to "Finalized".

Thank you and best regards,
Volker

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added myself as a reviewer based on the updates made by you and Alan

@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch from 2a9427e to 988f3ef Compare Oct 12, 2020
* the {@link ZipEntry#setCompressedSize(long)} method, then the
* compressed size will be set to the actual compressed size after
* deflation. The current time will be used if the entry has no set
* modification time.
Copy link
Contributor

@AlanBateman AlanBateman Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for combining the wording, I think this looks good. I'd probably drop "i.e." so that it's just "DEFLATED" parentheses.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider making "The current time..." its own paragraph separate from the compression discussion as I think it would be clearer that way.

@@ -53,6 +53,7 @@
long crc = -1; // crc-32 of entry data
long size = -1; // uncompressed size of entry data
long csize = -1; // compressed size of entry data
boolean manual_csize = false; // Only true if csize was explicitely set by a call to setCompressedSize()
Copy link
Contributor

@AlanBateman AlanBateman Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but this should probably be "csizeSet" as it's not too common to have underscore in field names.

e.flag = 8;

}
else if (!e.manual_csize) {
Copy link
Contributor

@AlanBateman AlanBateman Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the existing expression that set if csize has been set so that we don't set the flag to 8 in two branches.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that myself

private static void createZip(String zipFile) throws Exception {
File f = new File(zipFile);
f.deleteOnExit();
OutputStream os = new FileOutputStream(f);
Copy link
Contributor

@AlanBateman AlanBateman Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use try-with-resources here so that a test failure will close the file.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 12, 2020

Mailing list message from Alan Bateman on security-dev:

On 12/10/2020 12:49, Volker Simonis wrote:

:
Did you had a chance to look at the CSR? I think somebody has to review it before I can move it to "Finalized".

The updated javadoc looks good. Once Lance has seen it then I think the
CSR would be updated so that it's in sync with the proposal and then we
add ourselves as Reviewer and finalize.

-Alan

@@ -78,8 +78,12 @@ public JarOutputStream(OutputStream out) throws IOException {
* to the start of the entry data. This method will also close
* any previous entry. The default compression method will be
* used if no compression method was specified for the entry.
* The current time will be used if the entry has no set modification
* time.
* When writing a compressed (i.e. {@code ZipEntry.DEFLATED}))
Copy link
Contributor

@LanceAndersen LanceAndersen Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave as DEFLATED given it is used this way in other ZipOutputStream methods and I would also remove the "i.e."

Copy link
Contributor

@AlanBateman AlanBateman Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep it as "When writing as compressed (DEFLATED) ..." to keep the terminology consistent when linking to the setCompressedMethod.

Copy link
Contributor

@LanceAndersen LanceAndersen Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, no need for ZipEntry.DEFLATED which is what I was trying to say (sorry if that was not clear). Only suggesting ZipEntry.DEFLATED -> DEFLATED and remove the ".i.e."

Copy link
Contributor

@LanceAndersen LanceAndersen left a comment

I think we are getting close. Once the javadoc is cleaned up along with the minor changes recommended we should be in good shape and can then review the CSR

@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch from 988f3ef to 3edcf72 Compare Oct 13, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Oct 13, 2020

I hope I've addressed all your comments and suggestion with the latest PR. I've also updated the CSR to reflect the latest version of the API changes from the PR. Please have a look.

Thank you and best regards,
Volker

@@ -447,6 +450,7 @@ public long getCompressedSize() {
*/
public void setCompressedSize(long csize) {
this.csize = csize;
this.csizeSet = true;
Copy link
Contributor

@AlanBateman AlanBateman Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version of implementation changes look good.

@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch 2 times, most recently from 0e2e130 to 7f032e0 Compare Oct 13, 2020
@simonis simonis force-pushed the JDK-8253952-zip-entry-fix-simple branch from 7f032e0 to 6b570c6 Compare Oct 14, 2020
Copy link
Contributor

@LanceAndersen LanceAndersen left a comment

The changes look good. Thank you for the updates. I have kicked off mach5 jdk-tier1,jdk-tier2,jdk-tier3 runs across all of the platforms.

@LanceAndersen
Copy link
Contributor

@LanceAndersen LanceAndersen commented Oct 14, 2020

Mach5 run is clean :-)

@simonis
Copy link
Member Author

@simonis simonis commented Oct 14, 2020

Mach5 run is clean :-)

Thanks a lot Lance.
Just waiting for the CSR to get approved now.

@openjdk openjdk bot removed the csr label Oct 14, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Oct 14, 2020

@simonis This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size

Reviewed-by: lancea, alanb

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 34 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Oct 14, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Oct 15, 2020

/integrate

@openjdk openjdk bot closed this Oct 15, 2020
@openjdk openjdk bot added integrated and removed ready rfr labels Oct 15, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Oct 15, 2020

@simonis Since your change was applied there have been 43 commits pushed to the master branch:

  • 9359ff0: 8252221: Use multiple workers for Parallel GC pre-touching
  • f44fc6d: 8254734: "dead loop detected" assert failure with patch from 8223051
  • 7f73474: 8254773: Remove unimplemented ciReplay::is_loaded(Klass* klass)
  • 167c192: 8254771: Remove unimplemented ciSignature::get_all_klasses
  • 81a8ff1: 8254769: Remove unimplemented BCEscapeAnalyzer::{add_dependence, propagate_dependencies}
  • 7e5eb49: 8253402: Convert vmSymbols::SID to enum class
  • 038f58d: 8226236: [TESTBUG] win32: gc/metaspace/TestCapacityUntilGCWrapAround.java fails
  • 5194f11: 8254792: Disable intrinsic StringLatin1.indexOf until 8254790 is fixed
  • 55d760d: 8254263: Remove special_runtime_exit_condition() check from ~ThreadInVMForHandshake()
  • 03fa733: 8254777: Remove unimplemented Management::get_loaded_classes
  • ... and 33 more: https://git.openjdk.java.net/jdk/compare/9d230ea87dac793a10b256aa773d7fa2262057f0...master

Your commit was automatically rebased without conflicts.

Pushed as commit 60159cf.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs integrated security
3 participants