-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back simonis! A progress list of the required criteria for merging this PR into |
08f08ba
to
149b705
Compare
Webrevs
|
* | ||
* Unless explictely set by calling {@link ZipEntry#setCompressedSize(long)} | ||
* this output stream will ignore the compressed size of a {@code ZipEntry} | ||
* and re-compute its value automatically after the associted data has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "associated" -> "associated"
@@ -182,6 +182,21 @@ public void setLevel(int level) { | |||
* The default compression method will be used if no compression method | |||
* was specified for the entry, and the current time will be used if | |||
* the entry has no set modification time. | |||
* | |||
* The zip file format does not record the compression level used for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"zip" ->"ZIP" for consistency'
Should this refer to the Zip File Format specification? Perhaps something like:
The PKWARE ZIP file format specification specifies that the local file header and central directory header for each ZIP entry includes the compression method used to store the entry. It does not capture the compression level or ratio that is used.
* this output stream will ignore the compressed size of a {@code ZipEntry} | ||
* and re-compute its value automatically after the associted data has been | ||
* completely deflated. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to do a bit wordsmithing here.
I think I'd drop the first two sentences and instead start a new paragraph here (
tag) with "Unless explicitly set, the output stream will ignore ..." and see how that looks.
Probably should be "the ZipEntry" rather than "a ZipEntry" to be consistent with the parameter description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan makes a good point. Perhaps we keep things simple and focus solely on tweaking the description of the change in behavior which is described in your last paragraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree. What about using just the last sentence (as you've proposed) in the spec section and add the other to as @implNote? O you think the last sentence will be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just go with the last sentence/paragraph. Perhaps we can further simplify the paragraph/sentence with something like:
The compressed entry size will be recalculated for compressed (DEFLATED) entries when ZipEntry::setCompressedSize has not been explicitly called on the ZipEntry.
or
The compressed (DEFLATED) entry size will be recalculated when ZipEntry::setCompressedSize has not been explicitly called on the ZipEntry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording looks much better now
/csr |
@AlanBateman has indicated that a compatibility and specification (CSR) request is needed for this pull request. |
I think as we start to move forward with the review and CSR, we should update the bug and PR description as the change is not a workaround but a change in behavior of the implementation |
I already changed the bug description on Alan's request and just updated the PR description to reflect that. Please feel free to propose a better description. |
149b705
to
2a9427e
Compare
Thanks for your feedback. I've updated the API doc as requested and created a CSR at: https://bugs.openjdk.java.net/browse/JDK-8254284 Can somebody please also review the CSR? Thank you and best regards, |
@@ -81,6 +81,10 @@ public JarOutputStream(OutputStream out) throws IOException { | |||
* The current time will be used if the entry has no set modification | |||
* time. | |||
* | |||
* The compressed size field will be recalculated for compressed (i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording is better, but not sure we should state "compressed size field" as ZipEntry.setCompressedSize refers to "Sets the size of the compressed entry data." and "the compressed size to set"
I think I would omit "field" from the above so it reads similar to "The compressed size will be..." or "The compressed size value will be..."
* The compressed size field will be recalculated for compressed (i.e. | ||
* {@code ZipEntry.DEFLATED}) entries when {@link ZipEntry#setCompressedSize(long)} | ||
* has not been explicitly called on the {@code ZipEntry}. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an alternative that might be a bit clearer to readers.
"When writing a compressed (deflated) entry, and the compressed size has not been explicitly set with the setCompressedSize method, then the compressed size written to the entry's data descriptor will be its actual compressed size."
I think we should put it after the "The default compression method .." sentence and move the sentence on the time stamp to the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we discuss/reference the data descriptor for a Zip entry (outside of the PKWare Zip specification) so I am not sure we should reference it in the javadoc.
Placing the sentence after "The default compression method will be used if no compression method was specified for the entry" makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your input. I've tried to somehow merge both suggestions :)
Did you had a chance to look at the CSR? I think somebody has to review it before I can move it to "Finalized".
Thank you and best regards,
Volker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added myself as a reviewer based on the updates made by you and Alan
2a9427e
to
988f3ef
Compare
* the {@link ZipEntry#setCompressedSize(long)} method, then the | ||
* compressed size will be set to the actual compressed size after | ||
* deflation. The current time will be used if the entry has no set | ||
* modification time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for combining the wording, I think this looks good. I'd probably drop "i.e." so that it's just "DEFLATED" parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might consider making "The current time..." its own paragraph separate from the compression discussion as I think it would be clearer that way.
@@ -53,6 +53,7 @@ | |||
long crc = -1; // crc-32 of entry data | |||
long size = -1; // uncompressed size of entry data | |||
long csize = -1; // compressed size of entry data | |||
boolean manual_csize = false; // Only true if csize was explicitely set by a call to setCompressedSize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but this should probably be "csizeSet" as it's not too common to have underscore in field names.
e.flag = 8; | ||
|
||
} | ||
else if (!e.manual_csize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the existing expression that set if csize has been set so that we don't set the flag to 8 in two branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that myself
private static void createZip(String zipFile) throws Exception { | ||
File f = new File(zipFile); | ||
f.deleteOnExit(); | ||
OutputStream os = new FileOutputStream(f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use try-with-resources here so that a test failure will close the file.
Mailing list message from Alan Bateman on security-dev: On 12/10/2020 12:49, Volker Simonis wrote:
The updated javadoc looks good. Once Lance has seen it then I think the -Alan |
@@ -78,8 +78,12 @@ public JarOutputStream(OutputStream out) throws IOException { | |||
* to the start of the entry data. This method will also close | |||
* any previous entry. The default compression method will be | |||
* used if no compression method was specified for the entry. | |||
* The current time will be used if the entry has no set modification | |||
* time. | |||
* When writing a compressed (i.e. {@code ZipEntry.DEFLATED})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave as DEFLATED given it is used this way in other ZipOutputStream methods and I would also remove the "i.e."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep it as "When writing as compressed (DEFLATED) ..." to keep the terminology consistent when linking to the setCompressedMethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, no need for ZipEntry.DEFLATED which is what I was trying to say (sorry if that was not clear). Only suggesting ZipEntry.DEFLATED -> DEFLATED and remove the ".i.e."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are getting close. Once the javadoc is cleaned up along with the minor changes recommended we should be in good shape and can then review the CSR
988f3ef
to
3edcf72
Compare
I hope I've addressed all your comments and suggestion with the latest PR. I've also updated the CSR to reflect the latest version of the API changes from the PR. Please have a look. Thank you and best regards, |
@@ -447,6 +450,7 @@ public long getCompressedSize() { | |||
*/ | |||
public void setCompressedSize(long csize) { | |||
this.csize = csize; | |||
this.csizeSet = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version of implementation changes look good.
0e2e130
to
7f032e0
Compare
…y's compressed size
7f032e0
to
6b570c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. Thank you for the updates. I have kicked off mach5 jdk-tier1,jdk-tier2,jdk-tier3 runs across all of the platforms.
Mach5 run is clean :-) |
Thanks a lot Lance. |
@simonis This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 34 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
/integrate |
@simonis Since your change was applied there have been 43 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 60159cf. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Summary
Work around wrong usage of
ZipOutputStream.putNextEntry()
in user code which can lead to theZipException "invalid entry compressed size"
.Motivation
In general it is not safe to directly write a ZipEntry obtained from
ZipInputStream.getNextEntry()
,ZipFile.entries()
,ZipFile.getEntry()
orZipFile.stream()
withZipOutputStream.putNextEntry()
to aZipOutputStream
and then read the entries data from theZipInputStream
and write it to theZipOutputStream
as follows:The problem with this code is that the zip file format does not record the compression level used for deflation in its entries. In general, it doesn't even mandate a predefined compression ratio per compression level. Therefore the compressed size recorded in a
ZipEntry
read from a zip file might differ from the new compressed size produced by the receivingZipOutputStream
. Such a difference will result in aZipException
with the following message:The correct way of copying all entries from one zip file into another requires the creation of a new
ZipEntry
or at least resetting of the compressed size field. E.g.:or:
Unfortunately, there's a lot of user code out there which gets this wrong and uses the bad coding pattern described before. Searching for
"java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)"
gives ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of instances of this anti-pattern on GitHub when doing a code search forZipEntry
andputNextEntry()
. E.g. Gradle 4.x wrapper task is affected as well as the latest version of the mockableAndroidJar task. I've recently fixed two occurrences of this pattern in OpenJDK (see JDK-8240333 and JDK-8240235) but there still exist more of them (e.g.test/jdk/java/util/zip/ZipFile/CopyJar.java
which is there since 1999 :).Description
So while this has clearly been a problem before, it apparently wasn't painful enough to trigger any action from the side of the JDK. However, recently quite some zlib forks with superior deflate/inflate performance have evolved. Using them with OpenJDK is quite straight-forward: one just has to configure the alternative implementations by setting
LD_LIBRARY_PATH
orLD_PRELOAD
correspondingly. We've seen big saving by using these new zlib implementations for selected services in production and the only reason why we haven't enabled them by default until now is the problem I've just described. The reason why these new libraries uncover the described anti-pattern much more often is because their compression ratio is slightly different from that of the default zlib library. This can easily trigger aZipException
even if an application is not using a different compression levels but just a zip file created with another zlib version.I'd therefore like to propose the following workaround for the wrong
ZipOutputStream.putNextEntry()
usage in user code:ignore the compressed size if it was implicitly determined from the zip file and not explicitly set by calling
ZipEntry.setCompressedSize()
.Change the API-documentation of
ZipOutputStream.putNextEntry()
andJarOutputStream.putNextEntry()
to explain the problem and whyputNextEntry()
will ignore the compressed size of aZipEntry
if that was set implicitely when reading that entry from aZipFile
orZipInputStream
.Technical Details
A zip file consists of a stream of File Entries followed by a Central Directory (see here for a more detailed specification). Each File Entry is composed of a Local File Header (LFH) followed by the compressed Data and an optional Data Descriptor. The LFH contains the File Name and among other attributes the Compressed and Uncompressed size and CRC of the Data. In the case where the latter three attributes are not available at the time when the LFH is created, this fact will be recorded in a flag of the LFH and will trigger the creation of a Data Descriptor with the corresponding information right after the Data section. Finally, the Central Directory contains one Central Directory File Header (CDFH) for each entry of the zip archive. The CDFH is an extended version of the LFH and the ultimate reference for the contents of the zip archive. The redundancy between LFH and CDFH is a tribute to zip's long history when it was used to store archives on multiple floppy discs and the CDFH allowed to update the archive by only writing to the last disc which contained the Central Directory.
ZipEntries
read withZipInputStream.getNextEntry()
will initially only contain the information from the LFH. Only after the next entry was read (or afterZipInputStream.closeEntry()
was called explicitly), will the previously read entry be updated with the data from the Data Descriptor.ZipInputStream
doesn't inspect the Central Directory at all.On the other hand,
ZipFile
only queries the Central Directory forZipEntry
information so allZipEntries
returned byZipFile.entries()
,ZipFile.getEntry()
andZipFile.stream()
will always instantly contain the full Compressed and Uncompressed Size and CRC information for each entry independently of the LFH contents.Risks and Assumptions
If we choose to ignore the implicitly recorded compressed size in a
ZipEntry
read from a zip file when writing it to aZipOutputStream
, this will lead to zip files with incomplete information in the LFH and an additional Data Descriptor as described before. However, the result is still fully compatible to the zip file specification. It's also not unusual, because by default all new zip files created withZipOutputStream
will contain LFHs without Compressed and Uncompressed Size and CRC information and an additional Data Descriptor. Theoretically it is possible to create new zip files withZipOutputStream
class and Compressed and Uncompressed Size and CRC information in the LFH but that's complex and inefficient because it requires two steps. A first step to determine the crc and compressed size of the data and a second step to actually write the data to theZipOutputStream
(which will compress it a second time). This is because the current API offers no possibility to write already compressed data to aZipOutputStream
.Consequently, the only straight-forward way of creating zip files from Java which have all the data in the LFH and no Data Descriptor is by copying
ZipEntries
from an existing zip file with the buggy method described before. This incidentally worked more or less reliable for a long time but breaks miserably when using different zlib implementations. Ignoring the implicitly set compressed size ofZipEntries
can easily fix this problem.I'm not aware of any tool which can not handle such files and if it exists it would have problems with the majority of Java created zip files anyway (e.g. all jar-files created with the jar tool have zip entries with incomplete LFH data and additional Data Descriptor).
Ignoring the implicitly set compressed size of
ZipEntries
has no measurable performance impact and will increase the size of zip archives which used to have the complete file information in the LFH before by 16 bytes per entry. On the other hand it will give us the freedom to use whatever zip implementation we like :)Progress
Testing
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/520/head:pull/520
$ git checkout pull/520