Skip to content

8303923: ZipOutStream::putEntry should include an apiNote to indicate that the STORED compression method should be used when writing directory entries #12899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

eirbjo
Copy link
Contributor

@eirbjo eirbjo commented Mar 7, 2023

ZipOutputStream currently writes directory entries using the DEFLATED compression method. This does not strictly comply with the APPNOTE.TXT specification and is also about 10x slower than using the STORED compression method.

Because of these concerns, ZipOutputStream.putNextEntry should be updated with an @apiNote recommending
the use of the STORED compression method for directory entries.

Suggested CSR in the first comment.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change requires CSR request JDK-8303925 to be approved
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8303923: ZipOutStream::putEntry should include an apiNote to indicate that the STORED compression method should be used when writing directory entries
  • JDK-8303925: ZipOutStream::putEntry should include an apiNote to indicate that the STORED compression method should be used when writing directory entries (CSR)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/12899/head:pull/12899
$ git checkout pull/12899

Update a local copy of the PR:
$ git checkout pull/12899
$ git pull https://git.openjdk.org/jdk.git pull/12899/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12899

View PR using the GUI difftool:
$ git pr show -t 12899

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12899.diff

Webrev

Link to Webrev Comment

…compression method and have size and crc set to 0.
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 7, 2023

👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 7, 2023

@eirbjo The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Mar 7, 2023
@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 7, 2023

Suggested CSR:

Compatibility kind

none

Compatibility risk

minimal

Compatibility description

This is a documentation-only change

Summary

Add an @apiNoteto ZipOutputStream.putNextEntry recommending that directory entries should be added using the STORED compression method.

Problem

ZipOutputStream currently writes directory entries using the default DEFLATE method. This causes file data for a two-byte 'final empty' DEFLATE block to be written, followed by a 16-byte data descriptor.

This is in violation of the APPNOTE.txt specification, which mandates that directory entries MUST NOT have file data:

 4.3.8  File data

      Immediately following the local header for a file
      SHOULD be placed the compressed or stored data for the file.
      If the file is encrypted, the encryption header for the file 
      SHOULD be placed after the local header and before the file 
      data. The series of [local file header][encryption header]
      [file data][data descriptor] repeats for each file in the 
      .ZIP archive. 

      Zero-byte files, directories, and other file types that 
      contain no content MUST NOT include file data.

Additionally, benchmarks show that the writing of these empty DEFLATED directory entries are ~10X slower compared to an empty STORED entry.

While the jar command uses the STORED method for directory entries, the DEFLATE method still seems to be prevalent: An analysis of the 109 dependency jars of the Spring Petclinic project shows that 65 files had DEFLATE directories
while 34 files has STORED directories.

Solution

Add an @apiNote to ZipOutputStream.putNextEntry recommending the use of the STORED compression method for directory entries. The note should include a snippet which shows the recommended configuration of a directory ZipEntry.

(As an alternative solution, putNextEntry could be updated to change the default compression method to STORED for directory entires. This was deemed as having a too high risk, since users may be depending of the ability to attach arbitrary data to directory entries.)

Specification

  • Add the following @apiNote to ZipOutputStream.putNextEntry
Index: src/java.base/share/classes/java/util/zip/ZipOutputStream.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/java.base/share/classes/java/util/zip/ZipOutputStream.java b/src/java.base/share/classes/java/util/zip/ZipOutputStream.java
--- a/src/java.base/share/classes/java/util/zip/ZipOutputStream.java	(revision f2b03f9a2c0fca853211e41a1ddf46195dd56698)
+++ b/src/java.base/share/classes/java/util/zip/ZipOutputStream.java	(revision f57735cf134469b49cd19472680aee778c245771)
@@ -191,6 +191,22 @@
      * <p>
      * The current time will be used if the entry has no set modification time.
      *
+     * @apiNote When writing a directory entry, the STORED compression method
+     * should be used and the size and CRC-32 values should be set to 0:
+     *
+     * {@snippet lang="java" :
+     *     ZipEntry e = ...;
+     *     if (e.isDirectory()) {
+     *         e.setMethod(ZipEntry.STORED);
+     *         e.setSize(0);
+     *         e.setCrc(0);
+     *     }
+     *     stream.putNextEntry(e);
+     * }
+     *
+     * This ensures strict compliance with the ZIP specification and
+     * allows optimal performance when processing directory entries.
+     *
      * @param     e the ZIP entry to be written
      * @throws    ZipException if a ZIP format error has occurred
      * @throws    IOException if an I/O error has occurred

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 7, 2023

Here's what the generated API note looks like:

image

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 9, 2023

/issue 8303923

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 9, 2023

/csr

@openjdk openjdk bot changed the title ZipOutputStream.putNextEntry should recommend STORED directory entries 8303923: ZipOutStream::putEntry should include an apiNote to indicate that the STORED compression method should be used when writing directory entries Mar 9, 2023
@openjdk
Copy link

openjdk bot commented Mar 9, 2023

@eirbjo The primary solved issue for a PR is set through the PR title. Since the current title does not contain an issue reference, it will now be updated.

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Mar 9, 2023
@openjdk
Copy link

openjdk bot commented Mar 9, 2023

@eirbjo this pull request will not be integrated until the CSR request JDK-8303925 for issue JDK-8303923 has been approved.

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 18, 2023

Looking for reviewers for this CSR which adds an @apiNote recommending the use of the STORED compression method when writing directory entries in ZipOutputStream:

https://bugs.openjdk.org/browse/JDK-8303925

The CSR was initially written by me, then edited by Lance.

@eirbjo eirbjo marked this pull request as ready for review March 19, 2023 09:51
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 19, 2023
@mlbridge
Copy link

mlbridge bot commented Mar 19, 2023

Webrevs

Copy link
Contributor

@LanceAndersen LanceAndersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Eirik,

Thank you for starting the review of your proposed clarification. A couple of minor comments. Also the copyright should be changed to 2023 with your next update.

* should be used and the size and CRC-32 values should be set to 0:
*
* {@snippet lang="java" :
* ZipEntry e = ...;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this an actual value as the snippet should be valid code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used a directory as an example: ZipEntry e = new ZipEntry("dir/"), but then @AlanBateman suggested to make it more generic:

https://mail.openjdk.org/pipermail/core-libs-dev/2023-March/101686.html

For the note then you might want to change it to "ZipEntry e = ..." because the reader see the trailing slash after dir so it is obviously a directory.

I agree valid code would be preferrable. Not sure how to make it valid though, since the isDirectory part suggests the path is unknown. Could an undefined local variable work?

ZipEntry e = new ZipEntry(name);

Or perhaps with a comment:

ZipEntry e = new ZipEntry(name); // name could be a file or directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for ZipEntry e = new ZipEntry(entryName) for now.

* }
*
* This ensures strict compliance with the ZIP specification and
* allows optimal performance when processing directory entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove at least the first part of the sentence regarding "strict compliance" as "file data" can be interpreted as the contents of the file as 4.1.3 of the App.Note allows for the default compression method to be DEFAULT. The intent of the apiNote is to remind developers that the use of the STORED compression method is preferred and may be more optimal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.3.8  File data

      Immediately following the local header for a file
      SHOULD be placed the compressed or stored data for the file.
      If the file is encrypted, the encryption header for the file 
      SHOULD be placed after the local header and before the file 
      data. The series of [local file header][encryption header]
      [file data][data descriptor] repeats for each file in the 
      .ZIP archive. 

      Zero-byte files, directories, and other file types that 
      contain no content MUST NOT include file data.

My interpretation of section 4.3.8 is that 'file data' in the last sentence of this section refers to what is defined in the first sentence: 'Immediately following the local header for a file SHOULD be placed the compressed or stored data for the file.'

While I'm not sure I fully understand you interpretation, it seems you are saying that a DEFLATED entry with the two-byte 'empty final' DEFLATE blocks does not have file data? (Because it is just an encoding of 'no content')?

In any case, I'm happy to remove this since as it stands it is a bit vague and as we've seen open for interpretation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove at least the first part of the sentence regarding "strict compliance" as "file data" can be interpreted as the contents of the file as 4.1.3 of the App.Note allows for the default compression method to be DEFAULT.

4.1.3 Data compression MAY be used to reduce the size of files
   placed into a ZIP file, but is not required.  This format supports the 
   use of multiple data compression algorithms.  When compression is used, 
   one of the documented compression algorithms MUST be used.  Implementors 
   are advised to experiment with their data to determine which of the 
   available algorithms provides the best compression for their needs.
   Compression method 8 (Deflate) is the method used by default by most 
   ZIP compatible application programs.  

I think we read 4.1.3 slightly differently. I read it as saying that WHEN data compression is used, 8 (Deflate) is the method used by default by most ZIP compatible application programs. STORED entries are not using data compression, and as such do not have any 'default compression method'. 4.1.3 does not apply to them.

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 19, 2023

Also the copyright should be changed to 2023 with your next update.

Imagine if jcheck could remind people of this, reviewers would be out of a job :-)

Copy link
Contributor

@LanceAndersen LanceAndersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes. I think this looks better.

I go back and forth as to whether to include the sentence regarding performance but I think it is OK. Let's see if anyone else has thoughts prior to finalizing the CSR with this change

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 21, 2023

Thank you for making the changes. I think this looks better.

I go back and forth as to whether to include the sentence regarding performance but I think it is OK. Let's see if anyone else has thoughts prior to finalizing the CSR with this change

I understand one can look at this differently.

For me, winning friends and influencing people has always seemed easier when I provide an answer for 'What's in it for me?'. Here, I wanted to flip the reader's state of mind from 'hmm!?' to 'hah!'.

@jmehrens
Copy link

jmehrens commented Mar 22, 2023

The example code works without setting the compressed size on the entry?

Looks like there is a check in ZipOutputStream::putNextEntry at

"STORED entry where compressed != uncompressed size");

This must work because of the check at

@eirbjo
Copy link
Contributor Author

eirbjo commented Mar 23, 2023

The example code works without setting the compressed size on the entry?

Yes, this is the minimal code required and is also how the jar tool does it.

The current behaviour does feel a bit underspecified though. In the ZipEntry and ZipOutputStream documentation, getCompressedSize documents that: 'In the case of a stored entry, the compressed size will be the same as the uncompressed size of the entry.'

Perhaps a similar note should be addded to ZipEntry.setCompressedSize, documenting that this method need not be called for STORED entries.

@LanceAndersen what do you think?

@LanceAndersen
Copy link
Contributor

The example code works without setting the compressed size on the entry?

Yes, this is the minimal code required and is also how the jar tool does it.

The current behaviour does feel a bit underspecified though. In the ZipEntry and ZipOutputStream documentation, getCompressedSize documents that: 'In the case of a stored entry, the compressed size will be the same as the uncompressed size of the entry.'

Perhaps a similar note should be addded to ZipEntry.setCompressedSize, documenting that this method need not be called for STORED entries.

@LanceAndersen what do you think?

More thought needs to be given to a clarification as any validation of the ZipEntry values once set are done in ZipOutPutStream. There is some validation done in ZipEntry::setSize and ZipEntry::setCRC but nothing to validate the note you point out in ZipEntry::getCompressedSize.

So yes, we should probably add further clarification but lets address as a separate issue

@eirbjo
Copy link
Contributor Author

eirbjo commented Apr 6, 2023

I moved the CSR for this PR into the Proposed state. The CSR is now looking for reviewers.

@openjdk
Copy link

openjdk bot commented Apr 12, 2023

@eirbjo This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8303923: ZipOutStream::putEntry should include an apiNote to indicate that the STORED compression method  should be used when writing directory entries

Reviewed-by: lancea, alanb

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 573 new commits pushed to the master branch:

  • 2bbbff2: 8305858: Resolve multiple definition of 'handleSocketError' when statically linking with JDK native libraries
  • bc15163: 8304834: Fix wrapper insertion in TestScaffold.parseArgs(String args[])
  • 19380d7: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes
  • 87017b5: 8295859: Update Manual Test Groups
  • 99a9dbc: 8305783: x86_64: Optimize AbsI and AbsL
  • d8af7a6: 8304725: AsyncGetCallTrace can cause SIGBUS on M1
  • b9bdbe9: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub
  • 82e8b03: 8305203: Simplify trimming operation in Region::Ideal
  • 27cf638: 8300912: Update java/nio/MappedByteBuffer/PmemTest.java to run on x86_64 only
  • 42fa000: 8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process
  • ... and 563 more: https://git.openjdk.org/jdk/compare/f2b03f9a2c0fca853211e41a1ddf46195dd56698...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@LanceAndersen, @AlanBateman) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed csr Pull request needs approved CSR before integration labels Apr 12, 2023
@eirbjo
Copy link
Contributor Author

eirbjo commented Apr 12, 2023

/integrate

This PR is now approved and integrated and is looking for kind sponsors.

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Apr 12, 2023
@openjdk
Copy link

openjdk bot commented Apr 12, 2023

@eirbjo
Your change (at version 24d2afc) is now ready to be sponsored by a Committer.

@LanceAndersen
Copy link
Contributor

/sponsor

@openjdk
Copy link

openjdk bot commented Apr 12, 2023

Going to push as commit 425ef06.
Since your change was applied there have been 573 commits pushed to the master branch:

  • 2bbbff2: 8305858: Resolve multiple definition of 'handleSocketError' when statically linking with JDK native libraries
  • bc15163: 8304834: Fix wrapper insertion in TestScaffold.parseArgs(String args[])
  • 19380d7: 8305324: C2: Wrong execution of vectorizing Interger.reverseBytes
  • 87017b5: 8295859: Update Manual Test Groups
  • 99a9dbc: 8305783: x86_64: Optimize AbsI and AbsL
  • d8af7a6: 8304725: AsyncGetCallTrace can cause SIGBUS on M1
  • b9bdbe9: 8305524: AArch64: Fix arraycopy issue on SVE caused by matching rule vmask_gen_sub
  • 82e8b03: 8305203: Simplify trimming operation in Region::Ideal
  • 27cf638: 8300912: Update java/nio/MappedByteBuffer/PmemTest.java to run on x86_64 only
  • 42fa000: 8305484: Compiler::init_c1_runtime unnecessarily uses an Arena that lives for the lifetime of the process
  • ... and 563 more: https://git.openjdk.org/jdk/compare/f2b03f9a2c0fca853211e41a1ddf46195dd56698...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 12, 2023
@openjdk openjdk bot closed this Apr 12, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Apr 12, 2023
@openjdk
Copy link

openjdk bot commented Apr 12, 2023

@LanceAndersen @eirbjo Pushed as commit 425ef06.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants