-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8339874: Avoid duplicate checking of trailing slash in ZipFile.getZipEntry #20939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back eirbjo! A progress list of the required criteria for merging this PR into |
|
@eirbjo This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 43 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
| } | ||
|
|
||
| boolean hasTrailingSlash(byte[] a, int end) { | ||
| protected boolean hasTrailingSlash(byte[] a, int end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you making these protected? ZipCoder is package-private so any inheritors must be in the same package, which means they already have access to package-private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since hasTrailingSlash is now only used from UTFCoder and subclasses, I thought we could stricten the access to protected. But as I learned, protected is still accessible from the same package.
On closer inspection though, hasTrailingSlash is now used only from UTF8ZipEncoder.compare. So we can actually make that implementation private and remove the now-unused implementation from the base class, along with the slashBytes logic. I have pushed these changes.
What do you think?
| e = new ZipEntry(name); | ||
| } | ||
| e.flag = CENFLG(cen, pos); | ||
| e.method = CENHOW(cen, pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reading nlen when it's not needed is a good change, and moving clen and elen down to be grouped with the others is fine, but some of the other shuffling around here doesn't seem motivated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reordering of field reads was motivated by some previous experiences where ordering the reads sequentially with their appearance in the CEN a had small, but positive performance gain.
After closer investigation, it turns out that the internal ordering of field reads only has small benefits, maybe in the noise. However, reading the length fields before the allocation of the ZipEntry has a significant negative impact. In fact, it seems to explain most of the performance gains in this PR.
It seems that having ZipEntry allocation interspersed within the CEN field reads incurs a significant cost. I can't explain why, but perhaps @cl4es can? (To reproduce, simply move the length reads to the beginning of the method)
I have kept the reordering of nlen, elen, clen reads, but reverted some other reorderings to make the PR cleaner.
This PR:
Benchmark (size) Mode Cnt Score Error Units
ZipFileGetEntry.getEntryHit 512 avgt 15 52.057 ? 2.021 ns/op
ZipFileGetEntry.getEntryHit 1024 avgt 15 54.753 ? 1.093 ns/op
Reads length fields before ZipEntry allocation:
Benchmark (size) Mode Cnt Score Error Units
ZipFileGetEntry.getEntryHit 512 avgt 15 65.199 ? 0.823 ns/op
ZipFileGetEntry.getEntryHit 1024 avgt 15 69.407 ? 0.807 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a guess but perhaps this is down to a cache effect where the ZipEntry allocation has a chance to evict the cen data array from some cache. Batching all the reads from CEN together could counter-act some such effects and better streamline memory accesses.
LanceAndersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial thoughts but overall it seems to be a beneficial change.
| // each "entry" has 3 ints in table entries | ||
| return (T)getZipEntry(null, res.zsrc.getEntryPos(i++ * 3)); | ||
| int pos = res.zsrc.getEntryPos(i++ * 3); | ||
| return (T)getZipEntry(new EntryPos(getEntryName(pos), pos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have benchmarks covering the various types of iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized it's better to keep the signature of getZipEntry so these iteration callers won't need to construct the EntryPos. I assume calling getEntryName should have similar performance characteristic to the existing code which calls getEntryPos with a null name, since we're just seem to be moving the String decoding from one method to another.
But no, I don't think we have benchmarks for ZipEntry enumeration.
…d to wrap arguments in as EntryPos instances
…oder.hasTrailingSlash with the associated slashBytes() method and slashBytes field
…ew. Move nlen read to the common path.
LanceAndersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the clean up Eirik, I think the changes look good overall.
|
For the benefit of future maintainers, I added a code comment where This reminds me that |
cl4es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
/integrate |
|
Going to push as commit 7f1dae1.
Your commit was automatically rebased without conflicts. |
Please review this PR which speeds up
ZipFile.getZipEntryby removing slash-checking logic which is already taking place during lookup inZipFile.Source.getEntryPos.ZipFile.Source.getEntryPosincludes logic to match a lookup for "name" against a directory entry "name/" (with a trailing slash). However, only the CEN position is currently returned, soZipFile.getZipEntryneeds to re-read the name from the CEN and determine if a trailing slash needs to be appended to the name of the returnedZipEntry.By letting
ZipFile.Source.getEntryPosreturn the resolved name along with the CEN position (in a new recordEntryPos),ZipFile.getZipEntrycan now instead use the already resolved name.Since
ZipFile.getZipEntrynow has the name, CEN header fields can now be read in bulk, separate from the allocation of theZipEntry. This reordering is unlocked by the other changes in this PR and can alone explain a lot of the performance gains, probably because of better cache use.This results in a nice ~18% speedup in the
ZipFileGetEntry.getEntryHitmicro:Baseline:
PR:
The changes in this PR makes
UTF8ZipCoder.comparethe only caller ofZipCoder.hasTrailingSlash, so this method is made private and the implementation in the base class retired.This purely a cleanup and optimization PR, no functional tests are changed or added.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20939/head:pull/20939$ git checkout pull/20939Update a local copy of the PR:
$ git checkout pull/20939$ git pull https://git.openjdk.org/jdk.git pull/20939/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20939View PR using the GUI difftool:
$ git pr show -t 20939Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20939.diff
Webrev
Link to Webrev Comment