-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6723 unzip optimizations #9503
Merged
Merged
6723 unzip optimizations #9503
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
to run a quick check on the number of unpackable files, and the individual file sizes against the configured limits, so that we can avoid any unnecessary unzipping as well as creating temp files. (#6723)
landreev
added this to Ready for Review ⏩
in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34)
via automation
Apr 7, 2023
rtreacy
approved these changes
Apr 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. Tested with a couple of large Consilience zip files I had at hand
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34)
automation
moved this from Ready for Review ⏩
to Ready for QA ⏩
Apr 20, 2023
kcondon
moved this from Ready for QA ⏩
to QA ✅
in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34)
Apr 24, 2023
@landreev Ready to merge except for merge conflicts |
resolved conflicts: src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34)
automation
moved this from QA ✅
to Done 🚀
Apr 24, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
I was/am confused about what was expected to be done here, as a "size 3 spike" for this issue. If it was just a quick test to confirm that the issue was still present, then it appears that @kcondon has already done that (see his last comment in the issue).
I put in a day+ of work into this (so, more than "3") and added a couple of optimizations.
Our file type detection tests (those are run individually on every uncompressed file) were indeed leaving N extra un-closed file descriptors behind - i.e., 1 for each unpacked file. That has been fixed. But please note that not all the temp files that lsof is showing as open right after the upload are wasted resources necessarily. Generally jvm will hold onto some files that have been properly closed in the application, just in case; but it will close them as the resources are garbage-collected, and/or as more fds are needed.
I don't think the number of file descriptors was necessarily the worst underlying issue here. The other part, that we were uncompressing and saving individual temp files, even if a zip archive contained more than the unpacking limit num. of files until that number was reached, was more serious. More of a waste, IMO. It is indeed very easy to run a quick check through the zip file directory and decide whether to unpack or save as is, beforehand. (Note that it's not just on account of the number of files, we also save zip files un-unpacked if one or more uncompressed files are above the individual size limit. It's just as easy to quickly check for that ahead of time too.
Which issue(s) this PR closes:
Closes #6723
Special notes for your reviewer:
Suggestions on how to test this:
lsof should be showing fewer open files in .../files/temp. There should never be more of these open fds than the number of files unpacked; and they should not stay around for long un-garbage collected.
But, most importantly, you should be getting the warning "too many files inside, saving as is" much sooner, right after the upload is complete pretty much.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: