Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6723 unzip optimizations #9503

Merged
merged 5 commits into from
Apr 24, 2023
Merged

6723 unzip optimizations #9503

merged 5 commits into from
Apr 24, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Apr 7, 2023

What this PR does / why we need it:

I was/am confused about what was expected to be done here, as a "size 3 spike" for this issue. If it was just a quick test to confirm that the issue was still present, then it appears that @kcondon has already done that (see his last comment in the issue).

I put in a day+ of work into this (so, more than "3") and added a couple of optimizations.

Our file type detection tests (those are run individually on every uncompressed file) were indeed leaving N extra un-closed file descriptors behind - i.e., 1 for each unpacked file. That has been fixed. But please note that not all the temp files that lsof is showing as open right after the upload are wasted resources necessarily. Generally jvm will hold onto some files that have been properly closed in the application, just in case; but it will close them as the resources are garbage-collected, and/or as more fds are needed.
I don't think the number of file descriptors was necessarily the worst underlying issue here. The other part, that we were uncompressing and saving individual temp files, even if a zip archive contained more than the unpacking limit num. of files until that number was reached, was more serious. More of a waste, IMO. It is indeed very easy to run a quick check through the zip file directory and decide whether to unpack or save as is, beforehand. (Note that it's not just on account of the number of files, we also save zip files un-unpacked if one or more uncompressed files are above the individual size limit. It's just as easy to quickly check for that ahead of time too.

Which issue(s) this PR closes:

Closes #6723

Special notes for your reviewer:

Suggestions on how to test this:

lsof should be showing fewer open files in .../files/temp. There should never be more of these open fds than the number of files unpacked; and they should not stay around for long un-garbage collected.

But, most importantly, you should be getting the warning "too many files inside, saving as is" much sooner, right after the upload is complete pretty much.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

to run a quick check on the number of unpackable files, and the individual
file sizes against the configured limits, so that we can avoid any
unnecessary unzipping as well as creating temp files. (#6723)
@landreev landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Apr 7, 2023
@coveralls
Copy link

coveralls commented Apr 7, 2023

Coverage Status

Coverage: 20.184% (-0.0008%) from 20.184% when pulling 8fcf4ca on 6723-pre-unzip-limit-checks into e4961fe on develop.

@mreekie mreekie added the Size: 3 A percentage of a sprint. 2.1 hours. label Apr 12, 2023
@rtreacy rtreacy self-assigned this Apr 20, 2023
Copy link
Contributor

@rtreacy rtreacy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Tested with a couple of large Consilience zip files I had at hand

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Ready for Review ⏩ to Ready for QA ⏩ Apr 20, 2023
@rtreacy rtreacy removed their assignment Apr 20, 2023
@kcondon kcondon self-assigned this Apr 24, 2023
@kcondon
Copy link
Contributor

kcondon commented Apr 24, 2023

@landreev Ready to merge except for merge conflicts

resolved conflicts:
	src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
@kcondon kcondon merged commit 8125baa into develop Apr 24, 2023
8 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Apr 24, 2023
@pdurbin pdurbin added this to the 5.14 milestone May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

High levels of open file descriptors while uploading a zip with lots of files in it
6 participants