6723 unzip optimizations #9503

landreev · 2023-04-07T00:30:23Z

What this PR does / why we need it:

I was/am confused about what was expected to be done here, as a "size 3 spike" for this issue. If it was just a quick test to confirm that the issue was still present, then it appears that @kcondon has already done that (see his last comment in the issue).

I put in a day+ of work into this (so, more than "3") and added a couple of optimizations.

Our file type detection tests (those are run individually on every uncompressed file) were indeed leaving N extra un-closed file descriptors behind - i.e., 1 for each unpacked file. That has been fixed. But please note that not all the temp files that lsof is showing as open right after the upload are wasted resources necessarily. Generally jvm will hold onto some files that have been properly closed in the application, just in case; but it will close them as the resources are garbage-collected, and/or as more fds are needed.
I don't think the number of file descriptors was necessarily the worst underlying issue here. The other part, that we were uncompressing and saving individual temp files, even if a zip archive contained more than the unpacking limit num. of files until that number was reached, was more serious. More of a waste, IMO. It is indeed very easy to run a quick check through the zip file directory and decide whether to unpack or save as is, beforehand. (Note that it's not just on account of the number of files, we also save zip files un-unpacked if one or more uncompressed files are above the individual size limit. It's just as easy to quickly check for that ahead of time too.

Which issue(s) this PR closes:

Closes #6723

Special notes for your reviewer:

Suggestions on how to test this:

lsof should be showing fewer open files in .../files/temp. There should never be more of these open fds than the number of files unpacked; and they should not stay around for long un-garbage collected.

But, most importantly, you should be getting the warning "too many files inside, saving as is" much sooner, right after the upload is complete pretty much.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

to run a quick check on the number of unpackable files, and the individual file sizes against the configured limits, so that we can avoid any unnecessary unzipping as well as creating temp files. (#6723)

#6723)

coveralls · 2023-04-07T00:32:41Z

Coverage: 20.184% (-0.0008%) from 20.184% when pulling 8fcf4ca on 6723-pre-unzip-limit-checks into e4961fe on develop.

rtreacy

Code looks good. Tested with a couple of large Consilience zip files I had at hand

kcondon · 2023-04-24T19:25:57Z

@landreev Ready to merge except for merge conflicts

resolved conflicts: src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java

landreev added 2 commits April 5, 2023 19:53

Added a long overdue optimization to the code that handles zip uploads,

e0b2310

to run a quick check on the number of unpackable files, and the individual file sizes against the configured limits, so that we can avoid any unnecessary unzipping as well as creating temp files. (#6723)

skipped one tempfile creation step in the unpack workflow/more cleanup (

ef53e99

#6723)

landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Apr 7, 2023

landreev added 2 commits April 6, 2023 20:33

More cleanup; meant to commit this before making the pr. (#6723)

eaec098

typo. #6723

6558f62

mreekie added the Size: 3 A percentage of a sprint. 2.1 hours. label Apr 12, 2023

rtreacy self-assigned this Apr 20, 2023

rtreacy approved these changes Apr 20, 2023

View reviewed changes

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Ready for Review ⏩ to Ready for QA ⏩ Apr 20, 2023

rtreacy removed their assignment Apr 20, 2023

kcondon moved this from Ready for QA ⏩ to QA ✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Apr 24, 2023

kcondon self-assigned this Apr 24, 2023

Merge branch 'develop' into 6723-pre-unzip-limit-checks

8fcf4ca

resolved conflicts: src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java

kcondon merged commit 8125baa into develop Apr 24, 2023
8 checks passed

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Apr 24, 2023

pdurbin added this to the 5.14 milestone May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6723 unzip optimizations #9503

6723 unzip optimizations #9503

landreev commented Apr 7, 2023

coveralls commented Apr 7, 2023 •

edited

rtreacy left a comment

kcondon commented Apr 24, 2023

6723 unzip optimizations #9503

6723 unzip optimizations #9503

Conversation

landreev commented Apr 7, 2023

coveralls commented Apr 7, 2023 • edited

rtreacy left a comment

Choose a reason for hiding this comment

kcondon commented Apr 24, 2023

coveralls commented Apr 7, 2023 •

edited