Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/6763-multi-part upload API calls #6995

Merged
merged 59 commits into from Aug 31, 2020

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Jun 17, 2020

What this PR does / why we need it: This builds on #6994 to add three API calls needed to initiate, and complete or abort a direct multipart upload to S3. It will also include analogous calls in EditDatafilePage to support MP upload through the UI (could potentially be pulled into another PR).

Which issue(s) this PR closes:

Closes #6763

Special notes for your reviewer:

Suggestions on how to test this: The easiest way to test the API calls will be to try large uploads using the DVUploader (v1.1.0+ working code, no release yet). The UI support can be tested with large files via the normal direct upload UI (once completed)

Does this PR introduce a user interface change? If mockups are available, please link/include them here: ~No - probably will change how the progress bars are computed when using multiple parts

Is there a release notes update needed for this change?: Just to announce - no config changes needed.

Additional documentation:

@coveralls
Copy link

coveralls commented Jun 17, 2020

Coverage Status

Coverage decreased (-0.06%) to 19.482% when pulling c6d06e5 on GlobalDataverseCommunityConsortium:IQSS/6763 into cdbb7f2 on IQSS:develop.

@qqmyers
Copy link
Member Author

qqmyers commented Jun 17, 2020

Putting this up as draft so people can comment on basic design. I've tested using the DVUploader, so I know the API works.

The design here avoids changing the basic upload process (in general, and definitely w.r.t. single part directUpload.) Part of doing that was not tracking in-progress mp uploads in Dataverse. As is, the API to initiate sends URLs with the info needed to perform the part uploads to S3 and to complete/cancel the request at the end (the complete/cancel via Dataverse). That avoids having to keep state in Dataverse (the URLs have the state), but relies on the client to request a cancel/complete, otherwise the partial upload stays in AWS. Nominally this is another possible form of temp file that could/should be periodically cleaned up. Alternate designs could track open mp uploads directly, but there's no clear way to know when a client has stopped uploading, except waiting for the upload URLs to expire plus some time for transfers to complete.

From the UI, it might make sense to keep track, so that Dataverse would know which mp upload in AWS to cancel if the user cancels the overall upload/dataset creation action, although this could probably be handled on the browser side in the javascript that's managing the overall direct/mp uploads.

As you can tell, my guess is that this design is reasonable, but others may be able to see scenarios where an alternate would be better. If so, it would be good to fix now (since the DVUploader/other API clients would have to keep up/support multiple versions if the API changes later.)

@qqmyers qqmyers marked this pull request as ready for review July 9, 2020 21:14
@qqmyers
Copy link
Member Author

qqmyers commented Jul 9, 2020

Testing with DVUploader (the as-yet-unreleased 1.1.0 version at https://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader), the new API call appears to be working - tested up to ~40GB and part sizes from 5MB (510241024) to 100MB.

I'm going to move on to looking at how this might work in the UI as well, but I don't expect API changes will be needed for that (changes to EditDataFilesPage probably) - so it would be good to get any design/API naming issues settled now since I have to match those in Dataverse and DVUploader. I can put the DVUploader jar somewhere if someone want's to try it out without having to build it.

it's the max that a single part direct upload can be (as coded), and the
minimum part size for multipart uploads.
These could be separate settings, but it's not clear that that's useful
to support. E.g. for AWS S3, the min-part-size is 5MB but single uploads
could go to 5 GB so that is the max single upload.
@djbrooke
Copy link
Contributor

djbrooke commented Jul 9, 2020

Thanks @qqmyers, I'd be very excited to get this in as API-only ASAP, since we have a few large upload requests in the Harvard Dataverse queue and we've been doing the unsupported db-switcheroo method. :)

qqmyers and others added 9 commits July 9, 2020 17:50
at one point I was sending parts that were below the min-part-size on
AWS and everything worked until a 400 response/Entity too small response
when trying to complete. This commit would try to clean up after any
similar problems (as the code now stops this aprticular one from
happening).
@qqmyers
Copy link
Member Author

qqmyers commented Jul 24, 2020

Correction: since the API changes were just to the abort and complete API calls, the URLs for which are supplied via the json response to the API request to start a multipart upload, the DVUploader didn't have to change :-) - I did find/fix an auth issue in the complete api call in the PR though.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, could you please sync up this branch with develop? - there appears to be an actual conflict in this one.
Otherwise this looks really cool. (And worked for me! - when built and deployed as 4.20).

Conflicts:
	src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java
@qqmyers
Copy link
Member Author

qqmyers commented Aug 20, 2020

@landreev - thanks - both PRs are now updated.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Code Review 🦁 to QA 🔎✅ Aug 20, 2020
@landreev landreev removed their assignment Aug 20, 2020
@kcondon kcondon self-assigned this Aug 24, 2020
Conflicts:
	src/main/java/edu/harvard/iq/dataverse/EditDatafilesPage.java
	src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
	src/main/java/edu/harvard/iq/dataverse/dataaccess/S3AccessIO.java
@djbrooke djbrooke added this to the 5.1 milestone Aug 25, 2020
@kcondon
Copy link
Contributor

kcondon commented Aug 26, 2020

Have completed base line testing. Works with a couple notable points to investigate:
-investigate progress bar reset upon multipart error
-investigate throttling of large number of multipart initialization to not overwhelm browser client
-investigate what happens when you cancel a large file upload by clicking "x" next to upload, does it clean up completely.

for cancel - catching the per datafile cancel button and handling the
abort, after all currently uploading parts are done, in mp calls, to
remove the file

Note for single part uploads, cancelling the file does not yet remove
the temp file, but in that case we do add the temp label so files can be
cleaned up later. (Todo note includes more info)
@qqmyers
Copy link
Member Author

qqmyers commented Aug 27, 2020

@kcondon - I think the above are fixed:

  • the progress bar was going blank whenever a new part started and had an 'undefined' number of bytes uploaded when another part reported progress (total progress is the sum of progress across parts, and adding an undefined for a new part gives a NaN total that blanks the bar).
  • was able to limit to creating 10 parts at a time and starting uploads of new ones as those complete. The browser is probably only making progress on fewer than that (2, 4? - a limit in how many calls to a given server at once), so this avoids having the browser queue thousands of waiting calls for large files without limiting how many are uploaded
  • the per-file cancel wasn't handled. I added that and added code to correctly cancel multipart uploads - where you have to wait for currently uploading parts to complete before you can abort the overall transfer and release the space in S3. This change fixes a bug and also improves the overall Cancel of the upload session. Limiting the queue of parts to 10 (the point above) also speeds up being able to cancel as you now only have to wait for those 10 parts to complete before cancelling (versus thousands - not sure how one stops a call after it's been handed to the browser).

So - I think everything that can be fixed for this PR is fixed and it's ready for retesting. Other than regression testing that things still work, I think you should try to cancel a single upload (small or multipart size) and make sure you can upload a second file in the same upload session, and try the cancel of the overall upload - checking to make sure that no multipart uploads are left open at AWS S3 via their api.

@kcondon
Copy link
Contributor

kcondon commented Aug 28, 2020

Regression test:

  1. default config
    no multipart for 7MB, dvuploader 6GB works, check md5 value.

  2. set min part to 5MB+1
    check no multipart <5MB, multipart 7MB, using ui
    Works as expected

Test Fixes:
Using 1GB file upload test:
3. watch progress bar, it should grow not periodically zero when new parts begin
Works as expected
4. watch browser network tab in dev tool to observe how many parts threads are created and complete/grow
Appears to work as expected but don't see new ones added. Will retest with very large file to be more apparent.
5. cancel upload in ui by clicking x next to filename. watch parts on network tab, do they complete and no new ones added? does UI become responsive after cancel?
This is sort of 6&7 combined. Will update those.
6. jim's test 1: upload file, click x, upload another small file in upload session, does it work?
This does not work. The small file never starts upload and when canceled file cleans up, removes all files from upload session list in ui.
7. jim's test 2: upload file, cancel upload session (how? click done?) check open parts are cleared using s3api.
This kind of works, if I click x on file, any outstanding parts are removed according to s3api.

@kcondon kcondon merged commit cf1ac95 into IQSS:develop Aug 31, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Direct S3 Upload: Files >5GB cannot be uploaded to AWS, need to be segmented
5 participants