Download Package File from S3 #4949

matthew-a-dunlap · 2018-08-14T16:11:14Z

With #4703 we are supporting storage of data with DCM (rsync) on S3. There is more work needed to allow downloading of the data stored in this manner. Whether this is extending RSAL or providing some other download method (e.g. direct S3 link) needs discussion.

djbrooke · 2018-08-15T18:34:56Z

Had some discussion about this in the backlog grooming 8/15 and we want to discuss a few technical approaches in more detail before estimating.

djbrooke · 2018-08-21T16:51:37Z

At request of @pameyer - this will initially be unauthenticated

pdurbin · 2018-09-18T17:26:55Z

One technical approach I think we should at least consider is the "sync" command from AWS CLI.

Unfortunately, Dataverse users wanting to download files would need to install AWS CLI so it would be trickier to support than rsync, which comes standard on Mac and Linux and I presume can be installed without too much trouble on Windows (but I don't know). I have no idea how much config it requires for unauthenticated downloads, which is what we said above is all we want to support. For rsync there is no config to do for unauthenticated downloads, which is why our docs at http://guides.dataverse.org/en/4.9.2/user/find-use-data.html#downloading-a-dataverse-package-via-rsync are relatively straightforward.

The docs for "sync" can be found at https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html and it sounds like a hierarchical directory structure is supported: "Recursively copies new and updated files from the source directory to the destination."

An example specific to downloading is provided:

The following sync command syncs files in a local directory to objects under a specified prefix and bucket by downloading s3 objects....
aws s3 sync s3://mybucket .

I found out about "sync" from https://serverfault.com/questions/73959/using-rsync-with-amazon-s3

pameyer · 2018-09-18T17:46:02Z

When I've tested it, AWS S3 sync does support directory hierarchy. I haven't investigated if or how it supports un-authenticated access to public S3 objects; and I'm not fully up to speed on if a package file stored in S3 corresponds to an S3 bucket, S3 object, or something else.

matthew-a-dunlap · 2018-09-18T17:51:07Z

s3 sync thoughts:
I looked into the aws s3 throttling options and there seems to be no way to configure s3 access on the server side. The user can put parameters into their ./aws/config but that has to be voluntary: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#configuration-values .

With our current implementation we'd also have to separate the unpublished and restricted files from the s3 bucket.

My vibe is that we shouldn't expose the s3 bucket directly in this way (though we do already expose short-term direct downloads). I wonder if there are simple existing tools (web applications?) that would allow us to add a layer inbetween for access and other needs

@pameyer a package stored in s3 becomes a folder with subfolders

matthew-a-dunlap · 2018-09-18T18:14:53Z

We may be able to create a simple "API Gateway" as a layer between s3 and the world https://aws.amazon.com/api-gateway/ & s3 gateway We may want to steer away from more aws services though.

matthew-a-dunlap · 2018-09-18T20:18:11Z

Discussed solutions for S3 rync-uploaded files

Direct access via aws commands:

No throttling options
Will need to ensure bucket only has accessible files
No download counts

Something leveraging time-sensitive download urls

We could provide the users a script that talks to dataverse to generate and point to downloads of the files in the package.
Throttling can be done by dataverse
Could zip the file before and just do a single download

Rsync pointing to s3 mounted (fuse) on a box (separate server)

Dataverse API Gateway (layer infront of s3)

Would be easy and allow controls for s3
Unlikely if LTS will be ok with it

Extra notes from conversation after deciding on zip file download as the best solution

Can we support bagit?
If the package is too big provide the user a command not a link
Can we support bucket-to-bucket copy?
How are we going to ensure the zip file is good as its being passed around?

scolapasta · 2018-09-18T21:18:18Z

As @matthew-a-dunlap mentioned in the last comment, after meeting during tech hours, we decided on zip file download as the best solution:

DCM will have an option to upload as a zip* or as unzipped; default would be zipped
Download URLs would generated the way they are now for files, where user gets authorized via Glassfish then gets a direct link to S3

(*) need to see here if we can / should support bagit

pameyer · 2018-09-18T21:58:35Z

One thing that may (or may not) be worth checking; but I've been using "zip" in this discussion as a stand-in for "generic archive format" (aka - "tar", compressed or uncompressed might be another alternative). Is there a format (or implementation) imposed limit to the size of a zip file (or tar file)?

djbrooke · 2018-09-19T20:50:20Z

@scolapasta I moved this back to the design column because in the sprint planning meeting you mentioned some UI/UX impact. Can you elaborate on what you see as potentially impacting the UI?

scolapasta · 2018-09-19T20:57:19Z

Sure. Basically, when packages are large (as will often be the case), it may be best not to have a download button that automatically redirects your browser to the time limited URL, but rather to display that time limited URL (via a popup?) to the user and ask them to use their preferred download manager, or click and let browser download.

We may want to consider this to be the logic for any download? (so as to be consistent?)

includes the latex/pdf fix

matthew-a-dunlap · 2018-12-13T23:20:10Z

I've fixed and committed changes for the issues in the above list. Let me know if there is anything else that's needed, thanks!

djbrooke added the Status: Backlog label Aug 14, 2018

djbrooke assigned matthew-a-dunlap Aug 14, 2018

djbrooke added the ready for estimation label Aug 14, 2018

djbrooke assigned pameyer and scolapasta Aug 14, 2018

djbrooke removed the ready for estimation label Aug 21, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Sep 19, 2018

djbrooke unassigned scolapasta, matthew-a-dunlap and pameyer Sep 19, 2018

djbrooke added Status: Backlog and removed Status: This/Next Sprint labels Sep 19, 2018

djbrooke changed the title ~~RSAL (or equivalent) for DCM S3~~ Download Package File from S3 Sep 19, 2018

djbrooke closed this as completed Sep 19, 2018

djbrooke reopened this Sep 19, 2018

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Dec 10, 2018

matthew-a-dunlap removed their assignment Dec 10, 2018

kcondon assigned kcondon and matthew-a-dunlap and unassigned kcondon Dec 10, 2018

kcondon added Status: Development and removed Status: QA labels Dec 13, 2018

matthew-a-dunlap pushed a commit that referenced this issue Dec 13, 2018

dcms3 code review docfixes #4949

5537175

includes the latex/pdf fix

matthew-a-dunlap pushed a commit that referenced this issue Dec 13, 2018

dcms3 install fixes and enable restrict #4949

bf31f9e

matthew-a-dunlap added Status: QA and removed Status: Development labels Dec 13, 2018

matthew-a-dunlap removed their assignment Dec 13, 2018

kcondon assigned kcondon and matthew-a-dunlap and unassigned kcondon Dec 13, 2018

kcondon added Status: Development and removed Status: QA labels Dec 14, 2018

djbrooke added Status: Code Review and removed Status: Development labels Dec 14, 2018

djbrooke unassigned matthew-a-dunlap Dec 14, 2018

pameyer mentioned this issue Dec 14, 2018

clean up big data / DCM section of guides #5401

Closed

pdurbin added Status: QA and removed Status: Code Review labels Dec 14, 2018

kcondon closed this as completed in a170d78 Dec 14, 2018

kcondon self-assigned this Dec 14, 2018

kcondon removed the Status: QA label Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Package File from S3 #4949

Download Package File from S3 #4949

matthew-a-dunlap commented Aug 14, 2018

djbrooke commented Aug 15, 2018

djbrooke commented Aug 21, 2018

pdurbin commented Sep 18, 2018

pameyer commented Sep 18, 2018

matthew-a-dunlap commented Sep 18, 2018

matthew-a-dunlap commented Sep 18, 2018 •

edited

matthew-a-dunlap commented Sep 18, 2018 •

edited

scolapasta commented Sep 18, 2018 •

edited

pameyer commented Sep 18, 2018

djbrooke commented Sep 19, 2018

scolapasta commented Sep 19, 2018

matthew-a-dunlap commented Dec 13, 2018

Download Package File from S3 #4949

Download Package File from S3 #4949

Comments

matthew-a-dunlap commented Aug 14, 2018

djbrooke commented Aug 15, 2018

djbrooke commented Aug 21, 2018

pdurbin commented Sep 18, 2018

pameyer commented Sep 18, 2018

matthew-a-dunlap commented Sep 18, 2018

matthew-a-dunlap commented Sep 18, 2018 • edited

matthew-a-dunlap commented Sep 18, 2018 • edited

scolapasta commented Sep 18, 2018 • edited

pameyer commented Sep 18, 2018

djbrooke commented Sep 19, 2018

scolapasta commented Sep 19, 2018

matthew-a-dunlap commented Dec 13, 2018

matthew-a-dunlap commented Sep 18, 2018 •

edited

matthew-a-dunlap commented Sep 18, 2018 •

edited

scolapasta commented Sep 18, 2018 •

edited