Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Package File from S3 #4949

Closed
matthew-a-dunlap opened this issue Aug 14, 2018 · 27 comments
Closed

Download Package File from S3 #4949

matthew-a-dunlap opened this issue Aug 14, 2018 · 27 comments

Comments

@matthew-a-dunlap
Copy link
Contributor

With #4703 we are supporting storage of data with DCM (rsync) on S3. There is more work needed to allow downloading of the data stored in this manner. Whether this is extending RSAL or providing some other download method (e.g. direct S3 link) needs discussion.

@djbrooke
Copy link
Contributor

Had some discussion about this in the backlog grooming 8/15 and we want to discuss a few technical approaches in more detail before estimating.

@djbrooke
Copy link
Contributor

At request of @pameyer - this will initially be unauthenticated

@pdurbin
Copy link
Member

pdurbin commented Sep 18, 2018

One technical approach I think we should at least consider is the "sync" command from AWS CLI.

Unfortunately, Dataverse users wanting to download files would need to install AWS CLI so it would be trickier to support than rsync, which comes standard on Mac and Linux and I presume can be installed without too much trouble on Windows (but I don't know). I have no idea how much config it requires for unauthenticated downloads, which is what we said above is all we want to support. For rsync there is no config to do for unauthenticated downloads, which is why our docs at http://guides.dataverse.org/en/4.9.2/user/find-use-data.html#downloading-a-dataverse-package-via-rsync are relatively straightforward.

The docs for "sync" can be found at https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html and it sounds like a hierarchical directory structure is supported: "Recursively copies new and updated files from the source directory to the destination."

An example specific to downloading is provided:

The following sync command syncs files in a local directory to objects under a specified prefix and bucket by downloading s3 objects....

aws s3 sync s3://mybucket .

I found out about "sync" from https://serverfault.com/questions/73959/using-rsync-with-amazon-s3

@pameyer
Copy link
Contributor

pameyer commented Sep 18, 2018

When I've tested it, AWS S3 sync does support directory hierarchy. I haven't investigated if or how it supports un-authenticated access to public S3 objects; and I'm not fully up to speed on if a package file stored in S3 corresponds to an S3 bucket, S3 object, or something else.

@matthew-a-dunlap
Copy link
Contributor Author

s3 sync thoughts:
I looked into the aws s3 throttling options and there seems to be no way to configure s3 access on the server side. The user can put parameters into their ./aws/config but that has to be voluntary: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#configuration-values .

With our current implementation we'd also have to separate the unpublished and restricted files from the s3 bucket.

My vibe is that we shouldn't expose the s3 bucket directly in this way (though we do already expose short-term direct downloads). I wonder if there are simple existing tools (web applications?) that would allow us to add a layer inbetween for access and other needs

@pameyer a package stored in s3 becomes a folder with subfolders

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 18, 2018

We may be able to create a simple "API Gateway" as a layer between s3 and the world https://aws.amazon.com/api-gateway/ & s3 gateway We may want to steer away from more aws services though.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Sep 18, 2018

Discussed solutions for S3 rync-uploaded files

Direct access via aws commands:

  • No throttling options
  • Will need to ensure bucket only has accessible files
  • No download counts

Something leveraging time-sensitive download urls

  • We could provide the users a script that talks to dataverse to generate and point to downloads of the files in the package.
  • Throttling can be done by dataverse
  • Could zip the file before and just do a single download

Rsync pointing to s3 mounted (fuse) on a box (separate server)

Dataverse API Gateway (layer infront of s3)

  • Would be easy and allow controls for s3
  • Unlikely if LTS will be ok with it

Extra notes from conversation after deciding on zip file download as the best solution

  • Can we support bagit?
  • If the package is too big provide the user a command not a link
  • Can we support bucket-to-bucket copy?
  • How are we going to ensure the zip file is good as its being passed around?

@scolapasta
Copy link
Contributor

scolapasta commented Sep 18, 2018

As @matthew-a-dunlap mentioned in the last comment, after meeting during tech hours, we decided on zip file download as the best solution:

  • DCM will have an option to upload as a zip* or as unzipped; default would be zipped
  • Download URLs would generated the way they are now for files, where user gets authorized via Glassfish then gets a direct link to S3

(*) need to see here if we can / should support bagit

@pameyer
Copy link
Contributor

pameyer commented Sep 18, 2018

One thing that may (or may not) be worth checking; but I've been using "zip" in this discussion as a stand-in for "generic archive format" (aka - "tar", compressed or uncompressed might be another alternative). Is there a format (or implementation) imposed limit to the size of a zip file (or tar file)?

@djbrooke djbrooke changed the title RSAL (or equivalent) for DCM S3 Download Package File from S3 Sep 19, 2018
@djbrooke djbrooke reopened this Sep 19, 2018
@djbrooke
Copy link
Contributor

@scolapasta I moved this back to the design column because in the sprint planning meeting you mentioned some UI/UX impact. Can you elaborate on what you see as potentially impacting the UI?

@scolapasta
Copy link
Contributor

Sure. Basically, when packages are large (as will often be the case), it may be best not to have a download button that automatically redirects your browser to the time limited URL, but rather to display that time limited URL (via a popup?) to the user and ask them to use their preferred download manager, or click and let browser download.

We may want to consider this to be the logic for any download? (so as to be consistent?)

@matthew-a-dunlap
Copy link
Contributor Author

I've fixed and committed changes for the issues in the above list. Let me know if there is anything else that's needed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants