New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Package File from S3 #4949
Comments
Had some discussion about this in the backlog grooming 8/15 and we want to discuss a few technical approaches in more detail before estimating. |
At request of @pameyer - this will initially be unauthenticated |
One technical approach I think we should at least consider is the "sync" command from AWS CLI. Unfortunately, Dataverse users wanting to download files would need to install AWS CLI so it would be trickier to support than rsync, which comes standard on Mac and Linux and I presume can be installed without too much trouble on Windows (but I don't know). I have no idea how much config it requires for unauthenticated downloads, which is what we said above is all we want to support. For rsync there is no config to do for unauthenticated downloads, which is why our docs at http://guides.dataverse.org/en/4.9.2/user/find-use-data.html#downloading-a-dataverse-package-via-rsync are relatively straightforward. The docs for "sync" can be found at https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html and it sounds like a hierarchical directory structure is supported: "Recursively copies new and updated files from the source directory to the destination." An example specific to downloading is provided:
I found out about "sync" from https://serverfault.com/questions/73959/using-rsync-with-amazon-s3 |
When I've tested it, AWS S3 |
s3 sync thoughts: With our current implementation we'd also have to separate the unpublished and restricted files from the s3 bucket. My vibe is that we shouldn't expose the s3 bucket directly in this way (though we do already expose short-term direct downloads). I wonder if there are simple existing tools (web applications?) that would allow us to add a layer inbetween for access and other needs @pameyer a package stored in s3 becomes a folder with subfolders |
We may be able to create a simple "API Gateway" as a layer between s3 and the world https://aws.amazon.com/api-gateway/ & s3 gateway We may want to steer away from more aws services though. |
Discussed solutions for S3 rync-uploaded files Direct access via aws commands:
Something leveraging time-sensitive download urls
Rsync pointing to s3 mounted (fuse) on a box (separate server) Dataverse API Gateway (layer infront of s3)
Extra notes from conversation after deciding on zip file download as the best solution
|
As @matthew-a-dunlap mentioned in the last comment, after meeting during tech hours, we decided on zip file download as the best solution:
(*) need to see here if we can / should support bagit |
One thing that may (or may not) be worth checking; but I've been using "zip" in this discussion as a stand-in for "generic archive format" (aka - "tar", compressed or uncompressed might be another alternative). Is there a format (or implementation) imposed limit to the size of a zip file (or tar file)? |
@scolapasta I moved this back to the design column because in the sprint planning meeting you mentioned some UI/UX impact. Can you elaborate on what you see as potentially impacting the UI? |
Sure. Basically, when packages are large (as will often be the case), it may be best not to have a download button that automatically redirects your browser to the time limited URL, but rather to display that time limited URL (via a popup?) to the user and ask them to use their preferred download manager, or click and let browser download. We may want to consider this to be the logic for any download? (so as to be consistent?) |
I've fixed and committed changes for the issues in the above list. Let me know if there is anything else that's needed, thanks! |
With #4703 we are supporting storage of data with DCM (rsync) on S3. There is more work needed to allow downloading of the data stored in this manner. Whether this is extending RSAL or providing some other download method (e.g. direct S3 link) needs discussion.
The text was updated successfully, but these errors were encountered: