Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Access API: Download by Dataset #4529

Closed
SamiSousa opened this issue Mar 21, 2018 · 12 comments · Fixed by #7086
Closed

Data Access API: Download by Dataset #4529

SamiSousa opened this issue Mar 21, 2018 · 12 comments · Fixed by #7086
Labels
Feature: Performance & Stability User Role: Guest Anyone using the system, even without an account
Milestone

Comments

@SamiSousa
Copy link

Is there any method of downloading all the files of a dataset using the Data Access API? Something like using he global_id of the dataset to download all files in a zip, similar to the bundle download. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Mar 21, 2018

@SamiSousa thanks for the suggestion.

For a little more context, see our conversation in IRC: http://irclog.iq.harvard.edu/dataverse/2018-03-20#i_64759

@pdurbin
Copy link
Member

pdurbin commented May 1, 2018

@SamiSousa it was nice meeting you this afternoon at BU! I couldn't remember if you had opened any issues but found this one. Like I was saying, with 800+ issues I sometimes reach out to the person who originally opened the issue to see if they're still interested. I got the impression that you may or may not be interested in this issue long term, after your class is over, which is totally fine.

We did discuss this issue during backlog grooming in late March but we're worried about performance implications primarily, even though the code to address this issue is probably straightforward.

Also, is this issue on topic enough to link your team's video from here? I ask because I just checked my email and didn't get your message yet. We have pretty aggressive spam filtering enabled and I get a summary email at the end of the day that may give me the opportunity to see your email and have it delivered to my inbox. Thanks!

@pdurbin
Copy link
Member

pdurbin commented May 1, 2018

@SamiSousa nevermind! I clicked "release and allow sender" in the antispam tool...

screenshot from 2018-05-01 17-46-55

... and now I have a link to your video and code:

@SamiSousa
Copy link
Author

Great meeting you too Phil! In the project, we ended up using the Search and Data Access APIs to list files and download individual files, so this specific feature isn't a high priority request from me. Hope this helps!

@pdurbin
Copy link
Member

pdurbin commented May 2, 2018

@SamiSousa no problem. Thanks for clarifying. I just sent a message about your project to the Dataverse community at https://groups.google.com/d/msg/dataverse-community/P4llZSssZ2Q/zvhGltLpAQAJ and you are welcome to make sure I didn't misrepresent your project at all. The video is really interesting! Thanks for sharing!

@pdurbin
Copy link
Member

pdurbin commented May 2, 2018

@SamiSousa questions are coming in already! Please see dataverse-broker/dataverse-broker#46 . Thanks!

@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

we're worried about performance implications primarily, even though the code to address this issue is probably straightforward

I guess I'll vote to close this issue if we have no intention of supporting this.

@pdurbin
Copy link
Member

pdurbin commented Jun 6, 2019

I just spoke with @jggautier about this in the context of https://help.hmdc.harvard.edu/Ticket/Display.html?id=276556 and was telling him that I do think we should implement the ability to download all the files in a dataset based on the DOI or Handle of that dataset via API using a script (with a config option to turn off this feature for installations that don't want it).

The workaround for figuring out from a browser the file IDs to pass to the Data Access API is to use dev tools (inspect element) and copy the curl command. For example:

Firefox

Screen Shot 2019-06-06 at 11 56 21 AM

Chrome

Screen Shot 2019-06-06 at 12 00 31 PM

Then, once you have the crazy long URL (lots of extra junk in there), you can use it like this:

curl 'https://demo.dataverse.org/api/access/datafiles/307909,307910,307908?gbrecs=true' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/KG5TTZ' -H 'Connection: keep-alive' -H 'Cookie: _ga=GA1.2.2122576405.1458229148; __utma=226172687.2122576405.1458229148.1559755065.1559827431.784; _saml_idp=aHR0cHM6Ly9mZWQuaHVpdC5oYXJ2YXJkLmVkdS9pZHAvc2hpYmJvbGV0aA%3D%3D+aHR0cHM6Ly9pZHAudGVzdHNoaWIub3JnL2lkcC9zaGliYm9sZXRo; __utmz=226172687.1559243403.772.43.utmcsr=iq.harvard.edu|utmccn=(referral)|utmcmd=referral|utmcct=/product-development; _gid=GA1.2.418902204.1559572454; __utmc=226172687; JSESSIONID=d801bf3d60278565c5847c1f3dd0' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > out.zip

dhcp-10-250-190-90:tmp pdurbin$ file out.zip 
out.zip: Zip archive data, at least v2.0 to extract
dhcp-10-250-190-90:tmp pdurbin$ unzip out.zip 
Archive:  out.zip
  inflating: hdv-did-extract-metadata.fits  
  inflating: hdv-did-extract-metadata.uvfits  
  inflating: hdv-did-not-extract-metadata.fits  
  inflating: MANIFEST.TXT            
dhcp-10-250-190-90:tmp pdurbin$ 

The part that matters is https://demo.dataverse.org/api/access/datafiles/307909,307910,307908?gbrecs=true and this is documented at http://guides.dataverse.org/en/4.14/api/dataaccess.html#multiple-file-bundle-download

@mankoff
Copy link
Contributor

mankoff commented Jun 4, 2020

I would like to vote for this feature. It is critical to be able to do bulk downloads via wget or some other computer-to-computer solution. This is the beauty of the classic FTP folder full of files. It would be nice to be able to point any script to any dataverse DOI with /download appended to the end and know that it will fetching everything within.

This could be turned off for dataverses, on for datasets, and configurable by the site admin and each dataverse and dataset admin.

Alternatively, the dataverse site could auto-generate a script (bash? Python?) for each dataset to download all the data contained in that dataset. The National Snow and Ice Data Center (NSIDC) takes this approach.

@mankoff
Copy link
Contributor

mankoff commented Jun 4, 2020

I recognize that the reason this issue is often closed is "server load issues". The advantage of generating a script the user can run is that the script could throttle the download. You'd need to trust users to not remove that part of the script though. A script also lets the user download multiple files, not just a ZIP file, and then zipping is not required on the server backend (although perhaps the web server itself, e.g. Apache not dataverse does on-the-fly compression for transfers).

@djbrooke
Copy link
Contributor

djbrooke commented Jun 4, 2020

Hey @mankoff, we'll be implementing this after we make some optimizations to the zipping service in #6505.

We have a full API suite documented at http://guides.dataverse.org/en/latest/api/index.html, so it would be possible to script things now.

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2020

I made pull request #7086 for this issue. Feedback is welcome, of course.

@pdurbin pdurbin removed their assignment Jul 17, 2020
pdurbin added a commit that referenced this issue Jul 17, 2020
Added enum so we don't have two methods both with 3 String args.
pdurbin added a commit that referenced this issue Jul 28, 2020
Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.
kcondon added a commit that referenced this issue Jul 29, 2020
add API to download all files by dataset #4529
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Performance & Stability User Role: Guest Anyone using the system, even without an account
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants