Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple formats under "Download All" dropdown #4000

Closed
acoppock opened this issue Jul 13, 2017 · 44 comments · Fixed by #4979
Closed

Multiple formats under "Download All" dropdown #4000

acoppock opened this issue Jul 13, 2017 · 44 comments · Fixed by #4979
Assignees
Labels
Feature: File Upload & Handling Type: Suggestion an idea User Role: Guest Anyone using the system, even without an account

Comments

@acoppock
Copy link

Since many repositories include code that expects data files to be in a particular format, it's frustrating that dataverse defaults to downloading data files as .tab.

IMHO, the default should be the original file format, with options for all the others.

@landreev
Copy link
Contributor

I do see your point. The rational behind this behavior has been to take the user's data, that may have been in some proprietary format - like SPSS or Stata - and change it to tab-delimited, for archival purposes, since it's a format that's guaranteed to be readable without any special software... And then it makes sense that this format becomes the default. Admittedly, it probably makes less sense with files that were originally CSV (now that we support converting CSV files into tabular data...). If nothing else, CSV is just as good of an "archival format" as tab-delimited...

So, once again, I do see your point. Still, this is very, very ancient legacy and it would honestly be difficult for us to just change this behavior, without upsetting or at least confusing many existing users. But we should still be able to make it less frustrating for users like you.

First of all, we already have an issue that's very close to the head of the dev. queue that will add an option for the user to opt out of converting a file to tabular data in the first place. Kind of a nuclear option, really - because then you would not be able to do things that require tabular metadata. So, not sure if that will help with your use case.

And then we can make it configurable for individual files. As in, keep the default behavior as it is now - tab. is the default download format; but make it possible to specify, per file, which format should be the default.

Also, it sounds like you were talking about downloading files programmatically, via API calls. I'm assuming you were able to work around this, using our API methods. Should be relatively easy, to first look up a file and determine if it's tabular or not, and if it is, ask for the original, instead of the default file. Still some extra stuff to do, of course - but doable.

To summarize, we are open to suggestions, and we should be able to make the download API better suited to your needs - via extra options/features, etc. But just changing the default behavior for every existing tabular file may not be an option, for legacy reasons.

Cheers.

@acoppock
Copy link
Author

acoppock commented Jul 17, 2017 via email

@setgree
Copy link

setgree commented Nov 16, 2017

Hello! Is there any chance of letting users have a choice within the "download all" menu between download all in tabular format and downloading all in original formats? That would seem to address your concern about legacy users. You could then possibly run some analytics, and if it turns out that 98% of users prefer to download in .tab vs original format, we will all know definitively that this was a fringe issue; if it turns out the other way, we will know that there was demand for original file formats.

If this won't be possible, could there be some documentation on repositories for which data files have been converted? I ask because recently some colleagues were trying to download files from dataverse to replicate a study and were unable to do so, and had no idea even what the source of the problem was or how to address it based on the repository they had navigated to. Perhaps a pop-up menu when you select "download all" explaining the issue?

@landreev
Copy link
Contributor

landreev commented Nov 16, 2017

Regarding the last request:

This would be very easy to achieve on the API side (i.e. to make that API method that zips up multiple file bundles accept an extra "format=original" option; that would make it use the originals for the files that were converted to tabular data...)

On the dataset page, can we add an extra checkbox ("use originals"?) next to that download-multiple-files-button? (I mean, of course we can add a checkbox - but can we do it without making the whole thing more, rather than less confusing?)

@setgree
Copy link

setgree commented Nov 16, 2017

This checkbox would make my life a lot easier, thank you!

@landreev
Copy link
Contributor

landreev commented Nov 16, 2017

Also, we have a github issue already opened for giving the dataset owner an easy way to "un-ingest" a tabular data file; i.e. to convert it back to the original. Let's implement it finally. It should be easy. And for a researcher whose needs are primarily archival (like providing replication data to the research community), who don't need/care about running online data exploration/analysis on the site, this by itself would solve an issue like this one.

@pdurbin
Copy link
Member

pdurbin commented Nov 16, 2017

Also, we have a github issue already opened for giving the dataset owner an easy way to "un-ingest" a tabular data file; i.e. to convert it back to the original. Let's implement it finally. It should be easy

Yep. Good old #3766.

@oscardssmith
Copy link
Contributor

What's the actual action item here?

@pdurbin
Copy link
Member

pdurbin commented Jul 9, 2018

@oscardssmith good question. You could bring this up during backlog grooming to get a "definition of done".

@djbrooke
Copy link
Contributor

djbrooke commented Jul 9, 2018

No worries, I'll bring it to backlog grooming once it's a priority and there is some consensus on an approach.

@dlmurphy
Copy link
Contributor

dlmurphy commented Jul 11, 2018

We just discussed this issue in our weekly design meeting.

For this issue, our goal is to allow users to easily “download all” files in a dataset in their original format using our UI.

On the dataset page's "Download all" button, we want to add two dropdown options:

  • Archival (open) format

  • Original format


But we're open to suggestion, leave your comments if you have any thoughts on this solution.

@dlmurphy dlmurphy removed their assignment Jul 11, 2018
@scolapasta
Copy link
Contributor

Do we want to have logic to only give these options in the case where you have ingested files? Also to consider, currently we only ingest tabular files, but we have discussed the idea of other types of ingest, e.g. ingest zip files as a dataverse "package". Nt sure if this affects the design for this at this stage or if it's a bridge we should cross later.

@dlmurphy
Copy link
Contributor

Yes, we only want to offer these options in cases where the distinction matters, i.e. when the dataset has at least one ingested file.

@scolapasta
Copy link
Contributor

scolapasta commented Jul 11, 2018

That's what I assumed. Just wanted to make sure it got tracked in the issue. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Jul 11, 2018

One quick thought is that @mheppler and I seem to agree that we shouldn't hack on the code until after we've refactored it for #4656.

@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

Is #4464 a duplicate of this ticket?

@djbrooke
Copy link
Contributor

djbrooke commented Aug 6, 2018

assigning to @dlmurphy to talk about this at next estimation session

@djbrooke djbrooke changed the title Default Download Type should be "original" Add Original File formats to "Download All" dropdown Aug 8, 2018
@landreev
Copy link
Contributor

landreev commented Sep 3, 2018

@matthew-a-dunlap
Copy-and-pasting from/expanding on the slack discussion on making the full permission check pass before generating any zipped output, in order to produce the 207 return code, for posterity:

For the purposes of 4000, I feel like we probably should revert back to checking the permissions as we generate the zipped stream. We may still end up using the current implementation when working on #4576; but let's think about it then. The 207 code may beuseful to have, for the API users (even though, it looks like it was never specifically requested in #4576 - it was something we offered along the way); but the UI users will
a) have no benefit from it; and
b) will be penalized by having to wait before the zipped output starts streaming. In the past, it was a serious enough problem that it was causing timeouts for some users (datasets with tons of small files?) - so that we ended up rewriting that API methods, specifically to use the streaming approach instead. Also, we have switched to a more expensive way of checking permissions - we are no longer cutting corners there, so it'll be even worse.

Also, c) for the UI users, we are already checking the permissions on the UI side (and are warning the users there, via a popup, that we are dropping some files that they cannot download); and, per #4576, we'll be doing the same for the files that have to be dropped because of the size limit. Meaning, when this API receives a call that's a redirect from the UI, it will only contain the file ids that the user is in fact allowed to download. We do of course want to double check that it is indeed the case; but no need doing it in a separate first pass, before generating any output.

So we may want the API to do both things. I.e., handle it the way you have it implemented now, by default: - run the full check, if any files have to be dropped - generate 207, only then generate output. But, also support some kind of a "start streaming asap" flag, to be used when we redirect the user to that API from the UI. But, again, we should probably address that when we work on #4576.

@matthew-a-dunlap
Copy link
Contributor

This may not need another round of review but I am dragging it back in incase someone wants to put eyes on it.

@makmanalp
Copy link

I just want to express my excitement and enthusiasm for this ticket. Thanks for all the hard work! We're just moving our datasets to Dataverse and it probably would have been a dealbreaker if this wasn't in progress!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: File Upload & Handling Type: Suggestion an idea User Role: Guest Anyone using the system, even without an account
Projects
None yet
Development

Successfully merging a pull request may close this issue.