Multiple formats under "Download All" dropdown #4000

acoppock · 2017-07-13T10:00:50Z

Since many repositories include code that expects data files to be in a particular format, it's frustrating that dataverse defaults to downloading data files as .tab.

IMHO, the default should be the original file format, with options for all the others.

landreev · 2017-07-17T04:10:44Z

I do see your point. The rational behind this behavior has been to take the user's data, that may have been in some proprietary format - like SPSS or Stata - and change it to tab-delimited, for archival purposes, since it's a format that's guaranteed to be readable without any special software... And then it makes sense that this format becomes the default. Admittedly, it probably makes less sense with files that were originally CSV (now that we support converting CSV files into tabular data...). If nothing else, CSV is just as good of an "archival format" as tab-delimited...

So, once again, I do see your point. Still, this is very, very ancient legacy and it would honestly be difficult for us to just change this behavior, without upsetting or at least confusing many existing users. But we should still be able to make it less frustrating for users like you.

First of all, we already have an issue that's very close to the head of the dev. queue that will add an option for the user to opt out of converting a file to tabular data in the first place. Kind of a nuclear option, really - because then you would not be able to do things that require tabular metadata. So, not sure if that will help with your use case.

And then we can make it configurable for individual files. As in, keep the default behavior as it is now - tab. is the default download format; but make it possible to specify, per file, which format should be the default.

Also, it sounds like you were talking about downloading files programmatically, via API calls. I'm assuming you were able to work around this, using our API methods. Should be relatively easy, to first look up a file and determine if it's tabular or not, and if it is, ask for the original, instead of the default file. Still some extra stuff to do, of course - but doable.

To summarize, we are open to suggestions, and we should be able to make the download API better suited to your needs - via extra options/features, etc. But just changing the default behavior for every existing tabular file may not be an option, for legacy reasons.

Cheers.

acoppock · 2017-07-17T08:33:08Z

Ah I see! Purpose 1: archiving. .rdata is likely not a great archive format, so I see the need to convert here. I think the *right* approach is to convert all data.frames in an .rdata file to tabular, not just the first one. Leaves open the question of what to do with r objects that are not data.frames. Purpose 2: downloading. I’ve been using the website, not downloading programatically. The main point of friction for me is when I use the check box to select all files in an archive. That’s when it would be especially useful to have the default be the original file format. And I think this point is not r-specific — it’s been frustrating when the replication .do files call for the .dta versions of things but I only have the .tab! But since changing these defaults appears to be hard for legacy reasons, perhaps the moment to fix things is to have a drop-down on the “download” button associated with the multi-file download that says, “download all files as original.” I also like users being able to *set* the default download type per-file. I would use such a feature. Thanks very much for your response, Alex

…

On Jul 17, 2017, at 6:10 AM, landreev ***@***.***> wrote: I do see your point. The rational behind this behavior has been to take the user's data, that may have been in some proprietary format - like SPSS or Stata - and change it to tab-delimited, for archival purposes, since it's a format that's guaranteed to be readable without any special software... And then it makes sense that this format becomes the default. Admittedly, it probably makes less sense with files that were originally CSV (now that we support converting CSV files into tabular data...). If nothing else, CSV is just as good of an "archival format" as tab-delimited... So, once again, I do see your point. Still, this is very, very ancient legacy and it would honestly be difficult for us to just change this behavior, without upsetting or at least confusing many existing users. But we should still be able to make it less frustrating for users like you. First of all, we already have an issue that's very close to the head of the dev. queue that will add an option for the user to opt out of converting a file to tabular data in the first place. Kind of a nuclear option, really - because then you would not be able to do things that require tabular metadata. So, not sure if that will help with your use case. And then we can make it configurable for individual files. As in, keep the default behavior as it is now - tab. is the default download format; but make it possible to specify, per file, which format should be the default. Also, it sounds like you were talking about downloading files programmatically, via API calls. I'm assuming you were able to work around this, using our API methods. Should be relatively easy, to first look up a file and determine if it's tabular or not, and if it is, ask for the original, instead of the default file. Still some extra stuff to do, of course - but doable. To summarize, we are open to suggestions, and we should be able to make the download API better suited to your needs - via extra options/features, etc. But just changing the default behavior for every existing tabular file may not be an option, for legacy reasons. Cheers. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4000 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIvwQQjXDapmAx0aDXr71ZCsAOdwRpIuks5sOt7GgaJpZM4OWxWd>.

setgree · 2017-11-16T20:57:52Z

Hello! Is there any chance of letting users have a choice within the "download all" menu between download all in tabular format and downloading all in original formats? That would seem to address your concern about legacy users. You could then possibly run some analytics, and if it turns out that 98% of users prefer to download in .tab vs original format, we will all know definitively that this was a fringe issue; if it turns out the other way, we will know that there was demand for original file formats.

If this won't be possible, could there be some documentation on repositories for which data files have been converted? I ask because recently some colleagues were trying to download files from dataverse to replicate a study and were unable to do so, and had no idea even what the source of the problem was or how to address it based on the repository they had navigated to. Perhaps a pop-up menu when you select "download all" explaining the issue?

landreev · 2017-11-16T22:38:17Z

Regarding the last request:

This would be very easy to achieve on the API side (i.e. to make that API method that zips up multiple file bundles accept an extra "format=original" option; that would make it use the originals for the files that were converted to tabular data...)

On the dataset page, can we add an extra checkbox ("use originals"?) next to that download-multiple-files-button? (I mean, of course we can add a checkbox - but can we do it without making the whole thing more, rather than less confusing?)

setgree · 2017-11-16T22:51:22Z

This checkbox would make my life a lot easier, thank you!

landreev · 2017-11-16T22:51:33Z

Also, we have a github issue already opened for giving the dataset owner an easy way to "un-ingest" a tabular data file; i.e. to convert it back to the original. Let's implement it finally. It should be easy. And for a researcher whose needs are primarily archival (like providing replication data to the research community), who don't need/care about running online data exploration/analysis on the site, this by itself would solve an issue like this one.

pdurbin · 2017-11-16T23:02:56Z

Also, we have a github issue already opened for giving the dataset owner an easy way to "un-ingest" a tabular data file; i.e. to convert it back to the original. Let's implement it finally. It should be easy

Yep. Good old #3766.

oscardssmith · 2018-07-06T19:04:23Z

What's the actual action item here?

pdurbin · 2018-07-09T14:33:58Z

@oscardssmith good question. You could bring this up during backlog grooming to get a "definition of done".

djbrooke · 2018-07-09T14:38:22Z

No worries, I'll bring it to backlog grooming once it's a priority and there is some consensus on an approach.

dlmurphy · 2018-07-11T15:06:25Z

We just discussed this issue in our weekly design meeting.

For this issue, our goal is to allow users to easily “download all” files in a dataset in their original format using our UI.

On the dataset page's "Download all" button, we want to add two dropdown options:

Archival (open) format
Original format

But we're open to suggestion, leave your comments if you have any thoughts on this solution.

scolapasta · 2018-07-11T15:10:45Z

Do we want to have logic to only give these options in the case where you have ingested files? Also to consider, currently we only ingest tabular files, but we have discussed the idea of other types of ingest, e.g. ingest zip files as a dataverse "package". Nt sure if this affects the design for this at this stage or if it's a bridge we should cross later.

dlmurphy · 2018-07-11T15:11:38Z

Yes, we only want to offer these options in cases where the distinction matters, i.e. when the dataset has at least one ingested file.

scolapasta · 2018-07-11T15:12:51Z

That's what I assumed. Just wanted to make sure it got tracked in the issue. Thanks!

pdurbin · 2018-07-11T16:03:58Z

One quick thought is that @mheppler and I seem to agree that we shouldn't hack on the code until after we've refactored it for #4656.

pdurbin · 2018-07-13T03:20:35Z

Is #4464 a duplicate of this ticket?

djbrooke · 2018-08-06T16:26:33Z

assigning to @dlmurphy to talk about this at next estimation session

landreev · 2018-09-03T23:22:34Z

@matthew-a-dunlap
Copy-and-pasting from/expanding on the slack discussion on making the full permission check pass before generating any zipped output, in order to produce the 207 return code, for posterity:

For the purposes of 4000, I feel like we probably should revert back to checking the permissions as we generate the zipped stream. We may still end up using the current implementation when working on #4576; but let's think about it then. The 207 code may beuseful to have, for the API users (even though, it looks like it was never specifically requested in #4576 - it was something we offered along the way); but the UI users will
a) have no benefit from it; and
b) will be penalized by having to wait before the zipped output starts streaming. In the past, it was a serious enough problem that it was causing timeouts for some users (datasets with tons of small files?) - so that we ended up rewriting that API methods, specifically to use the streaming approach instead. Also, we have switched to a more expensive way of checking permissions - we are no longer cutting corners there, so it'll be even worse.

Also, c) for the UI users, we are already checking the permissions on the UI side (and are warning the users there, via a popup, that we are dropping some files that they cannot download); and, per #4576, we'll be doing the same for the files that have to be dropped because of the size limit. Meaning, when this API receives a call that's a redirect from the UI, it will only contain the file ids that the user is in fact allowed to download. We do of course want to double check that it is indeed the case; but no need doing it in a separate first pass, before generating any output.

So we may want the API to do both things. I.e., handle it the way you have it implemented now, by default: - run the full check, if any files have to be dropped - generate 207, only then generate output. But, also support some kind of a "start streaming asap" flag, to be used when we redirect the user to that API from the UI. But, again, we should probably address that when we work on #4576.

Tests still need fixing and expansion

matthew-a-dunlap · 2018-09-06T23:06:01Z

This may not need another round of review but I am dragging it back in incase someone wants to put eyes on it.

makmanalp · 2018-09-10T18:13:19Z

I just want to express my excitement and enthusiasm for this ticket. Thanks for all the hard work! We're just moving our datasets to Dataverse and it probably would have been a dealbreaker if this wasn't in progress!

pdurbin added Feature: File Upload & Handling User Role: Guest Anyone using the system, even without an account Type: Suggestion an idea labels Aug 18, 2017

pdurbin mentioned this issue Oct 17, 2017

File download and handling (download all): resulting zip file should download files in "original format" #4215

Closed

djbrooke added the Status: Backlog label Jan 25, 2018

pdurbin mentioned this issue Feb 9, 2018

Bulk data download original file format #4464

Closed

pdurbin mentioned this issue Mar 7, 2018

Ingest: Provide more robust ingest for Excel and CSV #585

Closed

TaniaSchlatter mentioned this issue May 29, 2018

Dataset Files DataTable - Paginator, Selection, Download, Counter #4656

Closed

djbrooke assigned dlmurphy Jul 11, 2018

dlmurphy removed their assignment Jul 11, 2018

djbrooke added the ready for estimation label Aug 1, 2018

djbrooke assigned dlmurphy Aug 6, 2018

djbrooke changed the title ~~Default Download Type should be "original"~~ Add Original File formats to "Download All" dropdown Aug 8, 2018

scolapasta assigned landreev Aug 31, 2018

landreev added a commit that referenced this issue Aug 31, 2018

Added an extra check, for tabular data files/saved originals. (#4000)

ac38d3b

landreev mentioned this issue Sep 3, 2018

Download pre-check alerting user when zipped files are too large #4576

Closed

matthew-a-dunlap assigned matthew-a-dunlap and unassigned landreev and kcondon Sep 4, 2018

matthew-a-dunlap added Status: Development and removed Status: QA labels Sep 4, 2018

matthew-a-dunlap added a commit that referenced this issue Sep 4, 2018

Fix size calc bug b/f revert sync zip download #4000

9023b1f

matthew-a-dunlap added a commit that referenced this issue Sep 4, 2018

Revert sync datafiles download #4000

53ca5d1

Tests still need fixing and expansion

matthew-a-dunlap added a commit that referenced this issue Sep 6, 2018

More verbose IT tests for access #4000

14c1931

matthew-a-dunlap added Status: Code Review and removed Status: Development labels Sep 6, 2018

djbrooke assigned landreev and unassigned matthew-a-dunlap Sep 7, 2018

djbrooke added Status: QA and removed Status: Code Review labels Sep 10, 2018

djbrooke unassigned landreev Sep 10, 2018

kcondon self-assigned this Sep 10, 2018

kcondon closed this as completed in cc6d502 Sep 14, 2018

kcondon removed the Status: QA label Sep 14, 2018

djbrooke added this to the 4.9.3 - Optional File PIDs, Initial Internationalization Work milestone Sep 18, 2018

pdurbin mentioned this issue Oct 4, 2018

No Option to download all file formats. #3513

Closed

djbrooke mentioned this issue Jul 18, 2019

Download All Error Reporting - Hidden in Manifest if not all files are downloaded (4.9.4) #5588

Closed

pdurbin mentioned this issue Nov 25, 2019

Files being converted automatically from .csv to .tab #6385

Closed

pdurbin mentioned this issue Jun 21, 2021

Display deposited (rather than ingested) copy of tabular files #7956

Open

pdurbin mentioned this issue Oct 10, 2022

rdata ingest defaults #3999

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple formats under "Download All" dropdown #4000

Multiple formats under "Download All" dropdown #4000

acoppock commented Jul 13, 2017

landreev commented Jul 17, 2017

acoppock commented Jul 17, 2017 via email

setgree commented Nov 16, 2017 •

edited

landreev commented Nov 16, 2017 •

edited

setgree commented Nov 16, 2017

landreev commented Nov 16, 2017 •

edited

pdurbin commented Nov 16, 2017

oscardssmith commented Jul 6, 2018

pdurbin commented Jul 9, 2018

djbrooke commented Jul 9, 2018

dlmurphy commented Jul 11, 2018 •

edited

scolapasta commented Jul 11, 2018

dlmurphy commented Jul 11, 2018

scolapasta commented Jul 11, 2018 •

edited

pdurbin commented Jul 11, 2018

pdurbin commented Jul 13, 2018

djbrooke commented Aug 6, 2018

landreev commented Sep 3, 2018

matthew-a-dunlap commented Sep 6, 2018

makmanalp commented Sep 10, 2018

Multiple formats under "Download All" dropdown #4000

Multiple formats under "Download All" dropdown #4000

Comments

acoppock commented Jul 13, 2017

landreev commented Jul 17, 2017

acoppock commented Jul 17, 2017 via email

setgree commented Nov 16, 2017 • edited

landreev commented Nov 16, 2017 • edited

setgree commented Nov 16, 2017

landreev commented Nov 16, 2017 • edited

pdurbin commented Nov 16, 2017

oscardssmith commented Jul 6, 2018

pdurbin commented Jul 9, 2018

djbrooke commented Jul 9, 2018

dlmurphy commented Jul 11, 2018 • edited

scolapasta commented Jul 11, 2018

dlmurphy commented Jul 11, 2018

scolapasta commented Jul 11, 2018 • edited

pdurbin commented Jul 11, 2018

pdurbin commented Jul 13, 2018

djbrooke commented Aug 6, 2018

landreev commented Sep 3, 2018

matthew-a-dunlap commented Sep 6, 2018

makmanalp commented Sep 10, 2018

setgree commented Nov 16, 2017 •

edited

landreev commented Nov 16, 2017 •

edited

landreev commented Nov 16, 2017 •

edited

dlmurphy commented Jul 11, 2018 •

edited

scolapasta commented Jul 11, 2018 •

edited