Refine the organization of `get_file()`? #48

wibeasley · 2020-01-11T00:05:07Z

@adam3smith, @kuriwaki, @pdurbin, and anyone else,

Should get_file() be refactored into multiple child functions? It seems like we're asking it to do a lot of things, including

retrieve by file id or by file name
retrieve single file or multiple files
in get_file directly into environment with user-specified file format #35, @kuriwaki has the cool suggestion of returning a specific file format (like Stata)
in get_file directly into environment with user-specified file format #35, I hijacked @kuriwaki's thread to suggest also returning R's native data.frame or tibble.
in PR Add ability to download zip of multiple files #47, @adam3smith has a cool solution of downloading multiple files as a single zip (also discussed in Downloading multiple files #46)

I like all these capabilities, and want to run discuss organizational ideas with people so the package structure is (a) easy for us to develop, test, & maintain, and (b) useful and natural to users to learn and incorporate.

One possible approach:
A foundational functional retrieves the file(s) by ID; it is the workhorse that actually retrieves the file. A second function accepts the file name (not ID); it essentially wraps the first function after calling get_fileid(). Both of these functions deal with a single file at a time.

Another pair of functions deal with multiple files (one by name, one by id). But these return lists, not a single object. They're essentially lapplys/loops around their respective siblings described above.

To avoid breaking the package interface, maybe the existing get_file() keeps its same interface (that ambiguously accepts with file names or id and returns either single files or a list of files), but we soft-deprecate it and encourage new code to use these more explicit functions? The guts of the function is moved out into the four new functions

Maybe the function names are

the existing get_file() with an unchanged interface
get_file_by_id() (the workhorse)
get_file_by_name()
get_files_by_id()
~~get_files_by_name()~~ (see @adam3smith's comment below)
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()
~~get_zip_by_name()~~ (see @adam3smith's comment below)
get_file_by_doi() (see @adam3smith 's comment

I'm pretty sure it would be easier to write tests that isolate problems. The documentation becomes more verbose, but probably more straight-forward.

You guys have more experience with Dataverse than I do, and better sense of the use cases. Would this reorganization help users? If not, maybe we still split it into multiple functions, but just keep the visibility of functions 2-5 private.

Maybe I'm making this unnecessarily tedious, but I'm thinking that these download functions are the most called by R users, and they're certainly the ones that are called by new users. So if they leave a bad impression, the package is less likely to be used.

The text was updated successfully, but these errors were encountered:

pdurbin · 2020-01-11T01:58:01Z

To be honest, I don't have a strong opinion. I'd suggest maintaining backward compatibility if you can.

adam3smith · 2020-01-11T03:03:31Z

Yes, keeping get_file() for backward compatibility makes sense.
I like adding get_zip_by_id() and I like adding a separate function for by_name -- the logic in get_file() is indeed rather tortured for that.

On the other hand, I think keeping get_file* and get_files* together keeps things simple, especially if you agree with my re-write of the multi-file download to use sequential downloads of individual files.

get zip by name() and get_files_by_name() don't make sense -- you can only request multiple files by id I believe.

I don't have a view about the tibble/data frame

So my suggestion then would be to reduce your list to

the existing get_file() with an unchanged interface
get_file_by_id() (the workhorse)
get_file_by_name()
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()

Given my re-write o

adam3smith · 2020-01-11T12:37:41Z

More thoughts after sleeping on it:

I've come around to a separate get_files* function so that the type of output is predictable: get_file* always produces a raw vector, get_files* always a list of raw vectors. Easy enough to implement
As opposed to what I write above, while the API doesn't implement it, we could of course implement get_files and get_zip by name if we wanted to -- not sure if there's demand for that, I think we shouldn't encourage it so I'm still inclined not to do those
What we should implement, however, is get_file_by_doi -- one of the principal reasons for implementing file-level DOIs were reproducibility workflows
I wonder if we could provite a write_files() function. It's a bit tricky to write the file with its right filename (especially when requesting the original) and it's especially tricky to get the folder structure written if you're requesting individual files, so this seems it'd be a huge help.

kuriwaki · 2020-01-11T15:16:26Z

We can start by fewer functions rather than more
Personally I had expected the existing get_dataset to do what is performed by get_file. Instead, get_dataset gets metadata. Probably hard to change for backwards compatibility though.
I think Jenny Bryan has a tweet pointing out that common package abbreviations at the start of a class of functions is good for auto-complete. For example, stringr, purrr, and googlesheets function names are nice because if a user starts typing str_ or map_ or gs_, rstudio auto-complete provides all the relevant functions. Very useful when looking for a function whose exact name you don't remember.

wibeasley · 2020-01-13T05:02:20Z

@pdurbin & @adam3smith

maintaining backward compatibility if you can.
...
keeping get_file() for backward compatibility makes sense.

Cool. Then let's keep this as a priority

@adam3smith

I like adding get_zip_by_id() and I like adding a separate function for by_name -- the logic in get_file() is indeed rather tortured for that.

Thanks for the extra perspective.

...so that the type of output is predictable...

Cool.

...not sure if there's demand for that, I think we shouldn't encourage it so I'm still inclined not to do those...

I think you're in agreement with @kuriwaki's later statement "We can start by fewer functions rather than more". Sounds good to me.

...get_file_by_doi()...

You're right. Good point. That would encourage/emphasize the reproducibility & transparency.

...provide a write_files() function...

One that writes straight to disk, and isn't immediately available as a data.frame ...until you read from the disk? I wouldn't have thought that comes up a lot, but I'll believe you guys since you have a better feel for how people use this package than I do.

Would you be o.k. with a function name like download_files() instead of write_files(). Or maybe some third option is better? When I hear "write" in the context of database servers, I initially think of persisting something on the server's disk, rather than the clients. That's what I used in REDCapR's redcap_write() and what DBI / odbc uses in dbWriteTable().

@kuriwaki

...I like the general idea and would be happy to contribute.

Sweet. Thanks for offering. I'd like us to start once we get some more test coverage. It sounds like the three of us roughly agree on (a) which are the important functions that should be exposed initially, and (b) how the function guts could be reorganized.

...common package abbreviations at the start of a class of functions is good for auto-complete...

I'm on the fence about this. I like Jenny's insight and advice almost all the time. But I don't like putting the package name in the function because I think almost all function calls should be qualified with the package --just like almost every other real programming language. That's what I urge my team to do in our practices

However, I broke this rule ~7 years ago with the REDCapR package because I didn't believe many of the users could keep a function called write() straight without it.

Given that this package's functions aren't already named like that, I'm inclined not to add a prefix like 'dv_'. But I could be convinced otherwise if people felt strongly.

I had expected the existing get_dataset to do what is performed by get_file. Instead, get_dataset gets metadata. Probably hard to change for backwards compatibility though.

I remember the same reaction myself. After a conversation with @pdurbin two weeks ago, I added a paragraph in our OU team's documentation that "dataset" isn't always a rectangle, like I've grown accustomed to in the R world.

Let's make sure the documentation for the two families of functions have a good "See Also" section that points to each other. Any other ideas how to make it more clear to users?

adam3smith · 2020-01-13T12:45:38Z

I think we're almost at a consensus; nice.

One that writes straight to disk, and isn't immediately available as a data.frame ...until you read from the disk? I wouldn't have thought that comes up a lot, but I'll believe you guys since you have a better feel for how people use this package than I do.

I don't actually know how commonly this would be used, but here are my two use cases:

We use this in my team to download the whole dataset for curation. (I've just published a curation package https://github.com/QualitativeDataRepository/dvcurator if you're curious. The Readme describes the workflow)
In a reproducibility package where the R script loads different files from a folder structure, I don't think there's an alternative to writing them to disk, is there?

Would you be o.k. with a function name like download_files() instead of write_files(). Or maybe some third option is better? When I hear "write" in the context of database servers, I initially think of persisting something on the server's disk, rather than the clients. That's what I used in REDCapR's redcap_write() and what DBI / odbc uses in dbWriteTable().

I have no strong feelings about the function names, but your write vs. download distinction sounds plausible to me. utils:download.file() saves files from the internet and writes them to the disk, so that, too, suggests "download" is the better term. download_dv_files() or download_dataset_files() would be longer but be more distinctive, but again, no strong feelings (and the rest of the package functions don't have particular distinctive names either)

pdurbin · 2020-01-13T12:59:05Z

I've just published a curation package https://github.com/QualitativeDataRepository/dvcurator if you're curious. The Readme describes the workflow

@adam3smith I'd be interested in a talk or a screencast about this. We could put it on DataverseTV. 😄

@wibeasley you should come the Dataverse Community Meeting in June: https://projects.iq.harvard.edu/dcm2020 (and @adam3smith should come again). 😄

kuriwaki · 2020-12-30T05:23:41Z

@wibeasley @adam3smith I've created PR #66 from my branch to implement the things discussed here. This is a substantial addition to the package.

I have implemented the following features, in addition to get_file (kept mainly for backwards compatibility)

get_file_by_id
get_file_by_doi
get_file_by_name
get_dataframe_by_id (sometimes referred to as get_tibble_* above)
get_dataframe_by_doi
get_dataframe_by_name

See the README data download section and help pages in this branch for some examples.

I did not implement get_zip because of my doubts in #46 (comment). And I did not implement get_files_* (plural; multiple files) because I felt it was redundant with get_file.

The main design question I have now is whether the function naming here is appropriate?

get_file_* is fine, but more technically it should be called something like "get_raw_" because it returns a raw binary.
get_dataframe_ is, as written, actually much more powerful than "dataframe" -- by supplying a function, the user can read any R object, even non-rectangular output. Therefore, it is may be more accurate to call it "get_object_"

I do like the naming implications of get_dataframe_* (other than that it's a bit long) because it is intuitive for users, but just wanted to put these two points out.

wibeasley self-assigned this Jan 11, 2020

wibeasley mentioned this issue Jan 11, 2020

Add ability to download zip of multiple files #47

Merged

3 tasks

kuriwaki added the data-download Functions that are about downloading, not uploading, data label Dec 3, 2020

kuriwaki added a commit that referenced this issue Dec 27, 2020

First version of #48, #35

812e414

kuriwaki pinned this issue Dec 27, 2020

kuriwaki added a commit that referenced this issue Dec 29, 2020

Update README with #48

e192e6c

kuriwaki mentioned this issue Dec 30, 2020

Reorganize get_file functions and add a get_dataframe functions #66

Merged

wibeasley closed this as completed in #66 Jan 17, 2021

wibeasley unpinned this issue Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine the organization of `get_file()`? #48

Refine the organization of `get_file()`? #48

wibeasley commented Jan 11, 2020 •

edited by kuriwaki

pdurbin commented Jan 11, 2020

adam3smith commented Jan 11, 2020

adam3smith commented Jan 11, 2020

kuriwaki commented Jan 11, 2020 •

edited

wibeasley commented Jan 13, 2020

adam3smith commented Jan 13, 2020 •

edited

pdurbin commented Jan 13, 2020

kuriwaki commented Dec 30, 2020 •

edited

Refine the organization of get_file()? #48

Refine the organization of get_file()? #48

Comments

wibeasley commented Jan 11, 2020 • edited by kuriwaki

pdurbin commented Jan 11, 2020

adam3smith commented Jan 11, 2020

adam3smith commented Jan 11, 2020

kuriwaki commented Jan 11, 2020 • edited

wibeasley commented Jan 13, 2020

adam3smith commented Jan 13, 2020 • edited

pdurbin commented Jan 13, 2020

kuriwaki commented Dec 30, 2020 • edited

Refine the organization of `get_file()`? #48

Refine the organization of `get_file()`? #48

wibeasley commented Jan 11, 2020 •

edited by kuriwaki

kuriwaki commented Jan 11, 2020 •

edited

adam3smith commented Jan 13, 2020 •

edited

kuriwaki commented Dec 30, 2020 •

edited