New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine the organization of get_file()
?
#48
Comments
To be honest, I don't have a strong opinion. I'd suggest maintaining backward compatibility if you can. |
Yes, keeping get_file() for backward compatibility makes sense. On the other hand, I think keeping get_file* and get_files* together keeps things simple, especially if you agree with my re-write of the multi-file download to use sequential downloads of individual files.
I don't have a view about the tibble/data frame So my suggestion then would be to reduce your list to the existing Given my re-write o |
More thoughts after sleeping on it:
|
|
Cool. Then let's keep this as a priority
Thanks for the extra perspective.
Cool.
I think you're in agreement with @kuriwaki's later statement "We can start by fewer functions rather than more". Sounds good to me.
You're right. Good point. That would encourage/emphasize the reproducibility & transparency.
One that writes straight to disk, and isn't immediately available as a data.frame ...until you read from the disk? I wouldn't have thought that comes up a lot, but I'll believe you guys since you have a better feel for how people use this package than I do. Would you be o.k. with a function name like
Sweet. Thanks for offering. I'd like us to start once we get some more test coverage. It sounds like the three of us roughly agree on (a) which are the important functions that should be exposed initially, and (b) how the function guts could be reorganized.
I'm on the fence about this. I like Jenny's insight and advice almost all the time. But I don't like putting the package name in the function because I think almost all function calls should be qualified with the package --just like almost every other real programming language. That's what I urge my team to do in our practices However, I broke this rule ~7 years ago with the REDCapR package because I didn't believe many of the users could keep a function called Given that this package's functions aren't already named like that, I'm inclined not to add a prefix like 'dv_'. But I could be convinced otherwise if people felt strongly.
I remember the same reaction myself. After a conversation with @pdurbin two weeks ago, I added a paragraph in our OU team's documentation that "dataset" isn't always a rectangle, like I've grown accustomed to in the R world. Let's make sure the documentation for the two families of functions have a good "See Also" section that points to each other. Any other ideas how to make it more clear to users? |
I think we're almost at a consensus; nice.
I don't actually know how commonly this would be used, but here are my two use cases:
I have no strong feelings about the function names, but your write vs. download distinction sounds plausible to me. |
@adam3smith I'd be interested in a talk or a screencast about this. We could put it on DataverseTV. 😄 @wibeasley you should come the Dataverse Community Meeting in June: https://projects.iq.harvard.edu/dcm2020 (and @adam3smith should come again). 😄 |
@wibeasley @adam3smith I've created PR #66 from my branch to implement the things discussed here. This is a substantial addition to the package. I have implemented the following features, in addition to
See the README data download section and help pages in this branch for some examples. I did not implement The main design question I have now is whether the function naming here is appropriate?
I do like the naming implications of |
@adam3smith, @kuriwaki, @pdurbin, and anyone else,
Should
get_file()
be refactored into multiple child functions? It seems like we're asking it to do a lot of things, includingdata.frame
ortibble
.I like all these capabilities, and want to run discuss organizational ideas with people so the package structure is (a) easy for us to develop, test, & maintain, and (b) useful and natural to users to learn and incorporate.
One possible approach:
A foundational functional retrieves the file(s) by ID; it is the workhorse that actually retrieves the file. A second function accepts the file name (not ID); it essentially wraps the first function after calling
get_fileid()
. Both of these functions deal with a single file at a time.Another pair of functions deal with multiple files (one by name, one by id). But these return lists, not a single object. They're essentially lapplys/loops around their respective siblings described above.
To avoid breaking the package interface, maybe the existing
get_file()
keeps its same interface (that ambiguously accepts with file names or id and returns either single files or a list of files), but we soft-deprecate it and encourage new code to use these more explicit functions? The guts of the function is moved out into the four new functionsMaybe the function names are
get_file()
with an unchanged interfaceget_file_by_id()
(the workhorse)get_file_by_name()
get_files_by_id()
(see @adam3smith's comment below)get_files_by_name()
get_tibble_by_id()
get_tibble_by_name()
get_zip_by_id()
(see @adam3smith's comment below)get_zip_by_name()
get_file_by_doi()
(see @adam3smith 's commentI'm pretty sure it would be easier to write tests that isolate problems. The documentation becomes more verbose, but probably more straight-forward.
You guys have more experience with Dataverse than I do, and better sense of the use cases. Would this reorganization help users? If not, maybe we still split it into multiple functions, but just keep the visibility of functions 2-5 private.
Maybe I'm making this unnecessarily tedious, but I'm thinking that these download functions are the most called by R users, and they're certainly the ones that are called by new users. So if they leave a bad impression, the package is less likely to be used.
The text was updated successfully, but these errors were encountered: