- 
                Notifications
    You must be signed in to change notification settings 
- Fork 26
feat: add functions to convert content lists to data frames #470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #' @export | ||
| as_tibble.connect_content_list <- function(x, ...) { | ||
| content_data <- purrr::map(x, "content") | ||
| parse_connectapi_typed(content_data, connectapi_ptypes$content) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is now the time to 🚮 the typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. 😣 I know you want me to remove — I would keep it, personally!
I'll try parsing it un-typed and see how that feels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing without providing a prototype (using parse_connectapi) still attempts to parse columns based on their contents, so we still get columns with e.g. logical, numeric, character. But those could technically be inconsistent if the lists are short.
Date times parse as character, which would shift the burden onto our customers to figure out the correct way to parse. I'm not sure if the logic we have about parsing date times is necessary complexity or not, but if it is, then it seems like it adds value and utility to handle that for the user.
Perhaps I'm thinking about this wrong though — perhaps we should just parse everything as a character vector — is that what you're thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not scope creep the type conversion business into this PR. It's worth a discussion but it can be separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON responses from the server are generally already correctly decoded (floats, ints, logicals, characters, etc all just work). There are a handful of fields that don't automatically parse, but the vast majority do just fine.
Thinking about how we actually want our core search_content function to work: the object it returns is useful, so if there are fields where the types are off, we should fix them before we even get to converting them to a dataframe. Somewhere like 
Lines 15 to 24 in b55a81b
| #' @description Initialize this content. | |
| #' @param connect The `Connect` instance. | |
| #' @param content The content data. | |
| initialize = function(connect, content) { | |
| validate_R6_class(connect, "Connect") | |
| self$connect <- connect | |
| # TODO: need to check that content has | |
| # at least guid, url, title to be functional | |
| self$content <- content | |
| }, | 
parser argument, I haven't dug into detail if it's intended use is something like this, but that might also be the right place for this.
For the purposes of this PR: maybe it makes sense to keep the parsing here where it is and we'll rip it out when we rip all of it out soon (I hope?). Or maybe it's not that big of a deal that the two(and if I'm looking at this correctly it is only two?) columns that are strings-of-datetimes aren't correctly translated in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this! I had realized I needed to ask about… what would replace it or perform that sort of function.
I definitely agree that the better place for it to happen is either during JSON parsing or during object creation. Doing it during conversion to a data frame is the wrong place for a server-object-centric package!
Happy to rip it out from here soon — I would prefer to do so after doing a bit of digging to see how much is actually needed. Like, if the only thing that gets parsed wrong at JSON-parse time are datetimes, it's not really a big deal. (At a cursory look, in the data frame conversion, an untyped conversion has a fair number of differences, but mostly in numeric types.)
I'll create an issue to track this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class(content_list) <- c("connect_content_list", class(content_list)) | ||
| content_list | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to preface this class with connect_? That seems like unnecessary typing for us, and if we want to communicate this is from Connect, we can do that in the print method(s) on this object. Our R6 classes are not for example all prefaced with Connect_
I hate content, as always, but I don't think that's solvable here, but so much do I wish we had a better label for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The S3 classes we've been playing with so far (just integrations and @karawoo's work on users right now) all use a connect_ prefix.
My thinking is that S3 classes are more common than R6 classes and are a lot less strict in their form, so having more verbosely & explicitly-named classes here is better.
There wasn't much discussion of this approach at the time, so I didn't think it was controversial! (integrations: #431, #437, #440, user: #462)
If we conclude that it should just be named content_list, I feel like I should also change the name of the integrations classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen S3 classes commonly be prefixed by the package name since they don't show up namespaced when you check them a la inherits(x, class). I think this is fine and the verbosity doesn't really matter here, across the API at worst you'll have three words right? connect_<entity>(_list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nods ok, I'm fine with leaving it in if it's systematic (and not unique to connectapi)
…com/posit-dev/connectapi into toph/as.data.frame-for-content-list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the methods are generally reasonable, but I have some concerns about the performance that I think means we should take a small step back in thinking about this container class.
| optional = FALSE, | ||
| ... | ||
| ) { | ||
| content_tbl <- as_tibble(x) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically this seems backwards since a tibble just has annotations on top of a data.frame, but I'm guessing it would be a bigger change to unwrap everything to sort that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it does seem backwards. This is because parse_connectapi_typed produces tibbles by default. I believe there was discussion of this in an integrations PR, lemme find it.
Yep: #426 (comment) :)
| #' | ||
| #' @return A `data.frame` with one row per content item. | ||
| #' @export | ||
| as.data.frame.connect_content_list <- function( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this method differ for any other list-of-records from the API? This makes me think you want to have some inheritance, set the S3 class as something like c("connect_content_list", "connect_list"), and then you define the method on connect_list so any list-type container would be supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably correct, but I haven't gone that route yet because I don't want to over-generalize before we have enough instances of the pattern to know what the caveats are. Stuff like… the difference between S3 and R6 classes (see #470 (comment)), whether we want to rely on typed parsing, etc.
I don't have a good sense for when the right moment to generalize is, but my gut is that it's a little in the future.
| #' @return A tibble with one row per content item. | ||
| #' @export | ||
| as_tibble.connect_content_list <- function(x, ...) { | ||
| content_data <- purrr::map(x, "content") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand the data structure here. What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Content R6 class stores the API response under a content property here: 
Line 23 in b55a81b
| self$content <- content | 
The parse_connectapi* functions expect a list of objects (i.e. shaped like the original JSON server response), so this step is reconstructing that. Probably worth a comment, eh?
With newer S3 class objects e.g. connect_integration_list and @karawoo's work in progress on connect_user, the objects are lists with the properties at the top level, so you don't need this extra step.
(For example, to get the guid of an S3 connect_integration you do object$guid, whereas to get the GUID of an R6 Content object you have to do object$content$guid.)
| ) | ||
|  | ||
| purrr::map(res, function(x) { | ||
| content_list <- purrr::map(res, function(x) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this answers my question below about what is in the data structure. You might want to reconsider this eager conversion to R6 objects. It's not free, and this list of records might be huge. Suppose it is only 1ms to create an R6 object (I've seen it be a few ms), but if you've got 1000 content items, that's a second of latency you've added. And most cases you probably won't use the vast majority of those R6 objects (and certainly not when you call as.data.frame on it).
Instead, you could keep this data structure as a bare list of content records (+ the client object in an attribute), and define a [[ method that extracts and wraps in the R6 object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experimentally, this does add some latency — with about 7000 items I saw about a second — it seems manageable, if not ideal.
Would defining a [[ method just shift that latency to other calls?
Returning a list of R6 objects allows all of the existing R6 content code to just work on that list's content — would defining [[ make that contingent on using that function to access the items? Like, would purrr::map(content_list, lock_content) work?
The change to returning R6 objects isn't a part of this PR, so I'd like to create an issue to track this separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IME the lapply functions work by [[ing elements so yes, it would work--though that's an empirical question here.
Would defining a [[ method just shift that latency to other calls?
It defers it, yes, but the point is that in many cases, you don't ever need to make the R6 object, so you never pay that penalty. To make a data.frame as you are doing here, for example, you are making 7000 R6 objects and then immediately iterating over them to extract the data back out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' @export | ||
| as_tibble.connect_content_list <- function(x, ...) { | ||
| content_data <- purrr::map(x, "content") | ||
| parse_connectapi_typed(content_data, connectapi_ptypes$content) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not scope creep the type conversion business into this PR. It's worth a discussion but it can be separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I do agree that #474 is worth looking into further, but fine to happen separately from this PR.
Intent
Allow the list of content returned by
search_content()to be converted to a data frame.Fixes #451
Approach
search_content()as_tibble()andas.data.frame()methods for that function, exactly following the pattern used inintegrations.R.Checklist
NEWS.md(referencing the connected issue if necessary)? [not needed]devtools::document()?