Skip to content

Conversation

@toph-allen
Copy link
Collaborator

Intent

Allow the list of content returned by search_content() to be converted to a data frame.

Fixes #451

Approach

  • Add a class name to the list returned by search_content()
  • Provide as_tibble() and as.data.frame() methods for that function, exactly following the pattern used in integrations.R.

Checklist

  • Does this change update NEWS.md (referencing the connected issue if necessary)? [not needed]
  • Does this change need documentation? Have you run devtools::document()?

#' @export
as_tibble.connect_content_list <- function(x, ...) {
content_data <- purrr::map(x, "content")
parse_connectapi_typed(content_data, connectapi_ptypes$content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is now the time to 🚮 the typing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. 😣 I know you want me to remove — I would keep it, personally!

I'll try parsing it un-typed and see how that feels.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing without providing a prototype (using parse_connectapi) still attempts to parse columns based on their contents, so we still get columns with e.g. logical, numeric, character. But those could technically be inconsistent if the lists are short.

Date times parse as character, which would shift the burden onto our customers to figure out the correct way to parse. I'm not sure if the logic we have about parsing date times is necessary complexity or not, but if it is, then it seems like it adds value and utility to handle that for the user.

Perhaps I'm thinking about this wrong though — perhaps we should just parse everything as a character vector — is that what you're thinking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not scope creep the type conversion business into this PR. It's worth a discussion but it can be separate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON responses from the server are generally already correctly decoded (floats, ints, logicals, characters, etc all just work). There are a handful of fields that don't automatically parse, but the vast majority do just fine.

Thinking about how we actually want our core search_content function to work: the object it returns is useful, so if there are fields where the types are off, we should fix them before we even get to converting them to a dataframe. Somewhere like

connectapi/R/content.R

Lines 15 to 24 in b55a81b

#' @description Initialize this content.
#' @param connect The `Connect` instance.
#' @param content The content data.
initialize = function(connect, content) {
validate_R6_class(connect, "Connect")
self$connect <- connect
# TODO: need to check that content has
# at least guid, url, title to be functional
self$content <- content
},
would be a good place to put that type munging (but again, only for the fields that aren't already converted for us by httr / jsonlite parsing). Alternatively, our HTTP interface also has a parser argument, I haven't dug into detail if it's intended use is something like this, but that might also be the right place for this.

For the purposes of this PR: maybe it makes sense to keep the parsing here where it is and we'll rip it out when we rip all of it out soon (I hope?). Or maybe it's not that big of a deal that the two(and if I'm looking at this correctly it is only two?) columns that are strings-of-datetimes aren't correctly translated in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this! I had realized I needed to ask about… what would replace it or perform that sort of function.

I definitely agree that the better place for it to happen is either during JSON parsing or during object creation. Doing it during conversion to a data frame is the wrong place for a server-object-centric package!

Happy to rip it out from here soon — I would prefer to do so after doing a bit of digging to see how much is actually needed. Like, if the only thing that gets parsed wrong at JSON-parse time are datetimes, it's not really a big deal. (At a cursory look, in the data frame conversion, an untyped conversion has a fair number of differences, but mostly in numeric types.)

I'll create an issue to track this though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1555 to +1556
class(content_list) <- c("connect_content_list", class(content_list))
content_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to preface this class with connect_? That seems like unnecessary typing for us, and if we want to communicate this is from Connect, we can do that in the print method(s) on this object. Our R6 classes are not for example all prefaced with Connect_

I hate content, as always, but I don't think that's solvable here, but so much do I wish we had a better label for this.

Copy link
Collaborator Author

@toph-allen toph-allen Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The S3 classes we've been playing with so far (just integrations and @karawoo's work on users right now) all use a connect_ prefix.

My thinking is that S3 classes are more common than R6 classes and are a lot less strict in their form, so having more verbosely & explicitly-named classes here is better.

There wasn't much discussion of this approach at the time, so I didn't think it was controversial! (integrations: #431, #437, #440, user: #462)

If we conclude that it should just be named content_list, I feel like I should also change the name of the integrations classes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen S3 classes commonly be prefixed by the package name since they don't show up namespaced when you check them a la inherits(x, class). I think this is fine and the verbosity doesn't really matter here, across the API at worst you'll have three words right? connect_<entity>(_list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nods ok, I'm fine with leaving it in if it's systematic (and not unique to connectapi)

Copy link
Collaborator

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the methods are generally reasonable, but I have some concerns about the performance that I think means we should take a small step back in thinking about this container class.

optional = FALSE,
...
) {
content_tbl <- as_tibble(x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically this seems backwards since a tibble just has annotations on top of a data.frame, but I'm guessing it would be a bigger change to unwrap everything to sort that out.

Copy link
Collaborator Author

@toph-allen toph-allen Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it does seem backwards. This is because parse_connectapi_typed produces tibbles by default. I believe there was discussion of this in an integrations PR, lemme find it.

Yep: #426 (comment) :)

#'
#' @return A `data.frame` with one row per content item.
#' @export
as.data.frame.connect_content_list <- function(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this method differ for any other list-of-records from the API? This makes me think you want to have some inheritance, set the S3 class as something like c("connect_content_list", "connect_list"), and then you define the method on connect_list so any list-type container would be supported.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably correct, but I haven't gone that route yet because I don't want to over-generalize before we have enough instances of the pattern to know what the caveats are. Stuff like… the difference between S3 and R6 classes (see #470 (comment)), whether we want to rely on typed parsing, etc.

I don't have a good sense for when the right moment to generalize is, but my gut is that it's a little in the future.

#' @return A tibble with one row per content item.
#' @export
as_tibble.connect_content_list <- function(x, ...) {
content_data <- purrr::map(x, "content")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the data structure here. What does this do?

Copy link
Collaborator Author

@toph-allen toph-allen Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Content R6 class stores the API response under a content property here:

self$content <- content

The parse_connectapi* functions expect a list of objects (i.e. shaped like the original JSON server response), so this step is reconstructing that. Probably worth a comment, eh?

With newer S3 class objects e.g. connect_integration_list and @karawoo's work in progress on connect_user, the objects are lists with the properties at the top level, so you don't need this extra step.

(For example, to get the guid of an S3 connect_integration you do object$guid, whereas to get the GUID of an R6 Content object you have to do object$content$guid.)

)

purrr::map(res, function(x) {
content_list <- purrr::map(res, function(x) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this answers my question below about what is in the data structure. You might want to reconsider this eager conversion to R6 objects. It's not free, and this list of records might be huge. Suppose it is only 1ms to create an R6 object (I've seen it be a few ms), but if you've got 1000 content items, that's a second of latency you've added. And most cases you probably won't use the vast majority of those R6 objects (and certainly not when you call as.data.frame on it).

Instead, you could keep this data structure as a bare list of content records (+ the client object in an attribute), and define a [[ method that extracts and wraps in the R6 object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimentally, this does add some latency — with about 7000 items I saw about a second — it seems manageable, if not ideal.

Would defining a [[ method just shift that latency to other calls?

Returning a list of R6 objects allows all of the existing R6 content code to just work on that list's content — would defining [[ make that contingent on using that function to access the items? Like, would purrr::map(content_list, lock_content) work?

The change to returning R6 objects isn't a part of this PR, so I'd like to create an issue to track this separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IME the lapply functions work by [[ing elements so yes, it would work--though that's an empirical question here.

Would defining a [[ method just shift that latency to other calls?

It defers it, yes, but the point is that in many cases, you don't ever need to make the R6 object, so you never pay that penalty. To make a data.frame as you are doing here, for example, you are making 7000 R6 objects and then immediately iterating over them to extract the data back out of it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' @export
as_tibble.connect_content_list <- function(x, ...) {
content_data <- purrr::map(x, "content")
parse_connectapi_typed(content_data, connectapi_ptypes$content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not scope creep the type conversion business into this PR. It's worth a discussion but it can be separate.

Copy link
Contributor

@karawoo karawoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I do agree that #474 is worth looking into further, but fine to happen separately from this PR.

@toph-allen
Copy link
Collaborator Author

Sounds good! Thanks @karawoo — we'll use #474 to consider performance around this.

@toph-allen toph-allen merged commit 787eeed into main Oct 28, 2025
@toph-allen toph-allen deleted the toph/as.data.frame-for-content-list branch October 28, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add way to convert list of content objects from content_search() into a data frame

5 participants