Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_json and write_json #161

Closed
hadley opened this issue Dec 19, 2016 · 16 comments
Closed

read_json and write_json #161

hadley opened this issue Dec 19, 2016 · 16 comments

Comments

@hadley
Copy link

hadley commented Dec 19, 2016

Would you consider adding:

write_json <- function(x, path, ...) {
  json <- jsonlite::toJSON(x, ...)
  writeLines(json, path)
}

read_json <- function(path, ...) {
  fromJSON(file(path), simplifyDataFrame = FALSE, ...)
}

That would make it slightly more symmetrical with readr, readxl and haven.

(If you don't want to add this to jsonlite, I'll probably make a tiny wrapper package, probably readjson)

@jeroen
Copy link
Owner

jeroen commented Dec 19, 2016

Sure. So you want this function to use fromJSON(x, simplifyVector = TRUE, simplifyDataFrame = FALSE) or is that a typo?

@hadley
Copy link
Author

hadley commented Dec 19, 2016

Hmmmm, might be more robust to not simplify vectors either.

@jeroen
Copy link
Owner

jeroen commented Dec 19, 2016

It's your call. I think the default behavior to simplify data frames is great for working with tidy data pipelines:

library(magrittr)
curl::curl("https://api.github.com/repos/hadley/ggplot2/issues") %>%
  jsonlite::fromJSON(flatten = TRUE) %>%
  dplyr::mutate(date = as.Date(created_at)) %>%
  dplyr::filter(user.login == "hadley") %>%
  dplyr::select(title, state, date)

It will seamlessly roundtrip between tidy data and json:

lm(mpg ~ wt, mtcars) %>%
  broom::tidy() %>%
  jsonlite::toJSON()  %>%
  jsonlite::fromJSON()

This has always been the motivation behind these defaults, and it fits nicely into the tidyverse.

@hadley
Copy link
Author

hadley commented Dec 19, 2016

My main worry is that it's a bit too magical - I'd prefer it if it worked more like col_types in readr, so you had some way to make it explicit.

@jennybc do you have any comments?

@jeroen
Copy link
Owner

jeroen commented Dec 19, 2016

I don't understand... col_types are needed because csv is not typed (everything is a string) but fields in json are already typed (numeric, boolean, string, ...). Why would you need col_types?

It's not that magical... it's quite well defined. Everyone stringifies dataframe-like structures (eg mysql tables) as a list of records:

> toJSON(iris, pretty=TRUE)
[
  {
    "Sepal.Length": 5.1,
    "Sepal.Width": 3.5,
    "Petal.Length": 1.4,
    "Petal.Width": 0.2,
    "Species": "setosa"
  },
  {
    "Sepal.Length": 4.9,
    "Sepal.Width": 3,
    "Petal.Length": 1.4,
    "Petal.Width": 0.2,
    "Species": "setosa"
  }
 ...

Then fromJSON simply inverts this mapping.

@hadley
Copy link
Author

hadley commented Dec 19, 2016

I guess I'm ok with simplifyVector — it's simplifyDataFrame that's more dangerous because you'll convert to a data frame if the elements have the same length, which can easily happen by coincidence.

@jennybc
Copy link

jennybc commented Dec 20, 2016

Searching my code ... I always use simplifyDataFrame = FALSE and frequently simplifyVector = FALSE as well.

I feel like I got here by getting surprised a few times: auto-simplifying code would "work" on a few records or on one day, but produce something quite different on the whole dataset or another day. But I don't have an example right now. Is this believable @jeroenooms?

If simplification is part of read_json(), I do like the readr system:

  • You can specify more types than just numeric, string, boolean. Namely integer vs double, date, time, datetime.
  • It feels like a good idea to state the types you're expecting. They get checked/enforced AND you're documenting the data for your future self.

This would be nice:

curl::curl("https://api.github.com/repos/hadley/ggplot2/issues") %>%
  read_json(col_types = cols_only(
    title = col_character(),
    state = col_character(),
    updated_at = col_datetime(),
    user.login = col_character()
  )) %>% 
  dplyr::filter(user.login == "hadley") %>%
  dplyr::select(-user.login)

@jeroen
Copy link
Owner

jeroen commented Dec 20, 2016

It depends on the input data. If the json is tidy then simplifyDataFrame = TRUE always gives a tidy data frame for any [{..}, {..}, ...] structure within the json. However to read messy nested structures, simplification is not going to help (it should not harm either) and you still get lists.

Internally, jsonlite already uses something like col_types. The simplifyDataFrame function has an argument columns which specifies the fields that need to be extracted from each record. The default for this argument is simply all names that appear in any of the records:

# find columns if not specified
if (missing(columns)) {
  columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE, use.names = FALSE))
}

# Convert row lists to column lists.
columnlist <- lapply(columns, function(x) lapply(recordlist, "[[", x))

Currently this is not exported, but we could add something to support col_types.

I recommend to either disable simplification all together (simplifyVector=FALSE) as is the default in e.g httr::content, or to stick with the defaults from jsonlite::fromJSON. The combi simplifyVector=TRUE with simplifyDataFrame=FALSE would introduce a third set of default json parsing behavior which will perhaps mostly create confusion.

@hadley
Copy link
Author

hadley commented Dec 20, 2016

Ok, in that case I would prefer no simplification for read_json()

@jennybc
Copy link

jennybc commented Dec 20, 2016

Just unearthed a real example of typical GitHub API JSON --> data frame task for me. Parking here in case readjson comes to pass and includes readr-ish function for this. Recurring themes: limiting to specific fields, indexing >1 level down in the hierarchy with a character vector, giving the associated variable a different name in the tibble, type specification, simplification.

issue_df <- issue_list %>%
{
  tibble(number = map_int(., "number"),
         id = map_int(., "id"),
         title = map_chr(., "title"),
         state = map_chr(., "state"),
         n_comments = map_int(., "comments"),
         opener = map_chr(., c("user", "login")),
         created_at = map_chr(., "created_at") %>% as.Date())
}

@jeroen
Copy link
Owner

jeroen commented Dec 21, 2016

OK so I guess jsonlite should only do the parsing, and than you can do the simplification, coercion, tidyfication and transformations in another package.

@jeroen
Copy link
Owner

jeroen commented Dec 21, 2016

@hadley would you like path in read_json to support only file paths, or also urls and literal json strings?

@hadley
Copy link
Author

hadley commented Dec 21, 2016

I think urls and literal json strings are fine. In the longer-term, I'd like to extract out a small helper package that defines a consistent interface across paths, connections, urls, and literal input (along with some way to manual override incorrect guesses)

@jeroen jeroen closed this as completed in ef112b6 Dec 29, 2016
@jeroen
Copy link
Owner

jeroen commented Dec 29, 2016

Added these wrappers for version 1.2: ef112b6. Please lmk if this is what you have in mind, or if it needs additional changes.

@hadley
Copy link
Author

hadley commented Dec 29, 2016

Looks good - thanks!

@jeroen
Copy link
Owner

jeroen commented Dec 31, 2016

On CRAN now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants