Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

io::json enhancements #1178

Closed
AnIrishDuck opened this issue Jul 22, 2022 · 5 comments
Closed

io::json enhancements #1178

AnIrishDuck opened this issue Jul 22, 2022 · 5 comments
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@AnIrishDuck
Copy link
Contributor

Hey!

We'd like to add some functionality to io::json, and I wanted to check if it would make sense as a contribution to the project.

In particular, we'd like to add support for the records-oriented output of the to_json and from_json Pandas methods. We probably also want to add support for the split orientation in the near future, but we'd like to start with records.

This is the rough path I've worked out to do so:

  • Add a new json::write::RecordSerializer alongside json::write::Serializer. Instead of outputting a batch of arrays, this takes a Chunk and outputs record blobs in the transposed format.
  • Create an infer_records_schema method in json::read::infer_schema. This would infer a Schema from the first row of inputted json in Pandas records format.
  • Add a json::read::deserialize_from_records that takes the above Schema and outputs the correct Chunk. This is the hardest part to untangle so far. My current rough plan of attack is to modify the deserialize_xxx methods to deserialize_into_xxx, and take a &mut mutable array as a parameter instead of outputting finalized arrays. We can then "extend" the right data into each array per record, avoiding copies as we go. This is starting to feel like a pretty invasive change, in contrast to the previous two.

I'm still figuring the last part out. I don't feel fully stuck, but I'm worried about the scope of the changes. Here's my questions:

  • Does the overall goal here fit with the direction of this project? Is it functionality you'd like to see in this crate?
  • Does the above plan make sense? Are there tactical errors I'm making due to my lack of experience with the code?
  • Should this even be in the io::json module? I see other modules for other types of JSON in there (ndjson, json_integration), but I'm not sure what the cutoff is for a new io module. This feels closer to the original io::json than either of those two formats?

Finally: thanks for working so hard on this crate. It's been hugely useful for us at Wallaroo.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Jul 22, 2022
@jorgecarleitao
Copy link
Owner

Thank you for this issue and for the through explanation. 🙇 I think this makes a lot of sense indeed!

Answering the questions:

  1. Yes - interoperability with standard formats fits exactly into io. Record-based is quite a standard way of representing data in json, so this fits quite well. 👍
  2. Yes -
    2.1. serialize: the io::csv and io::avro has similar ideas that we can piggy back, since they are also row-based formats (and thus required a transposition). The relevant aspect there is that a serializer of a column is a StreamingIterator that attaches itself to an array (or its .iter and serializes each of its items to the corresponding format).
    2.2. deserialize: the idea of &mut mutable arrays on the deserialize sounds interesting. We do something similar in io::csv and io::avro also. I can't anticipate if it works in json (I would basically try it out for e.g. PrimitiveArray and ListArray, to see if it works (i.e. non-nested and nested), and evaluate from such a "POC")
  3. I agree that io::json seems best, as you mentioned. (json_integration is only for integration tests with the rest of the Arrow ecosystem)

@AnIrishDuck
Copy link
Contributor Author

Hey @jorgecarleitao - we've integrated and tested the work on our side, and I'd like to start contributing back.

Here are the changes: main...WallarooLabs:arrow2:main

Would you prefer this as one large (cleaned up) PR? Or would you prefer me to rebase our changes and submit them in smaller chunks? If we go the smaller route, there are ten commits for super small changes and three PRs for medium-sized reviews...

@jorgecarleitao
Copy link
Owner

Wow, awesome!!!

Either is fine. The easiest for documentation purposes is to split it according to the main github labels (which correspond to entries in the change log):

  • backward incompatible changes
  • new feature

so it is easier to document what happened to the crate when people update.

Looking through it, I think I only have minor comments on naming - trait Preallocate I would have name it as Container (traits are substantives in Rust; something with a capacity), expand I would name it extend_offsets, and unsafe_expand as extend_offsets_unchecked (to align with the rest of the crate and Rust: extend_ to extend stuff, _unchecked for unsafe functions).

About the comment

    // though this will always be safe, we cannot use unsafe_expand here due to
    // `#![forbid(unsafe_code)]` on the io module

one way to address it is to create an auxiliary pub struct Offsets<O>(Vec<O>) that contains try_push(usize), with_capacity(usize) that fulfills the invariants required by expand (it is monotically increasing). This allows to write expand(Offsets) without unsafe_expand, since Offsets guarantee the invariant.

@AnIrishDuck
Copy link
Contributor Author

Work has been crazy, but I've finally gotten around to repackaging this. I believe it's a backwards-compatible new feature, not sure if I should slap the feature tag on it though.

@jorgecarleitao
Copy link
Owner

Closed by #1275

@jorgecarleitao jorgecarleitao added feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog and removed enhancement An improvement to an existing feature labels Oct 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants