Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): Rust functions for typed JsonPath implementation #5140

Merged
merged 7 commits into from
Oct 17, 2022

Conversation

cjermain
Copy link
Contributor

@cjermain cjermain commented Oct 7, 2022

This PR introduces 4 new methods on Utf8Chunked, allowing JSON values within the string arrays to be parsed to appropriate types. The current json_path_match only allows for str return types, and does not handle nested types well. This PR replaces #3413, and introduces only the core Rust functions needed. I'll follow up with another PR that includes the Expr implementations to allow these to be used in regular and lazy DataFrames, including Python support for these features.

New methods:

  • Utf8Chunked.json_infer - returns DataType for the JSON fields in the array
  • Utf8Chunked.json_extract - returns a Series with the appropriate types for the JSON values in the array
  • Utf8Chunked.json_path_select - returns a Utf8Chunked array, selecting based on the provided JsonPath
  • Utf8Chunked.json_path_extract - returns a Series with the appropriate type after selecting based on a JsonPath

Notes:

  • I did not change the existing json_path_match. I would suggest deprecating that method if these are adopted.
  • I added select_json to more elegantly handle the differences between the number of elements returned. Otherwise it is a replica of extract_json.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Oct 7, 2022
@ritchie46
Copy link
Member

Thanks for moving this to polars-ops. What are your thoughts on exposing this API. In polars-lazy we always need to know the output dtype of a transformation. Any idea how to determine that statically?

@cjermain
Copy link
Contributor Author

Great question! I wanted to raise that for the next PR, and have been thinking about it after looking through the Expr APIs. Here are my thoughts per method:

  • Series.str.json_infer - I was planning to only implement this for the non-lazy API first, although the return type is known to always be a scalar dtype
  • Series.str.json_extract - For the lazy API, the expected dtype would be required and would be passed forward so that it is known at construction. In the non-lazy API, the dtype can be optionally infered (which is probably the usual case when exploring data). The Rust method allows both to be implemented. We can come back to add more complexity later, but I would prefer to do this incrementally so that we can start taking advantage of these methods sooner.
  • Series.str.json_path_select - This always returns a Utf8-typed array, so no issues there.
  • Series.str.json_path_extract - Same strategy as Series.str.json_extract -- dtype required in lazy API only

What do you think?

@ritchie46
Copy link
Member

It sounds good for most! I am only not entirely sure about Series.str.json_infer as I want to make all the of the API lazy.

But this is something we can explore on the way. I think we should find some design with which we can make the Unknown dtype more valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants