Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python,rust): Typed JsonPath implementation #3413

Closed
wants to merge 7 commits into from

Conversation

cjermain
Copy link
Contributor

This PR resolves #3373. The PR is currently a work-in-progress, based on upstream changes in arrow2 (jorgecarleitao/arrow2#989). The Cargo.toml files have been modified for now to use the latest version from a fork, which will be reverted once the upstream PR merges. This needs some additional work to expose the feature to Python. The Rust pieces are working correctly, although there is certainly room for better testing and documentation. A new example in examples/json_path shows off what can be done at present.

@github-actions github-actions bot added the rust Related to Rust Polars label May 17, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 17, 2022

Codecov Report

Merging #3413 (e1876e9) into master (2243ecf) will increase coverage by 0.11%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master    #3413      +/-   ##
==========================================
+ Coverage   77.41%   77.52%   +0.11%     
==========================================
  Files         402      408       +6     
  Lines       69508    69960     +452     
==========================================
+ Hits        53808    54237     +429     
- Misses      15700    15723      +23     
Impacted Files Coverage Δ
...polars-core/src/chunked_array/strings/json_path.rs 21.66% <0.00%> (-54.81%) ⬇️
py-polars/polars/__init__.py 100.00% <ø> (ø)
py-polars/src/series.rs 75.00% <0.00%> (-1.69%) ⬇️
py-polars/src/lib.rs 93.51% <16.66%> (-1.27%) ⬇️
py-polars/polars/internals/series.py 95.27% <50.00%> (-0.39%) ⬇️
py-polars/src/conversion.rs 78.27% <64.28%> (-2.72%) ⬇️
py-polars/polars/datatypes.py 92.00% <72.22%> (-2.65%) ⬇️
polars/polars-core/src/datatypes.rs 69.96% <100.00%> (+0.19%) ⬆️
polars/polars-lazy/src/logical_plan/mod.rs 97.59% <0.00%> (-0.61%) ⬇️
polars/polars-lazy/src/logical_plan/aexpr.rs 80.72% <0.00%> (-0.61%) ⬇️
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2243ecf...e1876e9. Read the comment docs.

@cjermain
Copy link
Contributor Author

I'm interested in feedback on the function names. Here is a summary (omitting Option/Result as they are handled slightly differently in Py/Rs):

  • json_infer(number_of_rows) -> DataType - infers the Arrow schema for JSON values, 1 schema for all rows, up to number_of_rows
  • json_extract(dtype) -> Series - deserializes JSON values per row using a common schema
  • json_path_select(path) -> Uft8Chunked - selects JSON sub-sections using a JsonPath and returns a full string representation
  • json_path_extract(path, dtype) -> Series - deserializes JSON values after selection using a JsonPath

My thought is to deprecate json_path_match and keep it's behavior consistent to avoid introducing issues. I tried to get the above naming to be self-consistent, and preferred extract over deserialize for brevity.

@ritchie46, what do you think about the function names?

I'm still working on the Lazy/Expr API for this feature, and adding tests. The PR currently works end-to-end for these operations directly on a Series.

@cjermain
Copy link
Contributor Author

One alternative to consider for the above function names is using load instead of extract => json_load and json_path_load. This is more consistent with the json.loads function in Python.

@universalmind303
Copy link
Collaborator

One alternative to consider for the above function names is using load instead of extract => json_load and json_path_load. This is more consistent with the json.loads function in Python.

as someone coming from a JS background, i've always found the python loads & dumps methods very unintuitive. I think extract or parse are much more self-explanitory

@universalmind303
Copy link
Collaborator

universalmind303 commented May 25, 2022

I'm wondering, what is the benefit of exposing the json_infer method? could we not do something similar to the read_* implementations. & take a schema or an infer_count
perhaps a signature of

json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])

@cjermain
Copy link
Contributor Author

as someone coming from a JS background, i've always found the python loads & dumps methods very unintuitive. I think extract or parse are much more self-explanitory

Thanks. As someone who mostly develops in Python, I also find the naming unintuitive. I'll leave it as extract unless there is further feedback.

@cjermain
Copy link
Contributor Author

I'm wondering, what is the benefit of exposing the json_infer method? could we not do something similar to the read_* implementations. & take a schema or an infer_count
perhaps a signature of

json_extract(schema: Optional[DataType], infer_schema_length: Optional[int])

Exposing thejson_infer method allows for exploration and performance use cases. The json_path_extract function uses all N rows to infer the dtype by default, which is most likely to work and is generally useful. However json_infer allows you to get the dtype for a limited number of rows -- calling that manually first allows you to pass in the optional dtype to json_path_extract or json_extract. I think the current API gives broad convenience, while allowing more specific use cases.

@cjermain
Copy link
Contributor Author

I'm getting back to this PR after some time. It needs updates in arrow2 and json_deserialize to fix jorgecarleitao/json-deserializer#13. There are also conflicts from the dispatch changes in #4423 that need to be addressed.

@stinodego stinodego changed the title WIP: Typed JsonPath implementation feat(rust): Typed JsonPath implementation Oct 1, 2022
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Oct 1, 2022
@stinodego stinodego changed the title feat(rust): Typed JsonPath implementation feat(python,rust): Typed JsonPath implementation Oct 1, 2022
@cjermain
Copy link
Contributor Author

cjermain commented Oct 7, 2022

I've broken out the initial Rust functions into a different PR (#5140) to make this easier to review.

@alexander-beedie
Copy link
Collaborator

Superseded by #5140 and #6885 (both of which have been merged).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Typed Series for json_path_match
4 participants