Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parquet statistics when collecting column statistics from scanned parquet #16051

Open
borchero opened this issue May 5, 2024 · 2 comments
Labels
A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature P-medium Priority: medium

Comments

@borchero
Copy link
Contributor

borchero commented May 5, 2024

Description

When saving a dataframe via write_parquet("...", use_statistics=True), I would expect

lf = pl.scan_parquet("...")
lf.select(pl.all().null_count()).collect()

to read only the column statistics from the parquet file. However, judging from execution time and memory consumption, all of the data is read.

Interestingly, this issue even applies to simpler properties that are available in the parquet metadata, e.g.

lf.select(pl.all().len()).collect()

Would it be possible to push down relevant operations when statistics are available?

@borchero borchero added the enhancement New feature or an improvement of an existing feature label May 5, 2024
@deanm0000 deanm0000 added P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files labels May 7, 2024
@deanm0000
Copy link
Collaborator

similar to this one #14936.

the tldr of that one is if min=max and null_count=0 then don't read any data and just propagate the one known value.

@borchero
Copy link
Contributor Author

borchero commented May 7, 2024

@stinodego happy to try contributing this if you can point me to some documentation on where to touch code when augmenting the projection pushdown logic 🫣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files enhancement New feature or an improvement of an existing feature P-medium Priority: medium
Projects
Status: Ready
Development

No branches or pull requests

2 participants