Use parquet statistics when collecting column statistics from scanned parquet #16051

borchero · 2024-05-05T00:38:00Z

Description

When saving a dataframe via write_parquet("...", use_statistics=True), I would expect

lf = pl.scan_parquet("...")
lf.select(pl.all().null_count()).collect()

to read only the column statistics from the parquet file. However, judging from execution time and memory consumption, all of the data is read.

Interestingly, this issue even applies to simpler properties that are available in the parquet metadata, e.g.

lf.select(pl.all().len()).collect()

Would it be possible to push down relevant operations when statistics are available?

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-05-07T04:23:58Z

similar to this one #14936.

the tldr of that one is if min=max and null_count=0 then don't read any data and just propagate the one known value.

borchero · 2024-05-07T14:15:27Z

@stinodego happy to try contributing this if you can point me to some documentation on where to touch code when augmenting the projection pushdown logic 🫣

borchero added the enhancement New feature or an improvement of an existing feature label May 5, 2024

deanm0000 added P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use parquet statistics when collecting column statistics from scanned parquet #16051

Use parquet statistics when collecting column statistics from scanned parquet #16051

borchero commented May 5, 2024

deanm0000 commented May 7, 2024

borchero commented May 7, 2024

Use parquet statistics when collecting column statistics from scanned parquet #16051

Use parquet statistics when collecting column statistics from scanned parquet #16051

Comments

borchero commented May 5, 2024

Description

deanm0000 commented May 7, 2024

borchero commented May 7, 2024