Use parquet statistics when collecting column statistics from scanned parquet #16051
Labels
A-io-parquet
Area: reading/writing Parquet files
enhancement
New feature or an improvement of an existing feature
P-medium
Priority: medium
Description
When saving a dataframe via
write_parquet("...", use_statistics=True)
, I would expectto read only the column statistics from the parquet file. However, judging from execution time and memory consumption, all of the data is read.
Interestingly, this issue even applies to simpler properties that are available in the parquet metadata, e.g.
Would it be possible to push down relevant operations when statistics are available?
The text was updated successfully, but these errors were encountered: