-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle parquet files with incorrect statistics in scan_parquet
#16683
Comments
same as #15323 The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow
note how the min is bigger than the max. Essentially what's happening is that when you use |
@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause. |
Will take a look |
Will re-open this issue - the fix by #16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself. Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file. |
scan_parquet
Description
Some parquet files may contain incorrectly calculated statistics (e.g. some of the ones written by older versions of polars containing UInt64 statistics had incorrect min/max). Because we assume the statistics are correct, using some functions (e.g.
is_in
) withscan_parquet
would return incorrect results if the statistics were not correct.Reproducible example
For the below example we have a parquet file with incorrect min/max statistics (observe how
min (9223372036854775808) > max (0)
.Log output
Original post [Scanning parquet for a UInt64 value does not work for equality]
Checks
Reproducible example
Log output
Issue description
Scanning parquet file based on
eq
on aUInt64
column won't find some value.Sample parquet file: https://1drv.ms/u/s!AiNNar540QGDhKI7u7VsGoHG0WV3CQ?e=7r9XbH
Expected behavior
Filtering parquet file based on
eq
onUInt64
column works with same result as equivalent based onis_in
.Installed versions
The text was updated successfully, but these errors were encountered: