Handle parquet files with incorrect statistics in `scan_parquet` #16683

isvoboda · 2024-06-03T12:21:22Z

Description

Some parquet files may contain incorrectly calculated statistics (e.g. some of the ones written by older versions of polars containing UInt64 statistics had incorrect min/max). Because we assume the statistics are correct, using some functions (e.g. is_in) with scan_parquet would return incorrect results if the statistics were not correct.

Reproducible example

For the below example we have a parquet file with incorrect min/max statistics (observe how min (9223372036854775808) > max (0).

import os

os.environ["POLARS_VERBOSE"] = "1"
import pyarrow.parquet as pq
import polars as pl
import tempfile
import base64
from polars.testing import assert_frame_equal

with tempfile.NamedTemporaryFile("wb+") as f:
    f.write(
        base64.b64decode(
            "".join(
                [
                    "UEFSMRUGFSQVNlwVBBUAFQQVABUEFQARHDYAKAgAAAAAAAAAABgIAAAAAAAAAIAAAAADAy",
                    "i1L/0gEIEAAAAAAAAAAACAAAAAAAAAAAAVBBklAAYZGAFhFQwWBBaAARaSASYIPDYAKAgA",
                    "AAAAAAAAABgIAAAAAAAAAIAAABkRAhkYCAAAAAAAAACAGRgIAAAAAAAAAAAVABkWAAAZHB",
                    "YIFZIBFgAAABUEGSxIBHJvb3QVAgAVBCUCGAFhJRxMrBNAEgAAABYEGRwZHCaaARwVBBkl",
                    "AAYZGAFhFQwWBBaAARaSASYIPDYAKAgAAAAAAAAAABgIAAAAAAAAAIAAABa2AhUWFvgBFT",
                    "4AFoABFgQmCBaSARQAABkcGAxBUlJPVzpzY2hlbWEYpAEvLy8vLzNJQUFBQUVBQUFBOHYv",
                    "Ly94UUFBQUFFQUFFQUFBQUtBQXNBQ0FBS0FBUUErUC8vL3d3QUFBQUlBQWdBQUFBRUFBRU",
                    "FBQUFFQUFBQTdQLy8velFBQUFBZ0FBQUFHQUFBQUFFQ0FBQVFBQklBQkFBUUFCRUFDQUFB",
                    "QUF3QUFBQUFBUGIvLy85QUFBQUFBQUFHQUFnQUJBQUJBQUFBWVFBPQAYBlBvbGFycxkcHA",
                    "AAADUBAABQQVIx",
                ]
            )
        )
    )
    f.flush()

    print(pq.read_metadata(f.name).row_group(0).column(0))
    # <pyarrow._parquet.ColumnChunkMetaData object at 0x1096200e0>
    #     ...
    #     statistics:
    #         <pyarrow._parquet.Statistics object at 0x109620220>
    #         has_min_max: True
    #         min: 9223372036854775808
    #         max: 0
    #         null_count: 0
    #         ...
    #     ...

    expect = pl.read_parquet(f.name).filter(pl.col.a.is_in([0]))
    out = pl.scan_parquet(f.name).filter(pl.col.a.is_in([0])).collect()

    # `out` is empty due to the statistics being invalid
    assert_frame_equal(out, expect)

Log output

run FilterExec
dataframe filtered
parquet file can be skipped, the statistics were sufficient to apply the predicate.
AssertionError: DataFrames are different (number of rows does not match)
[left]:  0
[right]: 1

Original post [Scanning parquet for a UInt64 value does not work for equality]

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

# Does not find
(
    pl.scan_parquet("data.parquet")
    .filter(
        pl.col("value").eq(pl.lit(16287934034053521795, dtype=pl.UInt64()))
    ).collect()
)

shape: (0, 1)
┌───────┐
│ value │
│ ---   │
│ u64   │
╞═══════╡
└───────┘

# Does find
(
    pl.scan_parquet("data.parquet")
    .filter(
        pl.col("value").is_in(pl.lit(16287934034053521795, dtype=pl.UInt64()), )
    )
    .collect()
)

shape: (18, 1)
┌──────────────────────┐
│ value                │
│ ---                  │
│ u64                  │
╞══════════════════════╡
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ …                    │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
└──────────────────────┘

Log output

Multiple lines with:

parquet file can be skipped, the statistics were sufficient to apply the predicate.

Issue description

Scanning parquet file based on eq on a UInt64 column won't find some value.

Sample parquet file: https://1drv.ms/u/s!AiNNar540QGDhKI7u7VsGoHG0WV3CQ?e=7r9XbH

Expected behavior

Filtering parquet file based on eq on UInt64 column works with same result as equivalent based on is_in.

Installed versions

pl.show_versions()
--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.0.0-0.deb11.6-amd64-x86_64-with-glibc2.31
Python:               3.10.12 (main, Sep 15 2023, 21:10:26) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.1.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-06-04T20:28:07Z

same as #15323

The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow

    <pyarrow._parquet.Statistics object at 0x7f2abffd0bd0>
      has_min_max: True
      min: 9223539763183779054
      max: 9222809258525037712
      null_count: 0
      distinct_count: None
      num_values: 262615
      physical_type: INT64
      logical_type: Int(bitWidth=64, isSigned=false)
      converted_type (legacy): UINT_64

note how the min is bigger than the max.

Essentially what's happening is that when you use eq it skips all the row groups based on statistics but is_in doesn't do partition pruning and that's why it returns results. You could also turn off optimizations in the collect under the eq case.

isvoboda · 2024-06-04T21:16:43Z

@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause.

nameexhaustion · 2024-06-06T09:22:31Z

Will take a look

nameexhaustion · 2024-06-07T05:04:04Z

Will re-open this issue - the fix by #16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself.

Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file.

isvoboda added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 3, 2024

deanm0000 added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jun 4, 2024

nameexhaustion self-assigned this Jun 6, 2024

nameexhaustion mentioned this issue Jun 6, 2024

fix: Fix incorrect parquet statistics written for UInt64 values > Int64::MAX #16766

Merged

ritchie46 closed this as completed in #16766 Jun 6, 2024

deanm0000 linked a pull request Jun 6, 2024 that will close this issue

check_for_invalid_stats_in_isnull #16776

Draft

nameexhaustion added enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization and removed bug Something isn't working P-high Priority: high labels Jun 7, 2024

nameexhaustion removed their assignment Jun 7, 2024

nameexhaustion reopened this Jun 7, 2024

nameexhaustion changed the title ~~Scanning parquet for a UInt64 value does not work for equality~~ Handle parquet files with incorrect statistics in scan_parquet Jun 7, 2024

c-peters added the accepted Ready for implementation label Jun 9, 2024

c-peters assigned nameexhaustion Jun 9, 2024

nameexhaustion removed the accepted Ready for implementation label Jun 10, 2024

nameexhaustion removed their assignment Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle parquet files with incorrect statistics in `scan_parquet` #16683

Handle parquet files with incorrect statistics in `scan_parquet` #16683

isvoboda commented Jun 3, 2024 •

edited by nameexhaustion

Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Jun 4, 2024

isvoboda commented Jun 4, 2024

nameexhaustion commented Jun 6, 2024

nameexhaustion commented Jun 7, 2024 •

edited

Loading

Handle parquet files with incorrect statistics in scan_parquet #16683

Handle parquet files with incorrect statistics in scan_parquet #16683

Comments

isvoboda commented Jun 3, 2024 • edited by nameexhaustion Loading

Description

Reproducible example

Log output

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Jun 4, 2024

isvoboda commented Jun 4, 2024

nameexhaustion commented Jun 6, 2024

nameexhaustion commented Jun 7, 2024 • edited Loading

Handle parquet files with incorrect statistics in `scan_parquet` #16683

Handle parquet files with incorrect statistics in `scan_parquet` #16683

isvoboda commented Jun 3, 2024 •

edited by nameexhaustion

Loading

nameexhaustion commented Jun 7, 2024 •

edited

Loading