Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle parquet files with incorrect statistics in scan_parquet #16683

Open
2 tasks done
isvoboda opened this issue Jun 3, 2024 · 4 comments · Fixed by #16766 · May be fixed by #16776
Open
2 tasks done

Handle parquet files with incorrect statistics in scan_parquet #16683

isvoboda opened this issue Jun 3, 2024 · 4 comments · Fixed by #16766 · May be fixed by #16776
Labels
A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars

Comments

@isvoboda
Copy link

isvoboda commented Jun 3, 2024

Description

Some parquet files may contain incorrectly calculated statistics (e.g. some of the ones written by older versions of polars containing UInt64 statistics had incorrect min/max). Because we assume the statistics are correct, using some functions (e.g. is_in) with scan_parquet would return incorrect results if the statistics were not correct.

Reproducible example

For the below example we have a parquet file with incorrect min/max statistics (observe how min (9223372036854775808) > max (0).

import os

os.environ["POLARS_VERBOSE"] = "1"
import pyarrow.parquet as pq
import polars as pl
import tempfile
import base64
from polars.testing import assert_frame_equal

with tempfile.NamedTemporaryFile("wb+") as f:
    f.write(
        base64.b64decode(
            "".join(
                [
                    "UEFSMRUGFSQVNlwVBBUAFQQVABUEFQARHDYAKAgAAAAAAAAAABgIAAAAAAAAAIAAAAADAy",
                    "i1L/0gEIEAAAAAAAAAAACAAAAAAAAAAAAVBBklAAYZGAFhFQwWBBaAARaSASYIPDYAKAgA",
                    "AAAAAAAAABgIAAAAAAAAAIAAABkRAhkYCAAAAAAAAACAGRgIAAAAAAAAAAAVABkWAAAZHB",
                    "YIFZIBFgAAABUEGSxIBHJvb3QVAgAVBCUCGAFhJRxMrBNAEgAAABYEGRwZHCaaARwVBBkl",
                    "AAYZGAFhFQwWBBaAARaSASYIPDYAKAgAAAAAAAAAABgIAAAAAAAAAIAAABa2AhUWFvgBFT",
                    "4AFoABFgQmCBaSARQAABkcGAxBUlJPVzpzY2hlbWEYpAEvLy8vLzNJQUFBQUVBQUFBOHYv",
                    "Ly94UUFBQUFFQUFFQUFBQUtBQXNBQ0FBS0FBUUErUC8vL3d3QUFBQUlBQWdBQUFBRUFBRU",
                    "FBQUFFQUFBQTdQLy8velFBQUFBZ0FBQUFHQUFBQUFFQ0FBQVFBQklBQkFBUUFCRUFDQUFB",
                    "QUF3QUFBQUFBUGIvLy85QUFBQUFBQUFHQUFnQUJBQUJBQUFBWVFBPQAYBlBvbGFycxkcHA",
                    "AAADUBAABQQVIx",
                ]
            )
        )
    )
    f.flush()

    print(pq.read_metadata(f.name).row_group(0).column(0))
    # <pyarrow._parquet.ColumnChunkMetaData object at 0x1096200e0>
    #     ...
    #     statistics:
    #         <pyarrow._parquet.Statistics object at 0x109620220>
    #         has_min_max: True
    #         min: 9223372036854775808
    #         max: 0
    #         null_count: 0
    #         ...
    #     ...

    expect = pl.read_parquet(f.name).filter(pl.col.a.is_in([0]))
    out = pl.scan_parquet(f.name).filter(pl.col.a.is_in([0])).collect()

    # `out` is empty due to the statistics being invalid
    assert_frame_equal(out, expect)

Log output

run FilterExec
dataframe filtered
parquet file can be skipped, the statistics were sufficient to apply the predicate.
AssertionError: DataFrames are different (number of rows does not match)
[left]:  0
[right]: 1
Original post [Scanning parquet for a UInt64 value does not work for equality]

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

# Does not find
(
    pl.scan_parquet("data.parquet")
    .filter(
        pl.col("value").eq(pl.lit(16287934034053521795, dtype=pl.UInt64()))
    ).collect()
)

shape: (0, 1)
┌───────┐
│ value │
│ ---   │
│ u64   │
╞═══════╡
└───────┘

# Does find
(
    pl.scan_parquet("data.parquet")
    .filter(
        pl.col("value").is_in(pl.lit(16287934034053521795, dtype=pl.UInt64()), )
    )
    .collect()
)

shape: (18, 1)
┌──────────────────────┐
│ value                │
│ ---                  │
│ u64                  │
╞══════════════════════╡
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ …                    │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
│ 16287934034053521795 │
└──────────────────────┘

Log output

Multiple lines with:

parquet file can be skipped, the statistics were sufficient to apply the predicate.

Issue description

Scanning parquet file based on eq on a UInt64 column won't find some value.

Sample parquet file: https://1drv.ms/u/s!AiNNar540QGDhKI7u7VsGoHG0WV3CQ?e=7r9XbH

Expected behavior

Filtering parquet file based on eq on UInt64 column works with same result as equivalent based on is_in.

Installed versions

pl.show_versions()
--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.0.0-0.deb11.6-amd64-x86_64-with-glibc2.31
Python:               3.10.12 (main, Sep 15 2023, 21:10:26) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               0.10.0
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              16.1.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@isvoboda isvoboda added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 3, 2024
@deanm0000
Copy link
Collaborator

same as #15323

The statistics are written as a signed int and since it's bigger than the INT64 limit the statistics overflow

    <pyarrow._parquet.Statistics object at 0x7f2abffd0bd0>
      has_min_max: True
      min: 9223539763183779054
      max: 9222809258525037712
      null_count: 0
      distinct_count: None
      num_values: 262615
      physical_type: INT64
      logical_type: Int(bitWidth=64, isSigned=false)
      converted_type (legacy): UINT_64

note how the min is bigger than the max.

Essentially what's happening is that when you use eq it skips all the row groups based on statistics but is_in doesn't do partition pruning and that's why it returns results. You could also turn off optimizations in the collect under the eq case.

@deanm0000 deanm0000 added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Jun 4, 2024
@isvoboda
Copy link
Author

isvoboda commented Jun 4, 2024

@deanm0000, I apologize for the duplicate issue, and I appreciate you identifying the cause.

@nameexhaustion
Copy link
Collaborator

Will take a look

@nameexhaustion nameexhaustion self-assigned this Jun 6, 2024
@deanm0000 deanm0000 linked a pull request Jun 6, 2024 that will close this issue
@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization and removed bug Something isn't working P-high Priority: high labels Jun 7, 2024
@nameexhaustion nameexhaustion removed their assignment Jun 7, 2024
@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jun 7, 2024

Will re-open this issue - the fix by #16766 ensures we no longer write out parquet files with incorrect UInt64 min/max statistics, but the OP here gives an example that has more to do with reading an existing parquet file containing incorrect statistics. I've changed this from a bug to enhancement request as there isn't really a bug in the polars parquet reader, but rather the issue is in the parquet file itself.

Thanks @isvoboda for reporting this as well, I've edited your post to better highlight the underlying issue and use a smaller file.

@nameexhaustion nameexhaustion reopened this Jun 7, 2024
@nameexhaustion nameexhaustion changed the title Scanning parquet for a UInt64 value does not work for equality Handle parquet files with incorrect statistics in scan_parquet Jun 7, 2024
@c-peters c-peters added the accepted Ready for implementation label Jun 9, 2024
@nameexhaustion nameexhaustion removed the accepted Ready for implementation label Jun 10, 2024
@nameexhaustion nameexhaustion removed their assignment Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-optimizer Area: plan optimization enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars
Projects
Status: Ready
4 participants