parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

deanm0000 · 2024-03-07T14:24:09Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I'm including the actual file here since that's the only way to troubleshoot the issue.

import requests
from tempfile import NamedTemporaryFile
import pyarrow.parquet as pq
import os
os.environ['RUST_BACKTRACE']='full'
import polars as pl
resp=requests.get("https://stsynussp.blob.core.windows.net/stdatalake/woodmac/hourly/Reference_Year=2023/Case=2023%20Base%20Case%20-%20Update/Region=SERC/0000.parquet?sv=2023-11-03&st=2024-03-07T12%3A50%3A09Z&se=2024-04-08T11%3A50%3A00Z&sr=b&sp=r&sig=r6rtdUN0A2KjqmAQYcQfs7kz0OK4c93gltZ2ssCMbBk%3D")
ff=NamedTemporaryFile(delete=False)
with open(ff.name, "wb") as f:
    f.write(resp.content)
    
df=pl.scan_parquet(ff.name)
df.collect()
## PanicException: The column lengths in the DataFrame are not equal.

see full back trace below

pyarrow can open the file without issue

pqfile = pq.ParquetFile(ff.name)
for i, col in enumerate(df.columns):
    pl_cols=df.select(pl.col(col).len()).collect().item()
    pa_cols=sum([pqfile.metadata.row_group(x).column(i).statistics.num_values for x in range(pqfile.metadata.num_row_groups)])
    if pl_cols != pa_cols:
        print(f"column {col}, polars sees {pl_cols} and pyarrow sees {pa_cols}")
## column isPeak, polars sees 24 and pyarrow sees 1656816

Additionally, this works

df=pl.from_arrow(pqfile.read())

Log output

This is the RUST_BACKTRACE=full output

thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:513:13:
The column lengths in the DataFrame are not equal.
stack backtrace:
   0:     0x7efc97f4fccf - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h39795cad8f90e005
   1:     0x7efc95a3991c - core::fmt::write::h2b281d9025b7c47b
   2:     0x7efc97f1da1e - std::io::Write::write_fmt::h18044c54acec8470
   3:     0x7efc97f5517f - std::sys_common::backtrace::print::h154e885ac7142937
   4:     0x7efc97f54a5b - std::panicking::default_hook::{{closure}}::hcfb38eaaff34d735
   5:     0x7efc97f55a5e - std::panicking::rust_panic_with_hook::hb2c8227f8d32aff4
   6:     0x7efc97f554e8 - std::panicking::begin_panic_handler::{{closure}}::h987389c1664b681c
   7:     0x7efc97f55476 - std::sys_common::backtrace::__rust_end_short_backtrace::hfacc8ef11205d4dc
   8:     0x7efc97f55463 - rust_begin_unwind
   9:     0x7efc94baca94 - core::panicking::panic_fmt::h515a008904190f25
  10:     0x7efc965685f0 - polars_core::fmt::<impl core::fmt::Display for polars_core::frame::DataFrame>::fmt::hda9aa7eff421c4bd
  11:     0x7efc95a3991c - core::fmt::write::h2b281d9025b7c47b
  12:     0x7efc95936aab - alloc::fmt::format::format_inner::h58ed0db95ef255a0
  13:     0x7efc9576854b - polars::dataframe::_::<impl polars::dataframe::PyDataFrame>::__pymethod_as_str__::h36ad570380e829a4
  14:     0x7efc95256b07 - pyo3::impl_::trampoline::trampoline::h55b226f6e1fecbf0
  15:           0x561aa3 - _PyEval_EvalFrameDefault
  16:           0x58d18c - PyObject_CallOneArg
  17:           0x692ca8 - <unknown>
  18:           0x5eddbe - <unknown>
  19:           0x55941c - _PyEval_EvalFrameDefault
  20:           0x54ff58 - _PyObject_FastCallDictTstate
  21:           0x590129 - _PyObject_Call_Prepend
  22:           0x68531d - <unknown>
  23:           0x54bf9b - _PyObject_MakeTpCall
  24:           0x558e6d - _PyEval_EvalFrameDefault
  25:           0x54ff58 - _PyObject_FastCallDictTstate
  26:           0x590129 - _PyObject_Call_Prepend
  27:           0x68531d - <unknown>
  28:           0x54bf9b - _PyObject_MakeTpCall
  29:           0x58d1c8 - PyObject_CallOneArg
  30:           0x55ee7d - _PyEval_EvalFrameDefault
  31:           0x63eca9 - PyEval_EvalCode
  32:           0x65a7ae - <unknown>
  33:           0x55c0f7 - _PyEval_EvalFrameDefault
  34:           0x655c65 - <unknown>
  35:           0x656d3c - <unknown>
  36:           0x55cd72 - _PyEval_EvalFrameDefault
  37:           0x5afd4d - <unknown>
  38:           0x5af86e - <unknown>
  39:           0x592a0d - _PyObject_Call
  40:           0x55d7a3 - _PyEval_EvalFrameDefault
  41:           0x655c65 - <unknown>
  42:     0x7efcf1db21b8 - <unknown>
  43:     0x7efcf1db2e45 - <unknown>
  44:           0x57df49 - <unknown>
  45:           0x666fb5 - <unknown>
  46:           0x50017e - <unknown>
  47:           0x572c5c - <unknown>
  48:           0x55d7a3 - _PyEval_EvalFrameDefault
  49:           0x63eca9 - PyEval_EvalCode
  50:           0x65a7ae - <unknown>
  51:           0x572c5c - <unknown>
  52:           0x5729d6 - PyObject_Vectorcall
  53:           0x558e6d - _PyEval_EvalFrameDefault
  54:           0x66e370 - <unknown>
  55:           0x66def8 - Py_RunMain
  56:           0x62ad7d - Py_BytesMain
  57:     0x7efcf307ed90 - <unknown>
  58:     0x7efcf307ee40 - __libc_start_main
  59:           0x62abf5 - _start
  60:                0x0 - <unknown>

Issue description

polars views the column as only having 24 entries which seems to be related to there being 24 hours in a day (maybe).

Here's the metadata on the column

<pyarrow._parquet.ColumnChunkMetaData object at 0x7efc90f55440>
  file_offset: 67398
  file_path: 
  physical_type: BOOLEAN
  num_values: 1048576
  path_in_schema: isPeak
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7efc90f55350>
      has_min_max: True
      min: False
      max: True
      null_count: 0
      distinct_count: None
      num_values: 1048576
      physical_type: BOOLEAN
      logical_type: None
      converted_type (legacy): NONE
  compression: ZSTD
  encodings: ('RLE',)
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 67276
  total_compressed_size: 122
  total_uncompressed_size: 124900

When I resave the file with pyarrow 15 (which polars doesn't have a problem with) then this is the metadata

<pyarrow._parquet.ColumnChunkMetaData object at 0x7efc90f55940>
  file_offset: 67377
  file_path: 
  physical_type: BOOLEAN
  num_values: 1048576
  path_in_schema: isPeak
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7efc90f55bc0>
      has_min_max: True
      min: False
      max: True
      null_count: 0
      distinct_count: None
      num_values: 1048576
      physical_type: BOOLEAN
      logical_type: None
      converted_type (legacy): NONE
  compression: ZSTD
  encodings: ('RLE', 'PLAIN')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 67261
  total_compressed_size: 116
  total_uncompressed_size: 131120

so it seems that when it's saved without PLAIN encodings that it has a problem.

Expected behavior

It should recognize the column properly

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.12.2 (main, Feb 25 2024, 16:35:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            0.9.1
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-03-08T09:46:32Z

Somehow, the boolean columns is read with too little values:

df = pl.read_parquet("0000.parquet", rechunk=False, use_pyarrow=False)
for c in df.columns:
    s = df[c]
    print(f"name: {c}; dtype: {s.dtype}; len: {s.len()}" )

name: Area; dtype: String; len: 1656816
name: Date; dtype: Date; len: 1656816
name: Hourend; dtype: Int32; len: 1656816
name: isPeak; dtype: Boolean; len: 24
name: Nominal_Energy; dtype: Float64; len: 1656816
name: Real_Energy; dtype: Float64; len: 1656816
name: Real_SRMC; dtype: Float64; len: 1656816
name: Nominal_SRMC; dtype: Float64; len: 1656816

@deanm0000 do you know how the file was produced? With which pyarrow settings.

ritchie46 · 2024-03-08T09:50:43Z

Ok, this is the new RLE decoding which I recently added. Taking a look.

deanm0000 · 2024-03-08T10:51:09Z

    pq.write_table(
        DF.to_arrow(),
        dest_file,
        version="2.6",
        data_page_version="2.0",
        compression='zstd',
        filesystem=abfs
    )

deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files A-interop-pyarrow Area: interoperability with pyarrow labels Mar 7, 2024

This was referenced Mar 8, 2024

fix: parquet RLE boolean decoding #14921

Closed

fix: parquet rle boolean decoder #14931

Merged

ritchie46 closed this as completed in #14931 Mar 8, 2024

c-peters added the accepted Ready for implementation label Mar 11, 2024

c-peters assigned ritchie46 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

deanm0000 commented Mar 7, 2024 •

edited

ritchie46 commented Mar 8, 2024

ritchie46 commented Mar 8, 2024

deanm0000 commented Mar 8, 2024

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

Comments

deanm0000 commented Mar 7, 2024 • edited

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Mar 8, 2024

ritchie46 commented Mar 8, 2024

deanm0000 commented Mar 8, 2024

deanm0000 commented Mar 7, 2024 •

edited