Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

Closed
2 tasks done
deanm0000 opened this issue Mar 7, 2024 · 3 comments · Fixed by #14931
Closed
2 tasks done

parquet file saved with pyarrow 14.0.1 reports different column lengths #14902

deanm0000 opened this issue Mar 7, 2024 · 3 comments · Fixed by #14931
Assignees
Labels
A-interop-pyarrow Area: interoperability with pyarrow A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer P-medium Priority: medium python Related to Python Polars

Comments

@deanm0000
Copy link
Collaborator

deanm0000 commented Mar 7, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I'm including the actual file here since that's the only way to troubleshoot the issue.

import requests
from tempfile import NamedTemporaryFile
import pyarrow.parquet as pq
import os
os.environ['RUST_BACKTRACE']='full'
import polars as pl
resp=requests.get("https://stsynussp.blob.core.windows.net/stdatalake/woodmac/hourly/Reference_Year=2023/Case=2023%20Base%20Case%20-%20Update/Region=SERC/0000.parquet?sv=2023-11-03&st=2024-03-07T12%3A50%3A09Z&se=2024-04-08T11%3A50%3A00Z&sr=b&sp=r&sig=r6rtdUN0A2KjqmAQYcQfs7kz0OK4c93gltZ2ssCMbBk%3D")
ff=NamedTemporaryFile(delete=False)
with open(ff.name, "wb") as f:
    f.write(resp.content)
    
df=pl.scan_parquet(ff.name)
df.collect()
## PanicException: The column lengths in the DataFrame are not equal.

see full back trace below

pyarrow can open the file without issue

pqfile = pq.ParquetFile(ff.name)
for i, col in enumerate(df.columns):
    pl_cols=df.select(pl.col(col).len()).collect().item()
    pa_cols=sum([pqfile.metadata.row_group(x).column(i).statistics.num_values for x in range(pqfile.metadata.num_row_groups)])
    if pl_cols != pa_cols:
        print(f"column {col}, polars sees {pl_cols} and pyarrow sees {pa_cols}")
## column isPeak, polars sees 24 and pyarrow sees 1656816

Additionally, this works

df=pl.from_arrow(pqfile.read())

Log output

This is the RUST_BACKTRACE=full output

thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:513:13:
The column lengths in the DataFrame are not equal.
stack backtrace:
   0:     0x7efc97f4fccf - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h39795cad8f90e005
   1:     0x7efc95a3991c - core::fmt::write::h2b281d9025b7c47b
   2:     0x7efc97f1da1e - std::io::Write::write_fmt::h18044c54acec8470
   3:     0x7efc97f5517f - std::sys_common::backtrace::print::h154e885ac7142937
   4:     0x7efc97f54a5b - std::panicking::default_hook::{{closure}}::hcfb38eaaff34d735
   5:     0x7efc97f55a5e - std::panicking::rust_panic_with_hook::hb2c8227f8d32aff4
   6:     0x7efc97f554e8 - std::panicking::begin_panic_handler::{{closure}}::h987389c1664b681c
   7:     0x7efc97f55476 - std::sys_common::backtrace::__rust_end_short_backtrace::hfacc8ef11205d4dc
   8:     0x7efc97f55463 - rust_begin_unwind
   9:     0x7efc94baca94 - core::panicking::panic_fmt::h515a008904190f25
  10:     0x7efc965685f0 - polars_core::fmt::<impl core::fmt::Display for polars_core::frame::DataFrame>::fmt::hda9aa7eff421c4bd
  11:     0x7efc95a3991c - core::fmt::write::h2b281d9025b7c47b
  12:     0x7efc95936aab - alloc::fmt::format::format_inner::h58ed0db95ef255a0
  13:     0x7efc9576854b - polars::dataframe::_::<impl polars::dataframe::PyDataFrame>::__pymethod_as_str__::h36ad570380e829a4
  14:     0x7efc95256b07 - pyo3::impl_::trampoline::trampoline::h55b226f6e1fecbf0
  15:           0x561aa3 - _PyEval_EvalFrameDefault
  16:           0x58d18c - PyObject_CallOneArg
  17:           0x692ca8 - <unknown>
  18:           0x5eddbe - <unknown>
  19:           0x55941c - _PyEval_EvalFrameDefault
  20:           0x54ff58 - _PyObject_FastCallDictTstate
  21:           0x590129 - _PyObject_Call_Prepend
  22:           0x68531d - <unknown>
  23:           0x54bf9b - _PyObject_MakeTpCall
  24:           0x558e6d - _PyEval_EvalFrameDefault
  25:           0x54ff58 - _PyObject_FastCallDictTstate
  26:           0x590129 - _PyObject_Call_Prepend
  27:           0x68531d - <unknown>
  28:           0x54bf9b - _PyObject_MakeTpCall
  29:           0x58d1c8 - PyObject_CallOneArg
  30:           0x55ee7d - _PyEval_EvalFrameDefault
  31:           0x63eca9 - PyEval_EvalCode
  32:           0x65a7ae - <unknown>
  33:           0x55c0f7 - _PyEval_EvalFrameDefault
  34:           0x655c65 - <unknown>
  35:           0x656d3c - <unknown>
  36:           0x55cd72 - _PyEval_EvalFrameDefault
  37:           0x5afd4d - <unknown>
  38:           0x5af86e - <unknown>
  39:           0x592a0d - _PyObject_Call
  40:           0x55d7a3 - _PyEval_EvalFrameDefault
  41:           0x655c65 - <unknown>
  42:     0x7efcf1db21b8 - <unknown>
  43:     0x7efcf1db2e45 - <unknown>
  44:           0x57df49 - <unknown>
  45:           0x666fb5 - <unknown>
  46:           0x50017e - <unknown>
  47:           0x572c5c - <unknown>
  48:           0x55d7a3 - _PyEval_EvalFrameDefault
  49:           0x63eca9 - PyEval_EvalCode
  50:           0x65a7ae - <unknown>
  51:           0x572c5c - <unknown>
  52:           0x5729d6 - PyObject_Vectorcall
  53:           0x558e6d - _PyEval_EvalFrameDefault
  54:           0x66e370 - <unknown>
  55:           0x66def8 - Py_RunMain
  56:           0x62ad7d - Py_BytesMain
  57:     0x7efcf307ed90 - <unknown>
  58:     0x7efcf307ee40 - __libc_start_main
  59:           0x62abf5 - _start
  60:                0x0 - <unknown>

Issue description

polars views the column as only having 24 entries which seems to be related to there being 24 hours in a day (maybe).

Here's the metadata on the column

<pyarrow._parquet.ColumnChunkMetaData object at 0x7efc90f55440>
  file_offset: 67398
  file_path: 
  physical_type: BOOLEAN
  num_values: 1048576
  path_in_schema: isPeak
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7efc90f55350>
      has_min_max: True
      min: False
      max: True
      null_count: 0
      distinct_count: None
      num_values: 1048576
      physical_type: BOOLEAN
      logical_type: None
      converted_type (legacy): NONE
  compression: ZSTD
  encodings: ('RLE',)
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 67276
  total_compressed_size: 122
  total_uncompressed_size: 124900

When I resave the file with pyarrow 15 (which polars doesn't have a problem with) then this is the metadata

<pyarrow._parquet.ColumnChunkMetaData object at 0x7efc90f55940>
  file_offset: 67377
  file_path: 
  physical_type: BOOLEAN
  num_values: 1048576
  path_in_schema: isPeak
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7efc90f55bc0>
      has_min_max: True
      min: False
      max: True
      null_count: 0
      distinct_count: None
      num_values: 1048576
      physical_type: BOOLEAN
      logical_type: None
      converted_type (legacy): NONE
  compression: ZSTD
  encodings: ('RLE', 'PLAIN')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 67261
  total_compressed_size: 116
  total_uncompressed_size: 131120

so it seems that when it's saved without PLAIN encodings that it has a problem.

Expected behavior

It should recognize the column properly

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.12.2 (main, Feb 25 2024, 16:35:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            0.9.1
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@deanm0000 deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer P-medium Priority: medium A-io-parquet Area: reading/writing Parquet files A-interop-pyarrow Area: interoperability with pyarrow labels Mar 7, 2024
@ritchie46
Copy link
Member

Somehow, the boolean columns is read with too little values:

df = pl.read_parquet("0000.parquet", rechunk=False, use_pyarrow=False)
for c in df.columns:
    s = df[c]
    print(f"name: {c}; dtype: {s.dtype}; len: {s.len()}" )

name: Area; dtype: String; len: 1656816
name: Date; dtype: Date; len: 1656816
name: Hourend; dtype: Int32; len: 1656816
name: isPeak; dtype: Boolean; len: 24
name: Nominal_Energy; dtype: Float64; len: 1656816
name: Real_Energy; dtype: Float64; len: 1656816
name: Real_SRMC; dtype: Float64; len: 1656816
name: Nominal_SRMC; dtype: Float64; len: 1656816

@deanm0000 do you know how the file was produced? With which pyarrow settings.

@ritchie46
Copy link
Member

Ok, this is the new RLE decoding which I recently added. Taking a look.

@deanm0000
Copy link
Collaborator Author

    pq.write_table(
        DF.to_arrow(),
        dest_file,
        version="2.6",
        data_page_version="2.0",
        compression='zstd',
        filesystem=abfs
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-pyarrow Area: interoperability with pyarrow A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
3 participants