Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_parquet(pyarrow=False) with Struct panic: "The children must have an equal number of values." #16586

Closed
2 tasks done
Luffbee opened this issue May 29, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Luffbee
Copy link

Luffbee commented May 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.read_ndjson("test.json")
print(df)
df.write_parquet("test.pya.parquet", use_pyarrow=True)
df2 = pl.read_parquet("test.pya.parquet", use_pyarrow=True)
print(df2)
df.write_parquet("test.py.parquet") # panic here

With "test.json":

{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,627],"lost_byte":49132,"strict_lost_pkt":[0,400],"spur_lost_pkt":[0,229],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":27,"max":0,"avg":0.0,"hist":[27],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":96,"max":0,"avg":0.0,"hist":[96],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":96,"max":0,"avg":0.0,"hist":[96],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519908454691,"inter":5000408}}
{"v":{"lost_kbps":1.74,"pkt_lr":0.05,"byte_lr":0.06,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[2,4157],"lost_byte":1636233,"strict_lost_pkt":[0,154],"spur_lost_pkt":[2,4003],"retrans_ratio":100.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":30,"max":0,"avg":0.0,"hist":[30],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":207,"max":2,"avg":0.01,"hist":[206,0,1],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":207,"max":0,"avg":0.0,"hist":[207],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519908460804,"inter":5000325}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,0],"lost_byte":0,"strict_lost_pkt":[0,0],"spur_lost_pkt":[0,0],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":19,"max":0,"avg":0.0,"hist":[19],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":138,"max":0,"avg":0.0,"hist":[138],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":138,"max":0,"avg":0.0,"hist":[138],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519908495176,"inter":4999627}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,0],"lost_byte":0,"strict_lost_pkt":[0,0],"spur_lost_pkt":[0,0],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":27,"max":0,"avg":0.0,"hist":[27],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":138,"max":0,"avg":0.0,"hist":[138],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":138,"max":0,"avg":0.0,"hist":[138],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519908497848,"inter":5000437}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,0],"lost_byte":0,"strict_lost_pkt":[0,0],"spur_lost_pkt":[0,0],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":53,"max":0,"avg":0.0,"hist":[53],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":86,"max":0,"avg":0.0,"hist":[86],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":86,"max":0,"avg":0.0,"hist":[86],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519908608570,"inter":5000430}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,2901],"lost_byte":109811,"strict_lost_pkt":[0,2752],"spur_lost_pkt":[0,149],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":59,"max":0,"avg":0.0,"hist":[59],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":95,"max":0,"avg":0.0,"hist":[95],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":95,"max":0,"avg":0.0,"hist":[95],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519910241354,"inter":5000061}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,0],"lost_byte":0,"strict_lost_pkt":[0,0],"spur_lost_pkt":[0,0],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":21,"max":0,"avg":0.0,"hist":[21],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":141,"max":0,"avg":0.0,"hist":[141],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":141,"max":0,"avg":0.0,"hist":[141],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519910525566,"inter":4999637}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,0],"lost_byte":0,"strict_lost_pkt":[0,0],"spur_lost_pkt":[0,0],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":32,"max":0,"avg":0.0,"hist":[32],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":141,"max":0,"avg":0.0,"hist":[141],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":141,"max":0,"avg":0.0,"hist":[141],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519910525677,"inter":4999578}}
{"v":{"lost_kbps":2.04,"pkt_lr":0.07,"byte_lr":0.07,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[3,7579],"lost_byte":1468051,"strict_lost_pkt":[0,5123],"spur_lost_pkt":[3,2456],"retrans_ratio":100.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":26,"max":3,"avg":0.12,"hist":[25,0,0,1],"p90":0.0,"p95":0.0,"p99":3.0},"win_cum":{"cnt":209,"max":2,"avg":0.01,"hist":[207,1,1],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":209,"max":0,"avg":0.0,"hist":[209],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519910568405,"inter":4999312}}
{"v":{"lost_kbps":0.0,"pkt_lr":0.0,"byte_lr":0.0,"strict_pkt_lr":0.0,"strict_byte_lr":0.0,"lost_pkt":[0,2302],"lost_byte":84274,"strict_lost_pkt":[0,2204],"spur_lost_pkt":[0,98],"retrans_ratio":0.0,"rto_cnt":0,"rto_avg":0.0,"win_size":16,"win_cur":{"cnt":40,"max":0,"avg":0.0,"hist":[40],"p90":0.0,"p95":0.0,"p99":0.0},"win_cum":{"cnt":103,"max":0,"avg":0.0,"hist":[103],"p90":0.0,"p95":0.0,"p99":0.0},"win_strict":{"cnt":103,"max":0,"avg":0.0,"hist":[103],"p90":0.0,"p95":0.0,"p99":0.0},"now":2519910714284,"inter":5000037}}

I also tried rust, but it CANNOT reproduce this panic. Code is here:

use polars::prelude::*;

fn main() {
    {
        let mut df = JsonLineReader::new(&mut std::fs::File::open("test.json").unwrap())
            .finish()
            .unwrap();
        println!("{}", df);
        let mut file = std::fs::File::create("test.parquet").unwrap();
        ParquetWriter::new(&mut file).finish(&mut df).unwrap();
    }
    {
        let mut file = std::fs::File::open("test.parquet").unwrap();
        let df = ParquetReader::new(&mut file).finish().unwrap();
        println!("{}", df);
    }
}

Log output

$ POLARS_VERBOSE=1 RUST_BACKTRACE=1 python3 test.py
shape: (10, 1)
┌─────────────────────────────────┐
│ v                               │
│ ---                             │
│ struct[18]                      │
╞═════════════════════════════════╡
│ {0.0,0.0,0.0,0.0,0.0,[0, 627],… │
│ {1.74,0.05,0.06,0.0,0.0,[2, 41… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 2901]… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {2.04,0.07,0.07,0.0,0.0,[3, 75… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 2302]… │
└─────────────────────────────────┘
shape: (10, 1)
┌─────────────────────────────────┐
│ v                               │
│ ---                             │
│ struct[18]                      │
╞═════════════════════════════════╡
│ {0.0,0.0,0.0,0.0,0.0,[0, 627],… │
│ {1.74,0.05,0.06,0.0,0.0,[2, 41… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 2901]… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 0],0,… │
│ {2.04,0.07,0.07,0.0,0.0,[3, 75… │
│ {0.0,0.0,0.0,0.0,0.0,[0, 2302]… │
└─────────────────────────────────┘
thread 'python3' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 13 have a length of 0, which is different from values at index 0, 10."))
stack backtrace:
   0: rust_begin_unwind
             at /rustc/2d24fe591f30386d6d5fc2bb941c78d7266bf10f/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/2d24fe591f30386d6d5fc2bb941c78d7266bf10f/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/2d24fe591f30386d6d5fc2bb941c78d7266bf10f/library/core/src/result.rs:1654:5
   3: polars_core::series::into::<impl polars_core::series::Series>::to_arrow
   4: <polars_core::frame::RecordBatchIter as core::iter::traits::iterator::Iterator>::next
   5: core::iter::traits::iterator::Iterator::find_map
   6: polars_io::parquet::write::writer::ParquetWriter<W>::finish
   7: polars::dataframe::io::<impl polars::dataframe::PyDataFrame>::write_parquet
   8: polars::dataframe::io::<impl polars::dataframe::PyDataFrame>::__pymethod_write_parquet__
   9: pyo3::impl_::trampoline::trampoline
  10: polars::dataframe::io::_::__INVENTORY::trampoline
  11: method_vectorcall_VARARGS_KEYWORDS
             at /usr/local/src/conda/python-3.12.3/Objects/descrobject.c:365:14
  12: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.3/Include/internal/pycore_call.h:92:11
  13: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.3/Objects/call.c:325:12
  14: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1713204800955/work/build-static/Python/bytecodes.c:2706:19
  15: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.3/Python/ceval.c:578:21
  16: run_eval_code_obj
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1722
  17: run_mod
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1743
  18: pyrun_file
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:1643
  19: _PyRun_SimpleFileObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:433
  20: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.12.3/Python/pythonrun.c:78
  21: pymain_run_file_obj
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:360
  22: pymain_run_file
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:379
  23: pymain_run_python
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:629
  24: Py_RunMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:709
  25: Py_BytesMain
             at /usr/local/src/conda/python-3.12.3/Modules/main.c:763:12
  26: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  27: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  28: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/test.py", line 8, in <module>
    df.write_parquet("test.py.parquet")
  File "/root/micromamba/envs/test/lib/python3.12/site-packages/polars/dataframe/frame.py", line 3292, in write_parquet
    self._df.write_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 13 have a length of 0, which is different from values at index 0, 10."))

Issue description

The write_parquet(use_pyarrow=False) panicked with Struct type.
The rust version does not panic.

Expected behavior

It should not panic.

Installed versions

>>> import polars
>>> polars.show_versions()
--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-glibc2.35
Python:               3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@Luffbee Luffbee added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 29, 2024
@cmdlineluser
Copy link
Contributor

I think this may already be fixed.

I can reproduce the exception using the 0.20.30 wheel from pypi.

Testing on main it runs without error.

Hopefully some others can confirm.

@ritchie46
Copy link
Member

Yes, this is fixed on main. Will be released this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants