Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrow cannot read nested-column parquet file written from polars #5762

Closed
2 tasks done
edavisau opened this issue Dec 10, 2022 · 3 comments · Fixed by #5940
Closed
2 tasks done

PyArrow cannot read nested-column parquet file written from polars #5762

edavisau opened this issue Dec 10, 2022 · 3 comments · Fixed by #5940
Labels
bug Something isn't working python Related to Python Polars

Comments

@edavisau
Copy link
Contributor

edavisau commented Dec 10, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I receive the following error running the code below.

Traceback (most recent call last):
  File "/home/ed/dev/nested_test/nested_test.py", line 22, in <module>
    df = pq.read_table("out.parquet")
  File "/home/ed/.virtualenvs/venv/lib64/python3.10/site-packages/pyarrow/parquet/core.py", line 2871, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/ed/.virtualenvs/venv/lib64/python3.10/site-packages/pyarrow/parquet/core.py", line 2517, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Malformed levels. min: 0 max: 3 out of range.  Max Level: 2

It works with use_pyarrow=True, however, the Parquet files I am creating are being done in Rust. Interestingly I can do pl.read_parquet(...).to_arrow()

Reproducible example

import polars as pl
import pyarrow.parquet as pq

pl.from_records(
    [
        dict(
            id=1,
            list_of_structs_col=[
                dict(a=10, b=[10, 11, 12]),
                dict(a=11, b=[13, 14, 15]),
            ],
        ),
        dict(
            id=2,
            list_of_structs_col=[
                dict(a=44, b=[12]),
            ],
        ),
    ]
).write_parquet("out.parquet")

df = pq.read_table("out.parquet")

Expected behavior

The last line should return a pyarrow.Table containing the data

Installed versions

---Version info---
Polars: 0.15.2
Index type: UInt32
Platform: Linux-6.0.10-200.fc36.x86_64-x86_64-with-glibc2.35
Python: 3.10.8 (main, Nov 14 2022, 00:00:00) [GCC 12.2.1 20220819 (Red Hat 12.2.1-2)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.4.3
numpy: 1.23.1
fsspec: 2022.5.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
@edavisau edavisau added bug Something isn't working python Related to Python Polars labels Dec 10, 2022
@ritchie46
Copy link
Member

Thanks for the issue. This issue is known and taken upstream. jorgecarleitao/arrow2#1323

I shall default to writing with pyarrow in the next release for the time being.

@edavisau
Copy link
Contributor Author

edavisau commented Dec 10, 2022

Thank you. This might be a silly question: is there a simple way for me in the interim to write the parquet file in Rust by calling C++ Arrow? Or perhaps trying the arrow/parquet crates instead of arrow2/parquet2?

My current workaround otherwise would be: write parquets in Rust -> in python: pl.read_parquet(...).write_parquet(..., use_pyarrow=True)

@ritchie46
Copy link
Member

Fixed by #5940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants