`read_parquet()` returns empty `list[i64]` or explicitly crashes if `use_pyarrow=True` #6428

oscar6echo · 2023-01-25T08:38:32Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

I try to read a sample parquet file produced by another language (golang).
It works fine except for type list[int64] which returns empty while it is not on disk.
If I force use_pyarrow=True then it explicitly crashes.

NOTE: It seems somewhat connected to issue #6289 though this one is about reading parquet file while the other is about writing them.

1/ version 1:

code:

from pathlib import Path
import polars as pl
path = Path("sample.pqt")
df = pl.read_parquet(path)
print(df.head())
print(df.select("array"))
print(df.select("array").to_series().to_list())

result:

shape: (5, 7)
┌─────┬────────────┬─────┬───────┬────────┬─────────────────────────────────────┬───────────┐
│ idx ┆ name       ┆ age ┆ sex   ┆ weight ┆ time                                ┆ array     │
│ --- ┆ ---        ┆ --- ┆ ---   ┆ ---    ┆ ---                                 ┆ ---       │
│ i64 ┆ str        ┆ i64 ┆ bool  ┆ f64    ┆ datetime[ns, +00:00]                ┆ list[i64] │
╞═════╪════════════╪═════╪═══════╪════════╪═════════════════════════════════════╪═══════════╡
│ 0   ┆ Warlockfir ┆ 22  ┆ true  ┆ 50.0   ┆ 2023-01-25 08:27:47.077523962 +0... ┆ []        │
│ 1   ┆ Maskwood   ┆ 23  ┆ false ┆ 50.6   ┆ 2023-01-25 08:27:47.077562939 +0... ┆ []        │
│ 2   ┆ Pixiecomet ┆ 24  ┆ false ┆ 51.2   ┆ 2023-01-25 08:27:47.077569845 +0... ┆ []        │
│ 3   ┆ Biterflame ┆ 25  ┆ true  ┆ 51.8   ┆ 2023-01-25 08:27:47.077575769 +0... ┆ []        │
│ 4   ┆ Graspsalt  ┆ 26  ┆ false ┆ 52.4   ┆ 2023-01-25 08:27:47.077579569 +0... ┆ []        │
└─────┴────────────┴─────┴───────┴────────┴─────────────────────────────────────┴───────────┘
shape: (10, 1)
┌───────────┐
│ array     │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ []        │
│ []        │
│ []        │
│ ...       │
│ []        │
│ []        │
│ []        │
│ []        │
└───────────┘
[[], [], [], [], [], [], [], [], [], []]

2/ version 2:

code:

from pathlib import Path
import polars as pl
path = Path("sample.pqt")
df = pl.read_parquet(path, use_pyarrow=True)
print(df.head())
print(df.select("array"))
print(df.select("array").to_series().to_list())

result:

Traceback (most recent call last):
  File "/home/olivier/GDrive/dev/golang/parquet-go-explo/test.py", line 8, in <module>
    df = pl.read_parquet(path, use_pyarrow=True)
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/polars/utils.py", line 394, in wrapper
    return fn(*args, **kwargs)
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/polars/io.py", line 964, in read_parquet
    pa.parquet.read_table(
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2871, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2517, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

Reproducible example

The sample parquet file [sample.pqt](https://github.com/oscar6echo/parquet-go-explo/blob/main/sample.pqt)

Expected behavior

The last column in the parquet file (list[int64]) should be returned by pl.ready_parquet():

{0 Warlockfir 22 true 50 2023-01-25 09:27:47.077523962 +0100 CET m=+0.001425235 [0 3]}
{1 Maskwood 23 false 50.6 2023-01-25 09:27:47.077562939 +0100 CET m=+0.001464212 [1 4]}
{2 Pixiecomet 24 false 51.2 2023-01-25 09:27:47.077569845 +0100 CET m=+0.001471119 [2 5]}
{3 Biterflame 25 true 51.8 2023-01-25 09:27:47.077575769 +0100 CET m=+0.001477042 [3 6]}
{4 Graspsalt 26 false 52.4 2023-01-25 09:27:47.077579569 +0100 CET m=+0.001480841 [4 7]}
{5 Scalewave 22 false 53 2023-01-25 09:27:47.077584464 +0100 CET m=+0.001485735 [5 8]}
{6 Singerorange 23 true 53.6 2023-01-25 09:27:47.077589038 +0100 CET m=+0.001490311 [6 9]}
{7 Takerfringe 24 false 54.2 2023-01-25 09:27:47.077593866 +0100 CET m=+0.001495142 [7 10]}
{8 Arrowcopper 25 false 54.8 2023-01-25 09:27:47.07759902 +0100 CET m=+0.001500296 [8 11]}
{9 Terrierrowan 26 true 55.4 2023-01-25 09:27:47.077604323 +0100 CET m=+0.001505596 [9 12]}

Repo oscar6echo/parquet-go-explo contains the code to produce this parquet file.

Installed versions

---Version info---
Polars: 0.15.16
Index type: UInt32
Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
Python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.24.0
fsspec: 2023.1.0
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.2

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-01-25T08:50:50Z

If I force use_pyarrow=True then it explicitly crashes.

What is the crash? Can you read the file with pandas?

oscar6echo · 2023-01-25T10:12:11Z

The crash when using use_pyarrow=True is described in issue above. ☝️

If I use pandas it crashes too:

Unsurprisingly the errror with engine='pyarrow' is the same.
The error is less specific with engine='fastparquet'.

See below.

1/ with pyarrow

code:

from pathlib import Path
import pandas as pd
path = Path("sample.pqt")
df = pd.read_parquet(path, engine="pyarrow")
print(df.head())

result:

Traceback (most recent call last):
  File "/home/olivier/GDrive/dev/golang/parquet-go-explo/test.py", line 20, in <module>
    df = pd.read_parquet(path, engine="pyarrow")
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2871, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2517, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.

2/ with fastparquet

code:

from pathlib import Path
import pandas as pd
path = Path("sample.pqt")
df = pd.read_parquet(path, engine="fastparquet")
print(df.head())

result:

Traceback (most recent call last):
  File "/home/olivier/GDrive/dev/golang/parquet-go-explo/test.py", line 21, in <module>
    df = pd.read_parquet(path, engine="fastparquet")
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 358, in read
    return parquet_file.to_pandas(columns=columns, **kwargs)
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/api.py", line 778, in to_pandas
    self.read_row_group_file(rg, columns, categories, index,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/api.py", line 380, in read_row_group_file
    core.read_row_group(
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 621, in read_row_group
    read_row_group_arrays(file, rg, columns, categories, schema_helper,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 591, in read_row_group_arrays
    read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 487, in read_col
    num += read_data_page_v2(infile, schema_helper, se, ph.data_page_header_v2, cmd,
  File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 225, in read_data_page_v2
    raise NotImplementedError
NotImplementedError

ritchie46 · 2023-01-25T10:33:49Z

Then your parquet file is likely incorrect.

oscar6echo · 2023-01-25T12:00:36Z

Then your parquet file is likely incorrect.

Well that is what I thought too: the parquet file is corrupt or invalid in some way.
But it is not that clear cut. See below:

1/
If I read the file with pyarrow I do see the data, including the list[i64] data:

code:

from pathlib import Path
import pyarrow.parquet as pq
path = Path("sample2.pqt")
h = pq.ParquetFile(path)
print("----schema:")
print(h.schema)
print("----read:")
print(h.read())

result:

----read:
pyarrow.Table
name: large_string
age: int64
sex: bool
weight: double
time: timestamp[us]
array: large_list<item: int64>
  child 0, item: int64
----
name: [["Masterfog","Armspice"]]
age: [[22,23]]
sex: [[true,false]]
weight: [[51.2,65.3]]
time: [[2023-01-25 12:03:14.208962,2023-01-25 12:03:14.208962]]
array: [[[10,20],[11,22]]]

2/
There appear to be a subtle difference between the golang library I use segmentio/parquet-go and polars in the way they write nested fields to parquet.

In this example each lib can read their own nested list field but not that of the other. They return empty lists instead.

3/
So contrary to what I thought parquet compatilibity can be partial 🤔

So I think there must be subtle differences in the writing to parquet.
Where exactly in polars source code to you write nested fields ?
I'll try and compare with the golang lib.

oscar6echo · 2023-01-25T12:27:06Z

I'll add that one (of the many) benefit(s) of polars over pandas is the capability to hold lists and structs in cells. So this parquet issue is not negligible if you use python/polars in hybrid (multi language) data pipelines.

oscar6echo · 2023-01-25T21:14:40Z

In fact, and contrary to what I wrote above, polars and parquet-go produce incompatible parquet formats - for nested fields only.
To be more precise, both are valid parquet files, but have inconsistent schemas.

However the bridge does not seem completely insurmountable.

See parquet-go/issues/468 for the discussion.

@ritchie46, any opinion on the subject ?

tustvold · 2023-01-25T23:54:09Z

This likely relates to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists and relates to readers such as parquet2 not handling the backwards compatibility rules correctly.

For reference the logic to handle this in parquet can be found here and here

oscar6echo · 2023-02-04T18:41:28Z

After some trial and error, I found a way to read a polars produced parquet file in go, then save it as parquet and load the latter parquet file to a polars dataframe.

See https://github.com/oscar6echo/parquet-polars-go

I'll copy my conclusion:

Conclusion:

fraugster/parquet-go is the only lib that produces polars compatible format for complex types, but is is the slowest and most verbose to achieve that

xitongsys/parquet-go and segmentio/parquet-go are significantly faster but produce nested types that are not compatible with polars

It seems parquet format is quite permissive so different libs have generally little chance to be compatible beyond the most basic types.
So it would be good if polars offered some flexibility in the parquet formatting of nested types to help compatibility with other ecosystems.

oscar6echo added bug Something isn't working python Related to Python Polars labels Jan 25, 2023

oscar6echo mentioned this issue Jan 25, 2023

Nested fields not readable by python lib polars segmentio/parquet-go#468

Open

stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_parquet()` returns empty `list[i64]` or explicitly crashes if `use_pyarrow=True` #6428

`read_parquet()` returns empty `list[i64]` or explicitly crashes if `use_pyarrow=True` #6428

oscar6echo commented Jan 25, 2023 •

edited

Loading

ritchie46 commented Jan 25, 2023

oscar6echo commented Jan 25, 2023 •

edited

Loading

ritchie46 commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

tustvold commented Jan 25, 2023 •

edited

Loading

oscar6echo commented Feb 4, 2023

read_parquet() returns empty list[i64] or explicitly crashes if use_pyarrow=True #6428

read_parquet() returns empty list[i64] or explicitly crashes if use_pyarrow=True #6428

Comments

oscar6echo commented Jan 25, 2023 • edited Loading

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

ritchie46 commented Jan 25, 2023

oscar6echo commented Jan 25, 2023 • edited Loading

ritchie46 commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

oscar6echo commented Jan 25, 2023

tustvold commented Jan 25, 2023 • edited Loading

oscar6echo commented Feb 4, 2023

`read_parquet()` returns empty `list[i64]` or explicitly crashes if `use_pyarrow=True` #6428

`read_parquet()` returns empty `list[i64]` or explicitly crashes if `use_pyarrow=True` #6428

oscar6echo commented Jan 25, 2023 •

edited

Loading

oscar6echo commented Jan 25, 2023 •

edited

Loading

tustvold commented Jan 25, 2023 •

edited

Loading