Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

parquet_read fails when a column has too many rows with string values #366

Closed
vincev opened this issue Sep 2, 2021 · 0 comments · Fixed by #367
Closed

parquet_read fails when a column has too many rows with string values #366

vincev opened this issue Sep 2, 2021 · 0 comments · Fixed by #367
Labels
bug Something isn't working

Comments

@vincev
Copy link

vincev commented Sep 2, 2021

The parquet_read and parquet_read_record fail to read a parquet file with a column of string values when the number of rows exceed a few thousand.

To reproduce the problem I generated the parquet file with the following script:

import sys
import pandas as pd
import numpy as np
import pyarrow as pa

print(f"Pandas version:  {pd.__version__}")
print(f"Numpy version:   {np.__version__}")
print(f"Pyarrow version: {pa.__version__}")

values = [f'{x:040}' for x in range(int(sys.argv[1]))]
df = pd.DataFrame({'values': values})
df.to_parquet('test.parquet', index=False, version='2.0')
print(f"Wrote {len(df)} rows")

With 20000 rows all works well:

> python gen.py 20000
Pandas version:  1.3.2
Numpy version:   1.21.2
Pyarrow version: 5.0.0
Wrote 20000 rows
> cargo run --release --example parquet_read test.parquet 0 0
    Finished release [optimized] target(s) in 0.12s
     Running `target/release/examples/parquet_read test.parquet 0 0`
Utf8[0000000000000000000000000000000000000000, 0000000000000000000000000000000000000001, 0000000000000000000000000000000000000002, 0000000000000000000000000000000000000003, 0000000000000000000000000000000000000004, ...(19990)..., 0000000000000000000000000000000000019996, 0000000000000000000000000000000000019997, 0000000000000000000000000000000000019998, 0000000000000000000000000000000000019999]

but with 30000 rows the example fails with NotYetImplemented:

> python gen.py 30000
Pandas version:  1.3.2
Numpy version:   1.21.2
Pyarrow version: 5.0.0
Wrote 30000 rows
> cargo run --release --example parquet_read test.parquet 0 0
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/parquet_read test.parquet 0 0`
Error: NotYetImplemented("Decoding \"Plain\"-encoded, dictionary-encoded optional V1 pages is not yet implemented for Binary")

I am running the latest master version:

> git rev-parse HEAD
cef5f08cf86334772d5ac72291f563e63c298e46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants