Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

Closed
prrao87 opened this issue Oct 11, 2023 · 3 comments

Comments

@prrao87
Copy link
Member

prrao87 commented Oct 11, 2023

I think this is a bug/regression in version 0.0.9 with the parquet reader that depends on pyarrow.

Python version: 3.11.2
OS: MacOS Ventura 13.5.2
Platform: M2 (arm64)
Polars version: 0.19.8
Pyarrow version: 13.0.0
Kùzu versions: 0.0.8 and 0.0.9

Issue

The code snippet below works for Kùzu 0.0.8, while it fails for version 0.0.9. The issue seems to be related to how the pyarrow reader decompresses and parses columns in the new version.

$ python test.py
shape: (3, 3)
┌─────┬───────┬─────┐
│ id  ┆ name  ┆ age │
│ --- ┆ ---   ┆ --- │
│ i64 ┆ str   ┆ i64 │
╞═════╪═══════╪═════╡
│ 1   ┆ Jack  ┆ 33  │
│ 2   ┆ Jill  ┆ 22  │
│ 3   ┆ Wendy ┆ 25  │
└─────┴───────┴─────┘
Traceback (most recent call last):
  File "/code/test.py", line 23, in <module>
    conn.execute(f"COPY Person FROM 'test.parquet';")
  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 90, in execute
    self._connection.execute(
RuntimeError: ColumnReader::decompressInternal

MRE

In the following code, I'm exporting a Polars DataFrame to parquet format. Note that I specify compression="zstd" (best compression performance), and also use_pyarrow=True to ensure that it uses the C++ implementation of pyarrow under the hood -- this is what I'll be using downstream to ingest the parquet into Kùzu.

import os
import shutil

import polars as pl
import kuzu

df = pl.DataFrame({"id": [1, 2, 3], "name": ["Jack", "Jill", "Wendy"], "age": [33, 22, 25]})
print(df)

df.write_parquet("test.parquet", compression="zstd", use_pyarrow=True)

DB_NAME = "test_db"
# Delete directory each time till we have MERGE FROM available in kuzu
if os.path.exists(DB_NAME):
    shutil.rmtree(DB_NAME)

# Create database
db = kuzu.Database(f"./{DB_NAME}")
conn = kuzu.Connection(db)

conn.execute("CREATE NODE TABLE Person(id INT64, name STRING, age INT64, PRIMARY KEY (id))")

conn.execute(f"COPY Person FROM 'test.parquet';")

Workaround

I can use compression="snappy" to get the code above to work with Kùzu 0.0.9.

However, the default write_parquet method in Polars and use_pyarrow=True will prefer to use zstd compression, rather than snappy, which the Polars docs says is older and not as efficient with its compression.

Why is it that the code that used to work before, doesn't work now in newer versions of kuzu and pyarrow?

@acquamarin
Copy link
Collaborator

acquamarin commented Oct 11, 2023

Hi Prashanth,
We recently switched from the arrow parquet reader to our own parquet reader to achieve zero-copy during copying parquet files. We are also planning to remove arrow from our code base to reduce third-party dependency. The new parquet reader currently only supports SNAPPY compression, and we are planning to support more compression and encodings in the next release.

@prrao87
Copy link
Member Author

prrao87 commented Oct 11, 2023

Okay, will plan on documenting that I should only use snappy compression upstream!

Having less dependency on third-parties is great! Polars also doesn't depend on pyarrow and I can then safely remove pyarrow from my requirements for cases where I don't deal with intermediate arrow tables.

@prrao87 prrao87 changed the title parquet reader in Kùzu 0.0.9 shows different behaviour than earlier versions parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) Oct 13, 2023
@acquamarin
Copy link
Collaborator

Hi @prrao87
We have supported reading zstd and gzip compresssed parquet file in #2329 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants