parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

prrao87 · 2023-10-11T21:56:12Z

I think this is a bug/regression in version 0.0.9 with the parquet reader that depends on pyarrow.

Python version: 3.11.2
OS: MacOS Ventura 13.5.2
Platform: M2 (arm64)
Polars version: 0.19.8
Pyarrow version: 13.0.0
Kùzu versions: 0.0.8 and 0.0.9

Issue

The code snippet below works for Kùzu 0.0.8, while it fails for version 0.0.9. The issue seems to be related to how the pyarrow reader decompresses and parses columns in the new version.

$ python test.py
shape: (3, 3)
┌─────┬───────┬─────┐
│ id  ┆ name  ┆ age │
│ --- ┆ ---   ┆ --- │
│ i64 ┆ str   ┆ i64 │
╞═════╪═══════╪═════╡
│ 1   ┆ Jack  ┆ 33  │
│ 2   ┆ Jill  ┆ 22  │
│ 3   ┆ Wendy ┆ 25  │
└─────┴───────┴─────┘
Traceback (most recent call last):
  File "/code/test.py", line 23, in <module>
    conn.execute(f"COPY Person FROM 'test.parquet';")
  File "/code/.venv/lib/python3.11/site-packages/kuzu/connection.py", line 90, in execute
    self._connection.execute(
RuntimeError: ColumnReader::decompressInternal

MRE

In the following code, I'm exporting a Polars DataFrame to parquet format. Note that I specify compression="zstd" (best compression performance), and also use_pyarrow=True to ensure that it uses the C++ implementation of pyarrow under the hood -- this is what I'll be using downstream to ingest the parquet into Kùzu.

import os
import shutil

import polars as pl
import kuzu

df = pl.DataFrame({"id": [1, 2, 3], "name": ["Jack", "Jill", "Wendy"], "age": [33, 22, 25]})
print(df)

df.write_parquet("test.parquet", compression="zstd", use_pyarrow=True)

DB_NAME = "test_db"
# Delete directory each time till we have MERGE FROM available in kuzu
if os.path.exists(DB_NAME):
    shutil.rmtree(DB_NAME)

# Create database
db = kuzu.Database(f"./{DB_NAME}")
conn = kuzu.Connection(db)

conn.execute("CREATE NODE TABLE Person(id INT64, name STRING, age INT64, PRIMARY KEY (id))")

conn.execute(f"COPY Person FROM 'test.parquet';")

Workaround

I can use compression="snappy" to get the code above to work with Kùzu 0.0.9.

However, the default write_parquet method in Polars and use_pyarrow=True will prefer to use zstd compression, rather than snappy, which the Polars docs says is older and not as efficient with its compression.

Why is it that the code that used to work before, doesn't work now in newer versions of kuzu and pyarrow?

The text was updated successfully, but these errors were encountered:

acquamarin · 2023-10-11T23:09:59Z

Hi Prashanth,
We recently switched from the arrow parquet reader to our own parquet reader to achieve zero-copy during copying parquet files. We are also planning to remove arrow from our code base to reduce third-party dependency. The new parquet reader currently only supports SNAPPY compression, and we are planning to support more compression and encodings in the next release.

prrao87 · 2023-10-11T23:28:41Z

Okay, will plan on documenting that I should only use snappy compression upstream!

Having less dependency on third-parties is great! Polars also doesn't depend on pyarrow and I can then safely remove pyarrow from my requirements for cases where I don't deal with intermediate arrow tables.

acquamarin · 2023-11-02T19:45:33Z

Hi @prrao87
We have supported reading zstd and gzip compresssed parquet file in #2329 .

This was referenced Oct 13, 2023

Switch to snappy compression in Polars prrao87/kuzudb-study#31

Closed

Parquet reader improvement #2080

Closed

prrao87 changed the title ~~parquet reader in Kùzu 0.0.9 shows different behaviour than earlier versions~~ parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) Oct 13, 2023

acquamarin closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

prrao87 commented Oct 11, 2023 •

edited

Loading

acquamarin commented Oct 11, 2023 •

edited

Loading

prrao87 commented Oct 11, 2023 •

edited

Loading

acquamarin commented Nov 2, 2023

parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

parquet reader in Kùzu 0.0.9 only supports snappy compression (Polars uses zstd) #2190

Comments

prrao87 commented Oct 11, 2023 • edited Loading

Issue

MRE

Workaround

acquamarin commented Oct 11, 2023 • edited Loading

prrao87 commented Oct 11, 2023 • edited Loading

acquamarin commented Nov 2, 2023

prrao87 commented Oct 11, 2023 •

edited

Loading

acquamarin commented Oct 11, 2023 •

edited

Loading

prrao87 commented Oct 11, 2023 •

edited

Loading