Polars parquet writer much slower than pyarrow parquet writer #15455

ion-elgreco · 2024-04-03T10:56:13Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

df = pl.DataFrame({
    "foo":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo1":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo2":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo3":np.random.randn(1, 100000000).reshape((100000000,))
})
df = df.with_columns(pl.col('foo').cast(pl.Utf8).alias('foo_str'), pl.col('foo').cast(pl.Utf8).alias('foo_str2'))

df.write_parquet("test.parquet", compression='snappy') takes 92 seconds

df.write_parquet("test2.parquet", compression='snappy', use_pyarrow=True) takes 55 seconds.

Log output

No response

Issue description

At work we saw one of our pipelines taking around 50 minutes to write a parquet file. The difference was huge compared to pyarrow which took only one minute, see the logs below:

With polars (50minutes):

With pyarrow (1.5 minute):

Expected behavior

Write fast, like pyarrow does.

Installed versions

0.20.10

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-04-04T14:35:27Z

I tried to reproduce with the 100m but after 2 min of generating df, I tapped out and did it again with just 10m. With just 10m rows, I got 2.9s to save with polars and 3.0s with pyarrow.

Chuck321123 · 2024-04-04T21:04:38Z

By using "zstd" as compression method i got this (with 10m rows)
4.85 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.6 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Where i set use_pyarrow=True for the first part

ion-elgreco added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 3, 2024

deanm0000 added P-low Priority: low A-io Area: reading and writing data A-io-parquet Area: reading/writing Parquet files and removed needs triage Awaiting prioritization by a maintainer labels Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars parquet writer much slower than pyarrow parquet writer #15455

Polars parquet writer much slower than pyarrow parquet writer #15455

ion-elgreco commented Apr 3, 2024

deanm0000 commented Apr 4, 2024

Chuck321123 commented Apr 4, 2024 •

edited

Polars parquet writer much slower than pyarrow parquet writer #15455

Polars parquet writer much slower than pyarrow parquet writer #15455

Comments

ion-elgreco commented Apr 3, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Apr 4, 2024

Chuck321123 commented Apr 4, 2024 • edited

Chuck321123 commented Apr 4, 2024 •

edited