-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't sink_parquet on a sorted LazyFrame containing decimal columns
#17289
Comments
sink_parquet on a sorted LazyFramesink_parquet on a sorted LazyFrame containing decimal columns
|
I just tested with a smaller dataset, ie instead of scanning all ~17k files, I only scan the first 50... And it works 🤔 Does it mean that the problem comes from the data itself (eg, a null value or something similar)? In that case, it's still odd that the unsorted version works as expected... |
|
I managed to for i, batch in enumerate(batched(s3_urls, batch_size=500)):
pl.scan_parquet(
batch,
).filter(
pl.col("date") < datetime.now() - timedelta(days=120)
).sort(
pl.col("value")
).sink_parquet(
f'/tmp/data_{i}.parquet',
)Hence the supposition I gave above can be ruled out: it's not a data value/data type problem. Minor problem now: I now have 34 parquet files at the end of the process (knowing that I have 17k source files in total), instead of a single large one. |
|
I'm hitting something similar after upgrading to Polars v1.0.0 (note: I am using from: df = pl.scan_parquet(data)
(
df.sort(pl.col("l_orderkey"), pl.col("l_partkey"), pl.col("l_suppkey"))
.head(3)
.collect(streaming=True)
)where interestingly before upgrading I was hitting #17281 on this operation |
|
I confirm that I still get the issue after upgrading to v1.0.0 |
Checks
Reproducible example
Given a very large data set (1b rows) stored on S3:
This works good:
But this doesn't:
I get the following error:
Log output
Issue description
I stumbled upon #16603 and tried the
POLARS_ACTIVATE_DECIMAL=1hack.It was necessary for the first (unsorted) sample code to work, but it is apparently not sufficient for the sorted code sample to work.
I tested with both versions
0.20.31and1.0.0rc2: Same results.EDIT: also tested on 1.0.0 with same results
Expected behavior
I expected the lazy scan/filter/sort/sink to work as good as scan/filter/sink.
Installed versions
The text was updated successfully, but these errors were encountered: