-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sink_csv hangs when scanning multiple larger files #12918
Comments
I'm experiencing the same hang. It is reproucible by just reading from large enough csv file. import polars as pl
import os
dummy_data = list(range(100000 * 10))
dummy_df = pl.DataFrame({'a': dummy_data, 'b': dummy_data})
dummy_df.write_csv('dummy.csv')
print("written")
print("pid", os.getpid())
pl.scan_csv('dummy.csv').sink_csv('sink.csv') Some lines are written to I think both CSV reader and writer using the thread pool leads to some deadlock? gdb stack trace shows all threads are waiting:
Thread 1 is the main thread. |
Was this fixed in
|
The example still reproduces for me on It works correctly with Perhaps @ritchie46 can confirm? |
@naterichman #13239 has fixed this. |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
RUN STREAMING PIPELINE union -> parquet_sink RefCell { value: [] } STREAMING CHUNK SIZE: 25000 rows
Issue description
When scanning multiple (10 in this test) csv files with about 1 000 000 records each into a
LazyFrame
and sinking thatLazyFrame
to a single csv,polars
hangs. Withpolars
version 0.19.3 I do not encounter this issue, the same test script finishes within a minute.Expected behavior
The
sink_csv
method should not hang, but finish within a reasonable amount of time (for this small test case).Installed versions
The text was updated successfully, but these errors were encountered: