Very long reading or hangs time for lazy slice collection in worker process #14358
Open
2 tasks done
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Hey @ritchie46, @stinodego, @alexander-beedie
Checks
Reproducible example
I am trying to distribute reading parquet files across workers and it seems polars loading time either increases or hangs.
It seems this a common issue as Img2Dataset got around lazy loading but actually re-generating the shards: https://github.com/rom1504/img2dataset/blob/main/img2dataset/reader.py#L189
Here is a reproducible script:
Log output
Here are the logs. As you can observe, the time just increases insanely.
In comparison with Pyarrow. Still not great but much better.
Issue description
Doing a partial lazy loading of parquet slice is a key component to distributed data processing across workers and machines.
Additionally, if I get the length using polars instead of pyarrow, it seems to hang. This might be a second bug.
Expected behavior
This is fast and reading time is constant.
Installed versions
The text was updated successfully, but these errors were encountered: