You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Version 1:
join parallel: trueread files in parallel
read files in parallel
avg line length: 24.135742
std. dev. line length: 0.3425134
initial row estimate: 6185333
avg line length: 40.760742
std. dev. line length: 7.7675257
initial row estimate: 2949451008
no. of chunks: 128 processed by: 128 threads.
no. of chunks: 128 processed by: 128 threads.
INNER join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
Version 2:
RUN STREAMING PIPELINE
[csv -> callback -> parquet_sink, csv -> generic_join_build]
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 12500 rows
avg line length: 46.3291
std. dev. line length: 6.630605
initial row estimate: 403562720
no. of chunks: 128 processed by: 128 threads.
keys/aggregates are not partitionable: running default HASH AGGREGATION
Issue description
Iterating over GroupBy object constructed from join on the fly as in Version 1 is ~100x faster than iterating over GroupBy object created from precomputed join csv loaded in through pl.read_csv.
Expected behavior
Speed of iteration should be consistent between the two methods.
Checks
Reproducible example
Version 1:
Version 2:
Log output
Issue description
Iterating over GroupBy object constructed from join on the fly as in Version 1 is ~100x faster than iterating over GroupBy object created from precomputed join csv loaded in through pl.read_csv.
Expected behavior
Speed of iteration should be consistent between the two methods.
Installed versions
The text was updated successfully, but these errors were encountered: