Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

beazerj · 2024-06-29T10:37:40Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Version 1:

import polars as pl
from pathlib import Path
from tqdm import tqdm

def do_something():
    return None

if __name__ == "__main__":
    p1 = Path(file_1.csv)
    p2 = Path(file_2.csv)

    d1 = pl.scan_csv(p2)
    d2 = pl.scan_csv(p1)

   d3 = d1.join(
        d2, left_on="join_col_left", right_on="join_col_right"
    ).collect()  

    for iter_col_1, data in tqdm(
        d3.group_by("iter_col_1"),
        total=len(d3["iter_col_1"].unique()),
    ):
        do_something()

Version 2:

import polars as pl
from pathlib import Path
from tqdm import tqdm

def do_something():
    return None

if __name__ == "__main__":
    p1 = Path(file_1.csv)
    p2 = Path(file_2.csv)

    d1 = pl.scan_csv(p2)
    d2 = pl.scan_csv(p1)

   d1.join(
        d2, left_on="join_col_left", right_on="join_col_right"
    ).sink_csv("file_3.csv")

    d3 = pl.read_csv("file_3.csv")

    for iter_col_1, data in tqdm(
        d3.group_by("iter_col_1"),
        total=len(d3["iter_col_1"].unique()),
    ):
        do_something()

Log output

Version 1: 

join parallel: true                                                                                                
read files in parallel                                                                                             
read files in parallel                                                                                             
avg line length: 24.135742                                                                                         
std. dev. line length: 0.3425134                                                                                   
initial row estimate: 6185333                                                                                      
avg line length: 40.760742                                                                                         
std. dev. line length: 7.7675257                                                                                   
initial row estimate: 2949451008                                                                                   
no. of chunks: 128 processed by: 128 threads.                                                                      
no. of chunks: 128 processed by: 128 threads.                                                                      
INNER join dataframes finished      
keys/aggregates are not partitionable: running default HASH AGGREGATION 

Version 2:

RUN STREAMING PIPELINE
[csv -> callback -> parquet_sink, csv -> generic_join_build]
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 12500 rows
avg line length: 46.3291
std. dev. line length: 6.630605
initial row estimate: 403562720
no. of chunks: 128 processed by: 128 threads.
keys/aggregates are not partitionable: running default HASH AGGREGATION

Issue description

Iterating over GroupBy object constructed from join on the fly as in Version 1 is ~100x faster than iterating over GroupBy object created from precomputed join csv loaded in through pl.read_csv.

Expected behavior

Speed of iteration should be consistent between the two methods.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.1.0-21-cloud-amd64-x86_64-with-glibc2.36
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
torch:                2.3.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-06-30T07:44:36Z

Can you provide reproducible examples? Now we must recreate the missing files.

beazerj added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 29, 2024

ritchie46 mentioned this issue Jun 30, 2024

perf(python): Rechunk before group_by `iteration #17302

Merged

ritchie46 closed this as completed in #17302 Jun 30, 2024

stinodego removed the needs triage Awaiting prioritization by a maintainer label Jun 30, 2024

c-peters added the accepted Ready for implementation label Jul 1, 2024

c-peters assigned ritchie46 Jul 1, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

beazerj commented Jun 29, 2024 •

edited

Loading

ritchie46 commented Jun 30, 2024

Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

Comments

beazerj commented Jun 29, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Jun 30, 2024

beazerj commented Jun 29, 2024 •

edited

Loading