Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big difference in iteration speed over GroupBy object depending on dataFrame construction #17288

Closed
2 tasks done
beazerj opened this issue Jun 29, 2024 · 1 comment · Fixed by #17302
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars

Comments

@beazerj
Copy link

beazerj commented Jun 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Version 1:

import polars as pl
from pathlib import Path
from tqdm import tqdm

def do_something():
    return None

if __name__ == "__main__":
    p1 = Path(file_1.csv)
    p2 = Path(file_2.csv)

    d1 = pl.scan_csv(p2)
    d2 = pl.scan_csv(p1)

   d3 = d1.join(
        d2, left_on="join_col_left", right_on="join_col_right"
    ).collect()  

    for iter_col_1, data in tqdm(
        d3.group_by("iter_col_1"),
        total=len(d3["iter_col_1"].unique()),
    ):
        do_something()

Version 2:

import polars as pl
from pathlib import Path
from tqdm import tqdm

def do_something():
    return None

if __name__ == "__main__":
    p1 = Path(file_1.csv)
    p2 = Path(file_2.csv)

    d1 = pl.scan_csv(p2)
    d2 = pl.scan_csv(p1)

   d1.join(
        d2, left_on="join_col_left", right_on="join_col_right"
    ).sink_csv("file_3.csv")

    d3 = pl.read_csv("file_3.csv")

    for iter_col_1, data in tqdm(
        d3.group_by("iter_col_1"),
        total=len(d3["iter_col_1"].unique()),
    ):
        do_something()

Log output

Version 1: 

join parallel: true                                                                                                
read files in parallel                                                                                             
read files in parallel                                                                                             
avg line length: 24.135742                                                                                         
std. dev. line length: 0.3425134                                                                                   
initial row estimate: 6185333                                                                                      
avg line length: 40.760742                                                                                         
std. dev. line length: 7.7675257                                                                                   
initial row estimate: 2949451008                                                                                   
no. of chunks: 128 processed by: 128 threads.                                                                      
no. of chunks: 128 processed by: 128 threads.                                                                      
INNER join dataframes finished      
keys/aggregates are not partitionable: running default HASH AGGREGATION 

Version 2:

RUN STREAMING PIPELINE
[csv -> callback -> parquet_sink, csv -> generic_join_build]
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 12500 rows
avg line length: 46.3291
std. dev. line length: 6.630605
initial row estimate: 403562720
no. of chunks: 128 processed by: 128 threads.
keys/aggregates are not partitionable: running default HASH AGGREGATION

Issue description

Iterating over GroupBy object constructed from join on the fly as in Version 1 is ~100x faster than iterating over GroupBy object created from precomputed join csv loaded in through pl.read_csv.

Expected behavior

Speed of iteration should be consistent between the two methods.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.1.0-21-cloud-amd64-x86_64-with-glibc2.36
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.28
torch:                2.3.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@beazerj beazerj added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 29, 2024
@ritchie46
Copy link
Member

Can you provide reproducible examples? Now we must recreate the missing files.

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jun 30, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 1, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants