Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(rust,python): avoid quadratic exclude behaviour when selecting against dtypes and/or wildcards #8953

Merged
merged 1 commit into from
May 21, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented May 20, 2023

Closes #8947.

Huge speedup when excluding on wide frames; multiple orders of magnitude as the number of cols gets large. (Even though it's not noticeable on smaller frames, the quadratic behaviour really catches up with you as you head into the tens of thousands of columns 😅).

Changed prepare_excluded to return a HashSet as the number of items it can hold may be large, so we want the O(1) lookup; the dtypes param doesn't need the same treatment as even in the most extreme case there can only be a handful of them. Also moved the exclusion-match (inside expand_dtypes) up into the iter_fields filter.

Example

from codetiming import Timer
import polars as pl

for n in ( 
    # ↓ fast enough not to notice
    100,
    500,
    1_000,
    2_500,
    5_000,
    7_500,
    10_000,
    # ↓ "highway to the danger zone"
    50_000,
    75_000,
    100_000,
    175_000,
    250_000,
    500_000,
    1_000_000,
):
    df = pl.DataFrame( schema={f"c{i}":pl.Int32 for i in range(n)} )
    with Timer():
        _ = df.select( pl.exclude(pl.NUMERIC_DTYPES) )

Results

Timings:

Hardware: Apple Silicon M2 Max Pro
Compiled with: maturin develop --release -- -C target-cpu=native

n cols this pr main speedup
100 0.0001 0.0001 1x
500 0.0002 0.0006 3x
1,000 0.0004 0.0022 5x
2,500 0.0008 0.0089 11x
5,000 0.0015 0.0423 28x
7,500 0.0018 0.0845 46x
10,000 0.0022 0.1555 70x
50,000 0.0113 3.3557 296x
75,000 0.0187 8.2587 441x
100,000 0.0248 14.9413 602x
175,000 0.0418 30.139 721x
250,000 0.0688 65.9686 958x
500,000 0.1478 333.4362 2255x
1,000,000 0.3226 1527.7112 4735x

Plot:

Orange: current main.
Blue: this pr.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels May 20, 2023
@alexander-beedie alexander-beedie changed the title perf(rust,python): avoid quadratic exclude behaviour when selecting against dtypes perf(rust,python): avoid quadratic exclude behaviour when selecting against dtypes and/or wildcards May 20, 2023
@mcrumiller
Copy link
Contributor

What library did you use for the figure?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 20, 2023

What library did you use for the figure?

An elderly copy of Excel 🤣

@ritchie46
Copy link
Member

Good catch!

@ritchie46 ritchie46 merged commit 0e51555 into pola-rs:main May 21, 2023
33 checks passed
@alexander-beedie alexander-beedie deleted the quadratic-exclude-cols branch May 21, 2023 05:42
c-peters pushed a commit to c-peters/polars that referenced this pull request Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column datatype selector slowdown on wide dataframe
3 participants