Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: df.groupby(categorical) #49596

Merged
merged 2 commits into from
Nov 9, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Nov 9, 2022

Perf improvement in Categorical.reorder_categories with user facing performance improvements likely to be DataFrame.groupby(categorical).

import pandas as pd
import pandas._testing as tm

vals = pd.Series(tm.rands_array(10, 10**6), dtype="string")
df = pd.DataFrame({"cat": vals.astype("category")})

%timeit df.groupby("cat").size()

1.21 s ± 4.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
15.1 ms ± 274 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

existing ASVs:

       before           after         ratio
     [d69e63d7]       [382245ad]
     <main>           <cat-reorder-categories>
-      3.91±0.1ms       3.08±0.1ms     0.79  groupby.Categories.time_groupby_nosort
-     3.54±0.04ms      2.31±0.02ms     0.65  groupby.Categories.time_groupby_extra_cat_nosort
-     3.85±0.05ms       2.39±0.1ms     0.62  groupby.Categories.time_groupby_ordered_nosort
-     2.05±0.01ms          514±6μs     0.25  groupby.Categories.time_groupby_extra_cat_sort
-      2.21±0.1ms          536±4μs     0.24  groupby.Categories.time_groupby_sort
-      2.23±0.1ms         523±30μs     0.23  groupby.Categories.time_groupby_ordered_sort

@lukemanley lukemanley added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 9, 2022
@lukemanley lukemanley changed the title PERF: groupby(categorical) PERF: df.groupby(categorical) Nov 9, 2022
@topper-123
Copy link
Contributor

topper-123 commented Nov 9, 2022

Wow, very nice!

One question, that's only semi-related to this: why do we even use reorder_categories in this groupby? it does seem unrelated especially as we're only using one dtype (if we were combining two dtypes, I would understand it).

Could we get an additional perf. improvement by avoiding reorder_categories in groupby in cases like this?

@topper-123 topper-123 added this to the 2.0 milestone Nov 9, 2022
@lukemanley
Copy link
Member Author

I'm going to guess the rationale for matching the original order of the categories is to keep the result categories dtype equivalent to the input categories dtype. I think we could add a short-circuit that checks to see if they happen to be equal already and if so skip reorder_categories. However, I just tried that locally and could not measure any noticeable improvement in performance.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@topper-123
Copy link
Contributor

Ok. Thanks for trying it out. This performance improvement is already really great as is :-)

@topper-123
Copy link
Contributor

Thanks, @lukemanley. Great stuff.

@topper-123 topper-123 merged commit a88d6f2 into pandas-dev:main Nov 9, 2022
phofl pushed a commit to phofl/pandas that referenced this pull request Nov 9, 2022
* Categorical.reorder_categories perf

* whatsnew
@lukemanley lukemanley deleted the cat-reorder-categories branch November 10, 2022 03:18
codamuse pushed a commit to codamuse/pandas that referenced this pull request Nov 12, 2022
* Categorical.reorder_categories perf

* whatsnew
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
* Categorical.reorder_categories perf

* whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants