PERF: df.groupby(categorical) #49596

lukemanley · 2022-11-09T02:14:34Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Perf improvement in Categorical.reorder_categories with user facing performance improvements likely to be DataFrame.groupby(categorical).

import pandas as pd
import pandas._testing as tm

vals = pd.Series(tm.rands_array(10, 10**6), dtype="string")
df = pd.DataFrame({"cat": vals.astype("category")})

%timeit df.groupby("cat").size()

1.21 s ± 4.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     <- main
15.1 ms ± 274 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

existing ASVs:

       before           after         ratio
     [d69e63d7]       [382245ad]
     <main>           <cat-reorder-categories>
-      3.91±0.1ms       3.08±0.1ms     0.79  groupby.Categories.time_groupby_nosort
-     3.54±0.04ms      2.31±0.02ms     0.65  groupby.Categories.time_groupby_extra_cat_nosort
-     3.85±0.05ms       2.39±0.1ms     0.62  groupby.Categories.time_groupby_ordered_nosort
-     2.05±0.01ms          514±6μs     0.25  groupby.Categories.time_groupby_extra_cat_sort
-      2.21±0.1ms          536±4μs     0.24  groupby.Categories.time_groupby_sort
-      2.23±0.1ms         523±30μs     0.23  groupby.Categories.time_groupby_ordered_sort

topper-123 · 2022-11-09T05:11:43Z

Wow, very nice!

One question, that's only semi-related to this: why do we even use reorder_categories in this groupby? it does seem unrelated especially as we're only using one dtype (if we were combining two dtypes, I would understand it).

Could we get an additional perf. improvement by avoiding reorder_categories in groupby in cases like this?

lukemanley · 2022-11-09T11:47:26Z

I'm going to guess the rationale for matching the original order of the categories is to keep the result categories dtype equivalent to the input categories dtype. I think we could add a short-circuit that checks to see if they happen to be equal already and if so skip reorder_categories. However, I just tried that locally and could not measure any noticeable improvement in performance.

topper-123

LGTM

topper-123 · 2022-11-09T14:15:26Z

Ok. Thanks for trying it out. This performance improvement is already really great as is :-)

topper-123 · 2022-11-09T14:53:06Z

Thanks, @lukemanley. Great stuff.

* Categorical.reorder_categories perf * whatsnew

Categorical.reorder_categories perf

382245a

lukemanley added Groupby Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 9, 2022

whatsnew

097b151

lukemanley changed the title ~~PERF: groupby(categorical)~~ PERF: df.groupby(categorical) Nov 9, 2022

topper-123 added this to the 2.0 milestone Nov 9, 2022

topper-123 approved these changes Nov 9, 2022

View reviewed changes

topper-123 merged commit a88d6f2 into pandas-dev:main Nov 9, 2022

phofl pushed a commit to phofl/pandas that referenced this pull request Nov 9, 2022

PERF: df.groupby(categorical) (pandas-dev#49596)

371ae2e

* Categorical.reorder_categories perf * whatsnew

lukemanley deleted the cat-reorder-categories branch November 10, 2022 03:18

codamuse pushed a commit to codamuse/pandas that referenced this pull request Nov 12, 2022

PERF: df.groupby(categorical) (pandas-dev#49596)

6c0d3fb

* Categorical.reorder_categories perf * whatsnew

mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022

PERF: df.groupby(categorical) (pandas-dev#49596)

52f7d1f

* Categorical.reorder_categories perf * whatsnew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: df.groupby(categorical) #49596

PERF: df.groupby(categorical) #49596

lukemanley commented Nov 9, 2022 •

edited

Loading

topper-123 commented Nov 9, 2022 •

edited

Loading

lukemanley commented Nov 9, 2022

topper-123 left a comment

topper-123 commented Nov 9, 2022

topper-123 commented Nov 9, 2022

PERF: df.groupby(categorical) #49596

PERF: df.groupby(categorical) #49596

Conversation

lukemanley commented Nov 9, 2022 • edited Loading

topper-123 commented Nov 9, 2022 • edited Loading

lukemanley commented Nov 9, 2022

topper-123 left a comment

Choose a reason for hiding this comment

topper-123 commented Nov 9, 2022

topper-123 commented Nov 9, 2022

lukemanley commented Nov 9, 2022 •

edited

Loading

topper-123 commented Nov 9, 2022 •

edited

Loading