PERF: avoid creating numpy array in groupby.first|last #34178

topper-123 · 2020-05-14T18:36:35Z

A unneeded numpy array is created for each group when calling groupby.first and groupby.last on ExtensionArrays. This avoids that.

>>> cat = pd.Categorical(["a"] * 1_000_000 + ["b"] * 1_000_000)
>>> ser = pd.Series(cat)
>>> %timeit ser.groupby(cat).first()
210 ms ± 3.03 ms per loop  # master
78.4 ms ± 766 µs per loop  # this PR

The same speedup is archieved for groupby.last. The above is 3x faster than in master because there are two groups == we save creating two arrays. If there were more groups/larger arrays, we'd get even more improvements.

Also adds some type hints to help understand what parameters these funtions accept.

jreback · 2020-05-15T12:54:47Z

thanks @topper-123 very nice

PERF: avoid creating numpy array in groupby.first|last

0389edc

topper-123 force-pushed the groupby_first_last branch from a457506 to 0389edc Compare May 14, 2020 18:37

jreback added Groupby Performance Memory or execution speed performance labels May 15, 2020

jreback added this to the 1.1 milestone May 15, 2020

jreback merged commit 1f1735e into pandas-dev:master May 15, 2020

topper-123 mentioned this pull request May 15, 2020

CLN/TYP: Groupby agg methods #34200

Merged

topper-123 deleted the groupby_first_last branch May 24, 2020 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: avoid creating numpy array in groupby.first|last #34178

PERF: avoid creating numpy array in groupby.first|last #34178

topper-123 commented May 14, 2020 •

edited

Loading

jreback commented May 15, 2020

PERF: avoid creating numpy array in groupby.first|last #34178

PERF: avoid creating numpy array in groupby.first|last #34178

Conversation

topper-123 commented May 14, 2020 • edited Loading

jreback commented May 15, 2020

topper-123 commented May 14, 2020 •

edited

Loading