Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid creating numpy array in groupby.first|last #34178

Merged
merged 1 commit into from
May 15, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 14, 2020

A unneeded numpy array is created for each group when calling groupby.first and groupby.last on ExtensionArrays. This avoids that.

>>> cat = pd.Categorical(["a"] * 1_000_000 + ["b"] * 1_000_000)
>>> ser = pd.Series(cat)
>>> %timeit ser.groupby(cat).first()
210 ms ± 3.03 ms per loop  # master
78.4 ms ± 766 µs per loop  # this PR

The same speedup is archieved for groupby.last. The above is 3x faster than in master because there are two groups == we save creating two arrays. If there were more groups/larger arrays, we'd get even more improvements.

Also adds some type hints to help understand what parameters these funtions accept.

@jreback jreback added Groupby Performance Memory or execution speed performance labels May 15, 2020
@jreback jreback added this to the 1.1 milestone May 15, 2020
@jreback jreback merged commit 1f1735e into pandas-dev:master May 15, 2020
@jreback
Copy link
Contributor

jreback commented May 15, 2020

thanks @topper-123 very nice

@topper-123 topper-123 deleted the groupby_first_last branch May 24, 2020 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants