Skip to content

Conversation

jbrockmendel
Copy link
Member

Per discussions about removing libreduction code, this is part of an effort to make the non-libreduction path more performant.

Performance comparisons are done by disabling fast_apply entirely and taking the two most-affected asvs:

import numpy as np
from pandas import DataFrame

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%prun -s cumtime df.groupby(["key", "key2"]).apply(lambda x: 1)
PR -> 0.263 s
No optimization -> 0.308 s
master -> .039 s

%prun -s cumtime df.groupby("key").apply(lambda x: 1)
PR -> 0.083 s
No optimization -> 0.127 s
master -> .012 s

@jbrockmendel jbrockmendel added Groupby Performance Memory or execution speed performance labels May 17, 2020
@jreback jreback added this to the 1.1 milestone May 17, 2020
@jreback jreback merged commit 6f065b6 into pandas-dev:master May 17, 2020
@jbrockmendel jbrockmendel deleted the slow-apply branch May 17, 2020 21:35
Japanuspus added a commit to Japanuspus/pandas that referenced this pull request Aug 12, 2020
This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b].

Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`.

Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find.

[6f065b]: pandas-dev@6f065b6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Groupby Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants