-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977
Comments
Thanks @axil for the report and investigation. PR to fix welcome. cc @mroeschke |
Here is the current logic block that non-numba engine groupby transform goes through. I haven't been following the fast/slow path saga too closely, but maybe that refactor was done for groupby apply and not groupby transform? pandas/pandas/core/groupby/generic.py Lines 1138 to 1179 in 14287e3
|
I believe this behavior is mostly expected. Internally, groupby.transform chooses one of two potential code paths, a "slow path" or a "fast path". The slow path evaluates the transformation function column by column whereas the fast path evaluates all columns at once if the function can support it. To determine which path to take, pandas evaluates both paths on the first group in the groupby and if the results are equal, the fast path is used for subsequent groups. Relabeling the example with the internal paths, I think this makes sense:
Your example does highlight one potential performance improvement which is when only a single group exists in the groupby there is no benefit in evaluating both paths because there are no subsequent groups to process. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).
After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.
The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)
After it was merged in, the path output variable of the
_choose_path
is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.
After this commit this line from the docs is no longer valid:
Expected Behavior
0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7
Installed Versions
INSTALLED VERSIONS
commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2
The text was updated successfully, but these errors were encountered: