Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

Closed
3 tasks done
axil opened this issue Dec 19, 2021 · 3 comments · Fixed by #45820
Closed
3 tasks done

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

axil opened this issue Dec 19, 2021 · 3 comments · Fixed by #45820
Labels
Apply Apply, Aggregate, Transform Bug Groupby Performance Memory or execution speed performance
Milestone

Comments

@axil
Copy link

axil commented Dec 19, 2021

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'key': ['z', 'z'], 'a': [1,2], 'b': [3,4]})
def f(x):
    print(x)
    return x.sum()
df.groupby('key').transform(f)

Issue Description

For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).

After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.

The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)

After it was merged in, the path output variable of the _choose_path is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:

-        path = None
         for name, group in gen:
             object.__setattr__(group, "name", name)

-            if path is None:
+            if engine == "numba":
+                values, index = split_for_numba(group)
+                res = numba_func(values, index, *args)
+                if func not in self._numba_func_cache:
+                    self._numba_func_cache[func] = numba_func
+                # Return the result as a DataFrame for concatenation later
+                res = DataFrame(res, index=group.index, columns=group.columns)
+            else:
                 # Try slow path and fast path.
                 try:
                     path, res = self._choose_path(fast_path, slow_path, group)
@@ -1376,8 +1422,6 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
                 except ValueError as err:
                     msg = "transform must return a scalar value for each group"
                     raise ValueError(msg) from err
-            else:
-                res = path(group)

Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.

After this commit this line from the docs is no longer valid:

If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.

Expected Behavior

0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2

@axil axil added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 19, 2021
@simonjayhawkins
Copy link
Member

Thanks @axil for the report and investigation. PR to fix welcome.

cc @mroeschke

@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform Groupby Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 21, 2021
@mroeschke
Copy link
Member

Here is the current logic block that non-numba engine groupby transform goes through.

I haven't been following the fast/slow path saga too closely, but maybe that refactor was done for groupby apply and not groupby transform?

def _transform_general(self, func, *args, **kwargs):
from pandas.core.reshape.concat import concat
applied = []
obj = self._obj_with_exclusions
gen = self.grouper.get_iterator(obj, axis=self.axis)
fast_path, slow_path = self._define_paths(func, *args, **kwargs)
# Determine whether to use slow or fast path by evaluating on the first group.
# Need to handle the case of an empty generator and process the result so that
# it does not need to be computed again.
try:
name, group = next(gen)
except StopIteration:
pass
else:
object.__setattr__(group, "name", name)
try:
path, res = self._choose_path(fast_path, slow_path, group)
except TypeError:
return self._transform_item_by_item(obj, fast_path)
except ValueError as err:
msg = "transform must return a scalar value for each group"
raise ValueError(msg) from err
if group.size > 0:
res = _wrap_transform_general_frame(self.obj, group, res)
applied.append(res)
# Compute and process with the remaining groups
for name, group in gen:
if group.size == 0:
continue
object.__setattr__(group, "name", name)
res = path(group)
res = _wrap_transform_general_frame(self.obj, group, res)
applied.append(res)
concat_index = obj.columns if self.axis == 0 else obj.index
other_axis = 1 if self.axis == 0 else 0 # switches between 0 & 1
concatenated = concat(applied, axis=self.axis, verify_integrity=False)
concatenated = concatenated.reindex(concat_index, axis=other_axis, copy=False)
return self._set_result_index_ordered(concatenated)

@lukemanley
Copy link
Member

I believe this behavior is mostly expected. Internally, groupby.transform chooses one of two potential code paths, a "slow path" or a "fast path". The slow path evaluates the transformation function column by column whereas the fast path evaluates all columns at once if the function can support it. To determine which path to take, pandas evaluates both paths on the first group in the groupby and if the results are equal, the fast path is used for subsequent groups.

Relabeling the example with the internal paths, I think this makes sense:

0 1 # <-- SLOW PATH (column a)
1 2
Name: a, dtype: int64

0 3 # <-- SLOW PATH (column b)
1 4
Name: b, dtype: int64

a b # <-- FAST PATH (all columns)
0 1 3
1 2 4

a b # <-- the result is correct
0 3 7
1 3 7

Your example does highlight one potential performance improvement which is when only a single group exists in the groupby there is no benefit in evaluating both paths because there are no subsequent groups to process.

@jreback jreback modified the milestones: Contributions Welcome, 1.5 Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants