BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

axil · 2021-12-19T23:45:08Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame({'key': ['z', 'z'], 'a': [1,2], 'b': [3,4]})
def f(x):
    print(x)
    return x.sum()
df.groupby('key').transform(f)

Issue Description

For every group the user function is first called with every series of this group (which is correct), but then with the group as a whole (which is not right, as the result is not used anywhere at all).

After digging through the source code I've found remnants of the 'fast path' and 'slow path' – an optimization that has long been gone, but those extra calls to user function are still there.

The commit which obliterates this optimization is
b8b6471
ENH: Add numba engine to groupby.transform (#32854)

After it was merged in, the path output variable of the _choose_path is not used anywhere any longer. So that extra call to the user function that was only necessary setup the path is not necessary as well:

-        path = None
         for name, group in gen:
             object.__setattr__(group, "name", name)

-            if path is None:
+            if engine == "numba":
+                values, index = split_for_numba(group)
+                res = numba_func(values, index, *args)
+                if func not in self._numba_func_cache:
+                    self._numba_func_cache[func] = numba_func
+                # Return the result as a DataFrame for concatenation later
+                res = DataFrame(res, index=group.index, columns=group.columns)
+            else:
                 # Try slow path and fast path.
                 try:
                     path, res = self._choose_path(fast_path, slow_path, group)
@@ -1376,8 +1422,6 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
                 except ValueError as err:
                     msg = "transform must return a scalar value for each group"
                     raise ValueError(msg) from err
-            else:
-                res = path(group)

Or maybe it was done by mistake and deletion of the res=path(group) line should be reverted.

After this commit this line from the docs is no longer valid:

If f also supports application to the entire subframe, then a fast path is used starting from the second chunk.

Expected Behavior

0 1 # <-- this is correct
1 2
Name: a, dtype: int64
0 3 # <-- this is correct
1 4
Name: b, dtype: int64
a b # <-- this is wrong (result is ignored)
0 1 3
1 2 4
a b # <-- the result is correct
0 3 7
1 3 7

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.3.5
numpy : 1.21.2
pytz : 2019.2
dateutil : 2.8.0
pip : 21.0.1
setuptools : 41.1.0
Cython : 0.29.14
pytest : 5.1.3
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.16.2
xlrd : 1.2.0
xlwt : None
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-12-21T10:54:04Z

Thanks @axil for the report and investigation. PR to fix welcome.

cc @mroeschke

mroeschke · 2021-12-21T19:02:50Z

Here is the current logic block that non-numba engine groupby transform goes through.

I haven't been following the fast/slow path saga too closely, but maybe that refactor was done for groupby apply and not groupby transform?

pandas/pandas/core/groupby/generic.py

Lines 1138 to 1179 in 14287e3

    
           def _transform_general(self, func, *args, **kwargs): 
        
               from pandas.core.reshape.concat import concat 
        
               applied = [] 
        
               obj = self._obj_with_exclusions 
        
               gen = self.grouper.get_iterator(obj, axis=self.axis) 
        
               fast_path, slow_path = self._define_paths(func, *args, **kwargs) 
        
               # Determine whether to use slow or fast path by evaluating on the first group. 
        
               # Need to handle the case of an empty generator and process the result so that 
        
               # it does not need to be computed again. 
        
               try: 
        
                   name, group = next(gen) 
        
               except StopIteration: 
        
                   pass 
        
               else: 
        
                   object.__setattr__(group, "name", name) 
        
                   try: 
        
                       path, res = self._choose_path(fast_path, slow_path, group) 
        
                   except TypeError: 
        
                       return self._transform_item_by_item(obj, fast_path) 
        
                   except ValueError as err: 
        
                       msg = "transform must return a scalar value for each group" 
        
                       raise ValueError(msg) from err 
        
                   if group.size > 0: 
        
                       res = _wrap_transform_general_frame(self.obj, group, res) 
        
                       applied.append(res) 
        
               # Compute and process with the remaining groups 
        
               for name, group in gen: 
        
                   if group.size == 0: 
        
                       continue 
        
                   object.__setattr__(group, "name", name) 
        
                   res = path(group) 
        
                   res = _wrap_transform_general_frame(self.obj, group, res) 
        
                   applied.append(res) 
        
               concat_index = obj.columns if self.axis == 0 else obj.index 
        
               other_axis = 1 if self.axis == 0 else 0  # switches between 0 & 1 
        
               concatenated = concat(applied, axis=self.axis, verify_integrity=False) 
        
               concatenated = concatenated.reindex(concat_index, axis=other_axis, copy=False) 
        
               return self._set_result_index_ordered(concatenated)

lukemanley · 2022-02-04T11:57:40Z

I believe this behavior is mostly expected. Internally, groupby.transform chooses one of two potential code paths, a "slow path" or a "fast path". The slow path evaluates the transformation function column by column whereas the fast path evaluates all columns at once if the function can support it. To determine which path to take, pandas evaluates both paths on the first group in the groupby and if the results are equal, the fast path is used for subsequent groups.

Relabeling the example with the internal paths, I think this makes sense:

0 1 # <-- SLOW PATH (column a)
1 2
Name: a, dtype: int64

0 3 # <-- SLOW PATH (column b)
1 4
Name: b, dtype: int64

a b # <-- FAST PATH (all columns)
0 1 3
1 2 4

a b # <-- the result is correct
0 3 7
1 3 7

Your example does highlight one potential performance improvement which is when only a single group exists in the groupby there is no benefit in evaluating both paths because there are no subsequent groups to process.

axil added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 19, 2021

simonjayhawkins added Apply Apply, Aggregate, Transform Groupby Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2021

simonjayhawkins added this to the Contributions Welcome milestone Dec 21, 2021

lukemanley mentioned this issue Feb 4, 2022

PERF: groupby.transform skip fast path when only a single group exists #45820

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.5 Feb 4, 2022

jreback closed this as completed in #45820 Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

axil commented Dec 19, 2021 •

edited

Loading

INSTALLED VERSIONS

simonjayhawkins commented Dec 21, 2021

mroeschke commented Dec 21, 2021

lukemanley commented Feb 4, 2022

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

Comments

axil commented Dec 19, 2021 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

simonjayhawkins commented Dec 21, 2021

mroeschke commented Dec 21, 2021

lukemanley commented Feb 4, 2022

axil commented Dec 19, 2021 •

edited

Loading