Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

marioga · 2018-08-26T01:23:01Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame([[1, 10], [1, 20], [2, 40], [2, 30]], columns=['a', 'b'])

print("%s\n" % df)

print("Output of df.groupby('a')['b'].transform('rank'):\n%s\n" %
        df.groupby('a')['b'].transform('rank'))

print("Output of df.groupby('a')['b'].rank():\n%s\n" %
        df.groupby('a')['b'].rank())

print("Output of df.groupby('a')['b'].transform(lambda x: x.rank()):\n%s\n" %
        df.groupby('a')['b'].transform(lambda x: x.rank()))

print("Output of df.groupby('a')['b'].transform('cumcount'):\n%s\n" %
        df.groupby('a')['b'].transform('cumcount'))

print("Output of df.groupby('a')['b'].cumcount():\n%s\n" %
        df.groupby('a')['b'].cumcount())

Problem description

For simplicity, I will explain the issue for SeriesGroupBy, though the bug is present in DataFrameGroupBy as well.

When calling transform on a SeriesGroupBy object with a string input 'string_input' (e.g., .transform('mean')), the relevant code inside pandas/pandas/core/groupby/generic.py will end up calling SeriesGroupBy._transform_fast on the function

func = getattr(SeriesGroupBy, 'string_input')

(unless 'string_input' is inside base.cython_transforms, currently consisting of ['cumprod', 'cumsum', 'shift', 'cummin', 'cummax']). Inside _transform_fast, the result of applying func to the SeriesGroupBy object is then broadcast to the entire index of the original object. This works as expected if func returns a single value per group in the GroupBy object (e.g., for functions like 'mean', 'std', etc.). However, for functions like rank, cumcount, etc., that return several values per group, the result of broadcasting is nonsensical to my best knowledge.

Note: Index broadcasting works correctly if transform is called with a (non-cython) function, for example lambda x: x.rank(). In that case, _fast_transform is never called and the result is a simple concatenation of the results for each group.

Expected Output

The results of df.groupby('a')['b'].transform('rank') and df.groupby('a')['b'].rank() should be identical. Same for 'cumcount' and maybe other GroupBy functions.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-27T09:50:46Z

@marioga : Thanks for reporting this! For reference, do you mind posting what output you got?

cc @jreback

marioga · 2018-08-27T19:32:09Z

Here is the output I get after running the code I submitted:

   a   b
0  1  10
1  1  20
2  2  40
3  2  30

Output of df.groupby('a')['b'].transform('rank'):
0    1.0
1    1.0
2    2.0
3    2.0
Name: b, dtype: float64

Output of df.groupby('a')['b'].rank():
0    1.0
1    2.0
2    2.0
3    1.0
Name: b, dtype: float64

Output of df.groupby('a')['b'].transform(lambda x: x.rank()):
0    1
1    2
2    2
3    1
Name: b, dtype: int64

Output of df.groupby('a')['b'].transform('cumcount'):
0    0
1    0
2    1
3    1
Name: b, dtype: int64

Output of df.groupby('a')['b'].cumcount():
0    0
1    1
2    0
3    1
dtype: int64

ghost · 2019-06-08T07:21:14Z

xref #19354

Adding 'rank' to cython_transforms fixes the issue, as suggested by @chris-b1 in #19354.
cython_transforms is misnamed, because the conditionals in

pandas/pandas/core/groupby/generic.py

Line 524 in cf25c5c

if func in base.cython_transforms:

and

pandas/pandas/core/groupby/generic.py

Line 919 in cf25c5c

if func in base.cython_transforms:

should be used discriminate between aggregating functions (which _transform_fast assumes) and
non-aggregating functions (like rank), whether they are cythonized is not the point.

The result of transform('rank') is definitely wrong, and (surprising) different
from groupby_obj.rank() or groupby_obj.transform(lambda x: x.rank) (the latter two produce the same result as each other).

What the fast path does when passed a non-aggregating function, is generate a rank series R then for each value belonging to group i, it assigns the value R[i]. It's correct when aggreges produce one value per group, but wrong for other functions which produce one value per row.

The fast_path was introduces in 2014 with fe55b89, back when SeriesGroupby didn't actually have any non-aggregation function implemented, so at the time it was good enough.

ghost · 2019-06-11T19:11:56Z

To illustrate:

In [34]: df=pd.DataFrame(dict(price=[10,10,20,20,30,30],cost=(100,200,300,400,500,600)))
    ...: df
Out[34]: 
   price  cost
0     10   100
1     10   200
2     20   300
3     20   400
4     30   500
5     30   600

# correct
In [36]: df.groupby('price').cost.rank()
Out[36]: 
0    1.0
1    2.0
2    1.0
3    2.0
4    1.0
5    2.0
Name: cost, dtype: float64

# wrong
In [35]: df.groupby('price').cost.transform('rank')
Out[35]: 
0    1.0
1    1.0
2    2.0
3    2.0
4    1.0
5    1.0
Name: cost, dtype: float64

# the above generated via
In [47]: ids, _, _ = df.groupby('price').grouper.group_info
    ...: ids
Out[47]: array([0, 0, 1, 1, 2, 2])
In [48]: df.groupby('price').rank().loc[ids]
Out[48]: 
   cost
0   1.0
0   1.0
1   2.0
1   2.0
2   1.0
2   1.0

ghost · 2019-06-14T00:38:40Z

same issue with 'ffill', 'bfill' (xref #14274)
and 'pct_change', 'diff', 'shift'. (at least)

jreback · 2019-06-14T01:44:21Z

transforming on a non aggregating function doesn’t make any sense as how would u broadcast the results?

ghost · 2019-06-14T01:49:10Z

if its non-agg AND shape-preserving, as many are, you don't need to broadcast.

ghost · 2019-06-14T01:52:20Z

but if you're suddenly a purist, see #26743

jreback · 2019-06-14T02:04:31Z

@pilkibun you are missing the point

transform works by using groupby with an aggregator then broadcasting
that how I wrote it a while back

not sure why you have 2 issues here - just needs to raise on on aggregated outputs

ghost · 2019-06-14T03:46:58Z

@pilkibun you are missing the point

I'm sure one of us is.

mzeitlin11 · 2021-03-29T21:00:41Z

This is fixed on master but could use tests

mroeschke · 2021-05-28T03:44:53Z

This seems to be the same issue as #19354, closing

gfyoung added Groupby Compat pandas objects compatability with Numpy or Python functions labels Aug 27, 2018

ghost mentioned this issue Jun 8, 2019

Error in analysis due to pandas bug TomAugspurger/effective-pandas#30

Closed

ghost mentioned this issue Jun 8, 2019

groupby.agg should refuse non-aggregation operations #26743

Closed

WillAyd added this to the Contributions Welcome milestone Jun 12, 2019

ghost mentioned this issue Jul 14, 2019

Discuss: transformation vs. aggregation in agg vs. transform #27389

Closed

ghost mentioned this issue Jul 25, 2019

BUG: groupby.transform(name) validates name is an aggregation #27597

Closed

2 tasks

mroeschke added Bug and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020

mzeitlin11 added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Mar 29, 2021

mroeschke closed this as completed May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

marioga commented Aug 26, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

gfyoung commented Aug 27, 2018

marioga commented Aug 27, 2018

ghost commented Jun 8, 2019 •

edited by ghost

Loading

ghost commented Jun 11, 2019

ghost commented Jun 14, 2019 •

edited by ghost

Loading

jreback commented Jun 14, 2019

ghost commented Jun 14, 2019

ghost commented Jun 14, 2019

jreback commented Jun 14, 2019

ghost commented Jun 14, 2019

mzeitlin11 commented Mar 29, 2021

mroeschke commented May 28, 2021

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

Comments

marioga commented Aug 26, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

gfyoung commented Aug 27, 2018

marioga commented Aug 27, 2018

ghost commented Jun 8, 2019 • edited by ghost Loading

ghost commented Jun 11, 2019

ghost commented Jun 14, 2019 • edited by ghost Loading

jreback commented Jun 14, 2019

ghost commented Jun 14, 2019

ghost commented Jun 14, 2019

jreback commented Jun 14, 2019

ghost commented Jun 14, 2019

mzeitlin11 commented Mar 29, 2021

mroeschke commented May 28, 2021

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

ghost commented Jun 8, 2019 •

edited by ghost

Loading

ghost commented Jun 14, 2019 •

edited by ghost

Loading