Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong output of GroupBy transform with string input (e.g., transform('rank')) #22509

Closed
marioga opened this issue Aug 26, 2018 · 12 comments
Closed
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions

Comments

@marioga
Copy link

marioga commented Aug 26, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame([[1, 10], [1, 20], [2, 40], [2, 30]], columns=['a', 'b'])

print("%s\n" % df)

print("Output of df.groupby('a')['b'].transform('rank'):\n%s\n" %
        df.groupby('a')['b'].transform('rank'))

print("Output of df.groupby('a')['b'].rank():\n%s\n" %
        df.groupby('a')['b'].rank())

print("Output of df.groupby('a')['b'].transform(lambda x: x.rank()):\n%s\n" %
        df.groupby('a')['b'].transform(lambda x: x.rank()))

print("Output of df.groupby('a')['b'].transform('cumcount'):\n%s\n" %
        df.groupby('a')['b'].transform('cumcount'))

print("Output of df.groupby('a')['b'].cumcount():\n%s\n" %
        df.groupby('a')['b'].cumcount())

Problem description

For simplicity, I will explain the issue for SeriesGroupBy, though the bug is present in DataFrameGroupBy as well.

When calling transform on a SeriesGroupBy object with a string input 'string_input' (e.g., .transform('mean')), the relevant code inside pandas/pandas/core/groupby/generic.py will end up calling SeriesGroupBy._transform_fast on the function

func = getattr(SeriesGroupBy, 'string_input')

(unless 'string_input' is inside base.cython_transforms, currently consisting of ['cumprod', 'cumsum', 'shift', 'cummin', 'cummax']). Inside _transform_fast, the result of applying func to the SeriesGroupBy object is then broadcast to the entire index of the original object. This works as expected if func returns a single value per group in the GroupBy object (e.g., for functions like 'mean', 'std', etc.). However, for functions like rank, cumcount, etc., that return several values per group, the result of broadcasting is nonsensical to my best knowledge.

Note: Index broadcasting works correctly if transform is called with a (non-cython) function, for example lambda x: x.rank(). In that case, _fast_transform is never called and the result is a simple concatenation of the results for each group.

Expected Output

The results of df.groupby('a')['b'].transform('rank') and df.groupby('a')['b'].rank() should be identical. Same for 'cumcount' and maybe other GroupBy functions.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Groupby Compat pandas objects compatability with Numpy or Python functions labels Aug 27, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 27, 2018

@marioga : Thanks for reporting this! For reference, do you mind posting what output you got?

cc @jreback

@marioga
Copy link
Author

marioga commented Aug 27, 2018

Here is the output I get after running the code I submitted:

   a   b
0  1  10
1  1  20
2  2  40
3  2  30

Output of df.groupby('a')['b'].transform('rank'):
0    1.0
1    1.0
2    2.0
3    2.0
Name: b, dtype: float64

Output of df.groupby('a')['b'].rank():
0    1.0
1    2.0
2    2.0
3    1.0
Name: b, dtype: float64

Output of df.groupby('a')['b'].transform(lambda x: x.rank()):
0    1
1    2
2    2
3    1
Name: b, dtype: int64

Output of df.groupby('a')['b'].transform('cumcount'):
0    0
1    0
2    1
3    1
Name: b, dtype: int64

Output of df.groupby('a')['b'].cumcount():
0    0
1    1
2    0
3    1
dtype: int64

@ghost
Copy link

ghost commented Jun 8, 2019

xref #19354

Adding 'rank' to cython_transforms fixes the issue, as suggested by @chris-b1 in #19354.
cython_transforms is misnamed, because the conditionals in

if func in base.cython_transforms:

and
if func in base.cython_transforms:

should be used discriminate between aggregating functions (which _transform_fast assumes) and
non-aggregating functions (like rank), whether they are cythonized is not the point.

The result of transform('rank') is definitely wrong, and (surprising) different
from groupby_obj.rank() or groupby_obj.transform(lambda x: x.rank) (the latter two produce the same result as each other).

What the fast path does when passed a non-aggregating function, is generate a rank series R then for each value belonging to group i, it assigns the value R[i]. It's correct when aggreges produce one value per group, but wrong for other functions which produce one value per row.

The fast_path was introduces in 2014 with fe55b89, back when SeriesGroupby didn't actually have any non-aggregation function implemented, so at the time it was good enough.

@ghost
Copy link

ghost commented Jun 11, 2019

To illustrate:

In [34]: df=pd.DataFrame(dict(price=[10,10,20,20,30,30],cost=(100,200,300,400,500,600)))
    ...: df
Out[34]: 
   price  cost
0     10   100
1     10   200
2     20   300
3     20   400
4     30   500
5     30   600

# correct
In [36]: df.groupby('price').cost.rank()
Out[36]: 
0    1.0
1    2.0
2    1.0
3    2.0
4    1.0
5    2.0
Name: cost, dtype: float64

# wrong
In [35]: df.groupby('price').cost.transform('rank')
Out[35]: 
0    1.0
1    1.0
2    2.0
3    2.0
4    1.0
5    1.0
Name: cost, dtype: float64

# the above generated via
In [47]: ids, _, _ = df.groupby('price').grouper.group_info
    ...: ids
Out[47]: array([0, 0, 1, 1, 2, 2])
In [48]: df.groupby('price').rank().loc[ids]
Out[48]: 
   cost
0   1.0
0   1.0
1   2.0
1   2.0
2   1.0
2   1.0

@WillAyd WillAyd added this to the Contributions Welcome milestone Jun 12, 2019
@ghost
Copy link

ghost commented Jun 14, 2019

same issue with 'ffill', 'bfill' (xref #14274)
and 'pct_change', 'diff', 'shift'. (at least)

@jreback
Copy link
Contributor

jreback commented Jun 14, 2019

transforming on a non aggregating function doesn’t make any sense as how would u broadcast the results?

@ghost
Copy link

ghost commented Jun 14, 2019

if its non-agg AND shape-preserving, as many are, you don't need to broadcast.

@ghost
Copy link

ghost commented Jun 14, 2019

but if you're suddenly a purist, see #26743

@jreback
Copy link
Contributor

jreback commented Jun 14, 2019

@pilkibun you are missing the point

transform works by using groupby with an aggregator then broadcasting
that how I wrote it a while back

not sure why you have 2 issues here - just needs to raise on on aggregated outputs

@ghost
Copy link

ghost commented Jun 14, 2019

@pilkibun you are missing the point

I'm sure one of us is.

@mroeschke mroeschke added Bug and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 10, 2020
@mzeitlin11
Copy link
Member

This is fixed on master but could use tests

@mzeitlin11 mzeitlin11 added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Mar 29, 2021
@mroeschke
Copy link
Member

This seems to be the same issue as #19354, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants