Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: #34380

Closed
jesrael opened this issue May 26, 2020 · 15 comments
Closed

BUG: #34380

jesrael opened this issue May 26, 2020 · 15 comments
Labels
Milestone

Comments

@jesrael
Copy link

jesrael commented May 26, 2020

From SO:

After specified column after groupby output is incorrect.

df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0], 'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0, 13.0, 9.0, 20.0], 'change': [np.nan, 1.5, -0.4, 2.0, -0.44444399999999995, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})
print (df)
      token name   ltp    change
0   12345.0  abc   2.0       NaN
1   12345.0  abc   5.0  1.500000
2   12345.0  abc   3.0 -0.400000
3   12345.0  abc   9.0  2.000000
4   12345.0  abc   5.0 -0.444444
5   12345.0  abc  16.0  2.200000
6    6789.0  xyz   1.0       NaN
7    6789.0  xyz   5.0  4.000000
8    6789.0  xyz   3.0 -0.400000
9    6789.0  xyz  13.0  3.333333
10   6789.0  xyz   9.0 -0.307692
11   6789.0  xyz  20.0  1.222222

Correct using named aggregation:

df0 = df.groupby('name').agg(pos=pd.NamedAgg(column='change',aggfunc=lambda x: x.gt(0).sum()),\
                            neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
print (df0)
      pos  neg
name          
abc   3.0  2.0
xyz   3.0  2.0

Wrong output:

df1 = df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
    
print (df1)
      pos  neg
name          
abc   2.0  2.0
xyz   2.0  2.0

Correct using column after groupby:

df2 = df.groupby('name')['change'].agg(pos = lambda x:x.gt(0).sum(),\
                                      neg = lambda x:x.lt(0).sum())
print (df2)
      pos  neg
name          
abc   3.0  2.0
xyz   3.0  2.0

In my opinion instead wrong output it should raise error/warning.

Version of pandas:

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 7
machine          : AMD64
processor        : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : None.None

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.7
numba            : 0.48.0
None
@jesrael jesrael added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2020
@gurukiran07
Copy link
Contributor

gurukiran07 commented May 26, 2020

Yes, agreed. Instead of producing wrong results, it should raise a ValueError or TypeError. The values of the previous aggregation are overwritten by the latest aggregated values.

Minimum reproducible example.

df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0],
                   'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 
                   'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0, 13.0, 9.0, 20.0], 
                   'change': [np.nan, 1.5, -0.4, 2.0, -0.44444399999999995, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})

#with only 1 named aggregation :
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()))
#Output 
     pos
name
abc   3.0
xyz   3.0

#with 2 named aggregation on the same column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output the values of pos column are overwrittem by sum column values
           pos       sum
name
abc   4.855556  4.855556    #The values of `pos` column are over-written by `sum` column's values
xyz   7.847863  7.847863

# with 3 named  aggregation on the same column, the values of previous columns are replaced by the lastest aggregation.

df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='change',aggfunc='max'))

      pos  sum  max
name
abc   2.2  2.2  2.2          #The values of `pos`, `sum` are over-written by `max` column's values.
xyz   4.0  4.0  4.0

pd.__version__
# '1.0.3'

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.1.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

This should raise an error or should be specified in the documentation explicitly about this behaviour, this can be a common pitfall for beginners.

Link to relevant SO post

@gurukiran07
Copy link
Contributor

More weird result:

Using the same dataframe from the above-mentioned comment. pd.NamedAgg over Series produces some weird results.

#df is the same as above
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='ltp',aggfunc='max'))
# Output
           pos       neg  max
name
abc   4.855556  4.855556  2.2
xyz   7.847863  7.847863  4.0

2.2 is the max value of 'change' column(group key is 'abc')
4.0 is the max value of 'change' column(group key is 'xyz')
I would expect it to raise an error saying KeyError: "Column 'ltp' does not exist!". Instead it produces results but it doesn't affect the previous columns like mentioned in issue and the comment above.

Here I used a completely new column name which doesn't exist.

df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='Anything',aggfunc='max'))

#Output 
           pos       neg  max
name
abc   4.855556  4.855556  2.2
xyz   7.847863  7.847863  4.0

2.2 is the max value of 'change' column(group key is 'abc')
4.0 is the max value of 'change' column(group key is 'xyz')

Shouldn't it raise an KeyError

@gurukiran07
Copy link
Contributor

gurukiran07 commented May 26, 2020

I guess the problem is when using with pd.Series

Minimum reproducible example:

s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
#Output 
   one
1    2
2    4
3    6
4    4
5    5

I expected it to raise KeyError or ValueError here.

The values of previous column are over-written when same column same is specified in pd.NamedAgg

s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
                        second=pd.NamedAgg(column='weird',aggfunc='max'))
#Output 
  one  second  # Values of column `one` are over-written
1  1       1
2  2       2
3  3       3
4  4       4
5  5       5

When different column names are provided it doesn't over-write.

s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
                        second=pd.NamedAgg(column='something',aggfunc='max'))

#Output 
   one  second       #Values are not over written
1    2       1     
2    4       2
3    6       3
4    4       4
5    5       5

When previous aggregation values are overwritten when the same column name is fed to pd.NamedAgg.

s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
                        second=pd.NamedAgg(column='diff',aggfunc='max'),
                        third = pd.NamedAgg(column='weird',aggfunc=lambda x: ','.join(map(str,x))))

#Output 

    one  second third
1  1,1       1   1,1
2  2,2       2   2,2
3  3,3       3   3,3
4    4       4     4
5    5       5     5

The aggregation values of one are overwritten by latest aggregation values of third as they have same column name specified (pd.NamedAgg(column='weird',...), the aggregation values of second are not affected since it shares different column name in pd.NamedAgg(column='diff',...)

Link to relevant SO post

@MarcoGorelli
Copy link
Member

Many thanks for the report

The aggregation values of one are overwritten by latest aggregation values of third as they have same column name specified

I'll test this out this evening, but I'd like to think it'd be solved by this PR: #30858 - will update later when I've tried

@gurukiran07
Copy link
Contributor

@MarcoGorelli Thank you for the reply. If the test results indicate there's a bug, I'll be more than glad to work on a PR.

@MarcoGorelli
Copy link
Member

@gurukiran07 I just checkout at that PR, tried your minimal reproducible exaple above, and got

In [4]: df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0
   ...: , 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0], 
   ...:                    'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'x
   ...: yz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'],  
   ...:                    'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0,
   ...:  13.0, 9.0, 20.0],  
   ...:                    'change': [np.nan, 1.5, -0.4, 2.0, -0.444443999999999
   ...: 95, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]}) 
   ...:  
   ...: #with only 1 named aggregation : 
   ...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum())) 
   ...: #Output                                                                 
Out[4]: 
      pos
name     
abc   3.0
xyz   3.0

In [5]: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum()),\ 
   ...:                                  sum = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.sum())) 
   ...:                                                                         
Out[5]: 
      pos       sum
name               
abc   3.0  4.855556
xyz   3.0  7.847863

In [6]:  
   ...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum()),\ 
   ...:                                  sum = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.sum()),\ 
   ...:                                  max = pd.NamedAgg(column='change',aggfu
   ...: nc='max')) 
   ...:                                                                         
Out[6]: 
      pos       sum  max
name                    
abc   3.0  4.855556  2.2
xyz   3.0  7.847863  4.0

so it seems it's fixed.

I'll add a test case to explicitly use pd.NamedAgg

@gurukiran07
Copy link
Contributor

@MarcoGorelli Good to know it's fixed. One more thing I would like to suggest.

Minimum reproducible example:

s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
                        second=pd.NamedAgg(column='something',aggfunc='max'))

#Output 
   one  second   
1    2       1     
2    4       2
3    6       3
4    4       4
5    5       5

Expected Output:

KeyError : "Column 'anything' does not exist!"

Or

TypeError : got an unknown keyword column

It works with any name given to column parameter in pd.NamedAgg I would expect it raise KeyError or TypeError.

@MarcoGorelli
Copy link
Member

@gurukiran07 TBH it's not clear to me what the user should pass as column in NamedAgg when using SeriesGroupBy, as there wouldn't be any columns there anyway (though if I've misunderstood, please do let me know). Are you suggesting to disallow using NamedAgg with SeriesGroupBy?

@gurukiran07
Copy link
Contributor

@MarcoGorelli I was trying to convey the same you understood correctly. Disallowing NamedAgg or raise an error when a user passes column parameter while using with SeriesGroupBy.

Something like this

s.groupby(s.values).agg(pos = pd.NamedAgg(aggfunc=lambda x :x.sum()))

Currently it raises TypeError: __new__() missing 1 required positional argument: 'column'

@TomAugspurger
Copy link
Contributor

We document the behavior for Series in https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

So for SeriesGroupBy.agg, if there are any tuples or NamedAgg present in kwargs then I think we should raise.

@gurukiran07
Copy link
Contributor

@TomAugspurger Thank you for the reply. Yes, raising an error would much better and consistent for SeriesGroupBy.agg. When there's no column for SeriesGroupBy but still NamedAgg accepting column as a parameter is looks glitchy and it works with whatever argument given.

if there are any tuples or NamedAgg present in kwargs then I think we should raise.
I totally agree with this.

Changing how agg of SeriesGroupBy(link SeriesGroupBy class) works can be helpful without disturbing DataFrameGroupBy.

@TomAugspurger
Copy link
Contributor

NamedAgg is just a namedtuple. There’s nothing to be done there. The problem is solely in SerisGroupBy

@MarcoGorelli
Copy link
Member

So for SeriesGroupBy.agg, if there are any tuples or NamedAgg present in kwargs then I think we should raise.

@gurukiran07 as this would be separate, if you're interested in working on it, could you open an new issue for it?

@gurukiran07
Copy link
Contributor

gurukiran07 commented May 27, 2020

@TomAugspurger and @MarcoGorelli Yes I would be more than glad to work on it, I identified where the changes can be made so that an Error can be raised when kwargs are NamedAgg or tuples.

Adding some additional constraints in here can solve the issue groupby/generic.py.

I'll open a new issue and will start working on it and hopefully open a PR. Thank you.

@TomAugspurger
Copy link
Contributor

I think this was fixed by #34435

@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 10, 2020
@bashtage bashtage removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants