Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous behaviour when transform groupby with NaNs #17093

Closed
chbrandt opened this issue Jul 27, 2017 · 8 comments · Fixed by #44245 or #46367
Closed

Ambiguous behaviour when transform groupby with NaNs #17093

chbrandt opened this issue Jul 27, 2017 · 8 comments · Fixed by #44245 or #46367
Assignees
Labels
Apply Apply, Aggregate, Transform Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@chbrandt
Copy link

Similar issues: #10923, #9697, #9941

Please, consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn't.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean', is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda function, used to work on (pandas) version 0.19.1

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

@chbrandt : Thanks for doing this! This does look a little weird to me, though perhaps @jreback or @jorisvandenbossche might have more information about this than I do.

@jongmmm
Copy link

jongmmm commented Jul 31, 2017

Perhaps, the same problem:

df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})
df.groupby('A').apply(lambda x: x) # works
df.groupby('A').transform(lambda x:x) # ValueError: Length mismatch

@kernc
Copy link
Contributor

kernc commented Sep 1, 2017

In my case, I catch this error on group-resample-aggregate. The affecting lines seem to be:

if isinstance(result, ABCSeries) and result.empty:
obj = self.obj
result.index = obj.index._shallow_copy(freq=to_offset(self.freq))

If the series is empty, there's no point in setting a potentially non-empty index on it?

@jsevo
Copy link

jsevo commented Jun 26, 2018

Edit: This was because there were none values in the grouping column, groups. Filling them first with dummies gets around the issue.

I encounter this when trying to fill missing values per group:

def most_common_in_group(g):
    try:
        mc = g.value_counts().index[0]
        return(mc)
    except IndexError:
        return('all_missing')


df.groupby('groups')['sometime_missing_values'].transform(most_common_in_group)

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [17]: df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})

In [18]: df.groupby('A').transform(lambda x:x)
Out[18]:
   B
0  1

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Jun 12, 2021
@chbrandt
Copy link
Author

Side note: related Stackoverflow post was updated to account for this progress:

Thank you all very much.

@GYHHAHA
Copy link
Contributor

GYHHAHA commented Jul 15, 2021

This looks to work on master now. Could use a test

In [17]: df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})

In [18]: df.groupby('A').transform(lambda x:x)
Out[18]:
   B
0  1

@mroeschke Although this case works fine, still not working for dropna=False

>>>df = pd.DataFrame({"A": [1, np.nan, 1], "B": [1, 1, 1]})
>>>df.groupby("A", dropna=False)["B"].transform(lambda x: x)
... ...
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

But the under situation is feasible. Seems weird.

>>>df = pd.DataFrame({"A": [1, np.nan], "B": [1, 1]})
>>>df.groupby("A", dropna=False)["B"].transform(lambda x: x.mean())
0    1.0
1    1.0
Name: B, dtype: float64

@mroeschke mroeschke mentioned this issue Oct 31, 2021
9 tasks
@jreback jreback added this to the 1.4 milestone Oct 31, 2021
@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 31, 2021
@rhshadrach
Copy link
Member

This issue was closed based on #17093 (comment), but this output disagrees with what was expected in OP; namely that null keys lead to null values in the output rather than no value in the output. It also disagrees with the transform docs, which say in general that the result of a transform should either be the same length or have the same index (there is inconsistency in the docs here). E.g. from DataFrameGroupBy.transform:

Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
9 participants