Ambiguous behaviour when `transform` `groupby` with `NaN`s #17093

chbrandt · 2017-07-27T08:37:39Z

Please, consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn't.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64

In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean', is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda function, used to work on (pandas) version 0.19.1

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-27T09:17:53Z

@chbrandt : Thanks for doing this! This does look a little weird to me, though perhaps @jreback or @jorisvandenbossche might have more information about this than I do.

jongmmm · 2017-07-31T20:32:08Z

Perhaps, the same problem:

df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})
df.groupby('A').apply(lambda x: x) # works
df.groupby('A').transform(lambda x:x) # ValueError: Length mismatch

kernc · 2017-09-01T13:54:13Z

In my case, I catch this error on group-resample-aggregate. The affecting lines seem to be:

pandas/pandas/core/resample.py

Lines 442 to 444 in 3e9e947

    
           if isinstance(result, ABCSeries) and result.empty: 
        
               obj = self.obj 
        
               result.index = obj.index._shallow_copy(freq=to_offset(self.freq))

If the series is empty, there's no point in setting a potentially non-empty index on it?

jsevo · 2018-06-26T22:07:39Z

Edit: This was because there were none values in the grouping column, groups. Filling them first with dummies gets around the issue.

~~I encounter this when trying to fill missing values per group:~~

def most_common_in_group(g):
    try:
        mc = g.value_counts().index[0]
        return(mc)
    except IndexError:
        return('all_missing')


df.groupby('groups')['sometime_missing_values'].transform(most_common_in_group)

mroeschke · 2021-06-12T03:57:52Z

This looks to work on master now. Could use a test

In [17]: df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})

In [18]: df.groupby('A').transform(lambda x:x)
Out[18]:
   B
0  1

chbrandt · 2021-07-14T12:46:17Z

Side note: related Stackoverflow post was updated to account for this progress:

https://stackoverflow.com/q/45333681/687896

Thank you all very much.

GYHHAHA · 2021-07-15T15:38:27Z

This looks to work on master now. Could use a test

In [17]: df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})

In [18]: df.groupby('A').transform(lambda x:x)
Out[18]:
   B
0  1

@mroeschke Although this case works fine, still not working for dropna=False

>>>df = pd.DataFrame({"A": [1, np.nan, 1], "B": [1, 1, 1]})
>>>df.groupby("A", dropna=False)["B"].transform(lambda x: x)
... ...
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

But the under situation is feasible. Seems weird.

>>>df = pd.DataFrame({"A": [1, np.nan], "B": [1, 1]})
>>>df.groupby("A", dropna=False)["B"].transform(lambda x: x.mean())
0    1.0
1    1.0
Name: B, dtype: float64

rhshadrach · 2022-02-04T14:06:44Z

This issue was closed based on #17093 (comment), but this output disagrees with what was expected in OP; namely that null keys lead to null values in the output rather than no value in the output. It also disagrees with the transform docs, which say in general that the result of a transform should either be the same length or have the same index (there is inconsistency in the docs here). E.g. from DataFrameGroupBy.transform:

Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values.

gfyoung added Bug Groupby labels Jul 27, 2017

WillAyd mentioned this issue Apr 25, 2019

SeriesGroupBy.transform cannot handle empty series #26208

Closed

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby labels Jun 12, 2021

mroeschke mentioned this issue Oct 31, 2021

TST: Old issues #44245

Merged

9 tasks

jreback added this to the 1.4 milestone Oct 31, 2021

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 31, 2021

jreback closed this as completed in #44245 Nov 1, 2021

rhshadrach reopened this Feb 4, 2022

rhshadrach added Apply Apply, Aggregate, Transform and removed good first issue labels Feb 4, 2022

rhshadrach self-assigned this Feb 4, 2022

This was referenced Feb 5, 2022

WIP/BUG: Correct results for groupby(...).transform with null keys #45839

Closed

BUG: Fix some cases of groupby(...).transform with dropna=True #45953

Merged

rhshadrach mentioned this issue Mar 3, 2022

BUG: Fix some cases of groupby(...).transform with dropna=True #46209

Merged

3 tasks

rhshadrach mentioned this issue Mar 15, 2022

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

Merged

4 tasks

jreback modified the milestones: 1.4, 1.5 Mar 16, 2022

jreback closed this as completed in #46367 Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous behaviour when `transform` `groupby` with `NaN`s #17093

Ambiguous behaviour when `transform` `groupby` with `NaN`s #17093

chbrandt commented Jul 27, 2017

gfyoung commented Jul 27, 2017

jongmmm commented Jul 31, 2017

kernc commented Sep 1, 2017

jsevo commented Jun 26, 2018 •

edited

mroeschke commented Jun 12, 2021

chbrandt commented Jul 14, 2021

GYHHAHA commented Jul 15, 2021

rhshadrach commented Feb 4, 2022

Ambiguous behaviour when transform groupby with NaNs #17093

Ambiguous behaviour when transform groupby with NaNs #17093

Comments

chbrandt commented Jul 27, 2017

gfyoung commented Jul 27, 2017

jongmmm commented Jul 31, 2017

kernc commented Sep 1, 2017

jsevo commented Jun 26, 2018 • edited

mroeschke commented Jun 12, 2021

chbrandt commented Jul 14, 2021

GYHHAHA commented Jul 15, 2021

rhshadrach commented Feb 4, 2022

Ambiguous behaviour when `transform` `groupby` with `NaN`s #17093

Ambiguous behaviour when `transform` `groupby` with `NaN`s #17093

jsevo commented Jun 26, 2018 •

edited