groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

briangerke · 2015-08-28T19:03:58Z

This is similar to #9697, which was fixed in 0.16.1. I give a (very) slightly modified example here to show some related behavior which is at least inconsistent and should probably be handled cleanly.

It's not entirely clear to me what the desired behavior is in this case; it's possible that transform should not work here at all, since it spits out unexpected values. But at minimum it seems like it should do the same thing no matter how I invoke it below.

Example:

import numpy as np
df = pd.DataFrame({'col1':[1,1,2,2], 'col2':[1,2,3,np.nan])
#Let's try grouping on 'col2', which contains a NaN.

# Works and gives arguably reasonable results, with one unpredictable value
df.groupby('col2').transform(sum)['col1']

# Throws an unhelpful error
df.groupby('col2')['col1'].transform(sum)

Error is similar to the one encountered in the previous issue:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-2d4b83df6487> in <module>()
----> 1 df.groupby('col2')['col1'].transform(sum)

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
   2442         cyfunc = _intercept_cython(func)
   2443         if cyfunc and not args and not kwargs:
-> 2444             return self._transform_fast(cyfunc)
   2445 
   2446         # reg transform

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _transform_fast(self, func)
   2488             values = self._try_cast(values, self._selected_obj)
   2489 
-> 2490         return self._set_result_index_ordered(Series(values))
   2491 
   2492     def filter(self, func, dropna=True, *args, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _set_result_index_ordered(self, result)
    503             result = result.sort_index()
    504 
--> 505         result.index = self.obj.index
    506         return result
    507 

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __setattr__(self, name, value)
   2159         try:
   2160             object.__getattribute__(self, name)
-> 2161             return object.__setattr__(self, name, value)
   2162         except AttributeError:
   2163             pass

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in pandas.lib.AxisProperty.__set__ (pandas/lib.c:42548)()

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in _set_axis(self, axis, labels, fastpath)
    273         object.__setattr__(self, '_index', labels)
    274         if not fastpath:
--> 275             self._data.set_axis(axis, labels)
    276 
    277     def _set_subtyp(self, is_all_dates):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in set_axis(self, axis, new_labels)
   2217         if new_len != old_len:
   2218             raise ValueError('Length mismatch: Expected axis has %d elements, '
-> 2219                              'new values have %d elements' % (old_len, new_len))
   2220 
   2221         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

The text was updated successfully, but these errors were encountered:

jreback · 2015-08-28T19:13:25Z

I think we already have an issue for this - can u check?

agree that could be a nicer error message / better behavior

briangerke · 2015-08-28T20:18:43Z

Whoops, you're right. Sorry, I should have searched more thoroughly. This duplicates #9941.

chbrandt · 2017-07-26T18:35:08Z

I have a similar issue here.

Let's consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn't.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64

In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean' is what I was expecting.

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

gfyoung · 2017-07-26T19:57:19Z

I suspect it has to deal with the way in which we aggregate results at the end. Post this as a separate issue and reference this one.

briangerke closed this as completed Aug 28, 2015

jreback added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 28, 2015

chbrandt mentioned this issue Jul 27, 2017

Ambiguous behaviour when transform groupby with NaNs #17093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

briangerke commented Aug 28, 2015

jreback commented Aug 28, 2015

briangerke commented Aug 28, 2015

chbrandt commented Jul 26, 2017 •

edited

Loading

gfyoung commented Jul 26, 2017

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

Comments

briangerke commented Aug 28, 2015

jreback commented Aug 28, 2015

briangerke commented Aug 28, 2015

chbrandt commented Jul 26, 2017 • edited Loading

gfyoung commented Jul 26, 2017

chbrandt commented Jul 26, 2017 •

edited

Loading