Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.transform inconsistent behavior when grouping by columns containing NaN #10923

Closed
briangerke opened this issue Aug 28, 2015 · 4 comments
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@briangerke
Copy link

This is similar to #9697, which was fixed in 0.16.1. I give a (very) slightly modified example here to show some related behavior which is at least inconsistent and should probably be handled cleanly.

It's not entirely clear to me what the desired behavior is in this case; it's possible that transform should not work here at all, since it spits out unexpected values. But at minimum it seems like it should do the same thing no matter how I invoke it below.

Example:

import numpy as np
df = pd.DataFrame({'col1':[1,1,2,2], 'col2':[1,2,3,np.nan])
#Let's try grouping on 'col2', which contains a NaN.

# Works and gives arguably reasonable results, with one unpredictable value
df.groupby('col2').transform(sum)['col1']

# Throws an unhelpful error
df.groupby('col2')['col1'].transform(sum)

Error is similar to the one encountered in the previous issue:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-2d4b83df6487> in <module>()
----> 1 df.groupby('col2')['col1'].transform(sum)

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
   2442         cyfunc = _intercept_cython(func)
   2443         if cyfunc and not args and not kwargs:
-> 2444             return self._transform_fast(cyfunc)
   2445 
   2446         # reg transform

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _transform_fast(self, func)
   2488             values = self._try_cast(values, self._selected_obj)
   2489 
-> 2490         return self._set_result_index_ordered(Series(values))
   2491 
   2492     def filter(self, func, dropna=True, *args, **kwargs):

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _set_result_index_ordered(self, result)
    503             result = result.sort_index()
    504 
--> 505         result.index = self.obj.index
    506         return result
    507 

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __setattr__(self, name, value)
   2159         try:
   2160             object.__getattribute__(self, name)
-> 2161             return object.__setattr__(self, name, value)
   2162         except AttributeError:
   2163             pass

/usr/local/lib/python2.7/dist-packages/pandas/lib.so in pandas.lib.AxisProperty.__set__ (pandas/lib.c:42548)()

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in _set_axis(self, axis, labels, fastpath)
    273         object.__setattr__(self, '_index', labels)
    274         if not fastpath:
--> 275             self._data.set_axis(axis, labels)
    276 
    277     def _set_subtyp(self, is_all_dates):

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in set_axis(self, axis, new_labels)
   2217         if new_len != old_len:
   2218             raise ValueError('Length mismatch: Expected axis has %d elements, '
-> 2219                              'new values have %d elements' % (old_len, new_len))
   2220 
   2221         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

@jreback
Copy link
Contributor

jreback commented Aug 28, 2015

I think we already have an issue for this - can u check?

agree that could be a nicer error message / better behavior

@briangerke
Copy link
Author

Whoops, you're right. Sorry, I should have searched more thoroughly. This duplicates #9941.

@jreback jreback added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 28, 2015
@chbrandt
Copy link

chbrandt commented Jul 26, 2017

I have a similar issue here.

Let's consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn't.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean' is what I was expecting.

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2017

I suspect it has to deal with the way in which we aggregate results at the end. Post this as a separate issue and reference this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

4 participants