New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.core.groupby.GroupBy.apply fails #20949

Closed
MBlistein opened this Issue May 4, 2018 · 7 comments

Comments

Projects
None yet
5 participants
@MBlistein

MBlistein commented May 4, 2018

Code Sample:

>>> df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
>>> g = df.groupby('A')
>>> g.apply(lambda x: x / x.sum())

Problem description

Applying a function to a grouped data frame fails. The code above is the example code from the official pandas documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html

Output to the above code:

/usr/local/lib/python2.7/dist-packages/pandas/core/computation/check.py:17: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.4.6

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 805, in apply
    return self._python_apply_general(f)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 809, in _python_apply_general
    self.axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1969, in apply
    res = f(group)
  File "<stdin>", line 1, in <lambda>
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1262, in f
    return self._combine_series(other, na_op, fill_value, axis, level)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3944, in _combine_series
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3958, in _combine_series_infer
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 3981, in _combine_match_columns
    try_cast=try_cast)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3435, in eval
    return self.apply('eval', **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1377, in eval
    result = get_result(other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1346, in get_result
    result = func(values, other)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 1216, in na_op
    yrav.fill(yrav.item())
ValueError: can only convert an array of size 1 to a Python scalar

The error can be 'fixed' by applying another command to the grouped object first:

>>> g.sum()
   B   C
A       
a  3  10
b  3   5

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Expected Output

>>> g.apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-122-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.utf8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.7
pip: 9.0.1
setuptools: 20.7.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.5.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.0.11
pymysql: 0.7.2.None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@MBlistein MBlistein changed the title from pandas.core.groupby.GroupBy.apply is broken to pandas.core.groupby.GroupBy.apply fails May 4, 2018

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone May 4, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 4, 2018

Thanks for the bug report.

@WillAyd

This comment has been minimized.

Member

WillAyd commented May 4, 2018

Hmm interesting. FWIW when I remove numexpr I can't get this to run at all, regardless of whether or not I run another agg function first.

@WillAyd

This comment has been minimized.

Member

WillAyd commented May 4, 2018

Numexpr may be a red herring. From what I can tell the problem occurs at the following line of code:

results, mutated = reduction.apply_frame_axis0(sdata, f, names,

sdata when run without another agg function first includes the Grouping as part of the data and throws here, causing it to go down another path. sdata comes from _selected_obj.

For agg functions like sum, mean, etc... they have a call to _set_group_selection which takes care of setting the appropriately cached value for _selected_obj. I suppose a quick fix is to add a call to that at the beginning of apply, though I can't tell from the code alone why that isn't done across the board

cc @jreback for any insight

@Dr-Irv

This comment has been minimized.

Contributor

Dr-Irv commented May 4, 2018

Here's another example that fails with 0.23rc2 (and in 0.22.0 as well), based on code from pandas\core\indexes\datetimes.py in test_agg_timezone_round_trip:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.0rc2'

In [3]: dates = [pd.Timestamp("2016-01-0%d 12:00:00" % i, tz='US/Pacific')
   ...:          for i in range(1, 5)]
   ...: df = pd.DataFrame({'A': ['a', 'b'] * 2, 'B': dates})
   ...: grouped = df.groupby('A')
   ...:

In [4]: df
Out[4]:
   A                         B
0  a 2016-01-01 12:00:00-08:00
1  b 2016-01-02 12:00:00-08:00
2  a 2016-01-03 12:00:00-08:00
3  b 2016-01-04 12:00:00-08:00

In [5]: grouped.apply(lambda x: x.iloc[0])[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1499         else:
-> 1500             raise KeyError(val)
   1501

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-2b16555d6e05> in <module>()
----> 1 grouped.apply(lambda x: x.iloc[0])[0]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in __getitem__(self, key)
   2685             return self._getitem_multilevel(key)
   2686         else:
-> 2687             return self._getitem_column(key)
   2688
   2689     def _getitem_column(self, key):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\frame.py in _getitem_column(self, key)
   2692         # get column
   2693         if self.columns.is_unique:
-> 2694             return self._get_item_cache(key)
   2695
   2696         # duplicate columns & possible reduce dimensionality

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\generic.py in _get_item_cache(self, item)
   2485         res = cache.get(item)
   2486         if res is None:
-> 2487             values = self._data.get(item)
   2488             res = self._box_item_values(item, values)
   2489             cache[item] = res

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\internals.py in get(self, item, fastpath)
   4113
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5720)()
    137             util.set_value_at(arr, loc, value)
    138
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))


C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5566)()
    159
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22442)()
   1490                                        sizeof(uint32_t)) # flags
   1491
-> 1492     cpdef get_item(self, object val):
   1493         cdef khiter_t k
   1494         if val != val or val is None:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:22396)()
   1498             return self.table.vals[k]
   1499         else:
-> 1500             raise KeyError(val)
   1501
   1502     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: 0

However, if you do the following, it works:

In [6]: grouped.nth(0)['B'].iloc[0]
Out[6]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

In [7]: grouped.apply(lambda x: x.iloc[0])[0]
Out[7]: Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

So doing one operation (in this case nth) prior to the apply then makes the apply work.

@WillAyd

This comment has been minimized.

Member

WillAyd commented May 4, 2018

@Dr-Irv seems related. Some code below illustrating what I think is going on:

>>> grouped.apply(lambda x: x.iloc[0])[0]  # KeyError as indicator
KeyError

>>> grouped._set_group_selection()
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Works now, as 'A' was not part of data
Timestamp('2016-01-01 12:00:00-0800', tz='US/Pacific')

>>> grouped._reset_group_selection()  # Clear out the group selection
>>> grouped.apply(lambda x: x.iloc[0])[0]  # Back to failing
KeyError

Unfortunately just adding this call before _python_apply_general broke other tests where the grouping was supposed to be part of the returned object (at least according to the tests). Reviewing in more detail hope to have a PR soon

@jreback

This comment has been minimized.

Contributor

jreback commented May 5, 2018

this didn't work even in 0.20.3. not sure how we don't have a test for it though.

@jreback

This comment has been minimized.

Contributor

jreback commented May 5, 2018

@Dr-Irv your example is a separate issue. pls make a new report for that one.

jreback added a commit to jreback/pandas that referenced this issue May 5, 2018

BUG in .groupby.apply when applying a function that has mixed data ty…
…pes and the user supplied function can fail on the grouping column

closes pandas-dev#20949

jreback added a commit to jreback/pandas that referenced this issue May 7, 2018

BUG in .groupby.apply when applying a function that has mixed data ty…
…pes and the user supplied function can fail on the grouping column

closes pandas-dev#20949

@WillAyd WillAyd referenced this issue May 9, 2018

Merged

Consistent Return Structure for Rolling Apply #20984

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment