Grouped mutate raises error when a grouping column has NAs #251

machow · 2020-07-11T21:19:04Z

Running code like below raises an error, because pandas by default drops group keys that are NA.

from siuba.data import mtcars
from siuba import _, mutate, group_by

new_cars = mtcars.copy()
new_cars[new_cars.cyl == 4] = None

new_cars >> group_by(_.cyl) >> mutate(diff = _.mpg.min() - _.mpg.max())

As of pandas 1.1 (still in development), you can pass a dropna argument to groupby. This should fix it going forward, but I'm not sure of the best strategy with previous versions.

I'd lean toward...

updating the behavior to use dropna=False for pandas 1.1 (this will make it consistent with dplyr)
at the very least for < 1.1, raise an explicit error for pandas when NAs in group keys
at best for < 1.1, have it operate while dropping NA rows (pandas' behavior), and raise a warning.

Traceback:


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
 in 
      6 new_cars[new_cars.cyl == 4] = None
      7 
----> 8 new_cars >> group_by(_.cyl) >> mutate(diff = _.mpg.min() - _.mpg.max())
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in rrshift(self, x)

87             return Pipeable(calls = [x] + self.calls)

88

---> 89         return self(x)

90

91     def call(self, x):
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in call(self, x)

92         res = x

93         for f in self.calls:

---> 94             res = f(res)

95         return res

96
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/siu.py in call(self, x)

193             return getattr(inst, *rest)

194         elif self.func == "call":

--> 195             return getattr(inst, self.func)(*rest, **kwargs)

196

197         # in normal case, get method to call, and then call it
~/.pyenv/versions/3.6.8/lib/python3.6/functools.py in wrapper(*args, **kw)

805                             '1 positional argument')

806

--> 807         return dispatch(args[0].class)(*args, **kw)

808

809     funcname = getattr(func, 'name', 'singledispatch function')
~/Dropbox/Repo/siublocks-org/purview/utils/siututor/siututor.py in wrapper(*args, **kwargs)

30             return Blank()

31

---> 32         return f(*args, **kwargs)

33

34     return wrapper
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in _mutate(__data, **kwargs)

283     # will drop all but original index

284     group_by_lvls = list(range(df.index.nlevels - 1))

--> 285     g_df = df.reset_index(group_by_lvls, drop = True).loc[orig_index].groupby(groupings)

286

287     return g_df
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in getitem(self, key)

1766

1767             maybe_callable = com.apply_if_callable(key, self.obj)

-> 1768             return self._getitem_axis(maybe_callable, axis=axis)

1769

1770     def _is_scalar_access(self, key: Tuple):
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)

1952                     raise ValueError("Cannot index with multidimensional key")

1953

-> 1954                 return self._getitem_iterable(key, axis=axis)

1955

1956             # nested tuple slicing
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)

1593         else:

1594             # A collection of keys

-> 1595             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)

1596             return self.obj._reindex_with_indexers(

1597                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)

1551

1552         self._validate_read_indexer(

-> 1553             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing

1554         )

1555         return keyarr, indexer
~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)

1653             if not (ax.is_categorical() or ax.is_interval()):

1654                 raise KeyError(

-> 1655                     "Passing list-likes to .loc or [] with any missing labels "

1656                     "is no longer supported, see "

1657                     "https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"  # noqa:E501
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

The text was updated successfully, but these errors were encountered:

machow · 2020-07-29T16:48:58Z

Note, pandas 1.1 is released, so this issue can be addressed!

https://pandas.pydata.org/docs/whatsnew/v1.1.0.html#allow-na-in-groupby-key

machow · 2022-11-16T15:54:01Z

Note that for some reason the above code succeeds on pandas v1.5.1 (I believe there's an issue open, and that this is a regression?), but we should still set dropna=False in all grouped operations.

machow · 2022-11-16T18:08:56Z

Addressed in v0.4.2

machow mentioned this issue Jan 11, 2022

DRAFT(pandas): do not dropna in group_by #367

Closed

2 tasks

machow added api:verb dplyr:parity Enables a dplyr behavior type:bug Something isn't working labels Jan 11, 2022

This was referenced Nov 16, 2022

Grouped summarize fails when a grouping col has NAs and < 2 other levels #458

Closed

fix: summarize raising error when a grouping col is all NA (or mostly NA) #459

Merged

machow closed this as completed Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped mutate raises error when a grouping column has NAs #251

Grouped mutate raises error when a grouping column has NAs #251

machow commented Jul 11, 2020 •

edited

machow commented Jul 29, 2020

machow commented Nov 16, 2022

machow commented Nov 16, 2022

Grouped mutate raises error when a grouping column has NAs #251

Grouped mutate raises error when a grouping column has NAs #251

Comments

machow commented Jul 11, 2020 • edited

machow commented Jul 29, 2020

machow commented Nov 16, 2022

machow commented Nov 16, 2022

machow commented Jul 11, 2020 •

edited