Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped mutate raises error when a grouping column has NAs #251

Closed
machow opened this issue Jul 11, 2020 · 3 comments
Closed

Grouped mutate raises error when a grouping column has NAs #251

machow opened this issue Jul 11, 2020 · 3 comments
Labels
api:verb dplyr:parity Enables a dplyr behavior type:bug Something isn't working

Comments

@machow
Copy link
Owner

machow commented Jul 11, 2020

Running code like below raises an error, because pandas by default drops group keys that are NA.

from siuba.data import mtcars
from siuba import _, mutate, group_by

new_cars = mtcars.copy()
new_cars[new_cars.cyl == 4] = None

new_cars >> group_by(_.cyl) >> mutate(diff = _.mpg.min() - _.mpg.max())

As of pandas 1.1 (still in development), you can pass a dropna argument to groupby. This should fix it going forward, but I'm not sure of the best strategy with previous versions.

I'd lean toward...

  • updating the behavior to use dropna=False for pandas 1.1 (this will make it consistent with dplyr)
  • at the very least for < 1.1, raise an explicit error for pandas when NAs in group keys
  • at best for < 1.1, have it operate while dropping NA rows (pandas' behavior), and raise a warning.

Traceback:


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
 in 
      6 new_cars[new_cars.cyl == 4] = None
      7 
----> 8 new_cars >> group_by(_.cyl) >> mutate(diff = _.mpg.min() - _.mpg.max())

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in rrshift(self, x)
87 return Pipeable(calls = [x] + self.calls)
88
---> 89 return self(x)
90
91 def call(self, x):

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in call(self, x)
92 res = x
93 for f in self.calls:
---> 94 res = f(res)
95 return res
96

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/siu.py in call(self, x)
193 return getattr(inst, *rest)
194 elif self.func == "call":
--> 195 return getattr(inst, self.func)(*rest, **kwargs)
196
197 # in normal case, get method to call, and then call it

~/.pyenv/versions/3.6.8/lib/python3.6/functools.py in wrapper(*args, **kw)
805 '1 positional argument')
806
--> 807 return dispatch(args[0].class)(*args, **kw)
808
809 funcname = getattr(func, 'name', 'singledispatch function')

~/Dropbox/Repo/siublocks-org/purview/utils/siututor/siututor.py in wrapper(*args, **kwargs)
30 return Blank()
31
---> 32 return f(*args, **kwargs)
33
34 return wrapper

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/siuba/dply/verbs.py in _mutate(__data, **kwargs)
283 # will drop all but original index
284 group_by_lvls = list(range(df.index.nlevels - 1))
--> 285 g_df = df.reset_index(group_by_lvls, drop = True).loc[orig_index].groupby(groupings)
286
287 return g_df

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in getitem(self, key)
1766
1767 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768 return self._getitem_axis(maybe_callable, axis=axis)
1769
1770 def _is_scalar_access(self, key: Tuple):

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1952 raise ValueError("Cannot index with multidimensional key")
1953
-> 1954 return self._getitem_iterable(key, axis=axis)
1955
1956 # nested tuple slicing

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
-> 1595 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1551
1552 self._validate_read_indexer(
-> 1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1554 )
1555 return keyarr, indexer

~/.virtualenvs/siublocks-org/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1653 if not (ax.is_categorical() or ax.is_interval()):
1654 raise KeyError(
-> 1655 "Passing list-likes to .loc or [] with any missing labels "
1656 "is no longer supported, see "
1657 "https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike" # noqa:E501

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

@machow
Copy link
Owner Author

machow commented Jul 29, 2020

Note, pandas 1.1 is released, so this issue can be addressed!

https://pandas.pydata.org/docs/whatsnew/v1.1.0.html#allow-na-in-groupby-key

@machow machow added api:verb dplyr:parity Enables a dplyr behavior type:bug Something isn't working labels Jan 11, 2022
@machow
Copy link
Owner Author

machow commented Nov 16, 2022

Note that for some reason the above code succeeds on pandas v1.5.1 (I believe there's an issue open, and that this is a regression?), but we should still set dropna=False in all grouped operations.

@machow
Copy link
Owner Author

machow commented Nov 16, 2022

Addressed in v0.4.2

@machow machow closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api:verb dplyr:parity Enables a dplyr behavior type:bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

1 participant