Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped summarize fails when a grouping col has NAs and < 2 other levels #458

Closed
machow opened this issue Nov 16, 2022 · 1 comment
Closed

Comments

@machow
Copy link
Owner

machow commented Nov 16, 2022

For a grouped summarize, when a grouping column...

  • has all NA values, it raises an error.
  • has 1 level aside from all NAs, it does not put grouping column back on the result.

AFAICT setting groupby(..., dropna=False) resolves this (cf #251)

Example: all NA levels raises an error, since grouping columns on result and index

cars6 = cars.copy()
cars6["cyl"] = np.nan

cars6 >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean())

Raises

ValueError: cannot insert cyl, already exists
Full traceback
ValueError                                Traceback (most recent call last)
Cell In [23], line 4
      1 cars6 = cars.copy()
      2 cars6["cyl"] = np.nan
----> 4 cars6 >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean())

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/siu/calls.py:214, in Call.__rrshift__(self, x)
    210 if isinstance(strip_symbolic(x), (Call)):
    211     # only allow non-calls (i.e. data) on the left.
    212     raise TypeError()
--> 214 return self(x)

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/siu/calls.py:189, in Call.__call__(self, x)
    187     return operator.getitem(inst, *rest)
    188 elif self.func == "__call__":
--> 189     return getattr(inst, self.func)(*rest, **kwargs)
    191 # in normal case, get method to call, and then call it
    192 f_op = getattr(operator, self.func)

File ~/.pyenv/versions/3.8.12/lib/python3.8/functools.py:875, in singledispatch.<locals>.wrapper(*args, **kw)
    871 if not args:
    872     raise TypeError(f'{funcname} requires at least '
    873                     '1 positional argument')
--> 875 return dispatch(args[0].__class__)(*args, **kw)

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/dply/verbs.py:564, in _summarize(__data, *args, **kwargs)
    561 df = __data.apply(df_summarize, *args, **kwargs)
    563 group_by_lvls = list(range(df.index.nlevels - 1))
--> 564 out = df.reset_index(group_by_lvls)
    565 out.index = pd.RangeIndex(df.shape[0])
    567 return out

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/core/frame.py:6350, in DataFrame.reset_index(self, level, drop, inplace, col_level, col_fill, allow_duplicates, names)
   6344         if lab is not None:
   6345             # if we have the codes, extract the values with a mask
   6346             level_values = algorithms.take(
   6347                 level_values, lab, allow_fill=True, fill_value=lev._na_value
   6348             )
-> 6350         new_obj.insert(
   6351             0,
   6352             name,
   6353             level_values,
   6354             allow_duplicates=allow_duplicates,
   6355         )
   6357 new_obj.index = new_index
   6358 if not inplace:

File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/core/frame.py:4806, in DataFrame.insert(self, loc, column, value, allow_duplicates)
   4800     raise ValueError(
   4801         "Cannot specify 'allow_duplicates=True' when "
   4802         "'self.flags.allows_duplicate_labels' is False."
   4803     )
   4804 if not allow_duplicates and column in self.columns:
   4805     # Should this be a different kind of error??
-> 4806     raise ValueError(f"cannot insert {column}, already exists")
   4807 if not isinstance(loc, int):
   4808     raise TypeError("loc must be int")

ValueError: cannot insert cyl, already exists

Example: 1 non NA level outputs a table w/o grouping columns

cars5 = cars.copy()
cars5["cyl"] = [1] + [np.nan] * (len(cars) - 1)

cars5  >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean())

Output

Note there's no cyl or hp column on the result

image

@machow
Copy link
Owner Author

machow commented Nov 16, 2022

Addressed in v0.4.2

@machow machow closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant