Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna #35014

Closed
TomAugspurger opened this issue Jun 26, 2020 · 2 comments · Fixed by #35078
Closed

BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna #35014

TomAugspurger opened this issue Jun 26, 2020 · 2 comments · Fixed by #35078
Assignees
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Code Sample, a copy-pastable example

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [0, 0, 1, None], "B": [1, 2, 3, None]})
In [3]: gb = df.groupby("A", dropna=False)
In [6]: gb['B'].transform(len)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-3bae7d67a46f> in <module>
----> 1 gb['B'].transform(len)

~/sandbox/pandas/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
    471         if not isinstance(func, str):
    472             return self._transform_general(
--> 473                 func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
    474             )
    475

~/sandbox/pandas/pandas/core/groupby/generic.py in _transform_general(self, func, engine, engine_kwargs, *args, **kwargs)
    537
    538         result.name = self._selected_obj.name
--> 539         result.index = self._selected_obj.index
    540         return result
    541

~/sandbox/pandas/pandas/core/generic.py in __setattr__(self, name, value)
   5141         try:
   5142             object.__getattribute__(self, name)
-> 5143             return object.__setattr__(self, name, value)
   5144         except AttributeError:
   5145             pass

~/sandbox/pandas/pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()
     64
     65     def __set__(self, obj, value):
---> 66         obj._set_axis(self.axis, value)

~/sandbox/pandas/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    422         if not fastpath:
    423             # The ensure_index call above ensures we have an Index object
--> 424             self._mgr.set_axis(axis, labels)
    425
    426     # ndarray compatibility

~/sandbox/pandas/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    213         if new_len != old_len:
    214             raise ValueError(
--> 215                 f"Length mismatch: Expected axis has {old_len} elements, new "
    216                 f"values have {new_len} elements"
    217             )

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

Problem description

Compare that with the following

In [4]: gb.transform(len)
Out[4]:
   B
0  2
1  2
2  1
3  1

In [5]: gb[['B']].transform(len)
Out[5]:
   B
0  2
1  2
2  1
3  1

So it's just when slicing down to a SeriesGroupBy object.

Expected Output

A series:

Out[5]:
0  2
1  2
2  1
3  1
@TomAugspurger TomAugspurger added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 26, 2020
@arw2019
Copy link
Member

arw2019 commented Jun 27, 2020

take

@arw2019
Copy link
Member

arw2019 commented Jun 30, 2020

@TomAugspurger I think that the problem in SeriesGroupBy.transform comes down to L387-388 in pandas/core/groupby/grouper.py:

values = ensure_categorical(self.grouper)

For gb['B'] if we print out values NaN is getting getting dropped (and this then propagates along to where the original issue came up):

self.grouper = [ 0.  0.  1. nan]

values = {'_dtype': CategoricalDtype(categories=[0.0, 1.0], ordered=False), '_codes': array([ 0,  0,  1, -1], dtype=int8)}

Since what we need from indices is a dict of values and indices where they occur a quick solution could be to do that on the fly in Grouping.indices. Would that work or would we want to do something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants