Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby(..., dropna=False).filter() never includes rows with NaNs in the index #44517

Open
2 of 3 tasks
pganssle opened this issue Nov 18, 2021 · 4 comments
Open
2 of 3 tasks
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@pganssle
Copy link
Contributor

pganssle commented Nov 18, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

nan = float('nan')
df = pd.DataFrame([
    [1, 2, 1],
    [1, 2, 2],
    [2, nan, 3],
    [2, nan, 4],
    [3, 3, 5],],
                  columns=["a", "b", "c"])

gb = df.groupby(["a", "b"], dropna=False)
filtered = gb.filter(lambda x: len(x) > 1)
print(filtered)
#    a    b  c
# 0  1  2.0  1
# 1  1  2.0  2

# Expecting:
#    a    b  c
# 0  1  2.0  1
# 1  1  2.0  2
# 2  2  NaN  3
# 3  2  NaN  4


# Note that there is no problem with agg:
print()
agged = gb.agg(len)
print(agged)
#        c
# a b
# 1 2.0  2
# 2 NaN  2
# 3 3.0  1

Issue Description

When using groupby() and then applying a filter, if the group values include a NaN, those rows are always dropped! If you add dropna=False to the filter call, it just fills them with NaNs.

Expected Behavior

NaNs should be treated as unique values rather than dropped automatically.

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.9.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.16-arch1-1
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@pganssle pganssle added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 18, 2021
@mroeschke mroeschke added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2021
@abatomunkuev
Copy link
Contributor

Hello! I would like to work on this issue.

If we look at the filter function definition
https://github.com/pandas-dev/pandas/blob/v1.3.4/pandas/core/groupby/generic.py#L1456-L1522

def filter(self, func, dropna: bool = True, *args, **kwargs):

We can see that dropna function argument is set to True by default.
filtered = self._apply_filter(indices, dropna)

Then _apply_filter function gets called passing the indices and dropna arguments

There might be a problem in _apply_filter function:

def _apply_filter(self, indices, dropna):
if len(indices) == 0:
indices = np.array([], dtype="int64")
else:
indices = np.sort(np.concatenate(indices))
if dropna:
filtered = self._selected_obj.take(indices, axis=self.axis)
else:
mask = np.empty(len(self._selected_obj.index), dtype=bool)
mask.fill(False)
mask[indices.astype(int)] = True
# mask fails to broadcast when passed to where; broadcast manually.
mask = np.tile(mask, list(self._selected_obj.shape[1:]) + [1]).T
filtered = self._selected_obj.where(mask) # Fill with NaNs.
return filtered

@mroeschke what do you think?

@mroeschke
Copy link
Member

Sounds reasonable. May also want to investigate if the group (2, np.nan) is handled correctly

@keenborder786
Copy link

keenborder786 commented Nov 28, 2021

@abatomunkuev I am also checking the _apply_filter for detecting the problem.

@sappersapper
Copy link

sappersapper commented Dec 9, 2021

It seems the problem comes from nan in the key in dict:

@final
def _get_indices(self, names):
"""
Safe get multiple indices, translate keys for
datelike to underlying repr.
"""
def get_converter(s):
# possibly convert to the actual key types
# in the indices, could be a Timestamp or a np.datetime64
if isinstance(s, datetime.datetime):
return lambda key: Timestamp(key)
elif isinstance(s, np.datetime64):
return lambda key: Timestamp(key).asm8
else:
return lambda key: key
if len(names) == 0:
return []
if len(self.indices) > 0:
index_sample = next(iter(self.indices))
else:
index_sample = None # Dummy sample
name_sample = names[0]
if isinstance(index_sample, tuple):
if not isinstance(name_sample, tuple):
msg = "must supply a tuple to get_group with multiple grouping keys"
raise ValueError(msg)
if not len(name_sample) == len(index_sample):
try:
# If the original grouper was a tuple
return [self.indices[name] for name in names]
except KeyError as err:
# turns out it wasn't a tuple
msg = (
"must supply a same-length tuple to get_group "
"with multiple grouping keys"
)
raise ValueError(msg) from err
converters = [get_converter(s) for s in index_sample]
names = (tuple(f(n) for f, n in zip(converters, name)) for name in names)
else:
converter = get_converter(index_sample)
names = (converter(name) for name in names)
return [self.indices.get(name, []) for name in names]

In the last line,

return [self.indices.get(name, []) for name in names]

if the name contains nan, the get method of dict might not return the right value as expected.

For the above example, self.indices={(1, 2.0): array([0, 1]), (2, nan): array([2, 3]), (3, 3.0): array([4])}, and name=(2, nan), the _get_indices() return [[]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants