Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mi.drop(x).get_loc_level(x) returns empty slice (rather than raising KeyError) #22221

Closed
HuntJSparra opened this issue Aug 6, 2018 · 11 comments · Fixed by #22230
Closed

mi.drop(x).get_loc_level(x) returns empty slice (rather than raising KeyError) #22221

HuntJSparra opened this issue Aug 6, 2018 · 11 comments · Fixed by #22230
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@HuntJSparra
Copy link

Code Sample

import pandas as pd

df = pd.DataFrame(dict(value=[0, 1], group=['filled','empty']))
groups = df.groupby('group')

def remove1 (group):
	return group[group.value != 1]['value']
trimmed = groups.apply(remove1)


print("Trimmed.loc['empty']:\n",trimmed.loc['empty'])

Problem description

Since Pandas 0.23, if a group (ex. 'empty') is emptied by GroupBy.apply(), then accessing the result (ex. trimmed) through trimmed.loc['empty'] returns the error KeyError: 'the label [empty] is not in the [index]' (full output below). This is possible related to #21624.

Despite what the error says, printing trimmed.indexes shows that 'empty' is a valid index:

MultiIndex(levels=[['empty', 'filled'], [0, 2]],
           labels=[[1, 1], [0, 1]],
           names=['group', None])

Full Output:

Traceback (most recent call last):
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1790, in _validate_key
    error()
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [empty] is not in the [index]'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pandasExample.py", line 14, in <module>
    print("Trimmed.loc['empty']:\n",trimmed.loc['empty'])
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1478, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1911, in _getitem_axis
    self._validate_key(key, axis)
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1798, in _validate_key
    error()
  File ".../python3.6/site-packages/pandas/core/indexing.py", line 1785, in error
    axis=self.obj._get_axis_name(axis)))
KeyError: 'the label [empty] is not in the [index]'

Expected Output

Prior to 0.23, an empty Series would be returned.

0.22.0 output:

Trimmed.loc['empty']:
 Series([], Name: value, dtype: int64)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.6.4
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.7.6
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Aug 6, 2018

Can you try on master? This seemed to be working for me.

@WillAyd WillAyd added Groupby Needs Info Clarification about behavior needed to assess issue labels Aug 6, 2018
@HuntJSparra
Copy link
Author

Thanks for the quick response! Am I going to need to clone and build locally, or is the master on Conda or PIP?

@WillAyd
Copy link
Member

WillAyd commented Aug 6, 2018

0.23.4 was just released so you can try upgrading to that and see if the issue is there (I have not personally), but if not you would clone and build

@HuntJSparra
Copy link
Author

HuntJSparra commented Aug 6, 2018

I upgraded to 0.23.4 and had the same issue. However, it is fixed on master.

@WillAyd WillAyd added Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Aug 6, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 6, 2018

Thanks for confirming. Good to know its "fixed," though I'm actually not sure if empty really should be part of the resulting indexing given there is nothing being returned for it as part of the apply operation. If so then we could close this out with a test case to ensure this behavior remains in place, but if not we may want to rethink what we allow here.

@jreback any thoughts?

@HuntJSparra
Copy link
Author

In the original program where I encountered the error, GroupBy.apply() was being used to reduce a dataset to its outliers, which might or might not exist for a given group. Reverting to the 0.23.0-0.23.4 behavior would mean that I would have to implement a try-catch when going back through the now-trimmed outlier dataset since its MultiIndex still shows the indexes for emptied groups. If the MultiIndex returned by GroupBy.apply() was changed to reflect the behavior, I would then be required to check against the MultiIndex, still adding additional code/complexity.

@WillAyd
Copy link
Member

WillAyd commented Aug 6, 2018

Understood on the specific problem you are trying to solve but I'm not aware of any other places where this type of behavior is encouraged in the code-base, i.e. it may be non-idiomatic.

@toobaz may have some thoughts on this as well

@toobaz toobaz added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Aug 6, 2018
@toobaz
Copy link
Member

toobaz commented Aug 6, 2018

I upgraded to 0.23.4 and had the same issue. However, it is fixed on master.

I'm afraid the fact that this works on master is the actual bug. This is related to #19027 .

@HuntJSparra I can understand what you have in mind, but you are simply looking in an index for a value which is not there, and this would only work in the past - and works again in master - because of an implementation glitch. To be honest I don't even see the try-catch as the right way to go: since some groups just don't have outliers, you might want to loop over trimmed.index.unique(level='group') rather than on all original groups.

In any case, we must understand what happened in master that made this work again.

@toobaz toobaz added Regression Functionality that used to work in a prior pandas version and removed Groupby labels Aug 7, 2018
@toobaz toobaz changed the title 0.23 GroupBy.apply().loc() Empty Group Crash mi.drop(x).get_loc_level(x) returns empty slice (rather than raising KeyError) Aug 7, 2018
@toobaz
Copy link
Member

toobaz commented Aug 7, 2018

Simpler way to reproduce:

In [2]: pd.MultiIndex.from_product([[0, 1], [2, 3]]).drop(1).get_loc_level(1)
Out[2]: (slice(2, 2, None), Int64Index([], dtype='int64'))

@jreback jreback added this to the 0.24.0 milestone Aug 7, 2018
jreback pushed a commit that referenced this issue Aug 9, 2018
…#22230)

* BUG: raise KeyError if MultiIndex.get_loc_level is asked unused label

closes #22221

* TST: test groupby.apply() with user-defined function returning an empty chunk

* CLN: remove named lambda
@choma1
Copy link

choma1 commented Aug 13, 2018

Why was returning a KeyError here determined to be preferable to returning an empty DataFrame?

@toobaz
Copy link
Member

toobaz commented Aug 13, 2018

Why was returning a KeyError here determined to be preferable to returning an empty DataFrame?

Because the label is not in the index/level, and .loc will raise KeyError when the items are not found

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018
…pandas-dev#22230)

* BUG: raise KeyError if MultiIndex.get_loc_level is asked unused label

closes pandas-dev#22221

* TST: test groupby.apply() with user-defined function returning an empty chunk

* CLN: remove named lambda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Needs Discussion Requires discussion from core team before further action Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants