Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing on series where index is output from pd.cut #27437

Open
kuchenrolle opened this issue Jul 17, 2019 · 13 comments
Open

Indexing on series where index is output from pd.cut #27437

kuchenrolle opened this issue Jul 17, 2019 · 13 comments
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions

Comments

@kuchenrolle
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
grouped = data.groupby("quintiles")["values"]
samples = grouped.agg(lambda x: list(x.sample(3)))

samples
# quintiles
# (-0.019, 3.8]       [3, 2, 1]
# (3.8, 7.6]          [7, 4, 6]
# (7.6, 11.4]       [8, 10, 11]
# (11.4, 15.2]     [15, 14, 12]
# (15.2, 19.0]     [17, 16, 19]
# Name: values, dtype: object

samples[0]
# [3, 2, 1]
samples[1]
# [3, 2, 1]
samples[2.3]
# [3, 2, 1]
samples[19]
# [17, 16, 19]

samples.keys()
# CategoricalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2],
#                   (15.2, 19.0]],
#                  categories=[(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], ordered=True, name='quintiles', dtype='category')

Problem description

Indexing a series object that has pd.cut() intervals as its index results in membership test of the provided index on the intervals, which I found interesting, but it was very unexpected. Is this intentional and documented somewhere? When I give the labels names (labels=list("abcde") in pd.cut), this doesn't happen anymore.

Expected Output

I expectived it to fall back on .iloc[].

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 4.4.1
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.4.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 23, 2019
@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [7]: samples
Out[7]:
quintiles
(-0.019, 3.8]       [0, 2, 3]
(3.8, 7.6]          [7, 6, 5]
(7.6, 11.4]       [11, 8, 10]
(11.4, 15.2]     [13, 12, 15]
(15.2, 19.0]     [18, 16, 17]
Name: values, dtype: object

In [8]: samples[0]
Out[8]: [0, 2, 3]

In [9]: samples[1]
Out[9]: [7, 6, 5]

In [10]: samples[19]
IndexError: index 19 is out of bounds for axis 0 with size 5

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 10, 2021
@feverbrew
Copy link

Is this still open? Also want to confirm that this should go in test_cut.py?

@djgray780
Copy link
Contributor

take

@jackgoldsmith4
Copy link
Contributor

@djgray780 are you still working on this?

@jackgoldsmith4
Copy link
Contributor

take

@jackgoldsmith4 jackgoldsmith4 removed their assignment Jul 2, 2022
@ryuusama09
Copy link

Is this issue still open , if yes then i would like to contribute !

@tab1tha
Copy link
Contributor

tab1tha commented Sep 5, 2022

@ryuusama09 No one is assigned to the issue so I think you can go ahead and work on it.

@mahaoyu
Copy link

mahaoyu commented Sep 17, 2022

take

@mahaoyu mahaoyu removed their assignment Sep 17, 2022
@Tiri1992
Copy link

Tiri1992 commented Oct 9, 2022

take

@Tiri1992
Copy link

Tiri1992 commented Oct 10, 2022

Just started looking into this issue (as I see its still open). As far as I can see the issue still exists. The behaviour using .iloc[index] is as expected but fails for series[index] and series.loc[index] as demonstrated below:

>>> import pandas as pd
>>> data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
>>> grouped = data.groupby("quintiles")["values"]
>>> samples = grouped.agg(lambda x: list(x.sample(3)))
>>> samples
quintiles
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
>>> samples[0]
[3, 2, 1]
>>> samples[1]
[3, 2, 1]
>>> samples.loc[0]
[3, 2, 1]
>>> samples.loc[1]
[3, 2, 1]
>>> samples.iloc[0]
[3, 2, 1]
>>> samples.iloc[1] # expected behaviour
[6, 4, 7]

Extending this, the bug also occurs when constructing a series from an instance of IntervalIndex (which might be the root of this problem). I demonstrate recreating a series object with the profile from above but this time using the index as a type of IntervalIndex.

>>> samples.index.categories
IntervalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], dtype='interval[float64, right]')
>>> new_series = pd.Series(samples.values, index=samples.index.categories)
>>> new_series
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
dtype: object
>>> new_series[0]
[3, 2, 1]
>>> new_series[1]
[3, 2, 1]

I will attempt to write up some unittests for this (seems the logical grouping for Series indexing tests is under tests/series/indexing/test_getitem.py). Just from a quick browse a lot of the index logic for Series objects are happening in the IndexingMixin base class?

@Tiri1992
Copy link

^^ I think having worked on this issue I realised this is expected behaviour and changing this ends up breaking a lot of tests. The expected behaviour was already implemented in tests a few months back here.

The call stack was roughly going from checking the fallback from series

To then checking the categorical _should_fallback, which just redirected the call to the categories object

Which obviously fell back to the IntervalIndex _should_fallback method

In this case I initially tried modifying the list of accepted strings to include i and f dtypes (i worked for some f for others) but broke many other tests.

This matter should be closed and will delete my PR above.

@Tiri1992 Tiri1992 removed their assignment Oct 13, 2022
@liang3zy22
Copy link
Contributor

Just tried on master. This issue still exists.

In [3]: data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))

In [4]: grouped = data.groupby("quintiles", observed=False)["values"]

In [5]: samples = grouped.agg(lambda x: list(x.sample(3)))

In [6]: samples
Out[6]: 
quintiles
(-0.019, 3.8]       [2, 1, 0]
(3.8, 7.6]          [5, 7, 4]
(7.6, 11.4]       [11, 10, 8]
(11.4, 15.2]     [12, 15, 14]
(15.2, 19.0]     [17, 16, 19]
Name: values, dtype: object

In [7]: samples[0]
Out[7]: [2, 1, 0]

In [8]: samples[1]
Out[8]: [2, 1, 0]

In [9]: samples[19]
Out[9]: [17, 16, 19]

This might be expected behavior.

@matthijsdewit111
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.