Indexing on series where index is output from pd.cut #27437

kuchenrolle · 2019-07-17T18:24:13Z

Code Sample, a copy-pastable example if possible

import pandas as pd

data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
grouped = data.groupby("quintiles")["values"]
samples = grouped.agg(lambda x: list(x.sample(3)))

samples
# quintiles
# (-0.019, 3.8]       [3, 2, 1]
# (3.8, 7.6]          [7, 4, 6]
# (7.6, 11.4]       [8, 10, 11]
# (11.4, 15.2]     [15, 14, 12]
# (15.2, 19.0]     [17, 16, 19]
# Name: values, dtype: object

samples[0]
# [3, 2, 1]
samples[1]
# [3, 2, 1]
samples[2.3]
# [3, 2, 1]
samples[19]
# [17, 16, 19]

samples.keys()
# CategoricalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2],
#                   (15.2, 19.0]],
#                  categories=[(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], ordered=True, name='quintiles', dtype='category')

Problem description

Indexing a series object that has pd.cut() intervals as its index results in membership test of the provided index on the intervals, which I found interesting, but it was very unexpected. Is this intentional and documented somewhere? When I give the labels names (labels=list("abcde") in pd.cut), this doesn't happen anymore.

Expected Output

I expectived it to fall back on .iloc[].

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.12.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 4.4.1
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.4.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-07-10T18:26:30Z

This looks to work on master now. Could use a test

In [7]: samples
Out[7]:
quintiles
(-0.019, 3.8]       [0, 2, 3]
(3.8, 7.6]          [7, 6, 5]
(7.6, 11.4]       [11, 8, 10]
(11.4, 15.2]     [13, 12, 15]
(15.2, 19.0]     [18, 16, 17]
Name: values, dtype: object

In [8]: samples[0]
Out[8]: [0, 2, 3]

In [9]: samples[1]
Out[9]: [7, 6, 5]

In [10]: samples[19]
IndexError: index 19 is out of bounds for axis 0 with size 5

feverbrew · 2021-09-20T18:27:19Z

Is this still open? Also want to confirm that this should go in test_cut.py?

djgray780 · 2022-05-18T15:31:35Z

take

jackgoldsmith4 · 2022-06-04T15:30:28Z

@djgray780 are you still working on this?

jackgoldsmith4 · 2022-06-15T02:19:13Z

take

ryuusama09 · 2022-08-27T03:32:59Z

Is this issue still open , if yes then i would like to contribute !

tab1tha · 2022-09-05T05:26:10Z

@ryuusama09 No one is assigned to the issue so I think you can go ahead and work on it.

mahaoyu · 2022-09-17T15:26:00Z

take

Tiri1992 · 2022-10-09T20:14:18Z

take

Tiri1992 · 2022-10-10T17:18:20Z

Just started looking into this issue (as I see its still open). As far as I can see the issue still exists. The behaviour using .iloc[index] is as expected but fails for series[index] and series.loc[index] as demonstrated below:

>>> import pandas as pd
>>> data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))
>>> grouped = data.groupby("quintiles")["values"]
>>> samples = grouped.agg(lambda x: list(x.sample(3)))
>>> samples
quintiles
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
>>> samples[0]
[3, 2, 1]
>>> samples[1]
[3, 2, 1]
>>> samples.loc[0]
[3, 2, 1]
>>> samples.loc[1]
[3, 2, 1]
>>> samples.iloc[0]
[3, 2, 1]
>>> samples.iloc[1] # expected behaviour
[6, 4, 7]

Extending this, the bug also occurs when constructing a series from an instance of IntervalIndex (which might be the root of this problem). I demonstrate recreating a series object with the profile from above but this time using the index as a type of IntervalIndex.

>>> samples.index.categories
IntervalIndex([(-0.019, 3.8], (3.8, 7.6], (7.6, 11.4], (11.4, 15.2], (15.2, 19.0]], dtype='interval[float64, right]')
>>> new_series = pd.Series(samples.values, index=samples.index.categories)
>>> new_series
(-0.019, 3.8]       [3, 2, 1]
(3.8, 7.6]          [6, 4, 7]
(7.6, 11.4]        [11, 8, 9]
(11.4, 15.2]     [13, 14, 12]
(15.2, 19.0]     [18, 19, 16]
dtype: object
>>> new_series[0]
[3, 2, 1]
>>> new_series[1]
[3, 2, 1]

I will attempt to write up some unittests for this (seems the logical grouping for Series indexing tests is under tests/series/indexing/test_getitem.py). Just from a quick browse a lot of the index logic for Series objects are happening in the IndexingMixin base class?

Tiri1992 · 2022-10-11T13:10:42Z

^^ I think having worked on this issue I realised this is expected behaviour and changing this ends up breaking a lot of tests. The expected behaviour was already implemented in tests a few months back here.

The call stack was roughly going from checking the fallback from series

To then checking the categorical _should_fallback, which just redirected the call to the categories object

Which obviously fell back to the IntervalIndex _should_fallback method

In this case I initially tried modifying the list of accepted strings to include i and f dtypes (i worked for some f for others) but broke many other tests.

This matter should be closed and will delete my PR above.

liang3zy22 · 2023-03-23T03:09:23Z

Just tried on master. This issue still exists.

In [3]: data = pd.DataFrame(dict(values=range(20), quintiles=pd.cut(range(20), 5)))

In [4]: grouped = data.groupby("quintiles", observed=False)["values"]

In [5]: samples = grouped.agg(lambda x: list(x.sample(3)))

In [6]: samples
Out[6]: 
quintiles
(-0.019, 3.8]       [2, 1, 0]
(3.8, 7.6]          [5, 7, 4]
(7.6, 11.4]       [11, 10, 8]
(11.4, 15.2]     [12, 15, 14]
(15.2, 19.0]     [17, 16, 19]
Name: values, dtype: object

In [7]: samples[0]
Out[7]: [2, 1, 0]

In [8]: samples[1]
Out[8]: [2, 1, 0]

In [9]: samples[19]
Out[9]: [17, 16, 19]

This might be expected behavior.

matthijsdewit111 · 2023-08-18T08:22:47Z

take

jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 23, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 10, 2021

github-actions bot assigned jackgoldsmith4 Jun 15, 2022

jackgoldsmith4 removed their assignment Jul 2, 2022

github-actions bot assigned mahaoyu Sep 17, 2022

mahaoyu removed their assignment Sep 17, 2022

github-actions bot assigned Tiri1992 Oct 9, 2022

Tiri1992 mentioned this issue Oct 10, 2022

Index fallbacks with Intervals #49035

Closed

7 tasks

Tiri1992 removed their assignment Oct 13, 2022

github-actions bot assigned matthijsdewit111 Aug 18, 2023

matthijsdewit111 removed their assignment Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing on series where index is output from pd.cut #27437

Indexing on series where index is output from pd.cut #27437

kuchenrolle commented Jul 17, 2019

INSTALLED VERSIONS

mroeschke commented Jul 10, 2021

feverbrew commented Sep 20, 2021

djgray780 commented May 18, 2022

jackgoldsmith4 commented Jun 4, 2022

jackgoldsmith4 commented Jun 15, 2022

ryuusama09 commented Aug 27, 2022

tab1tha commented Sep 5, 2022

mahaoyu commented Sep 17, 2022

Tiri1992 commented Oct 9, 2022

Tiri1992 commented Oct 10, 2022 •

edited

Tiri1992 commented Oct 11, 2022

liang3zy22 commented Mar 23, 2023

matthijsdewit111 commented Aug 18, 2023

Indexing on series where index is output from pd.cut #27437

Indexing on series where index is output from pd.cut #27437

Comments

kuchenrolle commented Jul 17, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jul 10, 2021

feverbrew commented Sep 20, 2021

djgray780 commented May 18, 2022

jackgoldsmith4 commented Jun 4, 2022

jackgoldsmith4 commented Jun 15, 2022

ryuusama09 commented Aug 27, 2022

tab1tha commented Sep 5, 2022

mahaoyu commented Sep 17, 2022

Tiri1992 commented Oct 9, 2022

Tiri1992 commented Oct 10, 2022 • edited

Tiri1992 commented Oct 11, 2022

liang3zy22 commented Mar 23, 2023

matthijsdewit111 commented Aug 18, 2023

Output of `pd.show_versions()`

Tiri1992 commented Oct 10, 2022 •

edited