New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntervalIndex does not accept CategoricalIndex (of interval dtype) #21243

Closed
toobaz opened this Issue May 29, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@toobaz
Member

toobaz commented May 29, 2018

Code Sample, a copy-pastable example if possible

In [2]: pd.qcut(range(100), 10).value_counts().index.categories.dtype
Out[2]: interval[float64]

In [3]: pd.IntervalIndex(pd.qcut(range(100), 10).value_counts().index)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-32e3bb1a4f49> in <module>()
----> 1 pd.IntervalIndex(pd.qcut(range(100), 10).value_counts().index)

~/nobackup/repo/pandas/pandas/core/indexes/interval.py in __new__(cls, data, closed, dtype, copy, name, fastpath, verify_integrity)
    238 
    239             data = maybe_convert_platform_interval(data)
--> 240             left, right, infer_closed = intervals_to_interval_bounds(data)
    241 
    242             if (com._all_not_none(closed, infer_closed) and

TypeError: Argument 'intervals' has incorrect type (expected numpy.ndarray, got CategoricalIndex)

Problem description

From a practical point of view, the above is just handy. From a conceptual point of view, the fact that a Series or Index has categorical dtype should be irrelevant for all operations which do not directly concern categories, so again, the above should work the same as if it was not a CategoricalIndex.

I think the following also reflects the same problem:

In [5]: pd.IntervalIndex(pd.qcut(range(100), 10))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-3744b32bfad8> in <module>()
----> 1 pd.IntervalIndex(pd.qcut(range(100), 10))

~/nobackup/repo/pandas/pandas/core/indexes/interval.py in __new__(cls, data, closed, dtype, copy, name, fastpath, verify_integrity)
    238 
    239             data = maybe_convert_platform_interval(data)
--> 240             left, right, infer_closed = intervals_to_interval_bounds(data)
    241 
    242             if (com._all_not_none(closed, infer_closed) and

TypeError: Argument 'intervals' has incorrect type (expected numpy.ndarray, got Categorical)

... and hence, also passing index.values raises an error.

Expected Output

An IntervalIndex with the same content of the CategoricalIndex.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.24.0.dev0+25.gcd0447102
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.2.0
Cython: 0.25.2
numpy: 1.14.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.2.2.post1153+gff6786446
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@toobaz

This comment has been minimized.

Member

toobaz commented May 29, 2018

Tagging as regression because the code provided in this SO answer doesn't work any more.

@jschendel

This comment has been minimized.

Member

jschendel commented May 29, 2018

Yeah, this should probably work. I think it's just a couple line fix; will create a PR within the next day if the fix is as straightforward as I think.

@jschendel

This comment has been minimized.

Member

jschendel commented May 29, 2018

Also, the IntervalIndex.from_* methods look a bit inconsistent when categoricals are provided.

IntervalIndex.from_arrays fails:

In [2]: cat_l = pd.Categorical([0, 0, 1, 2, 2])

In [3]: cat_r = pd.Categorical([1, 3, 3, 3, 4])

In [4]: pd.IntervalIndex.from_arrays(cat_l, cat_r)
---------------------------------------------------------------------------
TypeError: category, object, and string subtypes are not supported for IntervalIndex

But from_tuples is fine:

In [5]: cat_tup = pd.Categorical([(0, 1), (0, 1), (0, 3), (1, 2)])

In [6]: pd.IntervalIndex.from_tuples(cat_tup)
Out[6]:
IntervalIndex([(0, 1], (0, 1], (0, 3], (1, 2]]
              closed='right',
              dtype='interval[int64]')

A while back we decided to disallow creating an IntervalIndex from categoricals, but there are some valid cases for them, though I don't imagine that what I've shown above is done often. Will create a separate issue for this later.

@toobaz

This comment has been minimized.

Member

toobaz commented May 29, 2018

A while back we decided to disallow creating an IntervalIndex from categoricals, but there are some valid cases for them, though I don't imagine that what I've shown above is done often

While I can't say I use IntervalIndex often, pd.qcut returns Intervals as categories, and more or less anything you want to do with them will involve transforming them in a IntervalIndex (I think).

Will create a separate issue for this later.

"this" = "the IntervalIndex.from_* methods", right?

@jschendel

This comment has been minimized.

Member

jschendel commented May 29, 2018

While I can't say I use IntervalIndex often, pd.qcut returns Intervals as categories, and more or less anything you want to do with them will involve transforming them in a IntervalIndex (I think).

Yes, I think I was a bit unclear but by "what I've shown above" I meant using from_* is with categorical data is likely uncommon (note that from_intervals is deprecated). Directly using the constructor with categorical data is likely not as uncommon due to pd.cut/pd.qcut, as you mention.

"this" = "the IntervalIndex.from_* methods", right?

Correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment