New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index astype('category') does not return a CategoricalIndex #18630

Closed
nmusolino opened this Issue Dec 4, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@nmusolino
Contributor

nmusolino commented Dec 4, 2017

Code Sample, a copy-pastable example if possible

In [1]: import pandas

In [2]: idx = pandas.Index(['a', 'b', 'c'])

In [3]: idx
Out[3]: Index(['a', 'b', 'c'], dtype='object')

In [4]: idx.astype('category')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b8d97d97d03f> in <module>()
----> 1 idx.astype('category')

C:\...\pandas\indexes\base.py in astype(self, dtype, copy)
    889     @Appender(_index_shared_docs['astype'])
    890     def astype(self, dtype, copy=True):
--> 891         return Index(self.values.astype(dtype, copy=copy), name=self.name,
    892                      dtype=dtype)
    893

TypeError: data type "category" not understood

Problem description

The documentation for this method reads:

Create an Index with values cast to dtypes. The class of a new Index is determined by dtype.

Since there is a CategoricalIndex type, it is reasonable for a user to expect that .astype('category') would return a CategoricalIndex object.

As a workaround for the issue, users can construct a CategoricalIndex directly:

In [7]: pandas.CategoricalIndex(idx)
Out[7]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

Expected Output

The method should return a CategoricalIndex equal to the following:

In [5]: pandas.CategoricalIndex(['a', 'b', 'c'])
Out[5]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Dec 4, 2017

That seems reasonable. We would also want to accept CategoricalDtype there.

Are you able to submit a pull request?

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Dec 4, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 5, 2017

note that this should test all versions of indexes .astype('category')

@jschendel

This comment has been minimized.

Member

jschendel commented Dec 7, 2017

A couple of questions:

  1. It looks like IntervalIndex.astype('category') already has some logic intentionally written to return a Categorical, not a CategoricalIndex. Should this be changed for consistency with the other types of index? Or was there a specific reason it was implemented this way? I don't immediately see a reason why we shouldn't return a CategoricalIndex. (see here for code)

  2. Should MultiIndex.astype('category') return categories consisting of tuples? Or should this not be supported for MultiIndex?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Dec 7, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 7, 2017

  1. this could prob be changed. I wrote it like this as we needed to convert II to categorical for indexing, but I don't fully remember if I then discarded that need. This should return a CI instead.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment