Index astype('category') does not return a CategoricalIndex #18630

nmusolino opened this Issue Dec 4, 2017 · 5 comments


nmusolino commented Dec 4, 2017

Code Sample, a copy-pastable example if possible

In [1]: import pandas

In [2]: idx = pandas.Index(['a', 'b', 'c'])

In [3]: idx
Out[3]: Index(['a', 'b', 'c'], dtype='object')

In [4]: idx.astype('category')
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b8d97d97d03f> in <module>()
----> 1 idx.astype('category')

C:\...\pandas\indexes\ in astype(self, dtype, copy)
    889     @Appender(_index_shared_docs['astype'])
    890     def astype(self, dtype, copy=True):
--> 891         return Index(self.values.astype(dtype, copy=copy),,
    892                      dtype=dtype)

TypeError: data type "category" not understood

Problem description

The documentation for this method reads:

Create an Index with values cast to dtypes. The class of a new Index is determined by dtype.

Since there is a CategoricalIndex type, it is reasonable for a user to expect that .astype('category') would return a CategoricalIndex object.

As a workaround for the issue, users can construct a CategoricalIndex directly:

In [7]: pandas.CategoricalIndex(idx)
Out[7]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

Expected Output

The method should return a CategoricalIndex equal to the following:

In [5]: pandas.CategoricalIndex(['a', 'b', 'c'])
Out[5]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

TomAugspurger commented Dec 4, 2017

That seems reasonable. We would also want to accept CategoricalDtype there.

Are you able to submit a pull request?

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Dec 4, 2017


jreback commented Dec 5, 2017

note that this should test all versions of indexes .astype('category')


jschendel commented Dec 7, 2017

A couple of questions:

  1. It looks like IntervalIndex.astype('category') already has some logic intentionally written to return a Categorical, not a CategoricalIndex. Should this be changed for consistency with the other types of index? Or was there a specific reason it was implemented this way? I don't immediately see a reason why we shouldn't return a CategoricalIndex. (see here for code)

  2. Should MultiIndex.astype('category') return categories consisting of tuples? Or should this not be supported for MultiIndex?


TomAugspurger commented Dec 7, 2017


jreback commented Dec 7, 2017

  1. this could prob be changed. I wrote it like this as we needed to convert II to categorical for indexing, but I don't fully remember if I then discarded that need. This should return a CI instead.

@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 7, 2017

