New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index(data=..., dtype=CategoricalDtype(...)) doesn't maintain dtype #19032

Closed
topper-123 opened this Issue Jan 2, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@topper-123
Contributor

topper-123 commented Jan 2, 2018

Code Sample, a copy-pastable example if possible

>>> c = pd.api.types.CategoricalDtype(categories=['German', 'English', 'French'],
...                                   ordered=True)
>>> pd.Index(['German', 'English', 'French'], dtype=c)
CategoricalIndex(['German', 'English', 'French'],
                 categories=['English', 'French', 'German'],
                 ordered=False, dtype='category')

Problem description

Notice that the index isn't ordered and the categories have changed location.

If construction goes through CategoricalIndex, everything is ok.

Expected Output

Expected output is CategoricalIndex(['German', 'English', 'French'], categories=['German', 'English', 'French'], ordered=True).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 32
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.1
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170922
Cython: 0.26.1
numpy: 1.11.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.5
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@topper-123 topper-123 closed this Jan 2, 2018

@topper-123 topper-123 reopened this Jan 2, 2018

@topper-123 topper-123 changed the title from Index(data, dtype=CategoricalDtype()) doesn't maintain dtype to Index(data=..., dtype=CategoricalDtype(...)) doesn't maintain dtype Jan 2, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Jan 2, 2018

thought we had an issue for this one, ok, IIRC cc @jschendel @jorisvandenbossche @TomAugspurger
discussed this on the last Cat PR.

@jreback jreback added this to the Next Major Release milestone Jan 2, 2018

@jschendel

This comment has been minimized.

Member

jschendel commented Jan 2, 2018

My previous PR's related to CDT were in the context of .astype(CDT), so this case wasn't something that was covered. Fix looks super simple though, dtype just isn't being passed through to the CI constructor here:

if is_categorical_dtype(data) or is_categorical_dtype(dtype):
from .category import CategoricalIndex
return CategoricalIndex(data, copy=copy, name=name, **kwargs)

Will open a PR to fix it later today.

@topper-123

This comment has been minimized.

Contributor

topper-123 commented Jan 2, 2018

Could you maybe also do the same for IntervalIndex? That is, if the dtype is a pd.api.types.IntervalDtype, the constructor should be passed to IntervalIndex. (this doesn't do anything useful ATM, but as a principle (IMO) if dtype=IntervalDtype() should delegate to IntervalIndex.

This should only be 2 lines more....

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Jan 2, 2018

Could you maybe also do the same for IntervalIndex?

FWIW, I'm thinking through this type of issue with #18767

I'd recommend avoiding more

if isinstance(foo, extension_type_1): ...
elif isinstance(foo, extension_type_2): ...

blocks for now.

@jschendel

This comment has been minimized.

Member

jschendel commented Jan 2, 2018

I'd recommend avoiding more blocks for now

The block is actually already there, so it's really just a matter of passing dtype through to the IntervalIndex constructor. Won't do anything right now, as the IntervalIndex constructor accepts a dtype kwarg, but never actually uses it.

Planning to open an issue in the near future to allow subtype conversion for IntervalIndex, e.g. converting interval[int64] to interval[float64], which currently isn't possible via the constructor or astype. I believe this same issue would come into play then, so it'd need to be addressed at some point in the near term. Can do it in this PR or later, doesn't really matter to me.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Jan 2, 2018

The block is actually already there, so it's really just a matter of passing dtype through to the IntervalIndex constructor

👍 then.

@topper-123

This comment has been minimized.

Contributor

topper-123 commented Jan 2, 2018

Great, thanks.

FYI, my use case is combining several series of the same dtype, and there may be several such groups. If I can pre-register the dtype, I can covert to the result index simply by Index(..., dtype=my_type), while if I explicitly have to supply CategoricalIndex, IntervalIndex also, it gets more complicated to do this kind of thing.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment