New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntervalDtype inconsistencies and bugs #18980

Closed
jschendel opened this Issue Dec 28, 2017 · 1 comment

Comments

Projects
None yet
2 participants
@jschendel
Member

jschendel commented Dec 28, 2017

1. Inconsistent comparisons versus string 'interval':

In [2]: IntervalDtype() == 'interval'
Out[2]: True

In [3]: IntervalDtype('interval') == 'interval'
Out[3]: False

In [4]: IntervalDtype('int64') == 'interval'
Out[4]: False

I'd expect all of these to return True, like how CategoricalDtype(*, *) == 'category' always returns True.


2. Inconsistent comparisons versus IntervalDtype(None):

In [5]: IntervalDtype(None) == IntervalDtype('interval')
Out[5]: False

In [6]: IntervalDtype(None) == IntervalDtype('int64')
Out[6]: False

I'd expect all of these to return True, like how CDT(None, None) == CDT(*, *) always returns True.


3. IntervalDtype.name attribute changes

In [7]: IntervalDtype().name
Out[7]: 'interval'

In [8]: IntervalDtype('interval').name
Out[8]: 'interval[]'

In [9]: IntervalDtype('int64').name
Out[9]: 'interval[int64]'

CategoricalDtype.name attribute is always the same:

In [10]: CategoricalDtype(list('abc'), True).name
Out[10]: 'category'

In [11]: CategoricalDtype(list('wxyz'), False).name
Out[11]: 'category'

I'd expect IntervalDtype.name to always return 'interval', like how CDT.name always returns 'category'. This makes the code for checking equality against strings (i.e. what I described in 1) simpler. I don't think the behavior of str(IntervalDtype) should change, which is currently the same as IntervalDtype.name, so I'd still have that return strings specifying the subtype.


4. CategoricalDtype gets cached incorrectly: (No longer an issue due to #19022)

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))

In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))

In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

This looks to be caused by the caching being done by string representation, and str(CDT(*, *)) always returns 'category':

try:
return cls._cache[str(subtype)]
except KeyError:
u = object.__new__(cls)
u.subtype = subtype
cls._cache[str(subtype)] = u
return u

Can caching be removed entirely for IntervalDtype, or is there some need/advantage that I'm not seeing? Looking at the other dtypes, CategoricalDtype appears to have had the caching code removed, but PeriodDtype and DatetimeTZDtype are using it.

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 29, 2017

IIRC the reason for caching was:

a) perf (constructing a dtype each time involves some checks)
b) equality/hashing; the caching gives exactly the same object back so we can do is comparison. this is quite important as these are singular objects (though they are parameterized). IOW, IntervalDtype() should always be a singleton, and so on.

@jreback jreback added this to the Next Major Release milestone Dec 29, 2017

@jschendel jschendel referenced this issue Dec 29, 2017

Merged

Fix IntervalDtype Bugs and Inconsistencies #18997

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment