Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: IntervalIndex[IntervalArray] #20611

Merged
merged 24 commits into from Jul 13, 2018

Conversation

Projects
None yet
6 participants
@TomAugspurger
Copy link
Contributor

commented Apr 4, 2018

Closes #19453
Closes #19209

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 4, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Apr 4, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 13, 2018 at 01:01 Hours UTC
@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
ret = take_1d(self.categories.values, self._codes)
if dtype and not is_dtype_equal(dtype, self.categories.dtype):
return np.asarray(ret, dtype)
if is_extension_array_dtype(ret):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

__array__ has to return an ndarray. Without this, Categorical[ExtensionArray]would fail, astake_1d(...)` would be an ExtensionArray.

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

comment this.

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

I have an update on this section already in intna

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
msg = "a string needs to be passed, got type {typ}"
raise TypeError(msg.format(typ=type(string)))

@property
def type(self):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

This is the correct scalar type for IntervalArray

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

add doc-string

@@ -270,7 +270,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
**kwargs)

# interval
if is_interval_dtype(data) or is_interval_dtype(dtype):
if ((is_interval_dtype(data) or is_interval_dtype(dtype)) and
not is_object_dtype(dtype)):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

This is so Index(intervals, dtype=object) works correctly (Index, not IntervalIndex)

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
-------
ndarray of hashed values array, same size as len(c)
"""
hashed = hash_array(c.categories.values, encoding, hash_key,
# Convert ExtensionArrays to ndarrays
values = np.asarray(c.categories.values)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Same as above. Categorica[ExtensionArray].values will be an EA, and this has to be an ndarray.

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

hmm, shouldn't you need to check if this is an EA here?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 5, 2018

Author Contributor

I don't think so, asarray won't copy an ndarray, so calling it on an ndarray should be fine.

is_extension_array_dtype takes ~ 2 microseconds, and asarray takes ~10 ns, so better to just always call asarray instead of checking.

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

do we still need this change? what is it for?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 9, 2018

Member

I tried removing it, and test_hash_scalar failed for Interval(0, 1):

def test_hash_scalar(self):
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
pd.Timestamp("2012-01-01", tz='Europe/Brussels'),
datetime.datetime(2012, 1, 1),
pd.Timestamp("2012-01-01", tz='EST').to_pydatetime(),
pd.Timedelta('1 days'), datetime.timedelta(1),
pd.Period('2012-01-01', freq='D'), pd.Interval(0, 1),
np.nan, pd.NaT, None]:
result = _hash_scalar(val)
expected = hash_array(np.array([val], dtype=object),
categorize=True)
assert result[0] == expected[0]

@@ -402,13 +403,17 @@ def encode(obj):
u'freq': u_safe(getattr(obj, 'freqstr', None)),
u'tz': tz,
u'compress': compressor}
elif isinstance(obj, IntervalIndex):
return {u'typ': u'interval_index',
elif isinstance(obj, (IntervalIndex, IntervalArray)):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Should talk about this more generally in #20612

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

this is very very tricky from a back-compat perspective.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 5, 2018

Author Contributor

Mmm, that doesn't sound fun. I'll see what I can do to avoid that.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

I wonder, should this be reverted? Do any msgpack tests fail if it is?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

The test below fails on the last copied line for Interval data if reverted:

def test_basic_index(self):
for s, i in self.d.items():
i_rec = self.encode_decode(i)

@@ -546,10 +546,8 @@ def test_basic(self):

s = Series(ii, name='A')

# dtypes

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

API change.

@@ -886,7 +886,7 @@ def test_hasnans_isnans(self):
assert not idx.hasnans

idx = index.copy()
values = idx.values
values = np.asarray(idx.values)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Made this change since IntervalArray currently isn't mutable. Worth talking about that.

@@ -51,7 +51,7 @@ def test_astype_category(self, index):
'timedelta64', 'timedelta64[ns]', 'datetime64', 'datetime64[ns]',
'datetime64[ns, US/Eastern]'])
def test_astype_cannot_cast(self, index, dtype):
msg = 'Cannot cast IntervalIndex to dtype'
msg = 'Cannot cast IntervalArray to dtype'

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Hmm shouldn't have changed this. Looking into it...

@TomAugspurger
Copy link
Contributor Author

left a comment

High-level summary:

  • Still have IntervalIndex composed with IntervalArray. This seems the most sensible for me.
  • IntervalArray currently isn't mutable. That could be changed but will require some care to ensure that .left and .right are modified inplace atomically
_typ = 'intervalindex'
_comparables = ['name']
_attributes = ['name', 'closed']
_allow_index_ops = True
_exception_rewrite = lambda: rewrite_exception('IntervalArray',

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Not sure what people's thoughts are. Without this, the exception would be

In [6]: pd.IntervalIndex(2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: IntervalArray(...) must be called with a collection of some kind, 2 was passed

This lets the message be IntervalIndex(...).

@jreback
Copy link
Contributor

left a comment

see comments -1 unless much more shared code

from .interval import IntervalArray


__all__ = [

This comment has been minimized.

Copy link
@jreback

jreback Apr 4, 2018

Contributor

you don’t need the all

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

That's so the #noqa aren't needed.

This comment has been minimized.

Copy link
@jreback

jreback Apr 4, 2018

Contributor

we don’t do this anywhere else - don’t need introduce new ways of doing things
u less u r fixing it everywhere

result = IntervalMixin.__new__(cls)

closed = closed or 'right'
left = _ensure_index(left, copy=copy)

This comment has been minimized.

Copy link
@jreback

jreback Apr 4, 2018

Contributor

this is SO duplicating II
have to do something about this - this is a technical debt nightmare

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

Did you see IntervalIndex._simple_new? There isn't any duplicated code between the two.

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

so EA takes a different take on construction that in Index. Where we use _simple_new and _shalllow_copy, but conceptually these are the same things. Would be ok with changing the spellings inside Index to conform this. This is my main objection generally. We spell things one way and a different way in EA, this is so confusing.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 9, 2018

Member

@jreback Can you elaborate on this comment (as you say this is your main objection, but I don't fully understand it).

So IntervalArray._simple_new construct from left/right, while IntervalIndex._simple_new now constructs from values (IntervalArray), while previously it constructed from left/right. So yes, this is different between array and index, but, for IntervalIndex it is now more consistent with the base Index._simple_new (which constructs from values), so I think that is a good thing.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

you copied code from II this is completely he wrong direction to go
instead move it to a shared mixin

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

instead move it to a shared mixin

Could you maybe demo a single method that you think can be moved to a mixin?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

this should be an extremely simple wrapper around the Index code
but if u want to make Index subclass EA that’s hard

my point is evey method here should be shared
and if not you have to have a really really good reason

value itself if False, ``nan``.
..versionadded:: 0.23.0

This comment has been minimized.

Copy link
@jreback

jreback Apr 4, 2018

Contributor

what the heck is this?

return cls.from_arrays(left, right, closed, name=name, copy=False,
dtype=dtype)

def to_tuples(self, na_tuple=True):

This comment has been minimized.

Copy link
@jreback

jreback Apr 4, 2018

Contributor

@TomAugspurger i do see u ripped some shared code out of this
but the top level EA is just adding way too much

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

my point is evey method here should be shared
and if not you have to have a really really good reason

Again, I really don't think there's any code duplication in the current implementation. We've been discussing Indexes subclassing EAs in #19696, but the prevailing opinion there is that composition is better, specifically #19696 (comment) and #19696 (comment). If you want to weigh in there, then please do.

But assuming we do have IntervalIndex subclass IntervalArray, what does that actually gain? We can eliminate things like

def left(self):
    return self._data.left

Ok... I don't think that's worth the extra time it'll take to rewrite CategoricalIndex and IntervalIndex to use inheritance.

TomAugspurger added some commits Apr 4, 2018

Py2 compat
Writable docstrings.

Removed lambda as class attribute.
@@ -1347,7 +926,7 @@ def _format_data(self, name=None):
tail = [formatter(x) for x in self]
summary = '[{tail}]'.format(tail=', '.join(tail))

return summary + self._format_space()
return summary + ',' + self._format_space()

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 4, 2018

Author Contributor

note: This was missing a , at the end of the first line.

In [2]: pd.IntervalIndex.from_breaks([1, 2])
Out[2]:
IntervalIndex([(1, 2]]  # <-- missing comma
              closed='right',
              dtype='interval[int64]')
@jschendel

This comment has been minimized.

Copy link
Member

commented Apr 4, 2018

I think also closes #19453, as it achieves the desired end result? At the very least it looks to implement the behavior I wanted from that issue. Will try taking a closer look at the changes here after work.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

@codecov

This comment has been minimized.

Copy link

commented Apr 4, 2018

Codecov Report

❗️ No coverage uploaded for pull request base (master@dcbf8b5). Click here to learn what that means.
The diff coverage is 93.44%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20611   +/-   ##
=========================================
  Coverage          ?   91.96%           
=========================================
  Files             ?      166           
  Lines             ?    50274           
  Branches          ?        0           
=========================================
  Hits              ?    46232           
  Misses            ?     4042           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.35% <93.44%> (?)
#single 42.17% <37.11%> (?)
Impacted Files Coverage Δ
pandas/core/dtypes/missing.py 93.56% <ø> (ø)
pandas/core/dtypes/dtypes.py 96.04% <100%> (ø)
pandas/util/_doctools.py 12.87% <100%> (ø)
pandas/util/testing.py 85.68% <100%> (ø)
pandas/core/util/hashing.py 98.36% <100%> (ø)
pandas/core/arrays/__init__.py 100% <100%> (ø)
pandas/core/dtypes/concat.py 99.18% <100%> (ø)
pandas/core/indexes/base.py 96.52% <100%> (ø)
pandas/core/arrays/categorical.py 95.94% <100%> (ø)
pandas/io/packers.py 88.04% <85.71%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbf8b5...385ce59. Read the comment docs.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Apr 4, 2018

Did a basic version of IntervalArray.__setitem__ in de96a61.

I think that we have to copy both left and right, to ensure that we're never in a state where only left.__setitem__ succeeds, while right.__setitem__ fails.

@jorisvandenbossche
Copy link
Member

left a comment

I won't have time this week (offline the next days) to do a thorough review, but +1 on general approach. Thanks a lot for this!

For IntervalIndex, if we want to reduce the lines of code, we could rather easily add the interval-specific methods and attributes (left, right, closed, length, mid, is_non_overlapping_monotonic, ...) in an automatic way (like we add arithmetic methods on DataFrame etc).
Basically the methods that just passthrough the self._data equivalent and could also inherit the docstring automatically.

return self.values

def get_values(self):
return self.values

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

Is this method needed?


return self._shallow_copy(new_left, new_right)

take_nd = take

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

Don't think this should be needed?


# Conversion
@property
def values(self):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

I would not call this .values (is this property required somewhere else?)
This is basically __array__ ?

tuples = np.where(~self.isna(), tuples, np.nan)
return tuples

def tolist(self):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

Is this needed somewhere?

# TODO: think about putting this in a parent
return self.values.tolist()

def repeat(self, repeats):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

idem, is this needed?

# Avoid materializing self.values
return self.left.size
# Avoid materializing ndarray[Interval]
return self._data.size

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 4, 2018

Member

This could in principle be moved to the base Index class (same for shape, itemsize, dtype, ..)

@jschendel
Copy link
Member

left a comment

A few quick comments; will review more later.

@@ -299,6 +299,41 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
df['A'].dtype
df['B'].dtype

.. _whatsnew_023.enhancements.interval:

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

_whatsnew_023. --> _whatsnew_0230. and likewise on line 337.


return self._shallow_copy(left, right)

def __setitem__(self, key, value):

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

Should this support the case where value is np.nan? Doesn't look like that currently works.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 6, 2018

Author Contributor

Yes, it currently fails. I'll see what can be done.

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 9, 2018

Member

Looks like there's also something strange going on with datetime data:

In [2]: ia = pd.interval_range(pd.Timestamp('20180101'), periods=3).values

In [3]: ia
Out[3]:
IntervalArray([(2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-03, 2018-01-04]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [4]: new_iv = pd.Interval(pd.Timestamp('20180101'), pd.Timestamp('20180105'))

In [5]: new_iv
Out[5]: Interval('2018-01-01', '2018-01-05', closed='right')

In [6]: ia[0] = new_iv

In [7]: ia
Out[7]: ---------------------------------------------------------------------------
ValueError: left side of interval must be <= right side

In [8]: ia.left
Out[8]: DatetimeIndex(['2018-01-01', '2018-01-05', '2018-01-03'], dtype='datetime64[ns]', freq='D')

In [9]: ia.right
Out[9]: DatetimeIndex(['2018-01-05', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D')
return self._shallow_copy(left, right)

def __setitem__(self, key, value):
if not (is_interval_dtype(value) or isinstance(value, Interval)):

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

Should ABCInterval be used for the isinstance check?


if not isinstance(value, Interval):
msg = ("Interval.fillna only supports filling with scalar "
"'value'. Got a '{}' instead of a 'pandas.Interval'."

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

This message seems a little unclear/confusing. Maybe something like: "IntervalArray.fillna only supports filling with pandas.Interval, got a '{}' instead."

self._left = left
self._right = right

def fillna(self, value=None, method=None, limit=None):

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

There needs to be some extra checks on value here. Right now if I pass an Interval with a different closed, it silently gets changed to match the array's closed:

In [2]: ia = pd.interval_range(0, 4).insert(0, np.nan).values

In [3]: ia
Out[3]:
IntervalArray([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
              closed='right',
              dtype='interval[float64]')

In [4]: ia.fillna(pd.Interval(10, 20, closed='both'))
Out[4]:
IntervalArray([(10.0, 20.0], (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
              closed='right',
              dtype='interval[float64]')

Could probably create some type of helper function/method for this, as I'd guess it'd be somewhat of a common thing to do.

def __setitem__(self, key, value):
if not (is_interval_dtype(value) or isinstance(value, Interval)):
msg = "'value' should be an interval type, got {} instead."
raise TypeError(msg.format(type(value)))

This comment has been minimized.

Copy link
@jschendel

jschendel Apr 5, 2018

Member

Looks like generally type(value).__name__ is used in messages instead of just type(value). Not super concerned with this distinction though, just a minor inconsistency I noticed.

@jreback
Copy link
Contributor

left a comment

left some comments. But this move shows a lot of rough edges. you had to copy quite a bit of code that is already in the index by base class. I really dont' want to re-invent the wheel. Need to decide what goes to EA base class (but that can't be used in Index ATM), and what should be in a shared EA/Index Mixin. This is getting pretty messy.

@@ -451,3 +451,11 @@ def is_platform_mac():

def is_platform_32bit():
return struct.calcsize("P") * 8 < 64


class _WritableDoc(type):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

should be in pandas.util._decorators

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 5, 2018

Author Contributor

It's not a decorator. Still want it there?

This comment has been minimized.

Copy link
@jreback

jreback Apr 14, 2018

Contributor

actually we have a pandas.util._doctools, move there

@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
ret = take_1d(self.categories.values, self._codes)
if dtype and not is_dtype_equal(dtype, self.categories.dtype):
return np.asarray(ret, dtype)
if is_extension_array_dtype(ret):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

comment this.

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
msg = "a string needs to be passed, got type {typ}"
raise TypeError(msg.format(typ=type(string)))

@property
def type(self):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

add doc-string

_typ = 'intervalindex'
_comparables = ['name']
_attributes = ['name', 'closed']
_allow_index_ops = True

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

what is this?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 5, 2018

Author Contributor

I think a remnant of a failed refactor. Will remove.

# both left and right are values
pass

result = self._data._shallow_copy(left=left, right=right)

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

so you are copying then constructing again???

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Apr 5, 2018

Author Contributor

Left & right are shallow copied just like before, and _simple_new is called just like before. Unless I made a mistake (which I don't think I did), this is identical to before.

right = np.concatenate([interval.right for interval in to_concat])
return cls._simple_new(left, right, closed=closed, copy=False)

def _shallow_copy(self, left=None, right=None):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

is this a method in EA? it should be

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 9, 2018

Member

As long as this is only uses internally within the IntervalArray class, it is not necessarily needed to make this part of the official ExtensionArray interface.

What are the advantages in doing so? (also the signature will be different for each array type)

verify_integrity=False)

# TODO: doc
def copy(self, deep=False):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

is this in EA?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 9, 2018

Member

Yes, copy is part of the extension array interface

return self.left.itemsize + self.right.itemsize

def take(self, indices, axis=0, allow_fill=True, fill_value=None,
**kwargs):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

doc. wonder how of this you can push to EA base clasess (I suspect almost all of it), this is a pure indexing operation.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 9, 2018

Member

I might be in favor of adding some basic implementation of take to EA, but in this case, IntervalArray would have to override it anyhow, as it needs to take both from left and right, which is something very specific to intervals.


# Formatting

def _format_data(self):

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

this is really a generic formatter Index. should this be in EA?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 9, 2018

Member

Or rather somewhere in the formatting code? (and then it could still be called from within a default EA repr)

head = []
tail = [formatter(x) for x in self]
summary = '[{tail}]'.format(tail=', '.join(tail))

This comment has been minimized.

Copy link
@jreback

jreback Apr 5, 2018

Contributor

so there is lots of copying of other code because you can't share the Index base things for this. You need prob push some of this code (e.g. formatting) to something like IndexOpsMixin (or the like). and use that here.

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 7, 2018

Pushed some pre-rebase commits:

  • Added NA support to __setitem__
  • Implemented IntervalArray.repeat to fix some failing tests that were complaining about it not existing
  • Addressed a number of review comments
    • Still a few that need to be addressed, but mostly docstring related
@TomAugspurger
Copy link
Contributor Author

left a comment

Overall looks good. I think you're good to merge in master anytime.

@jreback

jreback approved these changes Jul 7, 2018

@@ -299,7 +299,42 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
df['A'].dtype
df['B'].dtype

.. _whatsnew_023.enhancements.extension:
.. _whatsnew_0230.enhancements.interval:

This comment has been minimized.

Copy link
@jreback

jreback Jul 7, 2018

Contributor

this will need changing during the rebase

@jreback
Copy link
Contributor

left a comment

looks pretty good.

Storing Interval Data in Series and DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Interval data may now be stored in a Series or DataFrame, in addition to an

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

double backticks on Series/DataFrame.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Interval data may now be stored in a Series or DataFrame, in addition to an
:class:`IntervalIndex` like before (:issue:`19453`).

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

before -> previously.

ser
ser.dtype

Previously, these would be cast to a NumPy array of Interval objects. In general,

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

double backticks around Interval, or use a class ref. & Series

a Series.

Note that the ``.values`` of a Series containing intervals is no longer a NumPy
array. Rather, it's an ``ExtensionArray``, composed of two arrays ``left`` and

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

I don't think you need to show the internal impl. (e.g. just the first sentence is ok)


ser.values

To recover the NumPy array of Interval objects, use :func:`numpy.asarray`:

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

if you are going to have the below section I would not use this here (showing asarray) as its duplicative, just point to the ref for the below section)

return cls._simple_new(left, right, closed=closed, copy=False)

def _shallow_copy(self, left=None, right=None, closed=None):
if left is None:

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

can you add a doc-string

# Formatting

def _format_data(self):

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

can do in a followup (though prob easy here) to use
format_object_summary from pandas.core.formats.printing as this is vastly simplified

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

its used generally for all Indexes now

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
-------
ndarray of hashed values array, same size as len(c)
"""
hashed = hash_array(c.categories.values, encoding, hash_key,
# Convert ExtensionArrays to ndarrays
values = np.asarray(c.categories.values)

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

do we still need this change? what is it for?

@@ -240,6 +240,7 @@ def test_reindex_non_na_fill_value(self, data_missing):
na = data_missing[0]

array = data_missing._from_sequence([na, valid])
print(array)

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

extraneous print



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):

This comment has been minimized.

Copy link
@jreback

jreback Jul 8, 2018

Contributor

what is this for?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 9, 2018

Member

This is being used to overwrite exception messages so they display the correct class name. Looks like it's currently just used for IntervalIndex when it calls an underlying IntervalArray method that could raise some with an IntervalArray specific message, e.g. 'category, object, and string subtypes are not supported for IntervalArray'.

I suppose we could try to either make these messages slightly more generic ("IntervalArray" --> "IntervalArray-like" or "Interval classes" maybe) or try implementing an additional common interface.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self):
    # We have access to the correct `type(self).__name__`
    pass

def method(self):
    try:
        do_thing()
    except _IntervalException:
        self._raise_exception()

does that make sense?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

Yeah, I don't really have a preference either. I might end up trying a few things to see what's cleanest.

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 9, 2018

Addressed the latest set of review comments, aside from the _format_data one (will do in a followup pr or with another set of review comments).

if not isinstance(value, ABCInterval):
msg = ("'IntervalArray.fillna' only supports filling with a "
"'scalar pandas.Interval'. Got a '{}' instead."
.format(type(value).__name__))

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

Single quote is in the wrong place. Should be 'pandas.Interval'.


def __setitem__(self, key, value):
# na value: need special casing to set directly on numpy arrays
_needs_float_conversion = False

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

_needs_float_conversion -> needs_float_conversion?

raise ValueError("Intervals must all be closed on the same side.")
closed = closed.pop()

# TODO: avoid intermediate list

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

This TODO (which I think I added) seems unavoidable, right?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

Yeah, I don't think it can be avoided. Removed the TODO. IIUC the intermediate list doesn't copy data, so not a big deal (np.concatenate does of course).

"""

@Appender(_interval_shared_docs['set_closed'] % _shared_docs_kwargs)
def set_closed(self, closed):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

Should we make IntervalArray.closed a setter as well?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

I don't think this would be a problem with the current implementation, but #19371 makes me a bit hesitant since it'd make a closed setter a (limited) dtype setter.

@@ -402,13 +403,17 @@ def encode(obj):
u'freq': u_safe(getattr(obj, 'freqstr', None)),
u'tz': tz,
u'compress': compressor}
elif isinstance(obj, IntervalIndex):
return {u'typ': u'interval_index',
elif isinstance(obj, (IntervalIndex, IntervalArray)):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

I wonder, should this be reverted? Do any msgpack tests fail if it is?



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self):
    # We have access to the correct `type(self).__name__`
    pass

def method(self):
    try:
        do_thing()
    except _IntervalException:
        self._raise_exception()

does that make sense?



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 10, 2018

Author Contributor

Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

* :ref:`Categorical <categorical>`
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
* :ref:`Period <timeseries.periods>`
* Interval

This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

do we have a ref for this? (maybe should create one if not)?

@@ -345,6 +414,7 @@ Interval
^^^^^^^^

- Bug in the :class:`IntervalIndex` constructor where the ``closed`` parameter did not always override the inferred ``closed`` (:issue:`19370`)
- Bug in the ``IntervalIndex`` repr where a trailing comma was missing after the list of intervals (:issue`20611`)

This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

need extra colon

copy=False, dtype=None, verify_integrity=True):
result = IntervalMixin.__new__(cls)

closed = closed or 'right'

This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

I believe closed should always be not-None here by-definition?

raise TypeError(msg.format(type(value)))

# Need to ensure that left and right are updated atomically, so we're
# forced to copy, update the copy, and swap in the new values.

This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

not sure I am clear why this is problematic. is there a failure possible when updating the 2nd one?

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

Without the forcing the copy I'm seeing test_setitem_sequence fail:

def test_setitem_sequence(self, data):
arr = pd.Series(data)
original = data.copy()
arr[[0, 1]] = [data[1], data[0]]
assert arr[0] == original[1]
assert arr[1] == original[0]

self._right = right

def fillna(self, value=None, method=None, limit=None):
if method is not None:

This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

ideally add a doc-string

class TestReshaping(BaseInterval, base.BaseReshapingTests):
pass


This comment has been minimized.

Copy link
@jreback

jreback Jul 10, 2018

Contributor

I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here

This comment has been minimized.

Copy link
@jschendel

jschendel Jul 10, 2018

Member

I don't follow. I didn't add any tests to the base class, so not sure what should be moved here?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jul 13, 2018

@jschendel how do you feel about this? IMO, things looked quite nice the last time I looked. Are the outstanding issues

  1. Sharing printing between Index and IntervalArray
  2. (Maybe) changing rewrite_exception to use a method to handle raising with the right name?

Anything else? Are we comfortable doing those as followups? (fine by me)

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 13, 2018

Yes, I believe those are the only outstanding issues, aside from the question I had about @jreback's last comment. I'm fine with doing those items as followups.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jul 13, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2018

@jschendel what comment? i am ok with merging and doing follow ups as well

@jschendel

This comment has been minimized.

Copy link
Member

commented Jul 13, 2018

@jreback : #20611 (comment)

"I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here"

Regardless, doesn't seem like something that would block this from being merged, and could be handled in a followup too.

@jschendel jschendel merged commit 610a19a into pandas-dev:master Jul 13, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@property
def itemsize(self):
return self.left.itemsize + self.right.itemsize

This comment has been minimized.

Copy link
@jorisvandenbossche