Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: IntervalIndex[IntervalArray] #20611

Merged
merged 24 commits into from
Jul 13, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Apr 4, 2018

Closes #19453
Closes #19209

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 4, 2018
@TomAugspurger TomAugspurger added the Interval Interval data type label Apr 4, 2018
@pep8speaks
Copy link

pep8speaks commented Apr 4, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 13, 2018 at 01:01 Hours UTC

@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
ret = take_1d(self.categories.values, self._codes)
if dtype and not is_dtype_equal(dtype, self.categories.dtype):
return np.asarray(ret, dtype)
if is_extension_array_dtype(ret):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__array__ has to return an ndarray. Without this, Categorical[ExtensionArray]would fail, astake_1d(...)` would be an ExtensionArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an update on this section already in intna

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
msg = "a string needs to be passed, got type {typ}"
raise TypeError(msg.format(typ=type(string)))

@property
def type(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the correct scalar type for IntervalArray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc-string

@@ -270,7 +270,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
**kwargs)

# interval
if is_interval_dtype(data) or is_interval_dtype(dtype):
if ((is_interval_dtype(data) or is_interval_dtype(dtype)) and
not is_object_dtype(dtype)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so Index(intervals, dtype=object) works correctly (Index, not IntervalIndex)

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
-------
ndarray of hashed values array, same size as len(c)
"""
hashed = hash_array(c.categories.values, encoding, hash_key,
# Convert ExtensionArrays to ndarrays
values = np.asarray(c.categories.values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Categorica[ExtensionArray].values will be an EA, and this has to be an ndarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, shouldn't you need to check if this is an EA here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, asarray won't copy an ndarray, so calling it on an ndarray should be fine.

is_extension_array_dtype takes ~ 2 microseconds, and asarray takes ~10 ns, so better to just always call asarray instead of checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this change? what is it for?

Copy link
Member

@jschendel jschendel Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing it, and test_hash_scalar failed for Interval(0, 1):

def test_hash_scalar(self):
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
pd.Timestamp("2012-01-01", tz='Europe/Brussels'),
datetime.datetime(2012, 1, 1),
pd.Timestamp("2012-01-01", tz='EST').to_pydatetime(),
pd.Timedelta('1 days'), datetime.timedelta(1),
pd.Period('2012-01-01', freq='D'), pd.Interval(0, 1),
np.nan, pd.NaT, None]:
result = _hash_scalar(val)
expected = hash_array(np.array([val], dtype=object),
categorize=True)
assert result[0] == expected[0]

@@ -402,13 +403,17 @@ def encode(obj):
u'freq': u_safe(getattr(obj, 'freqstr', None)),
u'tz': tz,
u'compress': compressor}
elif isinstance(obj, IntervalIndex):
return {u'typ': u'interval_index',
elif isinstance(obj, (IntervalIndex, IntervalArray)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should talk about this more generally in #20612

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very very tricky from a back-compat perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, that doesn't sound fun. I'll see what I can do to avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, should this be reverted? Do any msgpack tests fail if it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test below fails on the last copied line for Interval data if reverted:

def test_basic_index(self):
for s, i in self.d.items():
i_rec = self.encode_decode(i)

@@ -546,10 +546,8 @@ def test_basic(self):

s = Series(ii, name='A')

# dtypes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API change.

@@ -886,7 +886,7 @@ def test_hasnans_isnans(self):
assert not idx.hasnans

idx = index.copy()
values = idx.values
values = np.asarray(idx.values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change since IntervalArray currently isn't mutable. Worth talking about that.

@@ -51,7 +51,7 @@ def test_astype_category(self, index):
'timedelta64', 'timedelta64[ns]', 'datetime64', 'datetime64[ns]',
'datetime64[ns, US/Eastern]'])
def test_astype_cannot_cast(self, index, dtype):
msg = 'Cannot cast IntervalIndex to dtype'
msg = 'Cannot cast IntervalArray to dtype'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm shouldn't have changed this. Looking into it...

Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level summary:

  • Still have IntervalIndex composed with IntervalArray. This seems the most sensible for me.
  • IntervalArray currently isn't mutable. That could be changed but will require some care to ensure that .left and .right are modified inplace atomically

_typ = 'intervalindex'
_comparables = ['name']
_attributes = ['name', 'closed']
_allow_index_ops = True
_exception_rewrite = lambda: rewrite_exception('IntervalArray',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what people's thoughts are. Without this, the exception would be

In [6]: pd.IntervalIndex(2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: IntervalArray(...) must be called with a collection of some kind, 2 was passed

This lets the message be IntervalIndex(...).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments -1 unless much more shared code

from .interval import IntervalArray


__all__ = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need the all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's so the #noqa aren't needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don’t do this anywhere else - don’t need introduce new ways of doing things
u less u r fixing it everywhere

result = IntervalMixin.__new__(cls)

closed = closed or 'right'
left = _ensure_index(left, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is SO duplicating II
have to do something about this - this is a technical debt nightmare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see IntervalIndex._simple_new? There isn't any duplicated code between the two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so EA takes a different take on construction that in Index. Where we use _simple_new and _shalllow_copy, but conceptually these are the same things. Would be ok with changing the spellings inside Index to conform this. This is my main objection generally. We spell things one way and a different way in EA, this is so confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Can you elaborate on this comment (as you say this is your main objection, but I don't fully understand it).

So IntervalArray._simple_new construct from left/right, while IntervalIndex._simple_new now constructs from values (IntervalArray), while previously it constructed from left/right. So yes, this is different between array and index, but, for IntervalIndex it is now more consistent with the base Index._simple_new (which constructs from values), so I think that is a good thing.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

you copied code from II this is completely he wrong direction to go
instead move it to a shared mixin

@TomAugspurger
Copy link
Contributor Author

instead move it to a shared mixin

Could you maybe demo a single method that you think can be moved to a mixin?

@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

this should be an extremely simple wrapper around the Index code
but if u want to make Index subclass EA that’s hard

my point is evey method here should be shared
and if not you have to have a really really good reason

value itself if False, ``nan``.

..versionadded:: 0.23.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return cls.from_arrays(left, right, closed, name=name, copy=False,
dtype=dtype)

def to_tuples(self, na_tuple=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger i do see u ripped some shared code out of this
but the top level EA is just adding way too much

@TomAugspurger
Copy link
Contributor Author

my point is evey method here should be shared
and if not you have to have a really really good reason

Again, I really don't think there's any code duplication in the current implementation. We've been discussing Indexes subclassing EAs in #19696, but the prevailing opinion there is that composition is better, specifically #19696 (comment) and #19696 (comment). If you want to weigh in there, then please do.

But assuming we do have IntervalIndex subclass IntervalArray, what does that actually gain? We can eliminate things like

def left(self):
    return self._data.left

Ok... I don't think that's worth the extra time it'll take to rewrite CategoricalIndex and IntervalIndex to use inheritance.

Writable docstrings.

Removed lambda as class attribute.
@@ -1347,7 +926,7 @@ def _format_data(self, name=None):
tail = [formatter(x) for x in self]
summary = '[{tail}]'.format(tail=', '.join(tail))

return summary + self._format_space()
return summary + ',' + self._format_space()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This was missing a , at the end of the first line.

In [2]: pd.IntervalIndex.from_breaks([1, 2])
Out[2]:
IntervalIndex([(1, 2]]  # <-- missing comma
              closed='right',
              dtype='interval[int64]')

@jschendel
Copy link
Member

I think also closes #19453, as it achieves the desired end result? At the very least it looks to implement the behavior I wanted from that issue. Will try taking a closer look at the changes here after work.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 4, 2018 via email

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 4, 2018
@codecov
Copy link

codecov bot commented Apr 4, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@dcbf8b5). Click here to learn what that means.
The diff coverage is 93.44%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20611   +/-   ##
=========================================
  Coverage          ?   91.96%           
=========================================
  Files             ?      166           
  Lines             ?    50274           
  Branches          ?        0           
=========================================
  Hits              ?    46232           
  Misses            ?     4042           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.35% <93.44%> (?)
#single 42.17% <37.11%> (?)
Impacted Files Coverage Δ
pandas/core/dtypes/missing.py 93.56% <ø> (ø)
pandas/core/dtypes/dtypes.py 96.04% <100%> (ø)
pandas/util/_doctools.py 12.87% <100%> (ø)
pandas/util/testing.py 85.68% <100%> (ø)
pandas/core/util/hashing.py 98.36% <100%> (ø)
pandas/core/arrays/__init__.py 100% <100%> (ø)
pandas/core/dtypes/concat.py 99.18% <100%> (ø)
pandas/core/indexes/base.py 96.52% <100%> (ø)
pandas/core/arrays/categorical.py 95.94% <100%> (ø)
pandas/io/packers.py 88.04% <85.71%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbf8b5...385ce59. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

Did a basic version of IntervalArray.__setitem__ in de96a61.

I think that we have to copy both left and right, to ensure that we're never in a state where only left.__setitem__ succeeds, while right.__setitem__ fails.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't have time this week (offline the next days) to do a thorough review, but +1 on general approach. Thanks a lot for this!

For IntervalIndex, if we want to reduce the lines of code, we could rather easily add the interval-specific methods and attributes (left, right, closed, length, mid, is_non_overlapping_monotonic, ...) in an automatic way (like we add arithmetic methods on DataFrame etc).
Basically the methods that just passthrough the self._data equivalent and could also inherit the docstring automatically.

return self.values

def get_values(self):
return self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method needed?


return self._shallow_copy(new_left, new_right)

take_nd = take
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this should be needed?


# Conversion
@property
def values(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call this .values (is this property required somewhere else?)
This is basically __array__ ?

tuples = np.where(~self.isna(), tuples, np.nan)
return tuples

def tolist(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed somewhere?

# TODO: think about putting this in a parent
return self.values.tolist()

def repeat(self, repeats):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem, is this needed?

# Avoid materializing self.values
return self.left.size
# Avoid materializing ndarray[Interval]
return self._data.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could in principle be moved to the base Index class (same for shape, itemsize, dtype, ..)

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick comments; will review more later.

@@ -299,6 +299,41 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
df['A'].dtype
df['B'].dtype

.. _whatsnew_023.enhancements.interval:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_whatsnew_023. --> _whatsnew_0230. and likewise on line 337.


return self._shallow_copy(left, right)

def __setitem__(self, key, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this support the case where value is np.nan? Doesn't look like that currently works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it currently fails. I'll see what can be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's also something strange going on with datetime data:

In [2]: ia = pd.interval_range(pd.Timestamp('20180101'), periods=3).values

In [3]: ia
Out[3]:
IntervalArray([(2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-03, 2018-01-04]],
              closed='right',
              dtype='interval[datetime64[ns]]')

In [4]: new_iv = pd.Interval(pd.Timestamp('20180101'), pd.Timestamp('20180105'))

In [5]: new_iv
Out[5]: Interval('2018-01-01', '2018-01-05', closed='right')

In [6]: ia[0] = new_iv

In [7]: ia
Out[7]: ---------------------------------------------------------------------------
ValueError: left side of interval must be <= right side

In [8]: ia.left
Out[8]: DatetimeIndex(['2018-01-01', '2018-01-05', '2018-01-03'], dtype='datetime64[ns]', freq='D')

In [9]: ia.right
Out[9]: DatetimeIndex(['2018-01-05', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D')

return self._shallow_copy(left, right)

def __setitem__(self, key, value):
if not (is_interval_dtype(value) or isinstance(value, Interval)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ABCInterval be used for the isinstance check?


if not isinstance(value, Interval):
msg = ("Interval.fillna only supports filling with scalar "
"'value'. Got a '{}' instead of a 'pandas.Interval'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message seems a little unclear/confusing. Maybe something like: "IntervalArray.fillna only supports filling with pandas.Interval, got a '{}' instead."

self._left = left
self._right = right

def fillna(self, value=None, method=None, limit=None):
Copy link
Member

@jschendel jschendel Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be some extra checks on value here. Right now if I pass an Interval with a different closed, it silently gets changed to match the array's closed:

In [2]: ia = pd.interval_range(0, 4).insert(0, np.nan).values

In [3]: ia
Out[3]:
IntervalArray([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
              closed='right',
              dtype='interval[float64]')

In [4]: ia.fillna(pd.Interval(10, 20, closed='both'))
Out[4]:
IntervalArray([(10.0, 20.0], (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
              closed='right',
              dtype='interval[float64]')

Could probably create some type of helper function/method for this, as I'd guess it'd be somewhat of a common thing to do.

def __setitem__(self, key, value):
if not (is_interval_dtype(value) or isinstance(value, Interval)):
msg = "'value' should be an interval type, got {} instead."
raise TypeError(msg.format(type(value)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like generally type(value).__name__ is used in messages instead of just type(value). Not super concerned with this distinction though, just a minor inconsistency I noticed.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. But this move shows a lot of rough edges. you had to copy quite a bit of code that is already in the index by base class. I really dont' want to re-invent the wheel. Need to decide what goes to EA base class (but that can't be used in Index ATM), and what should be in a shared EA/Index Mixin. This is getting pretty messy.

@@ -451,3 +451,11 @@ def is_platform_mac():

def is_platform_32bit():
return struct.calcsize("P") * 8 < 64


class _WritableDoc(type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be in pandas.util._decorators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a decorator. Still want it there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we have a pandas.util._doctools, move there

@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
ret = take_1d(self.categories.values, self._codes)
if dtype and not is_dtype_equal(dtype, self.categories.dtype):
return np.asarray(ret, dtype)
if is_extension_array_dtype(ret):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment this.

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
msg = "a string needs to be passed, got type {typ}"
raise TypeError(msg.format(typ=type(string)))

@property
def type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc-string

_typ = 'intervalindex'
_comparables = ['name']
_attributes = ['name', 'closed']
_allow_index_ops = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a remnant of a failed refactor. Will remove.

# both left and right are values
pass

result = self._data._shallow_copy(left=left, right=right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are copying then constructing again???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left & right are shallow copied just like before, and _simple_new is called just like before. Unless I made a mistake (which I don't think I did), this is identical to before.

right = np.concatenate([interval.right for interval in to_concat])
return cls._simple_new(left, right, closed=closed, copy=False)

def _shallow_copy(self, left=None, right=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a method in EA? it should be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as this is only uses internally within the IntervalArray class, it is not necessarily needed to make this part of the official ExtensionArray interface.

What are the advantages in doing so? (also the signature will be different for each array type)

verify_integrity=False)

# TODO: doc
def copy(self, deep=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this in EA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, copy is part of the extension array interface

return self.left.itemsize + self.right.itemsize

def take(self, indices, axis=0, allow_fill=True, fill_value=None,
**kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc. wonder how of this you can push to EA base clasess (I suspect almost all of it), this is a pure indexing operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be in favor of adding some basic implementation of take to EA, but in this case, IntervalArray would have to override it anyhow, as it needs to take both from left and right, which is something very specific to intervals.


# Formatting

def _format_data(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really a generic formatter Index. should this be in EA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather somewhere in the formatting code? (and then it could still be called from within a default EA repr)

head = []
tail = [formatter(x) for x in self]
summary = '[{tail}]'.format(tail=', '.join(tail))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there is lots of copying of other code because you can't share the Index base things for this. You need prob push some of this code (e.g. formatting) to something like IndexOpsMixin (or the like). and use that here.

@jschendel
Copy link
Member

Pushed some pre-rebase commits:

  • Added NA support to __setitem__
  • Implemented IntervalArray.repeat to fix some failing tests that were complaining about it not existing
  • Addressed a number of review comments
    • Still a few that need to be addressed, but mostly docstring related

Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I think you're good to merge in master anytime.

@@ -299,7 +299,42 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
df['A'].dtype
df['B'].dtype

.. _whatsnew_023.enhancements.extension:
.. _whatsnew_0230.enhancements.interval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need changing during the rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good.

Storing Interval Data in Series and DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Interval data may now be stored in a Series or DataFrame, in addition to an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks on Series/DataFrame.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Interval data may now be stored in a Series or DataFrame, in addition to an
:class:`IntervalIndex` like before (:issue:`19453`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before -> previously.

ser
ser.dtype

Previously, these would be cast to a NumPy array of Interval objects. In general,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks around Interval, or use a class ref. & Series

a Series.

Note that the ``.values`` of a Series containing intervals is no longer a NumPy
array. Rather, it's an ``ExtensionArray``, composed of two arrays ``left`` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to show the internal impl. (e.g. just the first sentence is ok)


ser.values

To recover the NumPy array of Interval objects, use :func:`numpy.asarray`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to have the below section I would not use this here (showing asarray) as its duplicative, just point to the ref for the below section)

return cls._simple_new(left, right, closed=closed, copy=False)

def _shallow_copy(self, left=None, right=None, closed=None):
if left is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

# Formatting

def _format_data(self):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do in a followup (though prob easy here) to use
format_object_summary from pandas.core.formats.printing as this is vastly simplified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its used generally for all Indexes now

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
-------
ndarray of hashed values array, same size as len(c)
"""
hashed = hash_array(c.categories.values, encoding, hash_key,
# Convert ExtensionArrays to ndarrays
values = np.asarray(c.categories.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this change? what is it for?

@@ -240,6 +240,7 @@ def test_reindex_non_na_fill_value(self, data_missing):
na = data_missing[0]

array = data_missing._from_sequence([na, valid])
print(array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous print



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used to overwrite exception messages so they display the correct class name. Looks like it's currently just used for IntervalIndex when it calls an underlying IntervalArray method that could raise some with an IntervalArray specific message, e.g. 'category, object, and string subtypes are not supported for IntervalArray'.

I suppose we could try to either make these messages slightly more generic ("IntervalArray" --> "IntervalArray-like" or "Interval classes" maybe) or try implementing an additional common interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self):
    # We have access to the correct `type(self).__name__`
    pass

def method(self):
    try:
        do_thing()
    except _IntervalException:
        self._raise_exception()

does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't really have a preference either. I might end up trying a few things to see what's cleanest.

@jschendel
Copy link
Member

Addressed the latest set of review comments, aside from the _format_data one (will do in a followup pr or with another set of review comments).

if not isinstance(value, ABCInterval):
msg = ("'IntervalArray.fillna' only supports filling with a "
"'scalar pandas.Interval'. Got a '{}' instead."
.format(type(value).__name__))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single quote is in the wrong place. Should be 'pandas.Interval'.


def __setitem__(self, key, value):
# na value: need special casing to set directly on numpy arrays
_needs_float_conversion = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_needs_float_conversion -> needs_float_conversion?

raise ValueError("Intervals must all be closed on the same side.")
closed = closed.pop()

# TODO: avoid intermediate list
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO (which I think I added) seems unavoidable, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it can be avoided. Removed the TODO. IIUC the intermediate list doesn't copy data, so not a big deal (np.concatenate does of course).

"""

@Appender(_interval_shared_docs['set_closed'] % _shared_docs_kwargs)
def set_closed(self, closed):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make IntervalArray.closed a setter as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would be a problem with the current implementation, but #19371 makes me a bit hesitant since it'd make a closed setter a (limited) dtype setter.

@@ -402,13 +403,17 @@ def encode(obj):
u'freq': u_safe(getattr(obj, 'freqstr', None)),
u'tz': tz,
u'compress': compressor}
elif isinstance(obj, IntervalIndex):
return {u'typ': u'interval_index',
elif isinstance(obj, (IntervalIndex, IntervalArray)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, should this be reverted? Do any msgpack tests fail if it is?



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self):
    # We have access to the correct `type(self).__name__`
    pass

def method(self):
    try:
        do_thing()
    except _IntervalException:
        self._raise_exception()

does that make sense?



@contextlib.contextmanager
def rewrite_exception(old_name, new_name):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

* :ref:`Categorical <categorical>`
* :ref:`Datetime with Timezone <timeseries.timezone_series>`
* :ref:`Period <timeseries.periods>`
* Interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a ref for this? (maybe should create one if not)?

@@ -345,6 +414,7 @@ Interval
^^^^^^^^

- Bug in the :class:`IntervalIndex` constructor where the ``closed`` parameter did not always override the inferred ``closed`` (:issue:`19370`)
- Bug in the ``IntervalIndex`` repr where a trailing comma was missing after the list of intervals (:issue`20611`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need extra colon

copy=False, dtype=None, verify_integrity=True):
result = IntervalMixin.__new__(cls)

closed = closed or 'right'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe closed should always be not-None here by-definition?

raise TypeError(msg.format(type(value)))

# Need to ensure that left and right are updated atomically, so we're
# forced to copy, update the copy, and swap in the new values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I am clear why this is problematic. is there a failure possible when updating the 2nd one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the forcing the copy I'm seeing test_setitem_sequence fail:

def test_setitem_sequence(self, data):
arr = pd.Series(data)
original = data.copy()
arr[[0, 1]] = [data[1], data[0]]
assert arr[0] == original[1]
assert arr[1] == original[0]

self._right = right

def fillna(self, value=None, method=None, limit=None):
if method is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally add a doc-string

class TestReshaping(BaseInterval, base.BaseReshapingTests):
pass


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. I didn't add any tests to the base class, so not sure what should be moved here?

@TomAugspurger
Copy link
Contributor Author

@jschendel how do you feel about this? IMO, things looked quite nice the last time I looked. Are the outstanding issues

  1. Sharing printing between Index and IntervalArray
  2. (Maybe) changing rewrite_exception to use a method to handle raising with the right name?

Anything else? Are we comfortable doing those as followups? (fine by me)

@jschendel
Copy link
Member

Yes, I believe those are the only outstanding issues, aside from the question I had about @jreback's last comment. I'm fine with doing those items as followups.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 13, 2018 via email

@jreback
Copy link
Contributor

jreback commented Jul 13, 2018

@jschendel what comment? i am ok with merging and doing follow ups as well

@jschendel
Copy link
Member

@jreback : #20611 (comment)

"I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here"

Regardless, doesn't seem like something that would block this from being merged, and could be handled in a followup too.

@jschendel jschendel merged commit 610a19a into pandas-dev:master Jul 13, 2018

@property
def itemsize(self):
return self.left.itemsize + self.right.itemsize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschendel a bit after the fact, but I don't think that we should add this to IntervalArray. We deprecated itemsize some time ago for Series/Index (#20721).
Just noticed this because we get the warning during the test:

pandas/tests/indexes/interval/test_interval.py::TestIntervalIndex::()::test_itemsize
  /home/travis/build/pandas-dev/pandas/pandas/core/arrays/interval.py:694: FutureWarning: Int64Index.itemsize is deprecated and will be removed in a future version
    return self.left.itemsize + self.right.itemsize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally part of IntervalIndex and is currently what IntervalIndex calls under the hood, so if we remove it from IntervalArray we'd need to add it back IntervalIndex. Can certainly do that, unless there's some objection based on it making the two inconsistent. Removing it from IntervalArray seems like the best course of action though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing it from IntervalArray is the best way forward I think.
Also, on IntervalIndex it should be deprecated (now it is deprecated on the base IndexOpsMixin class, but by overriding it in IntervalIndex, it lost that deprecation. But apparently IntervalIndex is the only index that is not tested in the base tests, so this 'regression' was not catched by the tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #22049 so we don't loose track of this discussion

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Co-authored-by: Jeremy Schendel <jschendel@users.noreply.github.com>
@qwhelan
Copy link
Contributor

qwhelan commented Nov 8, 2018

A run of asv find has identified this PR as resulting in a ~1000x slowdown in the indexing.IntervalIndexing.time_loc_list benchmark:

asv find 7e6e7^..HEAD indexing.IntervalIndexing.time_loc_list
[...]
·· Installing 365eac4d <master~642> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                      2.27±0ms
· Testing -----------------------------------O>-----------------------
·· For pandas commit f3160bfa <master~639>:
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Building f3160bfa <master~639> for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................................................................................
·· Installing f3160bfa <master~639> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                    2.90±0.03s
· Testing -----------------------------------O------------------------
·· For pandas commit 610a19a3 <master~641>:
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Building 610a19a3 <master~641> for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................................................................................................
·· Installing 610a19a3 <master~641> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                    2.92±0.01s.
· Greatest regression found: 610a19a3 <master~641>

Bisection done on Arch Linux and reproducible on Ubuntu 18.04.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants