REF: IntervalIndex[IntervalArray] #20611

TomAugspurger · 2018-04-04T16:49:54Z

Closes #19453
Closes #19209

Closes pandas-dev#19209

pep8speaks · 2018-04-04T16:50:06Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 13, 2018 at 01:01 Hours UTC

TomAugspurger · 2018-04-04T16:51:38Z

pandas/core/arrays/categorical.py

@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
        ret = take_1d(self.categories.values, self._codes)
        if dtype and not is_dtype_equal(dtype, self.categories.dtype):
            return np.asarray(ret, dtype)
+        if is_extension_array_dtype(ret):


__array__ has to return an ndarray. Without this, Categorical[ExtensionArray]would fail, astake_1d(...)` would be an ExtensionArray.

comment this.

I have an update on this section already in intna

TomAugspurger · 2018-04-04T16:52:00Z

pandas/core/dtypes/dtypes.py

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
        msg = "a string needs to be passed, got type {typ}"
        raise TypeError(msg.format(typ=type(string)))

+    @property
+    def type(self):


This is the correct scalar type for IntervalArray

add doc-string

TomAugspurger · 2018-04-04T16:52:44Z

pandas/core/indexes/base.py

@@ -270,7 +270,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
                                    **kwargs)

        # interval
-        if is_interval_dtype(data) or is_interval_dtype(dtype):
+        if ((is_interval_dtype(data) or is_interval_dtype(dtype)) and
+                not is_object_dtype(dtype)):


This is so Index(intervals, dtype=object) works correctly (Index, not IntervalIndex)

TomAugspurger · 2018-04-04T16:53:13Z

pandas/core/util/hashing.py

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
    -------
    ndarray of hashed values array, same size as len(c)
    """
-    hashed = hash_array(c.categories.values, encoding, hash_key,
+    # Convert ExtensionArrays to ndarrays
+    values = np.asarray(c.categories.values)


Same as above. Categorica[ExtensionArray].values will be an EA, and this has to be an ndarray.

hmm, shouldn't you need to check if this is an EA here?

I don't think so, asarray won't copy an ndarray, so calling it on an ndarray should be fine.

is_extension_array_dtype takes ~ 2 microseconds, and asarray takes ~10 ns, so better to just always call asarray instead of checking.

do we still need this change? what is it for?

I tried removing it, and test_hash_scalar failed for Interval(0, 1):

pandas/pandas/tests/util/test_hashing.py

Lines 91 to 102 in 1033e8b

def test_hash_scalar(self):

for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),

pd.Timestamp("2012-01-01", tz='Europe/Brussels'),

datetime.datetime(2012, 1, 1),

pd.Timestamp("2012-01-01", tz='EST').to_pydatetime(),

pd.Timedelta('1 days'), datetime.timedelta(1),

pd.Period('2012-01-01', freq='D'), pd.Interval(0, 1),

np.nan, pd.NaT, None]:

result = _hash_scalar(val)

expected = hash_array(np.array([val], dtype=object),

categorize=True)

assert result[0] == expected[0]

TomAugspurger · 2018-04-04T16:55:13Z

pandas/io/packers.py

@@ -402,13 +403,17 @@ def encode(obj):
                    u'freq': u_safe(getattr(obj, 'freqstr', None)),
                    u'tz': tz,
                    u'compress': compressor}
-        elif isinstance(obj, IntervalIndex):
-            return {u'typ': u'interval_index',
+        elif isinstance(obj, (IntervalIndex, IntervalArray)):


Should talk about this more generally in #20612

this is very very tricky from a back-compat perspective.

Mmm, that doesn't sound fun. I'll see what I can do to avoid that.

I wonder, should this be reverted? Do any msgpack tests fail if it is?

The test below fails on the last copied line for Interval data if reverted:

pandas/pandas/tests/io/test_packers.py

Lines 357 to 360 in da6e26d

def test_basic_index(self):

for s, i in self.d.items():

i_rec = self.encode_decode(i)

TomAugspurger · 2018-04-04T16:55:27Z

pandas/tests/dtypes/test_dtypes.py

@@ -546,10 +546,8 @@ def test_basic(self):

        s = Series(ii, name='A')

-        # dtypes


API change.

TomAugspurger · 2018-04-04T16:56:54Z

pandas/tests/indexes/common.py

@@ -886,7 +886,7 @@ def test_hasnans_isnans(self):
                assert not idx.hasnans

                idx = index.copy()
-                values = idx.values
+                values = np.asarray(idx.values)


Made this change since IntervalArray currently isn't mutable. Worth talking about that.

TomAugspurger · 2018-04-04T16:57:22Z

pandas/tests/indexes/interval/test_astype.py

@@ -51,7 +51,7 @@ def test_astype_category(self, index):
        'timedelta64', 'timedelta64[ns]', 'datetime64', 'datetime64[ns]',
        'datetime64[ns, US/Eastern]'])
    def test_astype_cannot_cast(self, index, dtype):
-        msg = 'Cannot cast IntervalIndex to dtype'
+        msg = 'Cannot cast IntervalArray to dtype'


Hmm shouldn't have changed this. Looking into it...

TomAugspurger

High-level summary:

Still have IntervalIndex composed with IntervalArray. This seems the most sensible for me.
IntervalArray currently isn't mutable. That could be changed but will require some care to ensure that .left and .right are modified inplace atomically

TomAugspurger · 2018-04-04T17:02:36Z

pandas/core/indexes/interval.py

    _typ = 'intervalindex'
    _comparables = ['name']
    _attributes = ['name', 'closed']
+    _allow_index_ops = True
+    _exception_rewrite = lambda: rewrite_exception('IntervalArray',


Not sure what people's thoughts are. Without this, the exception would be

In [6]: pd.IntervalIndex(2)

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) TypeError: IntervalArray(...) must be called with a collection of some kind, 2 was passed

This lets the message be IntervalIndex(...).

jreback

see comments -1 unless much more shared code

jreback · 2018-04-04T17:03:56Z

pandas/core/arrays/__init__.py

+from .interval import IntervalArray
+
+
+__all__ = [


you don’t need the all

That's so the #noqa aren't needed.

we don’t do this anywhere else - don’t need introduce new ways of doing things
u less u r fixing it everywhere

jreback · 2018-04-04T17:05:35Z

pandas/core/arrays/interval.py

+        result = IntervalMixin.__new__(cls)
+
+        closed = closed or 'right'
+        left = _ensure_index(left, copy=copy)


this is SO duplicating II
have to do something about this - this is a technical debt nightmare

Did you see IntervalIndex._simple_new? There isn't any duplicated code between the two.

so EA takes a different take on construction that in Index. Where we use _simple_new and _shalllow_copy, but conceptually these are the same things. Would be ok with changing the spellings inside Index to conform this. This is my main objection generally. We spell things one way and a different way in EA, this is so confusing.

@jreback Can you elaborate on this comment (as you say this is your main objection, but I don't fully understand it).

So IntervalArray._simple_new construct from left/right, while IntervalIndex._simple_new now constructs from values (IntervalArray), while previously it constructed from left/right. So yes, this is different between array and index, but, for IntervalIndex it is now more consistent with the base Index._simple_new (which constructs from values), so I think that is a good thing.

jreback · 2018-04-04T17:11:51Z

you copied code from II this is completely he wrong direction to go
instead move it to a shared mixin

TomAugspurger · 2018-04-04T17:15:25Z

instead move it to a shared mixin

Could you maybe demo a single method that you think can be moved to a mixin?

jreback · 2018-04-04T17:19:28Z

this should be an extremely simple wrapper around the Index code
but if u want to make Index subclass EA that’s hard

my point is evey method here should be shared
and if not you have to have a really really good reason

jreback · 2018-04-04T17:21:34Z

pandas/core/indexes/interval.py

-            value itself if False, ``nan``.
-
-            ..versionadded:: 0.23.0
-


what the heck is this?

https://github.com/pandas-dev/pandas/pull/20611/files/9e5fc500d0fb7abd3e270771b138cce905923430#r179213986

jreback · 2018-04-04T17:22:10Z

pandas/core/indexes/interval.py

-        return cls.from_arrays(left, right, closed, name=name, copy=False,
-                               dtype=dtype)
-
-    def to_tuples(self, na_tuple=True):


@TomAugspurger i do see u ripped some shared code out of this
but the top level EA is just adding way too much

TomAugspurger · 2018-04-04T17:57:52Z

my point is evey method here should be shared
and if not you have to have a really really good reason

Again, I really don't think there's any code duplication in the current implementation. We've been discussing Indexes subclassing EAs in #19696, but the prevailing opinion there is that composition is better, specifically #19696 (comment) and #19696 (comment). If you want to weigh in there, then please do.

But assuming we do have IntervalIndex subclass IntervalArray, what does that actually gain? We can eliminate things like

def left(self):
    return self._data.left

Ok... I don't think that's worth the extra time it'll take to rewrite CategoricalIndex and IntervalIndex to use inheritance.

Writable docstrings. Removed lambda as class attribute.

TomAugspurger · 2018-04-04T18:53:48Z

pandas/core/indexes/interval.py

@@ -1347,7 +926,7 @@ def _format_data(self, name=None):
                tail = [formatter(x) for x in self]
                summary = '[{tail}]'.format(tail=', '.join(tail))

-        return summary + self._format_space()
+        return summary + ',' + self._format_space()


note: This was missing a , at the end of the first line.

In [2]: pd.IntervalIndex.from_breaks([1, 2]) Out[2]: IntervalIndex([(1, 2]] # <-- missing comma closed='right', dtype='interval[int64]')

jschendel · 2018-04-04T19:03:13Z

I think also closes #19453, as it achieves the desired end result? At the very least it looks to implement the behavior I wanted from that issue. Will try taking a closer look at the changes here after work.

TomAugspurger · 2018-04-04T19:05:41Z

Thanks, forgot about that issue. Adding to the closed list.

…

On Wed, Apr 4, 2018 at 2:03 PM, jschendel ***@***.***> wrote: I think also closes #19453 <#19453>, as it achieves the desired end result? At the very least it looks to implement the behavior I wanted from that issue. Will try taking a closer look at the changes here after work. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20611 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIupYF34AACMJv8OlV5cGnSZ53l8qks5tlRj2gaJpZM4THJIe> .

codecov · 2018-04-04T20:29:54Z

Codecov Report

❗ No coverage uploaded for pull request base (master@dcbf8b5). Click here to learn what that means.
The diff coverage is 93.44%.

@@            Coverage Diff            @@
##             master   #20611   +/-   ##
=========================================
  Coverage          ?   91.96%           
=========================================
  Files             ?      166           
  Lines             ?    50274           
  Branches          ?        0           
=========================================
  Hits              ?    46232           
  Misses            ?     4042           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.35% <93.44%> (?)`
#single	`42.17% <37.11%> (?)`

Impacted Files	Coverage Δ
pandas/core/dtypes/missing.py	`93.56% <ø> (ø)`
pandas/core/dtypes/dtypes.py	`96.04% <100%> (ø)`
pandas/util/_doctools.py	`12.87% <100%> (ø)`
pandas/util/testing.py	`85.68% <100%> (ø)`
pandas/core/util/hashing.py	`98.36% <100%> (ø)`
pandas/core/arrays/__init__.py	`100% <100%> (ø)`
pandas/core/dtypes/concat.py	`99.18% <100%> (ø)`
pandas/core/indexes/base.py	`96.52% <100%> (ø)`
pandas/core/arrays/categorical.py	`95.94% <100%> (ø)`
pandas/io/packers.py	`88.04% <85.71%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcbf8b5...385ce59. Read the comment docs.

TomAugspurger · 2018-04-04T20:31:20Z

Did a basic version of IntervalArray.__setitem__ in de96a61.

I think that we have to copy both left and right, to ensure that we're never in a state where only left.__setitem__ succeeds, while right.__setitem__ fails.

jorisvandenbossche

I won't have time this week (offline the next days) to do a thorough review, but +1 on general approach. Thanks a lot for this!

For IntervalIndex, if we want to reduce the lines of code, we could rather easily add the interval-specific methods and attributes (left, right, closed, length, mid, is_non_overlapping_monotonic, ...) in an automatic way (like we add arithmetic methods on DataFrame etc).
Basically the methods that just passthrough the self._data equivalent and could also inherit the docstring automatically.

jorisvandenbossche · 2018-04-04T21:21:56Z

pandas/core/arrays/interval.py

+        return self.values
+
+    def get_values(self):
+        return self.values


Is this method needed?

jorisvandenbossche · 2018-04-04T21:22:10Z

pandas/core/arrays/interval.py

+
+        return self._shallow_copy(new_left, new_right)
+
+    take_nd = take


Don't think this should be needed?

jorisvandenbossche · 2018-04-04T21:24:28Z

pandas/core/arrays/interval.py

+
+    # Conversion
+    @property
+    def values(self):


I would not call this .values (is this property required somewhere else?)
This is basically __array__ ?

jorisvandenbossche · 2018-04-04T21:24:59Z

pandas/core/arrays/interval.py

+            tuples = np.where(~self.isna(), tuples, np.nan)
+        return tuples
+
+    def tolist(self):


Is this needed somewhere?

jorisvandenbossche · 2018-04-04T21:25:13Z

pandas/core/arrays/interval.py

+        # TODO: think about putting this in a parent
+        return self.values.tolist()
+
+    def repeat(self, repeats):


idem, is this needed?

jorisvandenbossche · 2018-04-04T21:31:11Z

pandas/core/indexes/interval.py

-        # Avoid materializing self.values
-        return self.left.size
+        # Avoid materializing ndarray[Interval]
+        return self._data.size


This could in principle be moved to the base Index class (same for shape, itemsize, dtype, ..)

jschendel

A few quick comments; will review more later.

jschendel · 2018-04-05T00:33:36Z

doc/source/whatsnew/v0.23.0.txt

@@ -299,6 +299,41 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
    df['A'].dtype
    df['B'].dtype

+.. _whatsnew_023.enhancements.interval:


_whatsnew_023. --> _whatsnew_0230. and likewise on line 337.

jschendel · 2018-04-05T00:35:33Z

pandas/core/arrays/interval.py

+
+        return self._shallow_copy(left, right)
+
+    def __setitem__(self, key, value):


Should this support the case where value is np.nan? Doesn't look like that currently works.

Yes, it currently fails. I'll see what can be done.

Looks like there's also something strange going on with datetime data:

In [2]: ia = pd.interval_range(pd.Timestamp('20180101'), periods=3).values In [3]: ia Out[3]: IntervalArray([(2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-03, 2018-01-04]], closed='right', dtype='interval[datetime64[ns]]') In [4]: new_iv = pd.Interval(pd.Timestamp('20180101'), pd.Timestamp('20180105')) In [5]: new_iv Out[5]: Interval('2018-01-01', '2018-01-05', closed='right') In [6]: ia[0] = new_iv In [7]: ia Out[7]: --------------------------------------------------------------------------- ValueError: left side of interval must be <= right side In [8]: ia.left Out[8]: DatetimeIndex(['2018-01-01', '2018-01-05', '2018-01-03'], dtype='datetime64[ns]', freq='D') In [9]: ia.right Out[9]: DatetimeIndex(['2018-01-05', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D')

jschendel · 2018-04-05T00:38:28Z

pandas/core/arrays/interval.py

+        return self._shallow_copy(left, right)
+
+    def __setitem__(self, key, value):
+        if not (is_interval_dtype(value) or isinstance(value, Interval)):


Should ABCInterval be used for the isinstance check?

jschendel · 2018-04-05T00:47:12Z

pandas/core/arrays/interval.py

+
+        if not isinstance(value, Interval):
+            msg = ("Interval.fillna only supports filling with scalar "
+                   "'value'. Got a '{}' instead of a 'pandas.Interval'."


This message seems a little unclear/confusing. Maybe something like: "IntervalArray.fillna only supports filling with pandas.Interval, got a '{}' instead."

jschendel · 2018-04-05T00:53:09Z

pandas/core/arrays/interval.py

+        self._left = left
+        self._right = right
+
+    def fillna(self, value=None, method=None, limit=None):


There needs to be some extra checks on value here. Right now if I pass an Interval with a different closed, it silently gets changed to match the array's closed:

In [2]: ia = pd.interval_range(0, 4).insert(0, np.nan).values In [3]: ia Out[3]: IntervalArray([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]], closed='right', dtype='interval[float64]') In [4]: ia.fillna(pd.Interval(10, 20, closed='both')) Out[4]: IntervalArray([(10.0, 20.0], (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]], closed='right', dtype='interval[float64]')

Could probably create some type of helper function/method for this, as I'd guess it'd be somewhat of a common thing to do.

jschendel · 2018-04-05T00:58:14Z

pandas/core/arrays/interval.py

+    def __setitem__(self, key, value):
+        if not (is_interval_dtype(value) or isinstance(value, Interval)):
+            msg = "'value' should be an interval type, got {} instead."
+            raise TypeError(msg.format(type(value)))


Looks like generally type(value).__name__ is used in messages instead of just type(value). Not super concerned with this distinction though, just a minor inconsistency I noticed.

jreback

left some comments. But this move shows a lot of rough edges. you had to copy quite a bit of code that is already in the index by base class. I really dont' want to re-invent the wheel. Need to decide what goes to EA base class (but that can't be used in Index ATM), and what should be in a shared EA/Index Mixin. This is getting pretty messy.

jreback · 2018-04-05T15:46:05Z

pandas/compat/__init__.py

@@ -451,3 +451,11 @@ def is_platform_mac():

 def is_platform_32bit():
    return struct.calcsize("P") * 8 < 64
+
+
+class _WritableDoc(type):


should be in pandas.util._decorators

It's not a decorator. Still want it there?

actually we have a pandas.util._doctools, move there

jreback · 2018-04-05T15:47:17Z

pandas/core/arrays/categorical.py

@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None):
        ret = take_1d(self.categories.values, self._codes)
        if dtype and not is_dtype_equal(dtype, self.categories.dtype):
            return np.asarray(ret, dtype)
+        if is_extension_array_dtype(ret):


comment this.

jreback · 2018-04-05T15:47:42Z

pandas/core/dtypes/dtypes.py

@@ -683,6 +682,11 @@ def construct_from_string(cls, string):
        msg = "a string needs to be passed, got type {typ}"
        raise TypeError(msg.format(typ=type(string)))

+    @property
+    def type(self):


add doc-string

jreback · 2018-04-05T15:49:28Z

pandas/core/indexes/interval.py

    _typ = 'intervalindex'
    _comparables = ['name']
    _attributes = ['name', 'closed']
+    _allow_index_ops = True


what is this?

I think a remnant of a failed refactor. Will remove.

jreback · 2018-04-05T15:50:26Z

pandas/core/indexes/interval.py

-            # both left and right are values
-            pass
-
+        result = self._data._shallow_copy(left=left, right=right)


so you are copying then constructing again???

Left & right are shallow copied just like before, and _simple_new is called just like before. Unless I made a mistake (which I don't think I did), this is identical to before.

jreback · 2018-04-05T16:04:04Z

pandas/core/arrays/interval.py

+        right = np.concatenate([interval.right for interval in to_concat])
+        return cls._simple_new(left, right, closed=closed, copy=False)
+
+    def _shallow_copy(self, left=None, right=None):


is this a method in EA? it should be

As long as this is only uses internally within the IntervalArray class, it is not necessarily needed to make this part of the official ExtensionArray interface.

What are the advantages in doing so? (also the signature will be different for each array type)

jreback · 2018-04-05T16:04:19Z

pandas/core/arrays/interval.py

+                                verify_integrity=False)
+
+    # TODO: doc
+    def copy(self, deep=False):


is this in EA?

Yes, copy is part of the extension array interface

jreback · 2018-04-05T16:05:00Z

pandas/core/arrays/interval.py

+        return self.left.itemsize + self.right.itemsize
+
+    def take(self, indices, axis=0, allow_fill=True, fill_value=None,
+             **kwargs):


doc. wonder how of this you can push to EA base clasess (I suspect almost all of it), this is a pure indexing operation.

I might be in favor of adding some basic implementation of take to EA, but in this case, IntervalArray would have to override it anyhow, as it needs to take both from left and right, which is something very specific to intervals.

jreback · 2018-04-05T16:05:28Z

pandas/core/arrays/interval.py

+
+    # Formatting
+
+    def _format_data(self):


this is really a generic formatter Index. should this be in EA?

Or rather somewhere in the formatting code? (and then it could still be called from within a default EA repr)

jreback · 2018-04-05T16:06:55Z

pandas/core/arrays/interval.py

+                head = []
+                tail = [formatter(x) for x in self]
+                summary = '[{tail}]'.format(tail=', '.join(tail))
+


so there is lots of copying of other code because you can't share the Index base things for this. You need prob push some of this code (e.g. formatting) to something like IndexOpsMixin (or the like). and use that here.

jschendel · 2018-07-07T19:58:05Z

Pushed some pre-rebase commits:

Added NA support to __setitem__
Implemented IntervalArray.repeat to fix some failing tests that were complaining about it not existing
Addressed a number of review comments
- Still a few that need to be addressed, but mostly docstring related

TomAugspurger

Overall looks good. I think you're good to merge in master anytime.

jreback · 2018-07-07T20:08:49Z

doc/source/whatsnew/v0.23.0.txt

@@ -299,7 +299,42 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist
    df['A'].dtype
    df['B'].dtype

-.. _whatsnew_023.enhancements.extension:
+.. _whatsnew_0230.enhancements.interval:


this will need changing during the rebase

jreback

looks pretty good.

jreback · 2018-07-08T20:51:16Z

doc/source/whatsnew/v0.24.0.txt

+Storing Interval Data in Series and DataFrame
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Interval data may now be stored in a Series or DataFrame, in addition to an


double backticks on Series/DataFrame.

jreback · 2018-07-08T20:51:24Z

doc/source/whatsnew/v0.24.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Interval data may now be stored in a Series or DataFrame, in addition to an
+:class:`IntervalIndex` like before (:issue:`19453`).


before -> previously.

jreback · 2018-07-08T20:51:42Z

doc/source/whatsnew/v0.24.0.txt

+   ser
+   ser.dtype
+
+Previously, these would be cast to a NumPy array of Interval objects. In general,


double backticks around Interval, or use a class ref. & Series

jreback · 2018-07-08T20:52:04Z

doc/source/whatsnew/v0.24.0.txt

+a Series.
+
+Note that the ``.values`` of a Series containing intervals is no longer a NumPy
+array. Rather, it's an ``ExtensionArray``, composed of two arrays ``left`` and


I don't think you need to show the internal impl. (e.g. just the first sentence is ok)

jreback · 2018-07-08T20:52:58Z

doc/source/whatsnew/v0.24.0.txt

+
+   ser.values
+
+To recover the NumPy array of Interval objects, use :func:`numpy.asarray`:


if you are going to have the below section I would not use this here (showing asarray) as its duplicative, just point to the ref for the below section)

jreback · 2018-07-08T20:56:10Z

pandas/core/arrays/interval.py

+        return cls._simple_new(left, right, closed=closed, copy=False)
+
+    def _shallow_copy(self, left=None, right=None, closed=None):
+        if left is None:


can you add a doc-string

jreback · 2018-07-08T20:57:37Z

pandas/core/arrays/interval.py

+    # Formatting
+
+    def _format_data(self):
+


can do in a followup (though prob easy here) to use
format_object_summary from pandas.core.formats.printing as this is vastly simplified

its used generally for all Indexes now

jreback · 2018-07-08T20:58:55Z

pandas/core/util/hashing.py

@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key):
    -------
    ndarray of hashed values array, same size as len(c)
    """
-    hashed = hash_array(c.categories.values, encoding, hash_key,
+    # Convert ExtensionArrays to ndarrays
+    values = np.asarray(c.categories.values)


do we still need this change? what is it for?

jreback · 2018-07-08T20:59:10Z

pandas/tests/extension/base/getitem.py

@@ -240,6 +240,7 @@ def test_reindex_non_na_fill_value(self, data_missing):
        na = data_missing[0]

        array = data_missing._from_sequence([na, valid])
+        print(array)


extraneous print

jreback · 2018-07-08T21:00:19Z

pandas/util/_exceptions.py

+
+
+@contextlib.contextmanager
+def rewrite_exception(old_name, new_name):


what is this for?

This is being used to overwrite exception messages so they display the correct class name. Looks like it's currently just used for IntervalIndex when it calls an underlying IntervalArray method that could raise some with an IntervalArray specific message, e.g. 'category, object, and string subtypes are not supported for IntervalArray'.

I suppose we could try to either make these messages slightly more generic ("IntervalArray" --> "IntervalArray-like" or "Interval classes" maybe) or try implementing an additional common interface.

What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self): # We have access to the correct `type(self).__name__` pass def method(self): try: do_thing() except _IntervalException: self._raise_exception()

does that make sense?

Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

Yeah, I don't really have a preference either. I might end up trying a few things to see what's cleanest.

jschendel · 2018-07-09T23:52:08Z

Addressed the latest set of review comments, aside from the _format_data one (will do in a followup pr or with another set of review comments).

TomAugspurger · 2018-07-10T02:50:52Z

pandas/core/arrays/interval.py

+        if not isinstance(value, ABCInterval):
+            msg = ("'IntervalArray.fillna' only supports filling with a "
+                   "'scalar pandas.Interval'. Got a '{}' instead."
+                   .format(type(value).__name__))


Single quote is in the wrong place. Should be 'pandas.Interval'.

TomAugspurger · 2018-07-10T02:51:16Z

pandas/core/arrays/interval.py

+
+    def __setitem__(self, key, value):
+        # na value: need special casing to set directly on numpy arrays
+        _needs_float_conversion = False


_needs_float_conversion -> needs_float_conversion?

TomAugspurger · 2018-07-10T02:53:54Z

pandas/core/arrays/interval.py

+            raise ValueError("Intervals must all be closed on the same side.")
+        closed = closed.pop()
+
+        # TODO: avoid intermediate list


This TODO (which I think I added) seems unavoidable, right?

Yeah, I don't think it can be avoided. Removed the TODO. IIUC the intermediate list doesn't copy data, so not a big deal (np.concatenate does of course).

TomAugspurger · 2018-07-10T02:55:32Z

pandas/core/arrays/interval.py

+        """
+
+    @Appender(_interval_shared_docs['set_closed'] % _shared_docs_kwargs)
+    def set_closed(self, closed):


Should we make IntervalArray.closed a setter as well?

I don't think this would be a problem with the current implementation, but #19371 makes me a bit hesitant since it'd make a closed setter a (limited) dtype setter.

TomAugspurger · 2018-07-10T03:00:18Z

pandas/io/packers.py

@@ -402,13 +403,17 @@ def encode(obj):
                    u'freq': u_safe(getattr(obj, 'freqstr', None)),
                    u'tz': tz,
                    u'compress': compressor}
-        elif isinstance(obj, IntervalIndex):
-            return {u'typ': u'interval_index',
+        elif isinstance(obj, (IntervalIndex, IntervalArray)):


I wonder, should this be reverted? Do any msgpack tests fail if it is?

TomAugspurger · 2018-07-10T03:04:06Z

pandas/util/_exceptions.py

+
+
+@contextlib.contextmanager
+def rewrite_exception(old_name, new_name):


What about something like defining a custom exception type in arrays/interval.py and then

def _raise_exception(self): # We have access to the correct `type(self).__name__` pass def method(self): try: do_thing() except _IntervalException: self._raise_exception()

does that make sense?

TomAugspurger · 2018-07-10T03:04:30Z

pandas/util/_exceptions.py

+
+
+@contextlib.contextmanager
+def rewrite_exception(old_name, new_name):


Or we could just use IntervalArray in the exception message. Doesn't really matter to me.

jreback · 2018-07-10T10:10:52Z

doc/source/basics.rst

+* :ref:`Categorical <categorical>`
+* :ref:`Datetime with Timezone <timeseries.timezone_series>`
+* :ref:`Period <timeseries.periods>`
+* Interval


do we have a ref for this? (maybe should create one if not)?

jreback · 2018-07-10T10:11:58Z

doc/source/whatsnew/v0.24.0.txt

@@ -345,6 +414,7 @@ Interval
 ^^^^^^^^

 - Bug in the :class:`IntervalIndex` constructor where the ``closed`` parameter did not always override the inferred ``closed`` (:issue:`19370`)
+- Bug in the ``IntervalIndex`` repr where a trailing comma was missing after the list of intervals (:issue`20611`)


need extra colon

jreback · 2018-07-10T10:13:35Z

pandas/core/arrays/interval.py

+                    copy=False, dtype=None, verify_integrity=True):
+        result = IntervalMixin.__new__(cls)
+
+        closed = closed or 'right'


I believe closed should always be not-None here by-definition?

jreback · 2018-07-10T10:16:44Z

pandas/core/arrays/interval.py

+                raise TypeError(msg.format(type(value)))
+
+        # Need to ensure that left and right are updated atomically, so we're
+        # forced to copy, update the copy, and swap in the new values.


not sure I am clear why this is problematic. is there a failure possible when updating the 2nd one?

Without the forcing the copy I'm seeing test_setitem_sequence fail:

pandas/pandas/tests/extension/base/setitem.py

Lines 17 to 23 in 1dd05cc

def test_setitem_sequence(self, data):

arr = pd.Series(data)

original = data.copy()

arr[[0, 1]] = [data[1], data[0]]

assert arr[0] == original[1]

assert arr[1] == original[0]

jreback · 2018-07-10T10:16:57Z

pandas/core/arrays/interval.py

+        self._right = right
+
+    def fillna(self, value=None, method=None, limit=None):
+        if method is not None:


ideally add a doc-string

jreback · 2018-07-10T10:24:41Z

pandas/tests/extension/interval/test_interval.py

+class TestReshaping(BaseInterval, base.BaseReshapingTests):
+    pass
+
+


I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here

I don't follow. I didn't add any tests to the base class, so not sure what should be moved here?

TomAugspurger · 2018-07-13T17:50:37Z

@jschendel how do you feel about this? IMO, things looked quite nice the last time I looked. Are the outstanding issues

Sharing printing between Index and IntervalArray
(Maybe) changing rewrite_exception to use a method to handle raising with the right name?

Anything else? Are we comfortable doing those as followups? (fine by me)

jschendel · 2018-07-13T20:10:02Z

Yes, I believe those are the only outstanding issues, aside from the question I had about @jreback's last comment. I'm fine with doing those items as followups.

TomAugspurger · 2018-07-13T20:11:31Z

Cool, +1 from me then.

…

On Fri, Jul 13, 2018 at 3:10 PM, Jeremy Schendel ***@***.***> wrote: Yes, I believe those are the only outstanding issues, aside from the question I had about @jreback <https://github.com/jreback>'s last comment. I'm fine with doing those items as followups. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20611 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIl209JtiUWQNdHlT5hf3MW1J9F5Kks5uGP6ggaJpZM4THJIe> .

jreback · 2018-07-13T20:21:46Z

@jschendel what comment? i am ok with merging and doing follow ups as well

jschendel · 2018-07-13T20:31:40Z

@jreback : #20611 (comment)

"I suppose should move some of these tests to pandas/tests/arrays/interval, maybe leave the extension API specific ones here"

Regardless, doesn't seem like something that would block this from being merged, and could be handled in a followup too.

jorisvandenbossche · 2018-07-24T08:54:33Z

pandas/core/arrays/interval.py

+
+    @property
+    def itemsize(self):
+        return self.left.itemsize + self.right.itemsize


@jschendel a bit after the fact, but I don't think that we should add this to IntervalArray. We deprecated itemsize some time ago for Series/Index (#20721).
Just noticed this because we get the warning during the test:

pandas/tests/indexes/interval/test_interval.py::TestIntervalIndex::()::test_itemsize /home/travis/build/pandas-dev/pandas/pandas/core/arrays/interval.py:694: FutureWarning: Int64Index.itemsize is deprecated and will be removed in a future version return self.left.itemsize + self.right.itemsize

This was originally part of IntervalIndex and is currently what IntervalIndex calls under the hood, so if we remove it from IntervalArray we'd need to add it back IntervalIndex. Can certainly do that, unless there's some objection based on it making the two inconsistent. Removing it from IntervalArray seems like the best course of action though.

Yes, removing it from IntervalArray is the best way forward I think.
Also, on IntervalIndex it should be deprecated (now it is deprecated on the base IndexOpsMixin class, but by overriding it in IntervalIndex, it lost that deprecation. But apparently IntervalIndex is the only index that is not tested in the base tests, so this 'regression' was not catched by the tests).

Opened #22049 so we don't loose track of this discussion

Co-authored-by: Jeremy Schendel <jschendel@users.noreply.github.com>

qwhelan · 2018-11-08T18:05:13Z

A run of asv find has identified this PR as resulting in a ~1000x slowdown in the indexing.IntervalIndexing.time_loc_list benchmark:

asv find 7e6e7^..HEAD indexing.IntervalIndexing.time_loc_list
[...]
·· Installing 365eac4d <master~642> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                      2.27±0ms
· Testing -----------------------------------O>-----------------------
·· For pandas commit f3160bfa <master~639>:
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Building f3160bfa <master~639> for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt............................................................................................................
·· Installing f3160bfa <master~639> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                    2.90±0.03s
· Testing -----------------------------------O------------------------
·· For pandas commit 610a19a3 <master~641>:
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Building 610a19a3 <master~641> for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................................................................................................
·· Installing 610a19a3 <master~641> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Setting up indexing.py:220                                                                                                                                                                         ok
··· Running (indexing.IntervalIndexing.time_loc_list--).
··· indexing.IntervalIndexing.time_loc_list                                                                                                                                                    2.92±0.01s.
· Greatest regression found: 610a19a3 <master~641>

Bisection done on Arch Linux and reproducible on Ubuntu 18.04.

REF: IntervalIndex[IntervalArray]

9b8564f

Closes pandas-dev#19209

TomAugspurger added this to the 0.23.0 milestone Apr 4, 2018

TomAugspurger added the Interval Interval data type label Apr 4, 2018

TomAugspurger commented Apr 4, 2018

View reviewed changes

fixup! REF: IntervalIndex[IntervalArray]

9e5fc50

jreback requested changes Apr 4, 2018

View reviewed changes

TomAugspurger added 2 commits April 4, 2018 13:10

fixup! fixup! REF: IntervalIndex[IntervalArray]

abb8a45

Py2 compat

4e48e88

Writable docstrings. Removed lambda as class attribute.

TomAugspurger mentioned this pull request Apr 4, 2018

BUG/PERF: Fixed IntervalIndex.nbytes and itemsize #20600

Closed

TomAugspurger commented Apr 4, 2018

View reviewed changes

DOC: Added note about missing comma

11d97db

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 4, 2018

Added IntervalArray.__setitem__

de96a61

jorisvandenbossche reviewed Apr 4, 2018

View reviewed changes

Fixed linting

24db222

jschendel reviewed Apr 5, 2018

View reviewed changes

jreback requested changes Apr 5, 2018

View reviewed changes

Fix for Joris

d2bb35a

jschendel added 3 commits July 7, 2018 10:05

cleanup tests

f82df72

additional review fixups

e0fe0bc

rename exceptions.py -> _exceptions.py

7c0ffd3

TomAugspurger commented Jul 7, 2018

View reviewed changes

jreback approved these changes Jul 7, 2018

View reviewed changes

jschendel added 3 commits July 7, 2018 18:10

resolve merge conflicts

e89d89a

fix breaking tests after merge

a33940a

move test_interval

382737d

jreback requested changes Jul 8, 2018

View reviewed changes

post-rebase review fixes

95f8f15

unused import

e82eeb6

TomAugspurger commented Jul 10, 2018

View reviewed changes

more review fixes

bccb4f7

jreback requested changes Jul 10, 2018

View reviewed changes

jschendel added 2 commits July 10, 2018 11:59

more review fixes

99ab41f

undo parametrize in test_hasing.py and move to separate PR

385ce59

jschendel merged commit 610a19a into pandas-dev:master Jul 13, 2018

jorisvandenbossche reviewed Jul 24, 2018

View reviewed changes

jorisvandenbossche mentioned this pull request Jul 25, 2018

DEPR: deprecated IntervalIndex.itemsize, remove IntervalArray.itemsize #22049

Closed

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

REF: IntervalIndex[IntervalArray] (pandas-dev#20611)

93146a2

Co-authored-by: Jeremy Schendel <jschendel@users.noreply.github.com>

TomAugspurger deleted the ea-ii branch November 8, 2018 18:16

TomAugspurger mentioned this pull request Nov 8, 2018

PERF: Slowdown in IntervalIndex.get_loc #23576

Closed

	def test_hash_scalar(self):
	for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"),
	pd.Timestamp("2012-01-01", tz='Europe/Brussels'),
	datetime.datetime(2012, 1, 1),
	pd.Timestamp("2012-01-01", tz='EST').to_pydatetime(),
	pd.Timedelta('1 days'), datetime.timedelta(1),
	pd.Period('2012-01-01', freq='D'), pd.Interval(0, 1),
	np.nan, pd.NaT, None]:
	result = _hash_scalar(val)
	expected = hash_array(np.array([val], dtype=object),
	categorize=True)
	assert result[0] == expected[0]

	def test_basic_index(self):

	for s, i in self.d.items():
	i_rec = self.encode_decode(i)

		@@ -546,10 +546,8 @@ def test_basic(self):

		s = Series(ii, name='A')

		# dtypes


		return self._shallow_copy(new_left, new_right)

		take_nd = take


		return self._shallow_copy(left, right)

		def __setitem__(self, key, value):


		ser.values

		To recover the NumPy array of Interval objects, use :func:`numpy.asarray`:



		@contextlib.contextmanager
		def rewrite_exception(old_name, new_name):

	def test_setitem_sequence(self, data):
	arr = pd.Series(data)
	original = data.copy()

	arr[[0, 1]] = [data[1], data[0]]
	assert arr[0] == original[1]
	assert arr[1] == original[0]

		class TestReshaping(BaseInterval, base.BaseReshapingTests):
		pass

REF: IntervalIndex[IntervalArray] #20611

REF: IntervalIndex[IntervalArray] #20611

Conversation

TomAugspurger commented Apr 4, 2018 • edited Loading

pep8speaks commented Apr 4, 2018 • edited Loading

Comment last updated on July 13, 2018 at 01:01 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Jul 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 4, 2018

TomAugspurger commented Apr 4, 2018

jreback commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 4, 2018

Choose a reason for hiding this comment

jschendel commented Apr 4, 2018

TomAugspurger commented Apr 4, 2018 via email

codecov bot commented Apr 4, 2018 • edited Loading

Codecov Report

TomAugspurger commented Apr 4, 2018

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 4, 2018 •

edited

Loading

pep8speaks commented Apr 4, 2018 •

edited

Loading

jschendel Jul 9, 2018 •

edited

Loading

codecov bot commented Apr 4, 2018 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

jschendel Apr 5, 2018 •

edited

Loading