BUG/API: prohibit dtype-changing IntervalArray.setitem #32782

jbrockmendel · 2020-03-17T19:21:30Z

~~closes BUG: IntervalArray.__setitem__ creates copies incorrectly #27147~~ Not quite
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Still needs a dedicated test, but this is also a non-trivial API change, so I want to get the ball rolling on discussion. cc @jschendel

…tervalarray

WillAyd · 2020-03-19T00:02:10Z

Is this linking back to the correct issue? The linkage there isn't immediately obvious to me (may be lack of understanding)

jreback

lgtm. cc @jschendel

jbrockmendel · 2020-03-19T02:03:52Z

Is this linking back to the correct issue? The linkage there isn't immediately obvious to me (may be lack of understanding)

#27147 is about the data backing an IntervalArray (a pair of Index objects) being swapped out under certain __setitem__ calls, which breaks existing views. The __setitem__ calls that cause this problem are exactly the ones that are dtype-changing, which this PR disallows.

jschendel

This should probably have a whatsnew entry since it's a breaking change.

jschendel · 2020-03-31T00:40:17Z

pandas/core/arrays/interval.py

@@ -543,7 +543,11 @@ def __setitem__(self, key, value):
                msg = f"'value' should be an interval type, got {type(value)} instead."
                raise TypeError(msg) from err

+        if needs_float_conversion:
+            raise ValueError("Cannot set float values for integer-backed IntervalArray")


The needs_float_conversion name might be a bit misleading here because it's not triggered when setting arbitrary float values but rather just when np.nan is included, so we're not generically disallowing setting float values. Maybe a more accurate error message would be along the lines "Cannot set float NaN to integer-backed IntervalArray".

FWIW the behavior when setting float values without np.nan included is to truncate the float and just use the integer component, which is consistent with numpy's behavior:

In [1]: import pandas as pd; import numpy as np In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5)) In [3]: ia[0] = pd.Interval(0.9, 1.1) In [4]: ia Out[4]: <IntervalArray> [(0, 1], (1, 2], (2, 3], (3, 4]] Length: 4, closed: right, dtype: interval[int64] In [5]: a = np.arange(4) In [6]: a[0] = 0.9 In [7]: a[1] = 1.1 In [8]: a Out[8]: array([0, 1, 2, 3])

I'm not crazy about that behavior but I'm fine with it since it's consistent with numpy.

jschendel · 2020-03-31T01:01:20Z

pandas/core/arrays/interval.py

@@ -543,7 +543,11 @@ def __setitem__(self, key, value):
                msg = f"'value' should be an interval type, got {type(value)} instead."
                raise TypeError(msg) from err

+        if needs_float_conversion:


Since we're raising here we can remove the future references that are no longer being used:

pandas/pandas/core/arrays/interval.py

Lines 554 to 555 in 23b6b93

if needs_float_conversion:

left = left.astype("float")

pandas/pandas/core/arrays/interval.py

Lines 560 to 561 in 23b6b93

if needs_float_conversion:

right = right.astype("float")

jschendel · 2020-03-31T01:25:00Z

I'm fine with prohibiting dtype changing here. Seems more consistent with existing behavior in pandas/numpy and makes the logic easier.

It looks like the existing needs_float_conversion logic is incomplete though and only handles the scalar case. Setting a slice with a list or IntervalArray containing np.nan doesn't raise or change dtype but instead takes the integer sentinel value (not sure if that's the right term?):

In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5))

In [3]: ia[:2] = [np.nan, pd.Interval(0, 5)]

In [4]: ia
Out[4]: 
<IntervalArray>
[(-9223372036854775808, -9223372036854775808], (0, 5], (2, 3], (3, 4]]
Length: 4, closed: right, dtype: interval[int64]

I think we can address this by shifting the logic around a little bit. The following diff addresses the issue locally for me and at first glance don't appear to break anything:

diff --git a/pandas/core/arrays/interval.py b/pandas/core/arrays/interval.py
index 22ce5a6f8..2e6e4bd0c 100644
--- a/pandas/core/arrays/interval.py
+++ b/pandas/core/arrays/interval.py
@@ -513,18 +513,15 @@ class IntervalArray(IntervalMixin, ExtensionArray):
         return self._shallow_copy(left, right)
 
     def __setitem__(self, key, value):
-        # na value: need special casing to set directly on numpy arrays
-        needs_float_conversion = False
         if is_scalar(value) and isna(value):
-            if is_integer_dtype(self.dtype.subtype):
-                # can't set NaN on a numpy integer array
-                needs_float_conversion = True
-            elif is_datetime64_any_dtype(self.dtype.subtype):
+            if is_datetime64_any_dtype(self.dtype.subtype):
                 # need proper NaT to set directly on the numpy array
                 value = np.datetime64("NaT")
             elif is_timedelta64_dtype(self.dtype.subtype):
                 # need proper NaT to set directly on the numpy array
                 value = np.timedelta64("NaT")
+            else:
+                value = np.nan
             value_left, value_right = value, value
 
         # scalar interval
@@ -542,18 +539,18 @@ class IntervalArray(IntervalMixin, ExtensionArray):
                 msg = f"'value' should be an interval type, got {type(value)} instead."
                 raise TypeError(msg) from err
 
+        if is_integer_dtype(self.dtype.subtype) and np.any(isna(value_left)):
+            raise ValueError("Cannot set float NaN to integer-backed IntervalArray")

…tervalarray

jreback · 2020-04-06T22:18:17Z

@jbrockmendel can you rebase and does @jschendel comments make sense?

…tervalarray

jbrockmendel · 2020-04-07T00:49:46Z

rebased+green; addressed some comments by @jschendel, commented above on reasons for punting on others

jreback · 2020-04-10T16:17:40Z

thanks @jbrockmendel and @jschendel

jbrockmendel added 7 commits February 26, 2020 17:53

BUG: disallow changing IntervalArray backing data

47202dd

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

9f21c62

…tervalarray

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

477c441

…tervalarray

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

bdaf49b

…tervalarray

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

8735b58

…tervalarray

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

755679c

…tervalarray

update tests

23b6b93

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type Bug labels Mar 19, 2020

jreback added this to the 1.1 milestone Mar 19, 2020

jreback requested a review from jschendel March 19, 2020 00:36

jreback approved these changes Mar 19, 2020

View reviewed changes

jschendel reviewed Mar 31, 2020

View reviewed changes

jbrockmendel added 2 commits April 5, 2020 18:24

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

3e5826c

…tervalarray

whatsnew, exception message

934dabe

Merge branch 'master' of https://github.com/pandas-dev/pandas into in…

959f674

…tervalarray

jreback merged commit 4334482 into pandas-dev:master Apr 10, 2020

jbrockmendel deleted the intervalarray branch April 10, 2020 17:25

jbrockmendel mentioned this pull request Apr 10, 2020

BUG: IntervalArray.__setitem__ creates copies incorrectly #27147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: prohibit dtype-changing IntervalArray.setitem #32782

BUG/API: prohibit dtype-changing IntervalArray.setitem #32782

jbrockmendel commented Mar 17, 2020 •

edited

Loading

WillAyd commented Mar 19, 2020

jreback left a comment

jbrockmendel commented Mar 19, 2020

jschendel left a comment

jschendel Mar 31, 2020

jschendel Mar 31, 2020

jschendel commented Mar 31, 2020 •

edited

Loading

jreback commented Apr 6, 2020

jbrockmendel commented Apr 7, 2020

jreback commented Apr 10, 2020

BUG/API: prohibit dtype-changing IntervalArray.__setitem__ #32782

BUG/API: prohibit dtype-changing IntervalArray.__setitem__ #32782

Conversation

jbrockmendel commented Mar 17, 2020 • edited Loading

WillAyd commented Mar 19, 2020

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 19, 2020

jschendel left a comment

Choose a reason for hiding this comment

jschendel Mar 31, 2020

Choose a reason for hiding this comment

jschendel Mar 31, 2020

Choose a reason for hiding this comment

jschendel commented Mar 31, 2020 • edited Loading

jreback commented Apr 6, 2020

jbrockmendel commented Apr 7, 2020

jreback commented Apr 10, 2020

BUG/API: prohibit dtype-changing IntervalArray.setitem #32782

BUG/API: prohibit dtype-changing IntervalArray.setitem #32782

jbrockmendel commented Mar 17, 2020 •

edited

Loading

jschendel commented Mar 31, 2020 •

edited

Loading