Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: prohibit dtype-changing IntervalArray.__setitem__ #32782

Merged
merged 10 commits into from
Apr 10, 2020

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Mar 17, 2020

Still needs a dedicated test, but this is also a non-trivial API change, so I want to get the ball rolling on discussion. cc @jschendel

@WillAyd
Copy link
Member

WillAyd commented Mar 19, 2020

Is this linking back to the correct issue? The linkage there isn't immediately obvious to me (may be lack of understanding)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type Bug labels Mar 19, 2020
@jreback jreback added this to the 1.1 milestone Mar 19, 2020
@jreback jreback requested a review from jschendel March 19, 2020 00:36
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. cc @jschendel

@jbrockmendel
Copy link
Member Author

Is this linking back to the correct issue? The linkage there isn't immediately obvious to me (may be lack of understanding)

#27147 is about the data backing an IntervalArray (a pair of Index objects) being swapped out under certain __setitem__ calls, which breaks existing views. The __setitem__ calls that cause this problem are exactly the ones that are dtype-changing, which this PR disallows.

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have a whatsnew entry since it's a breaking change.

@@ -543,7 +543,11 @@ def __setitem__(self, key, value):
msg = f"'value' should be an interval type, got {type(value)} instead."
raise TypeError(msg) from err

if needs_float_conversion:
raise ValueError("Cannot set float values for integer-backed IntervalArray")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The needs_float_conversion name might be a bit misleading here because it's not triggered when setting arbitrary float values but rather just when np.nan is included, so we're not generically disallowing setting float values. Maybe a more accurate error message would be along the lines "Cannot set float NaN to integer-backed IntervalArray".

FWIW the behavior when setting float values without np.nan included is to truncate the float and just use the integer component, which is consistent with numpy's behavior:

In [1]: import pandas as pd; import numpy as np

In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5))

In [3]: ia[0] = pd.Interval(0.9, 1.1)

In [4]: ia
Out[4]: 
<IntervalArray>
[(0, 1], (1, 2], (2, 3], (3, 4]]
Length: 4, closed: right, dtype: interval[int64]

In [5]: a = np.arange(4)

In [6]: a[0] = 0.9

In [7]: a[1] = 1.1

In [8]: a
Out[8]: array([0, 1, 2, 3])

I'm not crazy about that behavior but I'm fine with it since it's consistent with numpy.

@@ -543,7 +543,11 @@ def __setitem__(self, key, value):
msg = f"'value' should be an interval type, got {type(value)} instead."
raise TypeError(msg) from err

if needs_float_conversion:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're raising here we can remove the future references that are no longer being used:

if needs_float_conversion:
left = left.astype("float")

if needs_float_conversion:
right = right.astype("float")

@jschendel
Copy link
Member

jschendel commented Mar 31, 2020

I'm fine with prohibiting dtype changing here. Seems more consistent with existing behavior in pandas/numpy and makes the logic easier.

It looks like the existing needs_float_conversion logic is incomplete though and only handles the scalar case. Setting a slice with a list or IntervalArray containing np.nan doesn't raise or change dtype but instead takes the integer sentinel value (not sure if that's the right term?):

In [2]: ia = pd.arrays.IntervalArray.from_breaks(range(5))

In [3]: ia[:2] = [np.nan, pd.Interval(0, 5)]

In [4]: ia
Out[4]: 
<IntervalArray>
[(-9223372036854775808, -9223372036854775808], (0, 5], (2, 3], (3, 4]]
Length: 4, closed: right, dtype: interval[int64]

I think we can address this by shifting the logic around a little bit. The following diff addresses the issue locally for me and at first glance don't appear to break anything:

diff --git a/pandas/core/arrays/interval.py b/pandas/core/arrays/interval.py
index 22ce5a6f8..2e6e4bd0c 100644
--- a/pandas/core/arrays/interval.py
+++ b/pandas/core/arrays/interval.py
@@ -513,18 +513,15 @@ class IntervalArray(IntervalMixin, ExtensionArray):
         return self._shallow_copy(left, right)
 
     def __setitem__(self, key, value):
-        # na value: need special casing to set directly on numpy arrays
-        needs_float_conversion = False
         if is_scalar(value) and isna(value):
-            if is_integer_dtype(self.dtype.subtype):
-                # can't set NaN on a numpy integer array
-                needs_float_conversion = True
-            elif is_datetime64_any_dtype(self.dtype.subtype):
+            if is_datetime64_any_dtype(self.dtype.subtype):
                 # need proper NaT to set directly on the numpy array
                 value = np.datetime64("NaT")
             elif is_timedelta64_dtype(self.dtype.subtype):
                 # need proper NaT to set directly on the numpy array
                 value = np.timedelta64("NaT")
+            else:
+                value = np.nan
             value_left, value_right = value, value
 
         # scalar interval
@@ -542,18 +539,18 @@ class IntervalArray(IntervalMixin, ExtensionArray):
                 msg = f"'value' should be an interval type, got {type(value)} instead."
                 raise TypeError(msg) from err
 
+        if is_integer_dtype(self.dtype.subtype) and np.any(isna(value_left)):
+            raise ValueError("Cannot set float NaN to integer-backed IntervalArray")

@jreback
Copy link
Contributor

jreback commented Apr 6, 2020

@jbrockmendel can you rebase and does @jschendel comments make sense?

@jbrockmendel
Copy link
Member Author

rebased+green; addressed some comments by @jschendel, commented above on reasons for punting on others

@jreback jreback merged commit 4334482 into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @jbrockmendel and @jschendel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: IntervalArray.__setitem__ creates copies incorrectly
4 participants