BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

jschendel · 2017-11-15T07:12:41Z

closes BUG: IntervalIndex.insert does not work for inserting NaN #18295
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2017-11-15T07:24:36Z

pandas/core/indexes/interval.py

+            right_insert = item.right
+        elif is_scalar(item) and isna(item):
+            # GH 18295
+            left_insert = right_insert = item


yes u need to use a nan compat with left iow this could be a NaT

i think we have a left.na_value iirc (might be spelled differently)

or maybe the underlying already handles this in the insert

Updated. Procedure I'm following is "check if any type of NA value is passed -> raise if the wrong type of NA is passed". I suppose I could just bypass this and only check if the right type of NA is passed, if that would be preferred.

jreback · 2017-11-15T07:25:46Z

pandas/tests/indexes/test_interval.py

@@ -255,6 +255,11 @@ def test_insert(self):
        pytest.raises(ValueError, self.index.insert, 0,
                      Interval(2, 3, closed='left'))

+        # GH 18295
+        expected = self.index_with_nan


can u add a fixture that hits multiple kinds of left (float,int,datetime); might be able to do this more generally in the interval tests

codecov · 2017-11-15T07:44:41Z

Codecov Report

Merging #18300 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18300      +/-   ##
==========================================
- Coverage    91.4%   91.38%   -0.02%     
==========================================
  Files         164      164              
  Lines       49880    49884       +4     
==========================================
- Hits        45592    45587       -5     
- Misses       4288     4297       +9

Flag	Coverage Δ
#multiple	`89.19% <100%> (ø)`	⬆️
#single	`39.41% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`92.9% <100%> (+0.05%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 148ed63...fbfe3ab. Read the comment docs.

codecov · 2017-11-15T07:44:49Z

Codecov Report

Merging #18300 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18300      +/-   ##
==========================================
- Coverage   91.34%   91.32%   -0.02%     
==========================================
  Files         163      163              
  Lines       49717    49727      +10     
==========================================
+ Hits        45413    45414       +1     
- Misses       4304     4313       +9

Flag	Coverage Δ
#multiple	`89.12% <100%> (ø)`	⬆️
#single	`40.72% <10.52%> (+0.14%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.2% <100%> (ø)`	⬆️
pandas/core/indexes/interval.py	`92.69% <100%> (+0.05%)`	⬆️
pandas/core/indexes/datetimes.py	`95.5% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.44% <100%> (ø)`	⬆️
pandas/core/indexes/timedeltas.py	`91.21% <100%> (+0.03%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.8% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 412988e...e401032. Read the comment docs.

jreback

can do a pre-cursor PR to fix .insert or can do here.

jreback · 2017-11-16T11:29:49Z

pandas/core/indexes/interval.py

+            right_insert = item.right
+        elif is_scalar(item) and isna(item):
+            # GH 18295
+            if item is not self.left._na_value:


so this needs a little work in Base.insert (which works, though the values should be inserted as _na_value). and doesn't work with any datetimelikes (e.g. DatetimeIndex). can you make the fix for this, rather than here?

jreback · 2017-11-16T11:30:23Z

pandas/tests/indexes/test_interval.py

@@ -255,6 +255,30 @@ def test_insert(self):
        pytest.raises(ValueError, self.index.insert, 0,
                      Interval(2, 3, closed='left'))

+    @pytest.mark.parametrize('data', [


prob need some more tests for .insert with (None, np.nan, pd.NaT as the inserted value) generally

jreback · 2017-11-16T11:31:27Z

doc/source/whatsnew/v0.21.1.txt

@@ -62,6 +62,7 @@ Bug Fixes
 - Bug in ``pd.Series.rolling.skew()`` and ``rolling.kurt()`` with all equal values has floating issue (:issue:`18044`)
 - Bug in ``pd.DataFrameGroupBy.count()`` when counting over a datetimelike column (:issue:`13393`)
 - Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
+- Bug in ``IntervalIndex.insert`` when attempting to insert ``NaN`` (:issue:`18295`)


move to 0.22, this is a bit more invasive.

jschendel · 2017-11-17T07:34:01Z

Made the changes at Base.insert, and went through and made sure each type of index was behaving consistently.

jreback

nice fixes and tests. some comments.

jreback · 2017-11-17T11:25:43Z

pandas/core/indexes/base.py

@@ -3728,6 +3728,10 @@ def insert(self, loc, item):
        -------
        new_index : Index
        """
+        if lib.checknull(item):


you can just use isna()

The reason I'm using checknull over isna is to guard against non-scalar item.

Using isna by itself in an if statement would fail for non-scalar item with an unhelpful error message, whereas checknull returns False for any non-scalar:

In [8]: checknull(['a', 'b']) Out[8]: False In [9]: bool(isna(['a', 'b'])) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-358fb1312857> in <module>() ----> 1 bool(isna(['a', 'b'])) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The way around this with isna would be to do something like: is_scalar(item) and isna(item). However, looking at the implementation of isna, this just forces it down a path to hit checknull:

pandas/pandas/core/dtypes/missing.py

Lines 51 to 53 in cfad581

def _isna_new(obj):

if is_scalar(obj):

return lib.checknull(obj)

So is_scalar(item) and isna(item) is equivalent to checknull(item) but with the additional overhead of two calls to is_scalar.

Seems more efficient to just use checknull, but can switch to isna if that's preferred stylistically. Or maybe I'm overlooking a better way to using isna?

As an aside, might be nice to do a checknull -> checkna renaming in the spirit of isnull -> isna.

yeah should rename that as well.

yeah usually I would just do: is_scalar(..) and is_na(...) to make this explicit.

lib.checknull is pretty internal and don't want to use it in this code

jreback · 2017-11-17T11:25:54Z

pandas/core/indexes/category.py

@@ -688,7 +688,7 @@ def insert(self, loc, item):

        """
        code = self.categories.get_indexer([item])
-        if (code == -1):
+        if (code == -1) and not lib.checknull(item):


use isna() (or notna()

jreback · 2017-11-17T11:26:03Z

pandas/core/indexes/datetimes.py

@@ -1751,6 +1751,9 @@ def insert(self, loc, item):
        -------
        new_index : Index
        """
+        if lib.checknull(item):


jreback · 2017-11-17T11:26:25Z

pandas/core/indexes/interval.py

+                                 'side as the index')
+            left_insert = item.left
+            right_insert = item.right
+        elif lib.checknull(item):


jreback · 2017-11-17T11:26:39Z

pandas/core/indexes/timedeltas.py


        freq = None
-        if isinstance(item, Timedelta) or item is NaT:
+        if isinstance(item, Timedelta) or (item is self._na_value):


use isna() here as well

jreback · 2017-11-17T11:27:46Z

pandas/tests/indexes/test_numeric.py

+    def test_insert(self):
+        # GH 18295 (test missing)
+        expected = UInt64Index([0, 0, 1, 2, 3, 4])
+        for na in (np.nan, pd.NaT, None):


hmm, I think this should coerce to a FloatIndex, cc @gfyoung

That was my intuition as well, and how I originally wrote the test (flowed through from Numeric class). The logic behind expected is based on the _na_value of UInt64Index being 0:

In [2]: pd.UInt64Index._na_value Out[2]: 0

So, it'd be a matter of altering that to np.nan if we want it to coerce to Float64Index.

hmm, this should mimic Int64Index, in that it cannot hold na, so yes pls change this logic. (possibly this might break other things, not sure). may need to do this as a pre-curser (or you do it post as well, lmk).

Tested out changing the default value to np.nan, and it produced the expected result of coercing to Float64Index. It broke two tests though, both involving Index.where, so I didn't include it in the most recent commit. The tests that broke are:

test_where_array_like fails due to expected raising a ValueError:

pandas/pandas/tests/indexes/common.py

Lines 545 to 551 in 3b05a60

def test_where_array_like(self):

i = self.create_index()

_nan = i._na_value

cond = [False] + [True] * (len(i) - 1)

klasses = [list, tuple, np.array, pd.Series]

expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype)

The change makes _nan -> np.nan, so expected essentially becomes pd.Index([np.nan, 1, 2, 3, 4], dtype='uint64'), which raises ValueError: cannot convert float NaN to integer. I think the ValueError is fine, and it's just the construction of expected that needs to be modified. My thoughts are that expected should be a Float64Index in this case, as the introduction of np.nan should cause UInt64Index to coerce.

test_where fails because result gets coerced to Int64Index:

pandas/pandas/tests/indexes/common.py

Lines 532 to 536 in 3b05a60

def test_where(self):

i = self.create_index()

result = i.where(notna(i))

expected = i

tm.assert_index_equal(result, expected)

This actually happens in 0.21.0 too, and is only now apparent due to testing the change. Essentially, if any non-uint64 value gets passed as other (in this case np.nan due to the change), the index gets coerced, even if the mask results in no changes. Reproducing this error on 0.21.0:

In [2]: idx = pd.UInt64Index(range(5)) ...: idx ...: Out[2]: UInt64Index([0, 1, 2, 3, 4], dtype='uint64') In [3]: idx.where(idx < 100, np.nan) Out[3]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

Seems to be caused by _try_convert_to_int_index automatically trying Int64Index first, so should be fairly straightforward to fix by short-circuiting based on dtype.

is this fixed here? or the other issue fix should be merged first?

Will open a separate PR that should be merged first since there are a couple unrelated fixes. My bad, said that in the original comment above, but then edited over it!

np; just reference this pr from new one

jreback · 2017-11-21T00:36:18Z

pandas/tests/indexes/timedeltas/test_indexing.py

@@ -57,6 +57,12 @@ def test_insert(self):
            assert result.name == expected.name
            assert result.freq == expected.freq

+        # GH 18295 (test missing)
+        expected = TimedeltaIndex(['1day', pd.NaT, '2day', '3day'])


side issue, we have lots of duplication on these test_insert/delete and such in the hierarchy and most can simply be tested via fixture / in superclass (but will require some refactoring in the tests). I think we have an issue about this.

jreback · 2017-11-25T14:44:35Z

pandas/tests/indexes/period/test_period.py

@@ -697,3 +697,11 @@ def test_join_self(self, how):
        index = period_range('1/1/2000', periods=10)
        joined = index.join(index, how=how)
        assert index is joined
+
+    def test_insert(self):


in another PR, can come back and move test_insert to datetimelike.py where this can be tested generically (instead of having code inside each datetimelike index type)

jreback · 2017-11-25T14:44:45Z

pandas/tests/indexes/test_base.py

@@ -457,6 +457,12 @@ def test_insert(self):
        null_index = Index([])
        tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a'))

+        # GH 18295 (test missing)


can also be done generically

jreback · 2017-11-25T14:45:37Z

thanks @jschendel

jreback requested changes Nov 15, 2017

View reviewed changes

jreback added Bug Categorical Categorical Data Type labels Nov 15, 2017

jschendel force-pushed the intervalindex-insert-nan branch from fbfe3ab to 237766c Compare November 16, 2017 01:32

jreback requested changes Nov 16, 2017

View reviewed changes

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 16, 2017

jreback reviewed Nov 16, 2017

View reviewed changes

jschendel force-pushed the intervalindex-insert-nan branch 2 times, most recently from ab4fd51 to 8986439 Compare November 17, 2017 07:26

jreback requested changes Nov 17, 2017

View reviewed changes

jschendel force-pushed the intervalindex-insert-nan branch from 8986439 to 71a09eb Compare November 20, 2017 07:04

jreback reviewed Nov 21, 2017

View reviewed changes

jschendel mentioned this pull request Nov 21, 2017

Change UInt64Index._na_value from 0 to np.nan #18401

Merged

5 tasks

jschendel added 6 commits November 24, 2017 20:53

BUG: Fix IntervalIndex.insert to allow inserting NaN

e103ae0

NA Compat

2c4bf16

Generalize NA Compat

12bcfed

chucknull -> is_scalar + isna

a951f27

minor fixes

3e336e1

correct test_insert for UInt64Index

e401032

jschendel force-pushed the intervalindex-insert-nan branch from c1a0901 to e401032 Compare November 25, 2017 04:05

jreback approved these changes Nov 25, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Nov 25, 2017

jreback merged commit 9c9a09f into pandas-dev:master Nov 25, 2017

jreback mentioned this pull request Nov 25, 2017

TST: consolidate Index.test_insert / test_delete #18483

Closed

jschendel deleted the intervalindex-insert-nan branch November 25, 2017 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

jschendel commented Nov 15, 2017

jreback Nov 15, 2017

jreback Nov 15, 2017

jreback Nov 15, 2017

jschendel Nov 16, 2017

jreback Nov 15, 2017

jschendel Nov 16, 2017

codecov bot commented Nov 15, 2017 •

edited

codecov bot commented Nov 15, 2017 •

edited

jreback left a comment

jreback Nov 16, 2017

jreback Nov 16, 2017

jreback Nov 16, 2017

jschendel commented Nov 17, 2017

jreback left a comment

jreback Nov 17, 2017

jschendel Nov 17, 2017

jschendel Nov 17, 2017

jreback Nov 19, 2017

jreback Nov 19, 2017

jreback Nov 17, 2017

jreback Nov 17, 2017

jreback Nov 17, 2017

jreback Nov 17, 2017

jreback Nov 17, 2017

jschendel Nov 17, 2017 •

edited

jreback Nov 19, 2017

jschendel Nov 20, 2017 •

edited

jreback Nov 21, 2017

jschendel Nov 21, 2017

jreback Nov 21, 2017

jreback Nov 21, 2017

jreback Nov 25, 2017

jreback Nov 25, 2017

jreback commented Nov 25, 2017

	def _isna_new(obj):
	if is_scalar(obj):
	return lib.checknull(obj)

	def test_where_array_like(self):
	i = self.create_index()

	_nan = i._na_value
	cond = [False] + [True] * (len(i) - 1)
	klasses = [list, tuple, np.array, pd.Series]
	expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype)

	def test_where(self):
	i = self.create_index()
	result = i.where(notna(i))
	expected = i
	tm.assert_index_equal(result, expected)

BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

Conversation

jschendel commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 15, 2017 • edited

Codecov Report

codecov bot commented Nov 15, 2017 • edited

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel commented Nov 17, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Nov 17, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Nov 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 25, 2017

codecov bot commented Nov 15, 2017 •

edited

codecov bot commented Nov 15, 2017 •

edited

jschendel Nov 17, 2017 •

edited

jschendel Nov 20, 2017 •

edited