Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix IntervalIndex.insert to allow inserting NaN #18300

Merged
merged 6 commits into from Nov 25, 2017

Conversation

jschendel
Copy link
Member

right_insert = item.right
elif is_scalar(item) and isna(item):
# GH 18295
left_insert = right_insert = item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes u need to use a nan compat with left iow this could be a NaT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we have a left.na_value iirc (might be spelled differently)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe the underlying already handles this in the insert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Procedure I'm following is "check if any type of NA value is passed -> raise if the wrong type of NA is passed". I suppose I could just bypass this and only check if the right type of NA is passed, if that would be preferred.

@@ -255,6 +255,11 @@ def test_insert(self):
pytest.raises(ValueError, self.index.insert, 0,
Interval(2, 3, closed='left'))

# GH 18295
expected = self.index_with_nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a fixture that hits multiple kinds of left (float,int,datetime); might be able to do this more generally in the interval tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@codecov
Copy link

codecov bot commented Nov 15, 2017

Codecov Report

Merging #18300 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18300      +/-   ##
==========================================
- Coverage    91.4%   91.38%   -0.02%     
==========================================
  Files         164      164              
  Lines       49880    49884       +4     
==========================================
- Hits        45592    45587       -5     
- Misses       4288     4297       +9
Flag Coverage Δ
#multiple 89.19% <100%> (ø) ⬆️
#single 39.41% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 92.9% <100%> (+0.05%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 148ed63...fbfe3ab. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 15, 2017

Codecov Report

Merging #18300 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18300      +/-   ##
==========================================
- Coverage   91.34%   91.32%   -0.02%     
==========================================
  Files         163      163              
  Lines       49717    49727      +10     
==========================================
+ Hits        45413    45414       +1     
- Misses       4304     4313       +9
Flag Coverage Δ
#multiple 89.12% <100%> (ø) ⬆️
#single 40.72% <10.52%> (+0.14%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.2% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 92.69% <100%> (+0.05%) ⬆️
pandas/core/indexes/datetimes.py 95.5% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.44% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.21% <100%> (+0.03%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 412988e...e401032. Read the comment docs.

@jreback jreback added Bug Categorical Categorical Data Type labels Nov 15, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do a pre-cursor PR to fix .insert or can do here.

right_insert = item.right
elif is_scalar(item) and isna(item):
# GH 18295
if item is not self.left._na_value:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this needs a little work in Base.insert (which works, though the values should be inserted as _na_value). and doesn't work with any datetimelikes (e.g. DatetimeIndex). can you make the fix for this, rather than here?

@@ -255,6 +255,30 @@ def test_insert(self):
pytest.raises(ValueError, self.index.insert, 0,
Interval(2, 3, closed='left'))

@pytest.mark.parametrize('data', [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob need some more tests for .insert with (None, np.nan, pd.NaT as the inserted value) generally

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 16, 2017
@@ -62,6 +62,7 @@ Bug Fixes
- Bug in ``pd.Series.rolling.skew()`` and ``rolling.kurt()`` with all equal values has floating issue (:issue:`18044`)
- Bug in ``pd.DataFrameGroupBy.count()`` when counting over a datetimelike column (:issue:`13393`)
- Bug in ``pd.concat`` when empty and non-empty DataFrames or Series are concatenated (:issue:`18178` :issue:`18187`)
- Bug in ``IntervalIndex.insert`` when attempting to insert ``NaN`` (:issue:`18295`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.22, this is a bit more invasive.

@jschendel jschendel force-pushed the intervalindex-insert-nan branch 2 times, most recently from ab4fd51 to 8986439 Compare November 17, 2017 07:26
@jschendel
Copy link
Member Author

Made the changes at Base.insert, and went through and made sure each type of index was behaving consistently.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice fixes and tests. some comments.

@@ -3728,6 +3728,10 @@ def insert(self, loc, item):
-------
new_index : Index
"""
if lib.checknull(item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use isna()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm using checknull over isna is to guard against non-scalar item.

Using isna by itself in an if statement would fail for non-scalar item with an unhelpful error message, whereas checknull returns False for any non-scalar:

In [8]: checknull(['a', 'b'])
Out[8]: False

In [9]: bool(isna(['a', 'b']))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-358fb1312857> in <module>()
----> 1 bool(isna(['a', 'b']))

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The way around this with isna would be to do something like: is_scalar(item) and isna(item). However, looking at the implementation of isna, this just forces it down a path to hit checknull:

def _isna_new(obj):
if is_scalar(obj):
return lib.checknull(obj)

So is_scalar(item) and isna(item) is equivalent to checknull(item) but with the additional overhead of two calls to is_scalar.

Seems more efficient to just use checknull, but can switch to isna if that's preferred stylistically. Or maybe I'm overlooking a better way to using isna?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, might be nice to do a checknull -> checkna renaming in the spirit of isnull -> isna.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should rename that as well.

yeah usually I would just do: is_scalar(..) and is_na(...) to make this explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.checknull is pretty internal and don't want to use it in this code

@@ -688,7 +688,7 @@ def insert(self, loc, item):

"""
code = self.categories.get_indexer([item])
if (code == -1):
if (code == -1) and not lib.checknull(item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isna() (or notna()

@@ -1751,6 +1751,9 @@ def insert(self, loc, item):
-------
new_index : Index
"""
if lib.checknull(item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

'side as the index')
left_insert = item.left
right_insert = item.right
elif lib.checknull(item):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


freq = None
if isinstance(item, Timedelta) or item is NaT:
if isinstance(item, Timedelta) or (item is self._na_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isna() here as well

def test_insert(self):
# GH 18295 (test missing)
expected = UInt64Index([0, 0, 1, 2, 3, 4])
for na in (np.nan, pd.NaT, None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think this should coerce to a FloatIndex, cc @gfyoung

Copy link
Member Author

@jschendel jschendel Nov 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my intuition as well, and how I originally wrote the test (flowed through from Numeric class). The logic behind expected is based on the _na_value of UInt64Index being 0:

In [2]: pd.UInt64Index._na_value
Out[2]: 0

So, it'd be a matter of altering that to np.nan if we want it to coerce to Float64Index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this should mimic Int64Index, in that it cannot hold na, so yes pls change this logic. (possibly this might break other things, not sure). may need to do this as a pre-curser (or you do it post as well, lmk).

Copy link
Member Author

@jschendel jschendel Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested out changing the default value to np.nan, and it produced the expected result of coercing to Float64Index. It broke two tests though, both involving Index.where, so I didn't include it in the most recent commit. The tests that broke are:

  1. test_where_array_like fails due to expected raising a ValueError:

    def test_where_array_like(self):
    i = self.create_index()
    _nan = i._na_value
    cond = [False] + [True] * (len(i) - 1)
    klasses = [list, tuple, np.array, pd.Series]
    expected = pd.Index([_nan] + i[1:].tolist(), dtype=i.dtype)

    The change makes _nan -> np.nan, so expected essentially becomes pd.Index([np.nan, 1, 2, 3, 4], dtype='uint64'), which raises ValueError: cannot convert float NaN to integer. I think the ValueError is fine, and it's just the construction of expected that needs to be modified. My thoughts are that expected should be a Float64Index in this case, as the introduction of np.nan should cause UInt64Index to coerce.

  2. test_where fails because result gets coerced to Int64Index:

    def test_where(self):
    i = self.create_index()
    result = i.where(notna(i))
    expected = i
    tm.assert_index_equal(result, expected)

    This actually happens in 0.21.0 too, and is only now apparent due to testing the change. Essentially, if any non-uint64 value gets passed as other (in this case np.nan due to the change), the index gets coerced, even if the mask results in no changes. Reproducing this error on 0.21.0:

In [2]: idx = pd.UInt64Index(range(5))
   ...: idx
   ...:
Out[2]: UInt64Index([0, 1, 2, 3, 4], dtype='uint64')

In [3]: idx.where(idx < 100, np.nan)
Out[3]: Int64Index([0, 1, 2, 3, 4], dtype='int64')

Seems to be caused by _try_convert_to_int_index automatically trying Int64Index first, so should be fairly straightforward to fix by short-circuiting based on dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this fixed here? or the other issue fix should be merged first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a separate PR that should be merged first since there are a couple unrelated fixes. My bad, said that in the original comment above, but then edited over it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np; just reference this pr from new one

@@ -57,6 +57,12 @@ def test_insert(self):
assert result.name == expected.name
assert result.freq == expected.freq

# GH 18295 (test missing)
expected = TimedeltaIndex(['1day', pd.NaT, '2day', '3day'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side issue, we have lots of duplication on these test_insert/delete and such in the hierarchy and most can simply be tested via fixture / in superclass (but will require some refactoring in the tests). I think we have an issue about this.

@@ -697,3 +697,11 @@ def test_join_self(self, how):
index = period_range('1/1/2000', periods=10)
joined = index.join(index, how=how)
assert index is joined

def test_insert(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in another PR, can come back and move test_insert to datetimelike.py where this can be tested generically (instead of having code inside each datetimelike index type)

@@ -457,6 +457,12 @@ def test_insert(self):
null_index = Index([])
tm.assert_index_equal(Index(['a']), null_index.insert(0, 'a'))

# GH 18295 (test missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also be done generically

@jreback jreback added this to the 0.22.0 milestone Nov 25, 2017
@jreback jreback merged commit 9c9a09f into pandas-dev:master Nov 25, 2017
@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

thanks @jschendel

@jschendel jschendel deleted the intervalindex-insert-nan branch November 25, 2017 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: IntervalIndex.insert does not work for inserting NaN
2 participants