Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable indexing with nullable Boolean #31591

Merged
merged 70 commits into from
Feb 22, 2020
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
75c915f
Add test
Feb 1, 2020
9c5b9f0
Remove test that checks for error
Feb 2, 2020
2441b40
Add frame test
Feb 3, 2020
d71d1ba
Don't raise with nullable boolean
Feb 3, 2020
4d3a264
Don't modify result
Feb 3, 2020
543ef9a
Add frame test
Feb 3, 2020
d3e7a69
Update whatsnew
Feb 3, 2020
ad7ae66
Fill NA
Feb 3, 2020
6991394
Merge branch 'master' into bool-idx
Feb 3, 2020
f6e9ce5
Remove some more tests
Feb 3, 2020
1234407
Delete another test
Feb 3, 2020
9b7e879
Use to_numpy
Feb 3, 2020
efdd29a
Update whatsnew
Feb 3, 2020
7fa36b6
Don't check for NA
Feb 3, 2020
b8e3d6b
Revert "Remove test that checks for error"
Feb 4, 2020
bc3fe3f
Update NA test
Feb 4, 2020
73ad221
Revert "Remove some more tests"
Feb 4, 2020
547d7bc
Update Categorical test
Feb 4, 2020
5649445
Update getitem tests
Feb 4, 2020
bb3d143
Update indexers.py
Feb 4, 2020
f107252
tm -> self
Feb 4, 2020
7b924b7
Assert for EA not DataFrame
Feb 4, 2020
46d77df
Don't try / except
Feb 4, 2020
ac71cbf
Change check_indexer test
Feb 4, 2020
e5ed092
Modify __getitem__ for datetimelike
Feb 4, 2020
9fcdb23
Add back ValueError for non-boolean with NA
Feb 4, 2020
c2dfa93
Revert "Delete another test"
Feb 4, 2020
a9a12b1
Fixup error message
Feb 4, 2020
7c10f33
Add before and after examples
Feb 5, 2020
cf3d60d
Get rid of some tests
Feb 5, 2020
157d8b9
Cast another way
Feb 5, 2020
250f228
Import
Feb 5, 2020
647f0f6
Don't import unused
Feb 5, 2020
6ccd96d
Merge branch 'master' into bool-idx
Feb 8, 2020
a9e73de
Update whatsnew
Feb 10, 2020
adc3075
Update boolean.rst
Feb 10, 2020
29ff823
check_array_indexer docstring
Feb 10, 2020
0a58605
Edit 1.1.0 whatsnew
Feb 10, 2020
b38a209
Add to indexing.rst
Feb 10, 2020
5088cbb
Add back index parameter
Feb 10, 2020
54efdd9
Add some True values in test
Feb 10, 2020
c6b81ed
Edit boolean.rst
Feb 10, 2020
67800c6
Add list back to check_array_indexer test
Feb 10, 2020
4c334f3
Merge branch 'master' into bool-idx
Feb 10, 2020
578fd3c
Account for pd.NA in is_bool_indexer
Feb 10, 2020
a559385
Include list mask in test
Feb 10, 2020
705947e
Account for empty key
Feb 10, 2020
4974778
Revert "Account for empty key"
Feb 10, 2020
319b525
Revert "Account for pd.NA in is_bool_indexer"
Feb 10, 2020
8007ce4
Try modifying is_bool_indexer
Feb 11, 2020
a10765f
Revert "Try modifying is_bool_indexer"
Feb 11, 2020
d7fc3b7
Revert "Include list mask in test"
Feb 11, 2020
bca582e
Merge branch 'master' into bool-idx
Feb 13, 2020
6f9a298
Update release notes and docs
Feb 13, 2020
e1e39fe
Add issue number to tests
Feb 13, 2020
5a72b2f
Add some setitem tests
Feb 13, 2020
c0e8dc7
Revert "Add some setitem tests"
Feb 13, 2020
a293bc6
Merge branch 'master' into bool-idx
Feb 13, 2020
607d9ed
Update setitem tests
Feb 13, 2020
2e7f9b3
Merge branch 'master' into bool-idx
Feb 15, 2020
bfe472b
Merge branch 'master' into bool-idx
Feb 16, 2020
a6294f8
Merge branch 'master' into bool-idx
Feb 16, 2020
c6d23f6
Add setitem test
Feb 17, 2020
c8ee434
Merge branch 'master' into bool-idx
Feb 17, 2020
fbda99d
Move whatsnew note
dsaxton Feb 19, 2020
3bf9327
Add back example
dsaxton Feb 19, 2020
dd65b0d
Merge branch 'master' into bool-idx
dsaxton Feb 19, 2020
974ec5d
Merge remote-tracking branch 'upstream/master' into bool-idx
dsaxton Feb 20, 2020
8f2d7bb
Merge remote-tracking branch 'upstream/master' into bool-idx
dsaxton Feb 21, 2020
080d1d2
Update comment
dsaxton Feb 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ For example:
ser["2014"]
ser.loc["May 2015"]

Indexing with Nullable Boolean Arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok on moving to 1.0.2

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously indexing with a nullable Boolean array would raise a ``ValueError``, however this is now permitted with ``NA`` being treated as ``False``. (:issue:`31503`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example of the before and the after (style this after similar notes in 1.0.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it raised only when there were missing values.


.. _whatsnew_110.enhancements.other:

Other enhancements
Expand Down
8 changes: 0 additions & 8 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,24 +114,16 @@ def is_bool_indexer(key: Any) -> bool:
check_array_indexer : Check that `key` is a valid array to index,
and convert to an ndarray.
"""
na_msg = "cannot mask with array containing NA / NaN values"
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or (
is_array_like(key) and is_extension_array_dtype(key.dtype)
):
if key.dtype == np.object_:
key = np.asarray(values_from_object(key))

if not lib.is_bool_array(key):
if isna(key).any():
raise ValueError(na_msg)
return False
return True
elif is_bool_dtype(key.dtype):
# an ndarray with bool-dtype by definition has no missing values.
# So we only need to check for NAs in ExtensionArrays
if is_extension_array_dtype(key.dtype):
if np.any(key.isna()):
raise ValueError(na_msg)
return True
elif isinstance(key, list):
try:
Expand Down
23 changes: 14 additions & 9 deletions pandas/core/indexers.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from pandas.core.dtypes.common import (
is_array_like,
is_bool_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_list_like,
)
Expand Down Expand Up @@ -333,14 +334,11 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:
...
IndexError: Boolean index has wrong length: 3 instead of 2.

A ValueError is raised when the mask cannot be converted to
a bool-dtype ndarray.
NA values are treated as False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NA values in a boolean array.

In other contexts, they should still raise.


>>> mask = pd.array([True, pd.NA])
>>> pd.api.indexers.check_array_indexer(arr, mask)
Traceback (most recent call last):
...
ValueError: Cannot mask with a boolean indexer containing NA values
array([ True, False])

A numpy boolean mask will get passed through (if the length is correct):

Expand Down Expand Up @@ -392,10 +390,17 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:

dtype = indexer.dtype
if is_bool_dtype(dtype):
try:
indexer = np.asarray(indexer, dtype=bool)
except ValueError:
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError("Cannot mask with a boolean indexer containing NA values")
if is_extension_array_dtype(dtype):
jreback marked this conversation as resolved.
Show resolved Hide resolved
indexer = indexer.to_numpy(dtype=bool, na_value=False)
else:
try:
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
indexer = np.asarray(indexer, dtype=bool)
except ValueError:
msg = (
"Cannot mask with a non-ExtensionArray boolean indexer "
" containing missing values"
)
raise ValueError(msg)

# GH26658
if len(indexer) != len(array):
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from pandas.util._decorators import Appender

from pandas.core.dtypes.common import (
is_bool_dtype,
is_extension_array_dtype,
is_float,
is_integer,
is_iterator,
Expand Down Expand Up @@ -2222,6 +2224,8 @@ def check_bool_indexer(index: Index, key) -> np.ndarray:
"the indexed object do not match)."
)
result = result.astype(bool)._values
elif is_extension_array_dtype(key) and is_bool_dtype(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this at all anymore? is check_array_indexer not sufficient ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, there's a lot of duplicate code happening. Will work on cleaning this up.

result = key.to_numpy(dtype=bool, na_value=False)
else:
# key might be sparse / object-dtype bool, check_array_indexer needs bool array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here seems to indicate that this was also done for sparse data (and not only object dtype data). But no tests are failing? This might need some checking if the comment was outdated (I think it was only recently added)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this comment? (about the sparse from the code comment)

result = np.asarray(result, dtype=bool)
Expand Down
11 changes: 5 additions & 6 deletions pandas/tests/arrays/categorical/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -239,15 +239,14 @@ def test_mask_with_boolean(index):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("index", [True, False])
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
def test_mask_with_boolean_raises(index):
def test_mask_with_boolean_na_treated_as_false():
s = Series(range(3))
idx = Categorical([True, False, None])
if index:
idx = CategoricalIndex(idx)

with pytest.raises(ValueError, match="NA / NaN"):
s[idx]
result = s[idx]
expected = s[idx.fillna(False)]

tm.assert_series_equal(result, expected)


@pytest.fixture
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -929,3 +929,23 @@ def test_diff():
result = s.diff()
expected = pd.Series(expected)
tm.assert_series_equal(result, expected)


def test_nullable_boolean_mask_series():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these tests be removed in favor of the modified tests? May be a bit redundant with the others checking for the same / similar behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they are truly redudant

s = pd.Series([1, 2, 3])
mask = pd.array([True, True, None], dtype="boolean")

result = s[mask]
expected = s.iloc[:2]

tm.assert_series_equal(result, expected)


def test_nullable_boolean_mask_frame():
df = pd.DataFrame({"a": [1, 2, 3]})
mask = pd.array([True, True, None], dtype="boolean")

result = df[mask]
expected = df.iloc[:2, :]

tm.assert_frame_equal(result, expected)
18 changes: 9 additions & 9 deletions pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,21 +158,21 @@ def test_getitem_boolean_array_mask(self, data):
result = pd.Series(data)[mask]
self.assert_series_equal(result, expected)

def test_getitem_boolean_array_mask_raises(self, data):
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
def test_getitem_boolean_na_treated_as_false(self, data):
mask = pd.array(np.zeros(data.shape, dtype="bool"), dtype="boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take here something with True's as well? (now it will give an empty result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to run the test also for both a boolean array and a list as mask (to ensure the list works)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the list input may not be working properly, will work on fixing that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche So I think the issue with list masks containing bools and pd.NA was that is_bool_indexer from pandas/core/common.py didn't consider these to be valid boolean indexers because it was trying to cast to a bool numpy array: https://github.com/pandas-dev/pandas/blob/master/pandas/core/common.py#L142

Made an update there to recognize pd.NA and also updated the test; hopefully CI will still pass. The assumption that boolean indexers are ones that can be cast as numpy boolean arrays seems to happen in a lot of places (e.g., https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py#L4147) so I could see this causing problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the "old" code of still accepting object dtype makes this a bit more complex indeed. Maybe instead of casting to a numpy array, we could use pd.array, so it will handle the case with boolean values better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should theoretically work in combination with the right change to is_bool_indexer

mask[:2] = pd.NA

msg = (
"Cannot mask with a boolean indexer containing NA values|"
"cannot mask with array containing NA / NaN values"
)
with pytest.raises(ValueError, match=msg):
data[mask]
result = data[mask]
expected = data[mask.fillna(False)]

tm.assert_frame_equal(result, expected)

s = pd.Series(data)

with pytest.raises(ValueError):
s[mask]
result = s[mask]
expected = s[mask.fillna(False)]

tm.assert_series_equal(result, expected)

@pytest.mark.parametrize(
"idx",
Expand Down
26 changes: 18 additions & 8 deletions pandas/tests/indexing/test_na_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,28 @@ def test_series_mask_boolean(values, dtype, mask, box_mask, frame):


@pytest.mark.parametrize("frame", [True, False])
def test_indexing_with_na_raises(frame):
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
def test_na_treated_as_false(frame):
s = pd.Series([1, 2, 3], name="name")

if frame:
s = s.to_frame()

mask = pd.array([True, False, None], dtype="boolean")
match = "cannot mask with array containing NA / NaN values"
with pytest.raises(ValueError, match=match):
s[mask]

with pytest.raises(ValueError, match=match):
s.loc[mask]
result = s[mask]
expected = s[mask.fillna(False)]

result_loc = s.loc[mask]
expected_loc = s.loc[mask.fillna(False)]

with pytest.raises(ValueError, match=match):
s.iloc[mask]
result_iloc = s.iloc[mask]
expected_iloc = s.iloc[mask.fillna(False)]

if frame:
tm.assert_frame_equal(result, expected)
tm.assert_frame_equal(result_loc, expected_loc)
tm.assert_frame_equal(result_iloc, expected_iloc)
else:
tm.assert_series_equal(result, expected)
tm.assert_series_equal(result_loc, expected_loc)
tm.assert_series_equal(result_iloc, expected_iloc)
8 changes: 0 additions & 8 deletions pandas/tests/series/indexing/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,14 +70,6 @@ def test_getitem_boolean_object(string_series):
s2[mask] = 5
tm.assert_series_equal(cop, s2)

# nans raise exception
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
omask[5:10] = np.nan
msg = "cannot mask with array containing NA / NaN values"
with pytest.raises(ValueError, match=msg):
s[omask]
with pytest.raises(ValueError, match=msg):
s[omask] = 5


def test_getitem_setitem_boolean_corner(datetime_series):
ts = datetime_series
Expand Down