Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable indexing with nullable Boolean #31591

Merged
merged 70 commits into from
Feb 22, 2020
Merged
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
75c915f
Add test
Feb 1, 2020
9c5b9f0
Remove test that checks for error
Feb 2, 2020
2441b40
Add frame test
Feb 3, 2020
d71d1ba
Don't raise with nullable boolean
Feb 3, 2020
4d3a264
Don't modify result
Feb 3, 2020
543ef9a
Add frame test
Feb 3, 2020
d3e7a69
Update whatsnew
Feb 3, 2020
ad7ae66
Fill NA
Feb 3, 2020
6991394
Merge branch 'master' into bool-idx
Feb 3, 2020
f6e9ce5
Remove some more tests
Feb 3, 2020
1234407
Delete another test
Feb 3, 2020
9b7e879
Use to_numpy
Feb 3, 2020
efdd29a
Update whatsnew
Feb 3, 2020
7fa36b6
Don't check for NA
Feb 3, 2020
b8e3d6b
Revert "Remove test that checks for error"
Feb 4, 2020
bc3fe3f
Update NA test
Feb 4, 2020
73ad221
Revert "Remove some more tests"
Feb 4, 2020
547d7bc
Update Categorical test
Feb 4, 2020
5649445
Update getitem tests
Feb 4, 2020
bb3d143
Update indexers.py
Feb 4, 2020
f107252
tm -> self
Feb 4, 2020
7b924b7
Assert for EA not DataFrame
Feb 4, 2020
46d77df
Don't try / except
Feb 4, 2020
ac71cbf
Change check_indexer test
Feb 4, 2020
e5ed092
Modify __getitem__ for datetimelike
Feb 4, 2020
9fcdb23
Add back ValueError for non-boolean with NA
Feb 4, 2020
c2dfa93
Revert "Delete another test"
Feb 4, 2020
a9a12b1
Fixup error message
Feb 4, 2020
7c10f33
Add before and after examples
Feb 5, 2020
cf3d60d
Get rid of some tests
Feb 5, 2020
157d8b9
Cast another way
Feb 5, 2020
250f228
Import
Feb 5, 2020
647f0f6
Don't import unused
Feb 5, 2020
6ccd96d
Merge branch 'master' into bool-idx
Feb 8, 2020
a9e73de
Update whatsnew
Feb 10, 2020
adc3075
Update boolean.rst
Feb 10, 2020
29ff823
check_array_indexer docstring
Feb 10, 2020
0a58605
Edit 1.1.0 whatsnew
Feb 10, 2020
b38a209
Add to indexing.rst
Feb 10, 2020
5088cbb
Add back index parameter
Feb 10, 2020
54efdd9
Add some True values in test
Feb 10, 2020
c6b81ed
Edit boolean.rst
Feb 10, 2020
67800c6
Add list back to check_array_indexer test
Feb 10, 2020
4c334f3
Merge branch 'master' into bool-idx
Feb 10, 2020
578fd3c
Account for pd.NA in is_bool_indexer
Feb 10, 2020
a559385
Include list mask in test
Feb 10, 2020
705947e
Account for empty key
Feb 10, 2020
4974778
Revert "Account for empty key"
Feb 10, 2020
319b525
Revert "Account for pd.NA in is_bool_indexer"
Feb 10, 2020
8007ce4
Try modifying is_bool_indexer
Feb 11, 2020
a10765f
Revert "Try modifying is_bool_indexer"
Feb 11, 2020
d7fc3b7
Revert "Include list mask in test"
Feb 11, 2020
bca582e
Merge branch 'master' into bool-idx
Feb 13, 2020
6f9a298
Update release notes and docs
Feb 13, 2020
e1e39fe
Add issue number to tests
Feb 13, 2020
5a72b2f
Add some setitem tests
Feb 13, 2020
c0e8dc7
Revert "Add some setitem tests"
Feb 13, 2020
a293bc6
Merge branch 'master' into bool-idx
Feb 13, 2020
607d9ed
Update setitem tests
Feb 13, 2020
2e7f9b3
Merge branch 'master' into bool-idx
Feb 15, 2020
bfe472b
Merge branch 'master' into bool-idx
Feb 16, 2020
a6294f8
Merge branch 'master' into bool-idx
Feb 16, 2020
c6d23f6
Add setitem test
Feb 17, 2020
c8ee434
Merge branch 'master' into bool-idx
Feb 17, 2020
fbda99d
Move whatsnew note
dsaxton Feb 19, 2020
3bf9327
Add back example
dsaxton Feb 19, 2020
dd65b0d
Merge branch 'master' into bool-idx
dsaxton Feb 19, 2020
974ec5d
Merge remote-tracking branch 'upstream/master' into bool-idx
dsaxton Feb 20, 2020
8f2d7bb
Merge remote-tracking branch 'upstream/master' into bool-idx
dsaxton Feb 21, 2020
080d1d2
Update comment
dsaxton Feb 21, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions doc/source/user_guide/boolean.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ Nullable Boolean data type
Indexing with NA values
-----------------------

pandas does not allow indexing with NA values. Attempting to do so
will raise a ``ValueError``.
pandas allows indexing with ``NA`` values in a boolean array, which are treated as ``False``.
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

.. versionchanged:: 1.0.2

.. ipython:: python
:okexcept:
Expand All @@ -30,12 +31,11 @@ will raise a ``ValueError``.
mask = pd.array([True, False, pd.NA], dtype="boolean")
s[mask]

The missing values will need to be explicitly filled with True or False prior
to using the array as a mask.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this example but reword it as something like "if you want different behaviour, you can fill manually with fillna(True)" ?

If you would prefer to keep the ``NA`` values you can manually fill them with ``fillna(True)``.

.. ipython:: python

s[mask.fillna(False)]
s[mask.fillna(True)]

.. _boolean.kleene:

Expand Down
12 changes: 10 additions & 2 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ of multi-axis indexing.
slices, **both** the start and the stop are included, when present in the
index! See :ref:`Slicing with labels <indexing.slicing_with_labels>`
and :ref:`Endpoints are inclusive <advanced.endpoints_are_inclusive>`.)
* A boolean array
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).

Expand All @@ -75,7 +75,7 @@ of multi-axis indexing.
* An integer e.g. ``5``.
* A list or array of integers ``[4, 3, 0]``.
* A slice object with ints ``1:7``.
* A boolean array.
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).

Expand Down Expand Up @@ -374,6 +374,14 @@ For getting values with a boolean array:
df1.loc['a'] > 0
df1.loc[:, df1.loc['a'] > 0]

NA values in a boolean array propogate as ``False``:
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

.. versionchanged:: 1.0.2

mask = pd.array([True, False, True, False, pd.NA, False], dtype="boolean")
mask
df1[mask]

For getting a value explicitly:

.. ipython:: python
Expand Down
29 changes: 27 additions & 2 deletions doc/source/whatsnew/v1.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,33 @@ Fixed regressions

.. ---------------------------------------------------------------------------

Indexing with Nullable Boolean Arrays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Previously indexing with a nullable Boolean array containing ``NA`` would raise a ``ValueError``, however this is now permitted with ``NA`` being treated as ``False``. (:issue:`31503`)

.. ipython:: python

s = pd.Series([1, 2, 3, 4])
mask = pd.array([True, True, False, None], dtype="boolean")
s
mask

*pandas 1.0.0-1.0.1*

.. code-block:: python

>>> s[mask]
Traceback (most recent call last):
...
ValueError: cannot mask with array containing NA / NaN values

*pandas 1.0.2*

.. ipython:: python

s[mask]

.. _whatsnew_102.bug_fixes:

Bug fixes
Expand All @@ -40,8 +67,6 @@ Bug fixes
- Using ``pd.NA`` with :meth:`DataFrame.to_json` now correctly outputs a null value instead of an empty object (:issue:`31615`)
- Fixed bug in parquet roundtrip with nullable unsigned integer dtypes (:issue:`31896`).



**Experimental dtypes**

- Fix bug in :meth:`DataFrame.convert_dtypes` for columns that were already using the ``"string"`` dtype (:issue:`31731`).
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,9 @@ def __getitem__(self, key):
if com.is_bool_indexer(key):
# first convert to boolean, because check_array_indexer doesn't
# allow object dtype
key = np.asarray(key, dtype=bool)
if is_object_dtype(key):
jreback marked this conversation as resolved.
Show resolved Hide resolved
key = np.asarray(key, dtype=bool)

key = check_array_indexer(self, key)
if key.all():
key = slice(0, None, None)
Expand Down
7 changes: 1 addition & 6 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,24 +118,19 @@ def is_bool_indexer(key: Any) -> bool:
check_array_indexer : Check that `key` is a valid array to index,
and convert to an ndarray.
"""
na_msg = "cannot mask with array containing NA / NaN values"
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or (
is_array_like(key) and is_extension_array_dtype(key.dtype)
):
if key.dtype == np.object_:
key = np.asarray(values_from_object(key))

if not lib.is_bool_array(key):
na_msg = "Cannot mask with non-boolean array containing NA / NaN values"
if isna(key).any():
raise ValueError(na_msg)
return False
return True
elif is_bool_dtype(key.dtype):
# an ndarray with bool-dtype by definition has no missing values.
# So we only need to check for NAs in ExtensionArrays
if is_extension_array_dtype(key.dtype):
if np.any(key.isna()):
raise ValueError(na_msg)
return True
elif isinstance(key, list):
try:
Expand Down
14 changes: 6 additions & 8 deletions pandas/core/indexers.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from pandas.core.dtypes.common import (
is_array_like,
is_bool_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_list_like,
)
Expand Down Expand Up @@ -366,14 +367,11 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:
...
IndexError: Boolean index has wrong length: 3 instead of 2.

A ValueError is raised when the mask cannot be converted to
a bool-dtype ndarray.
NA values in a boolean array are treated as False.

>>> mask = pd.array([True, pd.NA])
>>> pd.api.indexers.check_array_indexer(arr, mask)
Traceback (most recent call last):
...
ValueError: Cannot mask with a boolean indexer containing NA values
array([ True, False])

A numpy boolean mask will get passed through (if the length is correct):

Expand Down Expand Up @@ -425,10 +423,10 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:

dtype = indexer.dtype
if is_bool_dtype(dtype):
try:
if is_extension_array_dtype(dtype):
jreback marked this conversation as resolved.
Show resolved Hide resolved
indexer = indexer.to_numpy(dtype=bool, na_value=False)
else:
indexer = np.asarray(indexer, dtype=bool)
except ValueError:
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError("Cannot mask with a boolean indexer containing NA values")

# GH26658
if len(indexer) != len(array):
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
is_iterator,
is_list_like,
is_numeric_dtype,
is_object_dtype,
is_scalar,
is_sequence,
)
Expand Down Expand Up @@ -2189,10 +2190,12 @@ def check_bool_indexer(index: Index, key) -> np.ndarray:
"the indexed object do not match)."
)
result = result.astype(bool)._values
else:
elif is_object_dtype(key):
# key might be sparse / object-dtype bool, check_array_indexer needs bool array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here seems to indicate that this was also done for sparse data (and not only object dtype data). But no tests are failing? This might need some checking if the comment was outdated (I think it was only recently added)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this comment? (about the sparse from the code comment)

result = np.asarray(result, dtype=bool)
result = check_array_indexer(index, result)
else:
result = check_array_indexer(index, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be able to simply to check_array_indexer right before returning (iow for all cases) here. (try in a followon)


return result

Expand Down
9 changes: 6 additions & 3 deletions pandas/tests/arrays/categorical/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,14 +240,17 @@ def test_mask_with_boolean(index):


@pytest.mark.parametrize("index", [True, False])
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
def test_mask_with_boolean_raises(index):
def test_mask_with_boolean_na_treated_as_false(index):
jreback marked this conversation as resolved.
Show resolved Hide resolved
# https://github.com/pandas-dev/pandas/issues/31503
s = Series(range(3))
idx = Categorical([True, False, None])
if index:
idx = CategoricalIndex(idx)

with pytest.raises(ValueError, match="NA / NaN"):
s[idx]
result = s[idx]
expected = s[idx.fillna(False)]

tm.assert_series_equal(result, expected)


@pytest.fixture
Expand Down
20 changes: 11 additions & 9 deletions pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,21 +158,23 @@ def test_getitem_boolean_array_mask(self, data):
result = pd.Series(data)[mask]
self.assert_series_equal(result, expected)

def test_getitem_boolean_array_mask_raises(self, data):
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
def test_getitem_boolean_na_treated_as_false(self, data):
# https://github.com/pandas-dev/pandas/issues/31503
mask = pd.array(np.zeros(data.shape, dtype="bool"), dtype="boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take here something with True's as well? (now it will give an empty result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to run the test also for both a boolean array and a list as mask (to ensure the list works)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the list input may not be working properly, will work on fixing that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche So I think the issue with list masks containing bools and pd.NA was that is_bool_indexer from pandas/core/common.py didn't consider these to be valid boolean indexers because it was trying to cast to a bool numpy array: https://github.com/pandas-dev/pandas/blob/master/pandas/core/common.py#L142

Made an update there to recognize pd.NA and also updated the test; hopefully CI will still pass. The assumption that boolean indexers are ones that can be cast as numpy boolean arrays seems to happen in a lot of places (e.g., https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/base.py#L4147) so I could see this causing problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the "old" code of still accepting object dtype makes this a bit more complex indeed. Maybe instead of casting to a numpy array, we could use pd.array, so it will handle the case with boolean values better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should theoretically work in combination with the right change to is_bool_indexer

mask[:2] = pd.NA
mask[2:4] = True
jreback marked this conversation as resolved.
Show resolved Hide resolved

msg = (
"Cannot mask with a boolean indexer containing NA values|"
"cannot mask with array containing NA / NaN values"
)
with pytest.raises(ValueError, match=msg):
data[mask]
result = data[mask]
expected = data[mask.fillna(False)]

self.assert_extension_array_equal(result, expected)

s = pd.Series(data)

with pytest.raises(ValueError):
s[mask]
result = s[mask]
expected = s[mask.fillna(False)]

self.assert_series_equal(result, expected)

@pytest.mark.parametrize(
"idx",
Expand Down
18 changes: 8 additions & 10 deletions pandas/tests/extension/base/setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,8 +98,9 @@ def test_setitem_iloc_scalar_multiple_homogoneous(self, data):
[
np.array([True, True, True, False, False]),
pd.array([True, True, True, False, False], dtype="boolean"),
pd.array([True, True, True, pd.NA, pd.NA], dtype="boolean"),
],
ids=["numpy-array", "boolean-array"],
ids=["numpy-array", "boolean-array", "boolean-array-na"],
)
def test_setitem_mask(self, data, mask, box_in_series):
arr = data[:5].copy()
Expand All @@ -124,20 +125,17 @@ def test_setitem_mask_raises(self, data, box_in_series):
with pytest.raises(IndexError, match="wrong length"):
data[mask] = data[0]

def test_setitem_mask_boolean_array_raises(self, data, box_in_series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than remove this can you turn it into a test (obviously changed that we no longer raise)

# missing values in mask
def test_setitem_mask_boolean_array_with_na(self, data, box_in_series):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this test duplicating the test_setitem_mask above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply remove it then, I think?

mask = pd.array(np.zeros(data.shape, dtype="bool"), dtype="boolean")
mask[:2] = pd.NA
mask[:3] = True
mask[3:5] = pd.NA

if box_in_series:
data = pd.Series(data)

msg = (
"Cannot mask with a boolean indexer containing NA values|"
"cannot mask with array containing NA / NaN values"
)
with pytest.raises(ValueError, match=msg):
data[mask] = data[0]
data[mask] = data[0]

assert (data[:3] == data[0]).all()

@pytest.mark.parametrize(
"idx",
Expand Down
4 changes: 0 additions & 4 deletions pandas/tests/extension/test_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,10 +415,6 @@ def test_setitem_mask(self, data, mask, box_in_series):
def test_setitem_mask_raises(self, data, box_in_series):
super().test_setitem_mask_raises(data, box_in_series)

@skip_nested
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a setitem test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a case containing NA to the the test_setitem_mask test Joris had merged previously (should be line 101 of setitem.py above)

def test_setitem_mask_boolean_array_raises(self, data, box_in_series):
super().test_setitem_mask_boolean_array_raises(data, box_in_series)

@skip_nested
@pytest.mark.parametrize(
"idx",
Expand Down
12 changes: 7 additions & 5 deletions pandas/tests/indexing/test_check_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,14 @@ def test_valid_input(indexer, expected):
@pytest.mark.parametrize(
"indexer", [[True, False, None], pd.array([True, False, None], dtype="boolean")],
)
def test_bool_raise_missing_values(indexer):
array = np.array([1, 2, 3])
def test_boolean_na_returns_indexer(indexer):
# https://github.com/pandas-dev/pandas/issues/31503
arr = np.array([1, 2, 3])

msg = "Cannot mask with a boolean indexer containing NA values"
with pytest.raises(ValueError, match=msg):
check_array_indexer(array, indexer)
result = check_array_indexer(arr, indexer)
expected = np.array([True, False, False], dtype=bool)

tm.assert_numpy_array_equal(result, expected)


@pytest.mark.parametrize(
Expand Down
27 changes: 19 additions & 8 deletions pandas/tests/indexing/test_na_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,29 @@ def test_series_mask_boolean(values, dtype, mask, box_mask, frame):


@pytest.mark.parametrize("frame", [True, False])
def test_indexing_with_na_raises(frame):
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
def test_na_treated_as_false(frame):
# https://github.com/pandas-dev/pandas/issues/31503
s = pd.Series([1, 2, 3], name="name")

if frame:
s = s.to_frame()

mask = pd.array([True, False, None], dtype="boolean")
match = "cannot mask with array containing NA / NaN values"
with pytest.raises(ValueError, match=match):
s[mask]

with pytest.raises(ValueError, match=match):
s.loc[mask]
result = s[mask]
expected = s[mask.fillna(False)]

result_loc = s.loc[mask]
expected_loc = s.loc[mask.fillna(False)]

with pytest.raises(ValueError, match=match):
s.iloc[mask]
result_iloc = s.iloc[mask]
expected_iloc = s.iloc[mask.fillna(False)]

if frame:
tm.assert_frame_equal(result, expected)
tm.assert_frame_equal(result_loc, expected_loc)
tm.assert_frame_equal(result_iloc, expected_iloc)
else:
tm.assert_series_equal(result, expected)
tm.assert_series_equal(result_loc, expected_loc)
tm.assert_series_equal(result_iloc, expected_iloc)
2 changes: 1 addition & 1 deletion pandas/tests/series/indexing/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def test_getitem_boolean_object(string_series):

# nans raise exception
dsaxton marked this conversation as resolved.
Show resolved Hide resolved
omask[5:10] = np.nan
msg = "cannot mask with array containing NA / NaN values"
msg = "Cannot mask with non-boolean array containing NA / NaN values"
with pytest.raises(ValueError, match=msg):
s[omask]
with pytest.raises(ValueError, match=msg):
Expand Down