Skip to content

Commit

Permalink
Backport PR #54670 on branch 2.1.x (BUG: drop_duplicates raising for …
Browse files Browse the repository at this point in the history
…boolean arrow dtype with missing values) (#54692)

Backport PR #54670: BUG: drop_duplicates raising for boolean arrow dtype with missing values

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
  • Loading branch information
meeseeksmachine and phofl committed Aug 22, 2023
1 parent 8c8ddc7 commit 6c5e79b
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 3 deletions.
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.transpose` when transposing a DataFrame with a single masked dtype, e.g. :class:`Int64` (:issue:`52836`)
- Performance improvement in :meth:`Series.add` for PyArrow string and binary dtypes (:issue:`53150`)
- Performance improvement in :meth:`Series.corr` and :meth:`Series.cov` for extension dtypes (:issue:`52502`)
- Performance improvement in :meth:`Series.drop_duplicates` for ``ArrowDtype`` (:issue:`54667`).
- Performance improvement in :meth:`Series.ffill`, :meth:`Series.bfill`, :meth:`DataFrame.ffill`, :meth:`DataFrame.bfill` with PyArrow dtypes (:issue:`53950`)
- Performance improvement in :meth:`Series.str.get_dummies` for PyArrow-backed strings (:issue:`53655`)
- Performance improvement in :meth:`Series.str.get` for PyArrow-backed strings (:issue:`53152`)
Expand Down Expand Up @@ -831,6 +832,7 @@ ExtensionArray
- Bug in :class:`~arrays.ArrowExtensionArray` converting pandas non-nanosecond temporal objects from non-zero values to zero values (:issue:`53171`)
- Bug in :meth:`Series.quantile` for PyArrow temporal types raising ``ArrowInvalid`` (:issue:`52678`)
- Bug in :meth:`Series.rank` returning wrong order for small values with ``Float64`` dtype (:issue:`52471`)
- Bug in :meth:`Series.unique` for boolean ``ArrowDtype`` with ``NA`` values (:issue:`54667`)
- Bug in :meth:`~arrays.ArrowExtensionArray.__iter__` and :meth:`~arrays.ArrowExtensionArray.__getitem__` returning python datetime and timedelta objects for non-nano dtypes (:issue:`53326`)
- Bug where the :class:`DataFrame` repr would not work when a column had an :class:`ArrowDtype` with a ``pyarrow.ExtensionDtype`` (:issue:`54063`)
- Bug where the ``__from_arrow__`` method of masked ExtensionDtypes (e.g. :class:`Float64Dtype`, :class:`BooleanDtype`) would not accept PyArrow arrays of type ``pyarrow.null()`` (:issue:`52223`)
Expand Down
11 changes: 8 additions & 3 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
)
from pandas.core.dtypes.concat import concat_compat
from pandas.core.dtypes.dtypes import (
ArrowDtype,
BaseMaskedDtype,
CategoricalDtype,
ExtensionDtype,
Expand Down Expand Up @@ -996,9 +997,13 @@ def duplicated(
-------
duplicated : ndarray[bool]
"""
if hasattr(values, "dtype") and isinstance(values.dtype, BaseMaskedDtype):
values = cast("BaseMaskedArray", values)
return htable.duplicated(values._data, keep=keep, mask=values._mask)
if hasattr(values, "dtype"):
if isinstance(values.dtype, ArrowDtype):
values = values._to_masked() # type: ignore[union-attr]

if isinstance(values.dtype, BaseMaskedDtype):
values = cast("BaseMaskedArray", values)
return htable.duplicated(values._data, keep=keep, mask=values._mask)

values = _ensure_data(values)
return htable.duplicated(values, keep=keep)
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/series/methods/test_drop_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -249,3 +249,10 @@ def test_drop_duplicates_ignore_index(self):
result = ser.drop_duplicates(ignore_index=True)
expected = Series([1, 2, 3])
tm.assert_series_equal(result, expected)

def test_duplicated_arrow_dtype(self):
pytest.importorskip("pyarrow")
ser = Series([True, False, None, False], dtype="bool[pyarrow]")
result = ser.drop_duplicates()
expected = Series([True, False, None], dtype="bool[pyarrow]")
tm.assert_series_equal(result, expected)

0 comments on commit 6c5e79b

Please sign in to comment.