Skip to content

Commit

Permalink
Backport PR #53232 on branch 2.0.x (BUG: sort_values raising for dict…
Browse files Browse the repository at this point in the history
…ionary arrow dtype) (#53241)

BUG: sort_values raising for dictionary arrow dtype (#53232)

(cherry picked from commit e1b657a)
  • Loading branch information
phofl committed May 15, 2023
1 parent 7a28ced commit 340346c
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 2 deletions.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Bug fixes
- Bug in :func:`api.interchange.from_dataframe` was unnecessarily raising on bitmasks (:issue:`49888`)
- Bug in :func:`merge` when merging on datetime columns on different resolutions (:issue:`53200`)
- Bug in :meth:`DataFrame.convert_dtypes` ignores ``convert_*`` keywords when set to False ``dtype_backend="pyarrow"`` (:issue:`52872`)
- Bug in :meth:`DataFrame.sort_values` raising for PyArrow ``dictionary`` dtype (:issue:`53232`)
- Bug in :meth:`Series.describe` treating pyarrow-backed timestamps and timedeltas as categorical data (:issue:`53001`)
- Bug in :meth:`Series.rename` not making a lazy copy when Copy-on-Write is enabled when a scalar is passed to it (:issue:`52450`)
- Bug in :meth:`pd.array` raising for ``NumPy`` array and ``pa.large_string`` or ``pa.large_binary`` (:issue:`52590`)
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -269,7 +269,10 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = Fal
# GH50430: let pyarrow infer type, then cast
scalars = pa.array(scalars, from_pandas=True)
if pa_dtype:
scalars = scalars.cast(pa_dtype)
if pa.types.is_dictionary(pa_dtype):
scalars = scalars.dictionary_encode()
else:
scalars = scalars.cast(pa_dtype)
arr = cls(scalars)
if pa.types.is_duration(scalars.type) and scalars.null_count > 0:
# GH52843: upstream bug for duration types when originally
Expand Down Expand Up @@ -868,7 +871,10 @@ def factorize(
else:
data = self._data

encoded = data.dictionary_encode(null_encoding=null_encoding)
if pa.types.is_dictionary(data.type):
encoded = data
else:
encoded = data.dictionary_encode(null_encoding=null_encoding)
if encoded.length() == 0:
indices = np.array([], dtype=np.intp)
uniques = type(self)(pa.chunked_array([], type=encoded.type.value_type))
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -1831,6 +1831,20 @@ def test_searchsorted_with_na_raises(data_for_sorting, as_series):
arr.searchsorted(b)


def test_sort_values_dictionary():
df = pd.DataFrame(
{
"a": pd.Series(
["x", "y"], dtype=ArrowDtype(pa.dictionary(pa.int32(), pa.string()))
),
"b": [1, 2],
},
)
expected = df.copy()
result = df.sort_values(by=["a", "b"])
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("pat", ["abc", "a[a-z]{2}"])
def test_str_count(pat):
ser = pd.Series(["abc", None], dtype=ArrowDtype(pa.string()))
Expand Down

0 comments on commit 340346c

Please sign in to comment.