Skip to content

Commit

Permalink
Backport PR #52765 on branch 2.0.x (BUG: ArrowExtensionArray returnin…
Browse files Browse the repository at this point in the history
…g approximate median) (#52785)

Backport PR #52765: BUG: ArrowExtensionArray returning approximate median

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
  • Loading branch information
meeseeksmachine and mroeschke committed Apr 19, 2023
1 parent 5a95983 commit dd8533e
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 1 deletion.
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Bug fixes
~~~~~~~~~
- Bug in :attr:`Series.dt.days` that would overflow ``int32`` number of days (:issue:`52391`)
- Bug in :class:`arrays.DatetimeArray` constructor returning an incorrect unit when passed a non-nanosecond numpy datetime array (:issue:`52555`)
- Bug in :func:`Series.median` with :class:`ArrowDtype` returning an approximate median (:issue:`52679`)
- Bug in :func:`pandas.testing.assert_series_equal` where ``check_dtype=False`` would still raise for datetime or timedelta types with different resolutions (:issue:`52449`)
- Bug in :func:`read_csv` casting PyArrow datetimes to NumPy when ``dtype_backend="pyarrow"`` and ``parse_dates`` is set causing a performance bottleneck in the process (:issue:`52546`)
- Bug in :func:`to_datetime` and :func:`to_timedelta` when trying to convert numeric data with a :class:`ArrowDtype` (:issue:`52425`)
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -1259,7 +1259,7 @@ def pyarrow_meth(data, skip_nulls, **kwargs):

else:
pyarrow_name = {
"median": "approximate_median",
"median": "quantile",
"prod": "product",
"std": "stddev",
"var": "variance",
Expand All @@ -1275,6 +1275,9 @@ def pyarrow_meth(data, skip_nulls, **kwargs):
# GH51624: pyarrow defaults to min_count=1, pandas behavior is min_count=0
if name in ["any", "all"] and "min_count" not in kwargs:
kwargs["min_count"] = 0
elif name == "median":
# GH 52679: Use quantile instead of approximate_median
kwargs["q"] = 0.5

try:
result = pyarrow_meth(data_to_reduce, skip_nulls=skipna, **kwargs)
Expand All @@ -1286,6 +1289,9 @@ def pyarrow_meth(data, skip_nulls, **kwargs):
f"upgrading pyarrow."
)
raise TypeError(msg) from err
if name == "median":
# GH 52679: Use quantile instead of approximate_median; returns array
result = result[0]
if pc.is_null(result).as_py():
return self.dtype.na_value

Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,12 @@ def test_reduce_series(self, data, all_numeric_reductions, skipna, request):
request.node.add_marker(xfail_mark)
super().test_reduce_series(data, all_numeric_reductions, skipna)

@pytest.mark.parametrize("typ", ["int64", "uint64", "float64"])
def test_median_not_approximate(self, typ):
# GH 52679
result = pd.Series([1, 2], dtype=f"{typ}[pyarrow]").median()
assert result == 1.5


class TestBaseBooleanReduce(base.BaseBooleanReduceTests):
@pytest.mark.parametrize("skipna", [True, False])
Expand Down

0 comments on commit dd8533e

Please sign in to comment.