Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.count maintains masked and arrow dtypes #54129

Merged
merged 4 commits into from
Jul 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,7 @@ Groupby/resample/rolling
- Bug in :meth:`GroupBy.var` failing to raise ``TypeError`` when called with datetime64, timedelta64 or :class:`PeriodDtype` values (:issue:`52128`, :issue:`53045`)
- Bug in :meth:`DataFrameGroupby.resample` with ``kind="period"`` raising ``AttributeError`` (:issue:`24103`)
- Bug in :meth:`Resampler.ohlc` with empty object returning a :class:`Series` instead of empty :class:`DataFrame` (:issue:`42902`)
- Bug in :meth:`SeriesGroupBy.count` and :meth:`DataFrameGroupBy.count` where the dtype would be ``np.int64`` for data with :class:`ArrowDtype` or masked dtypes (e.g. ``Int64``) (:issue:`53831`)
- Bug in :meth:`SeriesGroupBy.nth` and :meth:`DataFrameGroupBy.nth` after performing column selection when using ``dropna="any"`` or ``dropna="all"`` would not subset columns (:issue:`53518`)
- Bug in :meth:`SeriesGroupBy.nth` and :meth:`DataFrameGroupBy.nth` raised after performing column selection when using ``dropna="any"`` or ``dropna="all"`` resulted in rows being dropped (:issue:`53518`)
- Bug in :meth:`SeriesGroupBy.sum` and :meth:`DataFrameGroupby.sum` summing ``np.inf + np.inf`` and ``(-np.inf) + (-np.inf)`` to ``np.nan`` (:issue:`53606`)
Expand Down
7 changes: 7 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ class providing the base-class of operations.
Categorical,
ExtensionArray,
FloatingArray,
IntegerArray,
)
from pandas.core.base import (
PandasObject,
Expand Down Expand Up @@ -2248,6 +2249,12 @@ def hfunc(bvalues: ArrayLike) -> ArrayLike:
masked = mask & ~isna(bvalues)

counted = lib.count_level_2d(masked, labels=ids, max_bin=ngroups)
if isinstance(bvalues, BaseMaskedArray):
return IntegerArray(
counted[0], mask=np.zeros(counted.shape[1], dtype=np.bool_)
)
elif isinstance(bvalues, ArrowExtensionArray):
return type(bvalues)._from_sequence(counted[0])
if is_series:
assert counted.ndim == 2
assert counted.shape[0] == 1
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/extension/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -3154,6 +3154,18 @@ def test_to_numpy_temporal(pa_type):
tm.assert_numpy_array_equal(result, expected)


def test_groupby_count_return_arrow_dtype(data_missing):
df = pd.DataFrame({"A": [1, 1], "B": data_missing, "C": data_missing})
result = df.groupby("A").count()
expected = pd.DataFrame(
[[1, 1]],
index=pd.Index([1], name="A"),
columns=["B", "C"],
dtype="int64[pyarrow]",
)
tm.assert_frame_equal(result, expected)


def test_arrowextensiondtype_dataframe_repr():
# GH 54062
df = pd.DataFrame(
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/groupby/aggregate/test_cython.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,9 +343,27 @@ def test_cython_agg_nullable_int(op_name):
# the result is not yet consistently using Int64/Float64 dtype,
# so for now just checking the values by casting to float
result = result.astype("float64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope here, but I think this might pass now without the cast? Should probably be xfailed if not. I plan to look into this.

else:
result = result.astype("int64")
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dtype", ["Int64", "Float64", "boolean"])
def test_count_masked_returns_masked_dtype(dtype):
df = DataFrame(
{
"A": [1, 1],
"B": pd.array([1, pd.NA], dtype=dtype),
"C": pd.array([1, 1], dtype=dtype),
}
)
result = df.groupby("A").count()
expected = DataFrame(
[[1, 2]], index=Index([1], name="A"), columns=["B", "C"], dtype="Int64"
)
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("with_na", [True, False])
@pytest.mark.parametrize(
"op_name, action",
Expand Down