BUG: Fix min_count issue for groupby.sum #32914

dsaxton · 2020-03-22T20:58:06Z

closes Calling sum with min_count on SeriesGroupBy with dtype Int64 gives large negative value rather than pd.NA #32861
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

dsaxton · 2020-03-22T21:00:17Z

pandas/tests/groupby/test_function.py

+
+    result = grouped.sum(min_count=2)
+    expected = pd.DataFrame(
+        {"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx


Not ideal but it seems we get NA in the DataFrame case but NaN for Series. Should I add a FIXME comment with a follow-up issue?

huh? that seems odd, can you track this down

can you create an issue about this. I am not sure this is correct.

Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814

ok this is fine then

jreback · 2020-03-22T21:05:57Z

pandas/core/groupby/ops.py

@@ -553,7 +553,8 @@ def _cython_operation(
            # Two options for avoiding this special case
            # 1. mask-aware ops and avoid casting to float with NaN above
            # 2. specify the result dtype when calling this method
-            result = result.astype("int64")
+            if not isna(result).any():


move this condition to the above elif & add an appropriate comment as 3.

jreback · 2020-03-22T21:06:37Z

pandas/tests/groupby/test_function.py

+
+    result = grouped.sum(min_count=2)
+    expected = pd.DataFrame(
+        {"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx


huh? that seems odd, can you track this down

dsaxton · 2020-03-23T00:54:17Z

pandas/core/groupby/ops.py

@@ -577,6 +583,14 @@ def _cython_operation(
        elif is_datetimelike and kind == "aggregate":
            result = result.astype(orig_values.dtype)

+        if (


This isn't pretty but don't really know of a better way to make SeriesGroupBy return NA instead of NaN

see #32894 after we merge that you can just make a small mod to use the same

merged #32894 if you want to refactor

you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer

A little confused, do you mean maybe_cast_result (maybe_cast_result_dtype returns a numpy int dtype itself)? When I try using maybe_cast_result though I just get a numpy array back

look at the function which takes the name of the operation 'add' and the dtype and gives you back the casting dtype. which is eactly what we need to do here; you just need to modfy maybe_cast_result.

rather than have an else at all.

This seems to be causing lots of test failures (e.g., also sending numpy arrays down this path raises when maybe_cast_result tries to access a _values attribute). Also seems we'd need to remove the check for result[0] being an instance of the original dtype because then maybe_cast_result doesn't try to convert NaN to a nullable integer, but then categorical tests end up failing. Probably worth refactoring but I'm honestly not sure how.

i would like to fine a way of using maybe_cast_result, i don't mind have a very simpl elif here. but this is not good as written.

Made a couple changes (some from another PR) that simplifies it a little

jreback · 2020-03-27T15:42:42Z

pandas/core/groupby/ops.py

@@ -577,6 +583,14 @@ def _cython_operation(
        elif is_datetimelike and kind == "aggregate":
            result = result.astype(orig_values.dtype)

+        if (


you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-04-03T22:38:18Z

pandas/core/groupby/ops.py

@@ -575,6 +565,11 @@ def _cython_operation(
        elif is_datetimelike and kind == "aggregate":
            result = result.astype(orig_values.dtype)

+        if is_integer_dtype(orig_values.dtype) and is_extension_array_dtype(


the whole entire point of maybe_cast_result is that you don't need the if here

to be honest we should remove lines 558-566, but can do that in another PR

Without the if about 550 groupby tests fail, though I think it works when only checking for extension arrays

jreback · 2020-04-03T22:38:41Z

pandas/core/dtypes/cast.py

+            and dtype.kind != "M"
+            and not is_categorical_dtype(dtype)
+        ):
+            cls = dtype.construct_array_type()


IF you need extra logic, then you can add it here.

jreback · 2020-04-10T17:08:17Z

pandas/core/groupby/ops.py

@@ -575,6 +565,9 @@ def _cython_operation(
        elif is_datetimelike and kind == "aggregate":
            result = result.astype(orig_values.dtype)

+        if is_extension_array_dtype(orig_values.dtype):


if you move line 568 (the is_extension_array_dtype) before line 558, can we then completely remove lines 558-556? (these are already extension dtypes).

No, it looks like we still get a lot of failures. I could add back the integer check to avoid trying to cast things twice?

simonjayhawkins · 2020-05-07T12:18:37Z

@dsaxton what's the status here?

dsaxton · 2020-05-07T16:07:58Z

@dsaxton what's the status here?

It's working / green, but I think @jreback may want it refactored (I haven't had much success in deleting what looks like redundant code without causing tests to fail)

simonjayhawkins · 2020-05-07T16:20:53Z

The 'regression' for Int64 could be considered a bug since Int64 is considered experimental.

I've not looked in detail yet, but the 'regression' was caused by #31359 see #32861 (comment) so would to check that the other EAs which aren't experimental aren't affected by this issue.

but I think @jreback may want it refactored

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

dsaxton · 2020-05-07T16:58:18Z

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

It doesn't affect others that I'm aware of (there's a similar but as far as I can tell unrelated problem with boolean data type which I created a separate issue for)

dsaxton · 2020-05-07T20:53:50Z

@jreback @simonjayhawkins Is is possible we can merge this as is? I have a fix for a separate bug that depends on this.

jreback · 2020-05-07T20:55:02Z

@jreback @simonjayhawkins Is is possible we can merge this as is? I have a fix for a separate bug that depends on this.

sure is this rebase on master?

dsaxton · 2020-05-07T20:57:44Z

sure is this rebase on master?

I think Simon merged this morning

jreback · 2020-05-09T19:35:20Z

pandas/tests/groupby/test_function.py

+
+    result = grouped.sum(min_count=2)
+    expected = pd.DataFrame(
+        {"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx


it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

* Add test * Check for null * Release note * Update and comment * Update test * Hack * Try a different casting * No pd * Only for add * Undo * Release note * Fix * Space * maybe_cast_result * float -> Int * Less if Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

simonjayhawkins · 2020-05-11T13:09:02Z

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

It doesn't affect others that I'm aware of (there's a similar but as far as I can tell unrelated problem with boolean data type which I created a separate issue for)

there is a problem also with timedelta, see #34051 (comment), but this also occurs on 0.25.3

so won't backport for now.

dsaxton added 3 commits March 22, 2020 15:54

Add test

2a3f814

Check for null

3ce0384

Release note

e25c823

dsaxton commented Mar 22, 2020

View reviewed changes

jreback requested changes Mar 22, 2020

View reviewed changes

jreback added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 22, 2020

dsaxton added 3 commits March 22, 2020 16:37

Update and comment

2656bc3

Update test

300b1e5

Hack

e2e08e2

dsaxton commented Mar 23, 2020

View reviewed changes

dsaxton added 7 commits March 24, 2020 15:02

Merge remote-tracking branch 'upstream/master' into mincount-sum

ea59d54

Merge remote-tracking branch 'upstream/master' into mincount-sum

ab1050f

Merge remote-tracking branch 'upstream/master' into mincount-sum

98f0922

Merge remote-tracking branch 'upstream/master' into mincount-sum

ea1ef36

Try a different casting

cdf7dfb

No pd

b4edea3

Only for add

02fc55c

jreback requested changes Mar 27, 2020

View reviewed changes

dsaxton added 6 commits March 27, 2020 10:57

Merge remote-tracking branch 'upstream/master' into mincount-sum

3044843

Undo

fe0ab93

Release note

e6f5b4d

Fix

c29dfad

Space

fb6b1d5

Merge remote-tracking branch 'upstream/master' into mincount-sum

46eb601

dsaxton mentioned this pull request Mar 28, 2020

BUG: Don't cast nullable Boolean to float in groupby #33089

Merged

6 tasks

dsaxton added 4 commits March 28, 2020 13:33

Merge remote-tracking branch 'upstream/master' into mincount-sum

2e65c14

Merge remote-tracking branch 'upstream/master' into mincount-sum

a36131a

maybe_cast_result

7c815b5

float -> Int

aeec4b0

jreback requested changes Apr 3, 2020

View reviewed changes

dsaxton added 4 commits April 3, 2020 21:25

Merge remote-tracking branch 'upstream/master' into mincount-sum

01c1d56

Less if

fc0f406

Merge remote-tracking branch 'upstream/master' into mincount-sum

47c19e2

Merge remote-tracking branch 'upstream/master' into mincount-sum

8da2977

jreback requested changes Apr 10, 2020

View reviewed changes

dsaxton added 2 commits April 14, 2020 17:15

Merge remote-tracking branch 'upstream/master' into mincount-sum

d74a905

Merge remote-tracking branch 'upstream/master' into mincount-sum

ca3659e

simonjayhawkins added the Still Needs Manual Backport label May 7, 2020

Merge remote-tracking branch 'upstream/master' into mincount-sum

0443d73

dsaxton mentioned this pull request May 7, 2020

BUG: Make nullable booleans numeric #34056

Merged

5 tasks

jreback removed the Still Needs Manual Backport label May 7, 2020

jreback added this to the 1.1 milestone May 7, 2020

jreback requested changes May 9, 2020

View reviewed changes

jreback approved these changes May 9, 2020

View reviewed changes

jreback merged commit 6f5614b into pandas-dev:master May 9, 2020

dsaxton deleted the mincount-sum branch May 9, 2020 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix min_count issue for groupby.sum #32914

BUG: Fix min_count issue for groupby.sum #32914

dsaxton commented Mar 22, 2020 •

edited

Loading

dsaxton Mar 22, 2020

jreback Mar 22, 2020

jreback May 7, 2020

dsaxton May 7, 2020

jreback May 9, 2020

dsaxton May 9, 2020

jreback May 9, 2020

jreback Mar 22, 2020

jreback Mar 22, 2020

dsaxton Mar 23, 2020

jreback Mar 23, 2020 •

edited

Loading

jreback Mar 26, 2020

jreback Mar 27, 2020

dsaxton Mar 27, 2020

jreback Mar 27, 2020

dsaxton Mar 27, 2020 •

edited

Loading

jreback Mar 30, 2020

dsaxton Apr 1, 2020

jreback Mar 27, 2020

jreback Apr 3, 2020

dsaxton Apr 4, 2020

jreback Apr 3, 2020

jreback Apr 10, 2020

dsaxton Apr 11, 2020 •

edited

Loading

simonjayhawkins commented May 7, 2020

dsaxton commented May 7, 2020

simonjayhawkins commented May 7, 2020 •

edited

Loading

dsaxton commented May 7, 2020 •

edited

Loading

dsaxton commented May 7, 2020

jreback commented May 7, 2020

dsaxton commented May 7, 2020

jreback May 9, 2020

simonjayhawkins commented May 11, 2020

BUG: Fix min_count issue for groupby.sum #32914

BUG: Fix min_count issue for groupby.sum #32914

Conversation

dsaxton commented Mar 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Mar 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Mar 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Apr 11, 2020 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins commented May 7, 2020

dsaxton commented May 7, 2020

simonjayhawkins commented May 7, 2020 • edited Loading

dsaxton commented May 7, 2020 • edited Loading

dsaxton commented May 7, 2020

jreback commented May 7, 2020

dsaxton commented May 7, 2020

Choose a reason for hiding this comment

simonjayhawkins commented May 11, 2020

dsaxton commented Mar 22, 2020 •

edited

Loading

jreback Mar 23, 2020 •

edited

Loading

dsaxton Mar 27, 2020 •

edited

Loading

dsaxton Apr 11, 2020 •

edited

Loading

simonjayhawkins commented May 7, 2020 •

edited

Loading

dsaxton commented May 7, 2020 •

edited

Loading