Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix min_count issue for groupby.sum #32914

Merged
merged 30 commits into from
May 9, 2020
Merged

BUG: Fix min_count issue for groupby.sum #32914

merged 30 commits into from
May 9, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Mar 22, 2020


result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal but it seems we get NA in the DataFrame case but NaN for Series. Should I add a FIXME comment with a follow-up issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? that seems odd, can you track this down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue about this. I am not sure this is correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part seems incorrect? The dtype of the index being object is maybe odd, but otherwise seems okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment was from before adding the conversion using maybe_cast_result (so the original test had NaN for Series but NA for DataFrame): 2a3f814

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine then

@@ -553,7 +553,8 @@ def _cython_operation(
# Two options for avoiding this special case
# 1. mask-aware ops and avoid casting to float with NaN above
# 2. specify the result dtype when calling this method
result = result.astype("int64")
if not isna(result).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this condition to the above elif & add an appropriate comment as 3.


result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? that seems odd, can you track this down

@jreback jreback added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 22, 2020
@@ -577,6 +583,14 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't pretty but don't really know of a better way to make SeriesGroupBy return NA instead of NaN

Copy link
Contributor

@jreback jreback Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #32894 after we merge that you can just make a small mod to use the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged #32894 if you want to refactor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused, do you mean maybe_cast_result (maybe_cast_result_dtype returns a numpy int dtype itself)? When I try using maybe_cast_result though I just get a numpy array back

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the function which takes the name of the operation 'add' and the dtype and gives you back the casting dtype. which is eactly what we need to do here; you just need to modfy maybe_cast_result.

rather than have an else at all.

Copy link
Member Author

@dsaxton dsaxton Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be causing lots of test failures (e.g., also sending numpy arrays down this path raises when maybe_cast_result tries to access a _values attribute). Also seems we'd need to remove the check for result[0] being an instance of the original dtype because then maybe_cast_result doesn't try to convert NaN to a nullable integer, but then categorical tests end up failing. Probably worth refactoring but I'm honestly not sure how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would like to fine a way of using maybe_cast_result, i don't mind have a very simpl elif here. but this is not good as written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple changes (some from another PR) that simplifies it a little

@@ -577,6 +583,14 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to use maybe_cast_result_dtype here; I don't actually think we need the else any longer

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved
@@ -575,6 +565,11 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if is_integer_dtype(orig_values.dtype) and is_extension_array_dtype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole entire point of maybe_cast_result is that you don't need the if here

to be honest we should remove lines 558-566, but can do that in another PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the if about 550 groupby tests fail, though I think it works when only checking for extension arrays

and dtype.kind != "M"
and not is_categorical_dtype(dtype)
):
cls = dtype.construct_array_type()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF you need extra logic, then you can add it here.

@@ -575,6 +565,9 @@ def _cython_operation(
elif is_datetimelike and kind == "aggregate":
result = result.astype(orig_values.dtype)

if is_extension_array_dtype(orig_values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move line 568 (the is_extension_array_dtype) before line 558, can we then completely remove lines 558-556? (these are already extension dtypes).

Copy link
Member Author

@dsaxton dsaxton Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it looks like we still get a lot of failures. I could add back the integer check to avoid trying to cast things twice?

@simonjayhawkins
Copy link
Member

@dsaxton what's the status here?

@dsaxton
Copy link
Member Author

dsaxton commented May 7, 2020

@dsaxton what's the status here?

It's working / green, but I think @jreback may want it refactored (I haven't had much success in deleting what looks like redundant code without causing tests to fail)

@simonjayhawkins
Copy link
Member

simonjayhawkins commented May 7, 2020

The 'regression' for Int64 could be considered a bug since Int64 is considered experimental.

I've not looked in detail yet, but the 'regression' was caused by #31359 see #32861 (comment) so would to check that the other EAs which aren't experimental aren't affected by this issue.

but I think @jreback may want it refactored

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

@dsaxton
Copy link
Member Author

dsaxton commented May 7, 2020

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

It doesn't affect others that I'm aware of (there's a similar but as far as I can tell unrelated problem with boolean data type which I created a separate issue for)

@dsaxton
Copy link
Member Author

dsaxton commented May 7, 2020

@jreback @simonjayhawkins Is is possible we can merge this as is? I have a fix for a separate bug that depends on this.

@dsaxton dsaxton mentioned this pull request May 7, 2020
5 tasks
@jreback jreback added this to the 1.1 milestone May 7, 2020
@jreback
Copy link
Contributor

jreback commented May 7, 2020

@jreback @simonjayhawkins Is is possible we can merge this as is? I have a fix for a separate bug that depends on this.

sure is this rebase on master?

@dsaxton
Copy link
Member Author

dsaxton commented May 7, 2020

sure is this rebase on master?

I think Simon merged this morning


result = grouped.sum(min_count=2)
expected = pd.DataFrame(
{"b": [pd.NA] * 3, "c": [pd.NA] * 3}, dtype="Int64", index=idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems your comment about is not correct, e.g. we have NA in all cases. is that not what you meant here?

@jreback jreback merged commit 6f5614b into pandas-dev:master May 9, 2020
@dsaxton dsaxton deleted the mincount-sum branch May 9, 2020 22:14
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
* Add test

* Check for null

* Release note

* Update and comment

* Update test

* Hack

* Try a different casting

* No pd

* Only for add

* Undo

* Release note

* Fix

* Space

* maybe_cast_result

* float -> Int

* Less if

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
@simonjayhawkins
Copy link
Member

if this issue does affect other EAs, then this PR is a candidate for backport, so would rather keep the changes here minimal.

It doesn't affect others that I'm aware of (there's a similar but as far as I can tell unrelated problem with boolean data type which I created a separate issue for)

there is a problem also with timedelta, see #34051 (comment), but this also occurs on 0.25.3

so won't backport for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling sum with min_count on SeriesGroupBy with dtype Int64 gives large negative value rather than pd.NA
3 participants