Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sum of grouped bool has inconsistent dtype #32894

Merged
merged 7 commits into from
Mar 26, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Mar 21, 2020

Would appreciate any feedback on this attempt.

The strategy is to modify the dtype after the aggregation is computed in certain cases when casting. In order for this to work, the cast functions need to be made aware of how the data was aggregated. I've added an optional "how" argument to maybe_downcast_numeric and _try_cast. Because this dtype change is needed in two places, I've added the function groupby_result_dtype to dtypes/common.py to handle the logic.

I wasn't sure where the mapping information needed by groupby_result_dtype should be stored. Currently it is in the function itself, but maybe there is a better place for it.

If this is a good approach, it could potentially be expanded for other aggregations and datatypes. One thought is that perhaps groupby(-).mean() should always return a float for numeric types.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this - the code paths here are tricky but for general feedback on your approach I think this should all be self-contained within groupby, i.e. not leaking into the dtype / inference modules

pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Mar 21, 2020
pandas/core/dtypes/common.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Mar 24, 2020

@rhshadrach if you can update

@rhshadrach
Copy link
Member Author

@jreback Failure is unrelated, ready for another review:

##[error]./pandas/tests/indexes/timedeltas/test_constructors.py:173:String has a space at the beginning instead of the end of the previous string.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. small comments, ping on green.

cc @TomAugspurger @jorisvandenbossche @WillAyd @jbrockmendel if any comments

pandas/core/arrays/__init__.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.1 milestone Mar 25, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - certainly a more logical placement of things

 - Reverted maybe_downcast_numeric to its original state
 - Parameterized tests
 - Removed unnecessary import of try_cast_to_ea from arrays.__init__
@rhshadrach
Copy link
Member Author

@jreback ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small follow request.

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually need a whatsnew note for this. if you can fix the out-of-context comment and change the name of the try_cast_to_ea as suggested. ping on green.

@rhshadrach
Copy link
Member Author

@jreback ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good. one more item. ping on green and can get this in.

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Show resolved Hide resolved
@rhshadrach
Copy link
Member Author

@jreback ping

@jreback jreback merged commit 1a57596 into pandas-dev:master Mar 26, 2020
@jreback
Copy link
Contributor

jreback commented Mar 26, 2020

very nice @rhshadrach keep em coming!

@rhshadrach
Copy link
Member Author

Thanks for the feedback @jbrockmendel. I'll clean these type hitns up in a subsequent PR.

@simonjayhawkins
Copy link
Member

I'll clean these type hitns up in a subsequent PR.

FYI mypy error: pandas\core\dtypes\cast.py:333:18: error: "type" has no attribute "_from_sequence"

so maybe worth adding type hints to maybe_cast_to_extension_array and add a subclass check to the instance check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sum of grouped bool column has inconsistent type
5 participants