Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum of grouped bool column has inconsistent type #7001

Open
jkleint opened this Issue Apr 29, 2014 · 3 comments

Comments

Projects
None yet
4 participants
@jkleint
Copy link

jkleint commented Apr 29, 2014

Summing a bool column after a groupby gives a bool result until there are two or more True values, when it becomes a float64. Seems like it should always be an (unsigned?) integer. Straight sum without a groupby always gives an int64. This is with 0.13.1.

pd.DataFrame([True]).groupby(lambda x: 0).sum()
      0
0  True

pd.DataFrame([True,True]).groupby(lambda x: 0).sum()
   0
0  2

pd.DataFrame([False]).groupby(lambda x: 0).sum()
       0
0  False

pd.DataFrame([False,False]).groupby(lambda x: 0).sum()
       0
0  False

pd.DataFrame([False,False,True]).groupby(lambda x: 0).sum()
      0
0  True

pd.DataFrame([False,False,True,True]).groupby(lambda x: 0).sum()
   0
0  2

pd.DataFrame([False,False]).sum()
0    0
dtype: int64
@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Apr 29, 2014

this is a dupe of #3752, but I like your examples better, so will keep this issue!

Its possible to fix, but hasn't been high on the list of priorities

@xflr6

This comment has been minimized.

Copy link
Contributor

xflr6 commented Feb 14, 2016

As for getting float64 instead of int64 as result, a possible workaround is to use count_nonzero from numpy instead of sum to aggregate:

>>> pd.DataFrame([True,True]).groupby(lambda x: 0).agg(pd.np.count_nonzero)[0]
0    2
Name: 0, dtype: int64
@ediphy-dwild

This comment has been minimized.

Copy link

ediphy-dwild commented Oct 30, 2018

for some additional context - sometimes the user may not know they are dealing with a bool type. this may occur when performing a groupby on the result of pd.get_dummies, which may return columns of type uint8, but not always. if get_dummies returns a uint16, the issue above is not triggered, and dummies_result.groupby(...).sum() returns int types. if any of the counts in dummies is small enough, the groupby result will be float.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.