Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Closed
Khris777 opened this issue May 29, 2018 · 4 comments · Fixed by #40790
Closed

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Khris777 opened this issue May 29, 2018 · 4 comments · Fixed by #40790
Labels
Apply Apply, Aggregate, Transform Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@Khris777
Copy link

Khris777 commented May 29, 2018

Problem Description

I stumbled over some very specific off behaviour when trying to do a multi-index groupby over several boolean columns using agg with a dictionary to apply a function to each column that would count the number of True-values by using the boolean series as a boolean index on itself and getting the length of that.

Example code:

NN = 1
l0 = NN*['a','a','b','b','b','b','b','b','b']
l1 = NN*['x','y','x','x','x','y','y','z','z']
l2 = NN*[True,True,False,True,False,False,False,True,True]
l3 = NN*[True,False,False,True,False,True,False,True,False]
l4 = NN*[False,False,True,True,True,True,False,True,False]

df = pd.DataFrame({"C0":l0,'C1':l1,'C2':l2,'C3':l3,'C4':l4})
print(df.groupby(["C0","C1"])[["C2","C3","C4"]]\
.agg({"C2":lambda x: len(x[x]),"C3":lambda x: len(x[x]),"C4":lambda x: len(x[x])}))

The result I'm getting is:

	   C2     C3  C4
C0 C1               
a  x    1   True   0
   y    1  False   0
b  x    1   True   2
   y    0   True   1
   z    2   True   1

As you can see, the second column isn't properly aggregated for some reason.

If I set NN = 10, I'm getting a proper result:

	   C2  C3  C4
C0 C1            
a  x   10  10   0
   y   10   0   0
b  x   10  10  30
   y    0  10  10
   z   20  10  10

The problem only happens when using a multi-index, only with the C3 column, only when NN=1 and only when using the agg function (it works fine with apply on the single column), so it can be reduced to this call:

print(df.groupby(["C0","C1"])["C3"].agg(lambda x: len(x[x])))

It seems like the specific combination of the multi-index and the boolean values causes some odd behaviour.

EDIT: I did not write the C3 data specifically to cause the error, I just played with random combinations around until the error appeared, so I have no clue, why exactly that combination causes the problem.

In my real application this happens to two of three columns in a dataframe with 570 rows, so it's not limited to very short dataframes like in the example.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented May 29, 2018

Do you have any way to simplify this example?

FWIW you could just sum your boolean columns to get the same result

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label May 29, 2018
@Khris777
Copy link
Author

Khris777 commented May 29, 2018

This simplified example still shows the errorneous behaviour:

df = pd.DataFrame({"C0":['a','a'],'C1':['x','y'],'C3':[True,False]})
print(df.groupby(["C0","C1"])["C3"].agg(lambda x: len(x[x])))

Whereas using apply shows the correct behaviour:

print(df.groupby(["C0","C1"])["C3"].apply(lambda x: len(x[x])))

@WillAyd
Copy link
Member

WillAyd commented May 29, 2018

OK thanks for the last example. I suppose the difference is that agg is casting the result of your anonymous function back to boolean when the result is zeros / ones and apply is not. This also explains why none of the other columns were cast back to boolean (they had more than just 0s and 1s) and also why having N != 1 would not cast back to boolean either.

There are a variety of ways to handle this from the end user perspective, whether it be through explicit typing or again using something like sum which would be more idiomatic and performant than what you are trying to do anyway.

That said, investigation and PRs to align the casting rules for apply and agg are always welcome!

@WillAyd WillAyd added Apply Apply, Aggregate, Transform Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Info Clarification about behavior needed to assess issue labels May 29, 2018
@jreback
Copy link
Contributor

jreback commented May 29, 2018

xref #14873 is the root issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants