Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

Open
BraisLP opened this issue May 27, 2022 · 7 comments
Open
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@BraisLP
Copy link

BraisLP commented May 27, 2022

I discovered this in pandas 1.2.5 and I tested it in 1.4.2 and 1.3.1.
I haven't found a similar issue reported here.
I'm using python 3.9.10

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

import pandas as pd

dummy = pd.Series([0]*6, name='dummy')
rowcat = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='category', name='cat')
rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')
values = pd.Series([1, 2, 2, 1, 2, 2], name='val')

withsum = pd.crosstab([rowstr, rowcat],  # it looks the same for [rowcat, rowstr]
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)

This produces the odd-looking matrix:

>>> withsum
dummy    0
str cat   
a   a    2
    b    0
b   a    0
    b    8

Compare to the following three examples:

withmin = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=min,  # or any other np agg func -> it's OK
)
>>> withmin
dummy    0
str cat   
a   a    1
b   b    2
withlambda = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=lambda x: sum(x),  # dirty fix -> it's OK!
)
>>> withlambda
dummy    0
str cat   
a   a    2
b   b    8
withoutfunc = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
)  # no agg func-> it's OK
>>> withoutfunc
dummy    0
str cat   
a   a    2
b   b    4
@BraisLP BraisLP changed the title Crosstab inconsistent with using aggfunc=sum when MultiIndex has category BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category May 28, 2022
@MaruthiKo
Copy link
Contributor

MaruthiKo commented May 29, 2022

I'm using pandas 1.3.4 and python 3.9.13
I am getting the same output
Screenshot (130)
Screenshot (129)
Screenshot (128)

@MaruthiKo
Copy link
Contributor

Hi, Can I help in contributing to this issue?

@simonjayhawkins
Copy link
Member

Thanks @BraisLP for the report

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

from the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

This produces the odd-looking matrix:

given the above, I'm not so sure if this isn't the expected behavior and the other cases are perhaps the incorrect ones.

for instance if the cat series included an unused 'c'.

for the withsum case in the OP, you get

dummy    0
str cat   
a   a    2
    b    0
    c    0
b   a    0
    b    8
    c    0

and for the withmin case

dummy    0
str cat   
a   a    1
b   b    2

The withmin case does not include the unused category as so is IMO incorrect and the withsum case repeats the unused category for all values in the first level which makes sense and consitent with including the extra rows in the withsum case without unused categories.

@simonjayhawkins simonjayhawkins added Bug Categorical Categorical Data Type labels May 29, 2022
@simonjayhawkins
Copy link
Member

#16367 related

@simonjayhawkins simonjayhawkins added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 29, 2022
@BraisLP
Copy link
Author

BraisLP commented May 30, 2022

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

withsum_inverted = pd.crosstab([rowcat, rowstr],
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)
>>> withsum_inverted
dummy    0
cat str   
a   a    2
    b    0
b   a    0
    b    8

@simonjayhawkins
Copy link
Member

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

again, what would be the expected output if the category Series includes an used category. The first level of the MultiIndex should include the unused category (according to the docs), but what would be the second level? maybe a missing value indicator (such as np.nan) or all of the unique values in the second level? (i.e. the MultiIndex should be the product of the categories with the unique values in the non categorical series)

creating the MutiIndex from the product would be consistent with the current sum behavior for the reversed level order.

@BraisLP
Copy link
Author

BraisLP commented May 30, 2022

again, what would be the expected output if the category Series includes an used category

In this last example, the column that gets "expanded" to include unused values is not categorical (the column titled "str" was defined in my first post as rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')). Maybe internally it got converted to a categorical.

Time permitting, I'm going to have a look at this in more detail...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants