BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

BraisLP · 2022-05-27T14:39:34Z

I discovered this in pandas 1.2.5 and I tested it in 1.4.2 and 1.3.1.
I haven't found a similar issue reported here.
I'm using python 3.9.10

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

import pandas as pd

dummy = pd.Series([0]*6, name='dummy')
rowcat = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='category', name='cat')
rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')
values = pd.Series([1, 2, 2, 1, 2, 2], name='val')

withsum = pd.crosstab([rowstr, rowcat],  # it looks the same for [rowcat, rowstr]
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)

This produces the odd-looking matrix:

>>> withsum
dummy    0
str cat   
a   a    2
    b    0
b   a    0
    b    8

Compare to the following three examples:

withmin = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=min,  # or any other np agg func -> it's OK
)
>>> withmin
dummy    0
str cat   
a   a    1
b   b    2

withlambda = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
            values=values,
            aggfunc=lambda x: sum(x),  # dirty fix -> it's OK!
)
>>> withlambda
dummy    0
str cat   
a   a    2
b   b    8

withoutfunc = pd.crosstab(index=[rowstr, rowcat],
            columns=dummy,
)  # no agg func-> it's OK
>>> withoutfunc
dummy    0
str cat   
a   a    2
b   b    4

The text was updated successfully, but these errors were encountered:

MaruthiKo · 2022-05-29T11:46:26Z

I'm using pandas 1.3.4 and python 3.9.13
I am getting the same output

MaruthiKo · 2022-05-29T11:49:04Z

Hi, Can I help in contributing to this issue?

simonjayhawkins · 2022-05-29T12:26:32Z

Thanks @BraisLP for the report

Here's a "minimal" example: if I provide e.g. two rows to pd.crosstab, one of which is of categorical type, there's odd behaviour exclusively when aggfunc=sum or np.sum:

from the docs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

This produces the odd-looking matrix:

given the above, I'm not so sure if this isn't the expected behavior and the other cases are perhaps the incorrect ones.

for instance if the cat series included an unused 'c'.

for the withsum case in the OP, you get

dummy    0
str cat   
a   a    2
    b    0
    c    0
b   a    0
    b    8
    c    0

and for the withmin case

dummy    0
str cat   
a   a    1
b   b    2

The withmin case does not include the unused category as so is IMO incorrect and the withsum case repeats the unused category for all values in the first level which makes sense and consitent with including the extra rows in the withsum case without unused categories.

simonjayhawkins · 2022-05-29T12:31:00Z

#16367 related

BraisLP · 2022-05-30T10:51:22Z

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

withsum_inverted = pd.crosstab([rowcat, rowstr],
            columns=dummy,
            values=values,
            aggfunc=sum,  # same for np.sum
)

>>> withsum_inverted
dummy    0
cat str   
a   a    2
    b    0
b   a    0
    b    8

simonjayhawkins · 2022-05-30T11:21:06Z

I thought I'd add the example with the inverted row order that I mention in a comment: it includes all of the "categories" even if the inner row is not categorical.

again, what would be the expected output if the category Series includes an used category. The first level of the MultiIndex should include the unused category (according to the docs), but what would be the second level? maybe a missing value indicator (such as np.nan) or all of the unique values in the second level? (i.e. the MultiIndex should be the product of the categories with the unique values in the non categorical series)

creating the MutiIndex from the product would be consistent with the current sum behavior for the reversed level order.

BraisLP · 2022-05-30T12:41:37Z

again, what would be the expected output if the category Series includes an used category

In this last example, the column that gets "expanded" to include unused values is not categorical (the column titled "str" was defined in my first post as rowstr = pd.Series(['a', 'b', 'b', 'a', 'b', 'b'], dtype='str', name='str')). Maybe internally it got converted to a categorical.

Time permitting, I'm going to have a look at this in more detail...

BraisLP changed the title ~~Crosstab inconsistent with using aggfunc=sum when MultiIndex has category~~ BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category May 28, 2022

simonjayhawkins added Bug Categorical Categorical Data Type labels May 29, 2022

simonjayhawkins added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

BraisLP commented May 27, 2022

MaruthiKo commented May 29, 2022 •

edited

MaruthiKo commented May 29, 2022

simonjayhawkins commented May 29, 2022

simonjayhawkins commented May 29, 2022

BraisLP commented May 30, 2022

simonjayhawkins commented May 30, 2022

BraisLP commented May 30, 2022

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

BUG: Crosstab inconsistent with using aggfunc=sum when MultiIndex has category #47147

Comments

BraisLP commented May 27, 2022

MaruthiKo commented May 29, 2022 • edited

MaruthiKo commented May 29, 2022

simonjayhawkins commented May 29, 2022

simonjayhawkins commented May 29, 2022

BraisLP commented May 30, 2022

simonjayhawkins commented May 30, 2022

BraisLP commented May 30, 2022

MaruthiKo commented May 29, 2022 •

edited