BUG: pivot_table with margins=True fails for categorical dtype, #10989 #10993

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

jakevdp commented Sep 4, 2015

This is a fix for the issue reported in #10989. I suspect this is an example of "fixing the symptom" rather than "fixing the problem", but I think it makes clear what the source of the problem is: to compute margins, the pivot table must add a row and/or column to the result. If the index or column is categorical, a new value cannot be added.

Let me know if you think there are better approaches to this.

@jreback jreback commented on the diff Sep 4, 2015

pandas/tools/pivot.py
@@ -159,6 +159,19 @@ def _add_margins(table, data, values, rows, cols, aggfunc):
grand_margin = _compute_grand_margin(data, values, aggfunc)
+ # categorical index or columns will fail below when 'All' is added
+ # here we'll convert all categorical indices to object
+ def convert_categorical(ind):
+ _convert = lambda ind: (ind.astype('object')
+ if ind.dtype.name == 'category' else ind)
@jreback

jreback Sep 4, 2015

Contributor

this is not quite right, this should end of being a cat level.

can you add the tests case and i'll take a look. thx.

@jakevdp

jakevdp Sep 4, 2015

Contributor

The test case is in the linked issue. Or would you like me to add a unit test to the package as part of this PR?

@jreback

jreback Sep 4, 2015

Contributor

yes pls

@jakevdp

jakevdp Sep 4, 2015

Contributor

Done

@jreback

jreback Oct 18, 2015

Contributor

this is too specific
need to be more general and/or pushed down into the index itself

Contributor

jakevdp commented Sep 4, 2015

I should add one thing: I decided that rather than adding a new item to the category, it would be cleaner to simply change categories to objects rather than trying to track down all the corner cases of hierarchical indices with categorical levels.

Contributor

jakevdp commented Sep 4, 2015

A much cleaner solution, IMO, would be to add a utility function that does something along the lines of "add a new entry to this index, even if it requires a new category". I imagine there are other places in the package where this sort of thing might happen, and such a utility could be used there as well.

Contributor

TomAugspurger commented Sep 4, 2015

I think we should return a regular Index with object dtypes. What happens if the user has a category called 'All'? I suspect that any fix involving categories will break/be fragile (just a hunch, haven't tried).

Contributor

jakevdp commented Sep 4, 2015

@TomAugspurger – I checked – it turns out this is another bug in the current codebase, even if you're not using categories! If one of the index entries is called All, the attempt to compute margins will overwrite it:

In [19]: data = pd.DataFrame({'x': np.arange(99),
                     'y': np.arange(99) // 50,
                     'z': np.arange(99) % 3})

In [20]: data.z = np.array(['Any', 'All', 'None'])[data.z]

In [21]: data.pivot_table('x', 'y', 'z')
Out[21]: 
z   All   Any  None
y                  
0  25.0  24.0  24.5
1  74.5  73.5  74.0

In [22]: data.pivot_table('x', 'y', 'z', margins=True)
Out[22]: 
z     All   Any  None
y                    
0    24.5  24.0  24.5
1    74.0  73.5  74.0
All  49.0  48.0  50.0

jreback changed the title from BUG: quick fix for #10989 to BUG: pivot_table with margins=True fails for categorical dtype, #10989 Sep 10, 2015

Contributor

jreback commented Oct 18, 2015

@jakevdp can you update according to comments

jreback closed this Oct 18, 2015

Contributor

jakevdp commented Oct 18, 2015

I think I've already addressed all comments above – any others you have in mind? Any reason you closed this without merging?

Contributor

jreback commented Oct 18, 2015

I thought I put the comments before

this needs a more general soln as it too much if/then on type determination

jreback reopened this Oct 18, 2015

Contributor

jakevdp commented Oct 18, 2015

I guess I'm not entirely clear about what you're wanting as a "more general" solution. Any specific ideas?

Contributor

jreback commented Oct 18, 2015

So there is something more going on here; this bug report is a sympton of a different issue. Namely,
that we allow a Categorical as a level of a MultiIndex. But in fact we should simply convert them directly to an object dtype when creating the multi-index in the first place; these are de-facto the same.

In [4]: data = pd.DataFrame({'x': np.arange(8),'y': Series(np.arange(8) // 4).astype('category'),'z': Series(np.arange(8) % 2).astype('category')})

In [5]: data
Out[5]: 
   x  y  z
0  0  0  0
1  1  0  1
2  2  0  0
3  3  0  1
4  4  1  0
5  5  1  1
6  6  1  0
7  7  1  1

In [6]: data.dtypes
Out[6]: 
x       int64
y    category
z    category
dtype: object

In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6
In [7]: data.groupby(['y','z']).agg('mean')
Out[7]: 
     x
y z   
0 0  1
  1  2
1 0  5
  1  6

In [8]: data.groupby(['y','z']).agg('mean').index.levels[0]
Out[8]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'y', dtype='category')

In [9]: data.groupby(['y','z']).agg('mean').index.levels[1]
Out[9]: CategoricalIndex([0, 1], categories=[0, 1], ordered=False, name=u'z', dtype='category')

In [10]: data.groupby(['y','z']).agg('mean').index.values   
Out[10]: array([(0, 0), (0, 1), (1, 0), (1, 1)], dtype=object)

So this is the same as the index in [10]

In [15]: idx = pd.MultiIndex.from_tuples([(0, 0), (0, 1), (1, 0), (1, 1)],names=['y','z'])

In [16]: idx
Out[16]: 
MultiIndex(levels=[[0, 1], [0, 1]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'y', u'z'])

So I think that MultiIndex creation should coerce Categoricals on construction. This can be done in the MultiIndex.__init__ (keep all of these existing test), prob need a couple more.

jreback added the MultiIndex label Oct 18, 2015

jreback added this to the 0.17.1 milestone Oct 18, 2015

Contributor

jakevdp commented Oct 18, 2015

I suppose we should close this PR then, and leave the issue open. Hacking into the internals of MultiIndex is well beyond my level of comfort with the library.

Contributor

jreback commented Oct 18, 2015

haha. well, I'll take your tests in any event. So going to leave open for a bit.

Contributor

jakevdp commented Oct 19, 2015

Sounds good – thanks!

jreback closed this in #11371 Oct 20, 2015

@jreback jreback added a commit that referenced this pull request Oct 20, 2015

@jreback jreback Merge pull request #11371 from jreback/jakevdp-pivot-table-categorical
BUG: pivot table bug with Categorical indexes, #10993
db884d9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment