ENH/API: clarify groupby by to handle columns/index names #5677

TomAugspurger · 2013-12-11T01:52:05Z

Referenced briefly in the OP at #3275

In [11]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)])

In [12]: idx.names = ['outer', 'inner']

In [13]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']}, index=idx)

So the idea is to be able to call

df.groupby('B', level='inner')

instead of

In [15]: df.reset_index().groupby(['B', 'inner']).mean()
Out[15]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

[5 rows x 1 columns]

Currently this raises TypeError: 'numpy.ndarray' object is not callable. Mostly just syntactic sugar, but I've been having to do a lot of this lately and all the reset_indexes are getting annoying. Thoughts?

The text was updated successfully, but these errors were encountered:

ghost · 2013-12-12T15:38:04Z

I take it the idea is to provide sugar for grouping on a combination of a column and and a multiindex level.
In that case, the example you provided works nicely, but:

The group key spec is now smeared over multiple args (instead of just by). I think that's wrong.
That approach doesn't handle a slightly more general case, grouping by a combination
of multiple cols and levels in a certain order. That's a bad sign when it comes to API design.

ghost · 2013-12-12T15:40:51Z

How about resolving level names in the by handling code, and if there's a collision between
column names and multiindex names defaulting to the column (preserves existing code), but issuing
a warning. @jreback, does that sound right to you?

jreback · 2013-12-12T16:01:02Z

I don't think you even need these arguments, just an enhancement to figure out what the user wants. e.g

I see a label in the grouper, then follow a simple algo:

index names (e.g. level)
column name

ghost · 2013-12-12T16:04:45Z

That's just what I propose, except that variation breaks backwards-compat.
Since by doesn't consider the index levels currently, column names should have precedence IMO
to keep old code working.
The warning should be there to alert the user that pandas is resolving an ambiguity by being opinionated.

jreback · 2013-12-12T16:13:29Z

that's reasonable

jreback · 2013-12-12T16:33:31Z

@TomAugspurger so maybe let's change the name of this issue to something like 'clarify the grouper'?

so it deals nicely with index/columns names (and can raise/warn if their are duplicates, taking the columns in preference to the index names). If there STILL is ambiguity let's discuss (e.g.if you specify a label and it could possibly be misinterpreted somehow).

ghost · 2013-12-12T16:54:31Z

Just as an example:

In [4]: df=mkdf(4,2,r_idx_nlevels=2)
   ...: df.columns = ["foo","bar"]
   ...: df.index.names = ["baz","foo"]
   ...: df
Out[4]: 
                  foo   bar
baz     foo                
R_l0_g0 R_l1_g0  R0C0  R0C1
R_l0_g1 R_l1_g1  R1C0  R1C1
R_l0_g2 R_l1_g2  R2C0  R2C1
R_l0_g3 R_l1_g3  R3C0  R3C1

[4 rows x 2 columns]

In [5]: df.groupby('foo') # not ambiguous now, but would be
Out[5]: <pandas.core.groupby.DataFrameGroupBy object at 0x39e9d50>

Since groupy accepts an axis argument, I guess we need to consider the case for hirerchical columns as well.
Never used it myself nor can I reason out how to use it at the repl now that I tried, I expected

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.groupby(['a'],1).groups
KeyError: u'no item named a'

to be equivelent to:

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.T.groupby(['a']).groups # notice transpose
{1: [0, 1], 3: [2]}

TomAugspurger · 2013-12-12T16:58:24Z

That all sounds reasonable. I can take a shot at implementing this in a few weeks. I'll probably have questions :)

The other bit is that by accepts (a list of) mapping functions, dicts, Series. I don't think I've used either of these before.

jreback · 2013-12-12T17:01:04Z

@TomAugspurger I dont' think it accept a list of mapping function / series, only a single mapper or series (otherwise should be an error)

TomAugspurger · 2013-12-12T17:06:49Z

From the docstring.

by : mapping function / list of functions, dict, Series, or tuple /
    list of column names.
    Called on each element of the object index to determine the groups.
    If a dict or Series is passed, the Series or dict VALUES will be
    used to determine the groups

I'll clarify that while I'm working on it. I just have to figure out what that actually does first.

ghost · 2013-12-12T17:11:14Z

The docstring also says

>>> df.groupby.__doc__
...
Group **series** using mapper (dict or key function, apply given function\n

Which is wrong.

jreback · 2014-02-15T21:18:25Z

@TomAugspurger I think this is worthwhile.....and prob not too complex....

?

jonmmease · 2016-09-29T23:40:32Z

@jreback @TomAugspurger I'm interested in implementing this. Has anything changed since this discussion that I should be aware of? I'll likely use Tom's old PR (#7033) as a starting point.

jreback · 2016-09-29T23:44:26Z

http://pandas.pydata.org/pandas-docs/stable/groupby.html#grouping-with-a-grouper-specification implements this

though you could make sugar for non colliding names which could be a name of an index / multi index

jonmmease · 2016-09-30T00:36:54Z

Ok, thanks. So am I following that the way to accomplish the original example from this issue, without resetting the index, is the following?

In [75]: df.groupby(['B', pd.Grouper(level='inner')]).mean()
Out [75]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

Should this approach also work when the frame has a singe named index? e.g.

In [76]: df2 = df.reset_index('outer')

In [77]: df2
Out [77]: 
      outer  A    B
inner              
1         a  0  one
2         a  1  one
3         a  2  two
1         b  3  two
2         b  4  one
3         b  5  one

In [79]: df2.groupby(['B', pd.Grouper(level='inner')]).mean()              
...
AttributeError: 'Int64Index' object has no attribute 'labels'

In this case I'm getting an Attribute Error (I'm on pandas 0.18.1 and happy to file a bug if this is one).

I would be interested in adding sugar in order to support

df.groupby(['B', 'inner']).mean()

and

df2.groupby(['B', 'inner']).mean()

where column names take precedence as discussed above.

jorisvandenbossche · 2016-09-30T21:00:32Z

Should this approach also work when the frame has a singe named index?

Yes, I think it should work. For example, it works when only specifying the index in this way:

In [40]: df2.groupby(level='inner').mean()
Out[40]: 
         A
inner     
1      1.5
2      2.5
3      3.5

In [42]: df2.groupby(pd.Grouper(level='inner')).mean()
Out[42]: 
         A
inner     
1      1.5
2      2.5
3      3.5

So using it in a list to group by multiple columns/indexes should also work. Do you want to open a separate bug report for this?

jonmmease · 2016-09-30T21:18:51Z

Yes, I'll open a separate bug report for this issue in a few hours.

jreback mentioned this issue Mar 1, 2014

BUG/API: allow TimeGrouper with other columns in a groupby (GH3794) #6516

Merged

TomAugspurger mentioned this issue May 4, 2014

API: Allow groupby's by to take column and index names [WIP] #7033

Closed

jreback modified the milestones: 0.14.1, 0.14.0 May 10, 2014

jreback modified the milestones: 0.15.0, 0.14.1 Jun 13, 2014

TomAugspurger mentioned this issue Sep 5, 2014

Allowing the index to be referenced by name, like a column #8162

Closed

3 tasks

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback added Difficulty Intermediate labels Apr 8, 2015

jreback mentioned this issue Apr 8, 2015

PyCon 2015 sprints #9811

Closed

shoyer mentioned this issue Apr 27, 2015

Towards "pandas 1.0" #10000

Closed

hsharrison mentioned this issue Feb 21, 2016

ENH: allow index to be referenced by name #12404

Closed

5 tasks

jonmmease mentioned this issue Oct 15, 2016

ENH: Allow the groupby by param to handle columns and index levels (GH5677) #14432

Merged

8 tasks

jorisvandenbossche closed this as completed in #14432 Dec 14, 2016

jorisvandenbossche modified the milestones: 0.20.0, Next Major Release Dec 14, 2016

pksohn mentioned this issue May 23, 2017

Fix bug in server test UDST/orca#28

Merged

toobaz mentioned this issue Apr 19, 2020

ENH: favor columns over index levels when groupby-ing over ambiguous label #33657

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: clarify groupby by to handle columns/index names #5677

ENH/API: clarify groupby by to handle columns/index names #5677

TomAugspurger commented Dec 11, 2013

ghost commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Dec 12, 2013

jreback commented Dec 12, 2013

ghost commented Dec 12, 2013

TomAugspurger commented Dec 12, 2013

jreback commented Dec 12, 2013

TomAugspurger commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Feb 15, 2014

jonmmease commented Sep 29, 2016

jreback commented Sep 29, 2016

jonmmease commented Sep 30, 2016

jorisvandenbossche commented Sep 30, 2016

jonmmease commented Sep 30, 2016

ENH/API: clarify groupby by to handle columns/index names #5677

ENH/API: clarify groupby by to handle columns/index names #5677

Comments

TomAugspurger commented Dec 11, 2013

ghost commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Dec 12, 2013

jreback commented Dec 12, 2013

ghost commented Dec 12, 2013

TomAugspurger commented Dec 12, 2013

jreback commented Dec 12, 2013

TomAugspurger commented Dec 12, 2013

ghost commented Dec 12, 2013

jreback commented Feb 15, 2014

jonmmease commented Sep 29, 2016

jreback commented Sep 29, 2016

jonmmease commented Sep 30, 2016

jorisvandenbossche commented Sep 30, 2016

jonmmease commented Sep 30, 2016