Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: clarify groupby by to handle columns/index names #5677

Closed
TomAugspurger opened this issue Dec 11, 2013 · 17 comments · Fixed by #14432
Closed

ENH/API: clarify groupby by to handle columns/index names #5677

TomAugspurger opened this issue Dec 11, 2013 · 17 comments · Fixed by #14432

Comments

@TomAugspurger
Copy link
Contributor

Referenced briefly in the OP at #3275

In [11]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)])

In [12]: idx.names = ['outer', 'inner']

In [13]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']}, index=idx)

So the idea is to be able to call

df.groupby('B', level='inner')

instead of

In [15]: df.reset_index().groupby(['B', 'inner']).mean()
Out[15]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

[5 rows x 1 columns]

Currently this raises TypeError: 'numpy.ndarray' object is not callable. Mostly just syntactic sugar, but I've been having to do a lot of this lately and all the reset_indexes are getting annoying. Thoughts?

@ghost
Copy link

ghost commented Dec 12, 2013

I take it the idea is to provide sugar for grouping on a combination of a column and and a multiindex level.
In that case, the example you provided works nicely, but:

  • The group key spec is now smeared over multiple args (instead of just by). I think that's wrong.
  • That approach doesn't handle a slightly more general case, grouping by a combination
    of multiple cols and levels in a certain order. That's a bad sign when it comes to API design.

@ghost
Copy link

ghost commented Dec 12, 2013

How about resolving level names in the by handling code, and if there's a collision between
column names and multiindex names defaulting to the column (preserves existing code), but issuing
a warning. @jreback, does that sound right to you?

@jreback
Copy link
Contributor

jreback commented Dec 12, 2013

I don't think you even need these arguments, just an enhancement to figure out what the user wants. e.g

I see a label in the grouper, then follow a simple algo:

  • index names (e.g. level)
  • column name

@ghost
Copy link

ghost commented Dec 12, 2013

That's just what I propose, except that variation breaks backwards-compat.
Since by doesn't consider the index levels currently, column names should have precedence IMO
to keep old code working.
The warning should be there to alert the user that pandas is resolving an ambiguity by being opinionated.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2013

that's reasonable

@jreback
Copy link
Contributor

jreback commented Dec 12, 2013

@TomAugspurger so maybe let's change the name of this issue to something like 'clarify the grouper'?

so it deals nicely with index/columns names (and can raise/warn if their are duplicates, taking the columns in preference to the index names). If there STILL is ambiguity let's discuss (e.g.if you specify a label and it could possibly be misinterpreted somehow).

@ghost
Copy link

ghost commented Dec 12, 2013

Just as an example:

In [4]: df=mkdf(4,2,r_idx_nlevels=2)
   ...: df.columns = ["foo","bar"]
   ...: df.index.names = ["baz","foo"]
   ...: df
Out[4]: 
                  foo   bar
baz     foo                
R_l0_g0 R_l1_g0  R0C0  R0C1
R_l0_g1 R_l1_g1  R1C0  R1C1
R_l0_g2 R_l1_g2  R2C0  R2C1
R_l0_g3 R_l1_g3  R3C0  R3C1

[4 rows x 2 columns]

In [5]: df.groupby('foo') # not ambiguous now, but would be
Out[5]: <pandas.core.groupby.DataFrameGroupBy object at 0x39e9d50>

Since groupy accepts an axis argument, I guess we need to consider the case for hirerchical columns as well.
Never used it myself nor can I reason out how to use it at the repl now that I tried, I expected

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.groupby(['a'],1).groups
KeyError: u'no item named a'

to be equivelent to:

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.T.groupby(['a']).groups # notice transpose
{1: [0, 1], 3: [2]}

@TomAugspurger
Copy link
Contributor Author

That all sounds reasonable. I can take a shot at implementing this in a few weeks. I'll probably have questions :)

The other bit is that by accepts (a list of) mapping functions, dicts, Series. I don't think I've used either of these before.

@jreback
Copy link
Contributor

jreback commented Dec 12, 2013

@TomAugspurger I dont' think it accept a list of mapping function / series, only a single mapper or series (otherwise should be an error)

@TomAugspurger
Copy link
Contributor Author

From the docstring.

by : mapping function / list of functions, dict, Series, or tuple /
    list of column names.
    Called on each element of the object index to determine the groups.
    If a dict or Series is passed, the Series or dict VALUES will be
    used to determine the groups

I'll clarify that while I'm working on it. I just have to figure out what that actually does first.

@ghost
Copy link

ghost commented Dec 12, 2013

The docstring also says

>>> df.groupby.__doc__
...
Group **series** using mapper (dict or key function, apply given function\n       

Which is wrong.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@TomAugspurger I think this is worthwhile.....and prob not too complex....

?

@jonmmease
Copy link
Contributor

@jreback @TomAugspurger I'm interested in implementing this. Has anything changed since this discussion that I should be aware of? I'll likely use Tom's old PR (#7033) as a starting point.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2016

http://pandas.pydata.org/pandas-docs/stable/groupby.html#grouping-with-a-grouper-specification implements this

though you could make sugar for non colliding names which could be a name of an index / multi index

@jonmmease
Copy link
Contributor

Ok, thanks. So am I following that the way to accomplish the original example from this issue, without resetting the index, is the following?

In [75]: df.groupby(['B', pd.Grouper(level='inner')]).mean()
Out [75]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

Should this approach also work when the frame has a singe named index? e.g.

In [76]: df2 = df.reset_index('outer')

In [77]: df2
Out [77]: 
      outer  A    B
inner              
1         a  0  one
2         a  1  one
3         a  2  two
1         b  3  two
2         b  4  one
3         b  5  one

In [79]: df2.groupby(['B', pd.Grouper(level='inner')]).mean()              
...
AttributeError: 'Int64Index' object has no attribute 'labels'

In this case I'm getting an Attribute Error (I'm on pandas 0.18.1 and happy to file a bug if this is one).

I would be interested in adding sugar in order to support

df.groupby(['B', 'inner']).mean()

and

df2.groupby(['B', 'inner']).mean()

where column names take precedence as discussed above.

@jorisvandenbossche
Copy link
Member

Should this approach also work when the frame has a singe named index?

Yes, I think it should work. For example, it works when only specifying the index in this way:

In [40]: df2.groupby(level='inner').mean()
Out[40]: 
         A
inner     
1      1.5
2      2.5
3      3.5

In [42]: df2.groupby(pd.Grouper(level='inner')).mean()
Out[42]: 
         A
inner     
1      1.5
2      2.5
3      3.5

So using it in a list to group by multiple columns/indexes should also work. Do you want to open a separate bug report for this?

@jonmmease
Copy link
Contributor

Yes, I'll open a separate bug report for this issue in a few hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment