New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: clarify groupby by to handle columns/index names #5677

Closed
TomAugspurger opened this Issue Dec 11, 2013 · 17 comments

Comments

Projects
None yet
5 participants
@TomAugspurger
Contributor

TomAugspurger commented Dec 11, 2013

Referenced briefly in the OP at #3275

In [11]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)])

In [12]: idx.names = ['outer', 'inner']

In [13]: df = pd.DataFrame({"A": np.arange(6), 'B': ['one', 'one', 'two', 'two', 'one', 'one']}, index=idx)

So the idea is to be able to call

df.groupby('B', level='inner')

instead of

In [15]: df.reset_index().groupby(['B', 'inner']).mean()
Out[15]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

[5 rows x 1 columns]

Currently this raises TypeError: 'numpy.ndarray' object is not callable. Mostly just syntactic sugar, but I've been having to do a lot of this lately and all the reset_indexes are getting annoying. Thoughts?

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 12, 2013

Contributor

I take it the idea is to provide sugar for grouping on a combination of a column and and a multiindex level.
In that case, the example you provided works nicely, but:

  • The group key spec is now smeared over multiple args (instead of just by). I think that's wrong.
  • That approach doesn't handle a slightly more general case, grouping by a combination
    of multiple cols and levels in a certain order. That's a bad sign when it comes to API design.
Contributor

y-p commented Dec 12, 2013

I take it the idea is to provide sugar for grouping on a combination of a column and and a multiindex level.
In that case, the example you provided works nicely, but:

  • The group key spec is now smeared over multiple args (instead of just by). I think that's wrong.
  • That approach doesn't handle a slightly more general case, grouping by a combination
    of multiple cols and levels in a certain order. That's a bad sign when it comes to API design.
@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 12, 2013

Contributor

How about resolving level names in the by handling code, and if there's a collision between
column names and multiindex names defaulting to the column (preserves existing code), but issuing
a warning. @jreback, does that sound right to you?

Contributor

y-p commented Dec 12, 2013

How about resolving level names in the by handling code, and if there's a collision between
column names and multiindex names defaulting to the column (preserves existing code), but issuing
a warning. @jreback, does that sound right to you?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2013

Contributor

I don't think you even need these arguments, just an enhancement to figure out what the user wants. e.g

I see a label in the grouper, then follow a simple algo:

  • index names (e.g. level)
  • column name
Contributor

jreback commented Dec 12, 2013

I don't think you even need these arguments, just an enhancement to figure out what the user wants. e.g

I see a label in the grouper, then follow a simple algo:

  • index names (e.g. level)
  • column name
@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 12, 2013

Contributor

That's just what I propose, except that variation breaks backwards-compat.
Since by doesn't consider the index levels currently, column names should have precedence IMO
to keep old code working.
The warning should be there to alert the user that pandas is resolving an ambiguity by being opinionated.

Contributor

y-p commented Dec 12, 2013

That's just what I propose, except that variation breaks backwards-compat.
Since by doesn't consider the index levels currently, column names should have precedence IMO
to keep old code working.
The warning should be there to alert the user that pandas is resolving an ambiguity by being opinionated.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2013

Contributor

that's reasonable

Contributor

jreback commented Dec 12, 2013

that's reasonable

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2013

Contributor

@TomAugspurger so maybe let's change the name of this issue to something like 'clarify the grouper'?

so it deals nicely with index/columns names (and can raise/warn if their are duplicates, taking the columns in preference to the index names). If there STILL is ambiguity let's discuss (e.g.if you specify a label and it could possibly be misinterpreted somehow).

Contributor

jreback commented Dec 12, 2013

@TomAugspurger so maybe let's change the name of this issue to something like 'clarify the grouper'?

so it deals nicely with index/columns names (and can raise/warn if their are duplicates, taking the columns in preference to the index names). If there STILL is ambiguity let's discuss (e.g.if you specify a label and it could possibly be misinterpreted somehow).

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 12, 2013

Contributor

Just as an example:

In [4]: df=mkdf(4,2,r_idx_nlevels=2)
   ...: df.columns = ["foo","bar"]
   ...: df.index.names = ["baz","foo"]
   ...: df
Out[4]: 
                  foo   bar
baz     foo                
R_l0_g0 R_l1_g0  R0C0  R0C1
R_l0_g1 R_l1_g1  R1C0  R1C1
R_l0_g2 R_l1_g2  R2C0  R2C1
R_l0_g3 R_l1_g3  R3C0  R3C1

[4 rows x 2 columns]

In [5]: df.groupby('foo') # not ambiguous now, but would be
Out[5]: <pandas.core.groupby.DataFrameGroupBy object at 0x39e9d50>

Since groupy accepts an axis argument, I guess we need to consider the case for hirerchical columns as well.
Never used it myself nor can I reason out how to use it at the repl now that I tried, I expected

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.groupby(['a'],1).groups
KeyError: u'no item named a'

to be equivelent to:

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.T.groupby(['a']).groups # notice transpose
{1: [0, 1], 3: [2]}
Contributor

y-p commented Dec 12, 2013

Just as an example:

In [4]: df=mkdf(4,2,r_idx_nlevels=2)
   ...: df.columns = ["foo","bar"]
   ...: df.index.names = ["baz","foo"]
   ...: df
Out[4]: 
                  foo   bar
baz     foo                
R_l0_g0 R_l1_g0  R0C0  R0C1
R_l0_g1 R_l1_g1  R1C0  R1C1
R_l0_g2 R_l1_g2  R2C0  R2C1
R_l0_g3 R_l1_g3  R3C0  R3C1

[4 rows x 2 columns]

In [5]: df.groupby('foo') # not ambiguous now, but would be
Out[5]: <pandas.core.groupby.DataFrameGroupBy object at 0x39e9d50>

Since groupy accepts an axis argument, I guess we need to consider the case for hirerchical columns as well.
Never used it myself nor can I reason out how to use it at the repl now that I tried, I expected

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.groupby(['a'],1).groups
KeyError: u'no item named a'

to be equivelent to:

df=pd.DataFrame([[1,1,3],['a','b','c']],index=['a','b'])
    ...: df.T.groupby(['a']).groups # notice transpose
{1: [0, 1], 3: [2]}
@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Dec 12, 2013

Contributor

That all sounds reasonable. I can take a shot at implementing this in a few weeks. I'll probably have questions :)

The other bit is that by accepts (a list of) mapping functions, dicts, Series. I don't think I've used either of these before.

Contributor

TomAugspurger commented Dec 12, 2013

That all sounds reasonable. I can take a shot at implementing this in a few weeks. I'll probably have questions :)

The other bit is that by accepts (a list of) mapping functions, dicts, Series. I don't think I've used either of these before.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2013

Contributor

@TomAugspurger I dont' think it accept a list of mapping function / series, only a single mapper or series (otherwise should be an error)

Contributor

jreback commented Dec 12, 2013

@TomAugspurger I dont' think it accept a list of mapping function / series, only a single mapper or series (otherwise should be an error)

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Dec 12, 2013

Contributor

From the docstring.

by : mapping function / list of functions, dict, Series, or tuple /
    list of column names.
    Called on each element of the object index to determine the groups.
    If a dict or Series is passed, the Series or dict VALUES will be
    used to determine the groups

I'll clarify that while I'm working on it. I just have to figure out what that actually does first.

Contributor

TomAugspurger commented Dec 12, 2013

From the docstring.

by : mapping function / list of functions, dict, Series, or tuple /
    list of column names.
    Called on each element of the object index to determine the groups.
    If a dict or Series is passed, the Series or dict VALUES will be
    used to determine the groups

I'll clarify that while I'm working on it. I just have to figure out what that actually does first.

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 12, 2013

Contributor

The docstring also says

>>> df.groupby.__doc__
...
Group **series** using mapper (dict or key function, apply given function\n       

Which is wrong.

Contributor

y-p commented Dec 12, 2013

The docstring also says

>>> df.groupby.__doc__
...
Group **series** using mapper (dict or key function, apply given function\n       

Which is wrong.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 15, 2014

Contributor

@TomAugspurger I think this is worthwhile.....and prob not too complex....

?

Contributor

jreback commented Feb 15, 2014

@TomAugspurger I think this is worthwhile.....and prob not too complex....

?

@jonmmease

This comment has been minimized.

Show comment
Hide comment
@jonmmease

jonmmease Sep 29, 2016

Contributor

@jreback @TomAugspurger I'm interested in implementing this. Has anything changed since this discussion that I should be aware of? I'll likely use Tom's old PR (#7033) as a starting point.

Contributor

jonmmease commented Sep 29, 2016

@jreback @TomAugspurger I'm interested in implementing this. Has anything changed since this discussion that I should be aware of? I'll likely use Tom's old PR (#7033) as a starting point.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 29, 2016

Contributor

http://pandas.pydata.org/pandas-docs/stable/groupby.html#grouping-with-a-grouper-specification implements this

though you could make sugar for non colliding names which could be a name of an index / multi index

Contributor

jreback commented Sep 29, 2016

http://pandas.pydata.org/pandas-docs/stable/groupby.html#grouping-with-a-grouper-specification implements this

though you could make sugar for non colliding names which could be a name of an index / multi index

@jonmmease

This comment has been minimized.

Show comment
Hide comment
@jonmmease

jonmmease Sep 30, 2016

Contributor

Ok, thanks. So am I following that the way to accomplish the original example from this issue, without resetting the index, is the following?

In [75]: df.groupby(['B', pd.Grouper(level='inner')]).mean()
Out [75]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

Should this approach also work when the frame has a singe named index? e.g.

In [76]: df2 = df.reset_index('outer')

In [77]: df2
Out [77]: 
      outer  A    B
inner              
1         a  0  one
2         a  1  one
3         a  2  two
1         b  3  two
2         b  4  one
3         b  5  one

In [79]: df2.groupby(['B', pd.Grouper(level='inner')]).mean()              
...
AttributeError: 'Int64Index' object has no attribute 'labels'

In this case I'm getting an Attribute Error (I'm on pandas 0.18.1 and happy to file a bug if this is one).

I would be interested in adding sugar in order to support

df.groupby(['B', 'inner']).mean()

and

df2.groupby(['B', 'inner']).mean()

where column names take precedence as discussed above.

Contributor

jonmmease commented Sep 30, 2016

Ok, thanks. So am I following that the way to accomplish the original example from this issue, without resetting the index, is the following?

In [75]: df.groupby(['B', pd.Grouper(level='inner')]).mean()
Out [75]: 
             A
B   inner     
one 1      0.0
    2      2.5
    3      5.0
two 1      3.0
    3      2.0

Should this approach also work when the frame has a singe named index? e.g.

In [76]: df2 = df.reset_index('outer')

In [77]: df2
Out [77]: 
      outer  A    B
inner              
1         a  0  one
2         a  1  one
3         a  2  two
1         b  3  two
2         b  4  one
3         b  5  one

In [79]: df2.groupby(['B', pd.Grouper(level='inner')]).mean()              
...
AttributeError: 'Int64Index' object has no attribute 'labels'

In this case I'm getting an Attribute Error (I'm on pandas 0.18.1 and happy to file a bug if this is one).

I would be interested in adding sugar in order to support

df.groupby(['B', 'inner']).mean()

and

df2.groupby(['B', 'inner']).mean()

where column names take precedence as discussed above.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 30, 2016

Member

Should this approach also work when the frame has a singe named index?

Yes, I think it should work. For example, it works when only specifying the index in this way:

In [40]: df2.groupby(level='inner').mean()
Out[40]: 
         A
inner     
1      1.5
2      2.5
3      3.5

In [42]: df2.groupby(pd.Grouper(level='inner')).mean()
Out[42]: 
         A
inner     
1      1.5
2      2.5
3      3.5

So using it in a list to group by multiple columns/indexes should also work. Do you want to open a separate bug report for this?

Member

jorisvandenbossche commented Sep 30, 2016

Should this approach also work when the frame has a singe named index?

Yes, I think it should work. For example, it works when only specifying the index in this way:

In [40]: df2.groupby(level='inner').mean()
Out[40]: 
         A
inner     
1      1.5
2      2.5
3      3.5

In [42]: df2.groupby(pd.Grouper(level='inner')).mean()
Out[42]: 
         A
inner     
1      1.5
2      2.5
3      3.5

So using it in a list to group by multiple columns/indexes should also work. Do you want to open a separate bug report for this?

@jonmmease

This comment has been minimized.

Show comment
Hide comment
@jonmmease

jonmmease Sep 30, 2016

Contributor

Yes, I'll open a separate bug report for this issue in a few hours.

Contributor

jonmmease commented Sep 30, 2016

Yes, I'll open a separate bug report for this issue in a few hours.

@jorisvandenbossche jorisvandenbossche removed this from the Next Major Release milestone Dec 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment