Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow the groupby by param to handle columns and index levels (GH5677) #14432

Merged
merged 8 commits into from Dec 14, 2016

Conversation

Projects
None yet
6 participants
@jonmmease
Copy link
Contributor

commented Oct 15, 2016

  • closes #5677
  • tests added
  • rebase after #14342 is merged to fix #14334
  • tests pass (requires rebase above)
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry
  • updates existing documentation
  • add example to whatsnew 0.20 and groupby.rst

Change to allow strings passed as the by parameter to df.groupby to reference columns (existing behavior) or index level names if no column match is found. Columns take precedence in the case of ambiguity to maintain backward compatibility.

@@ -29,8 +29,7 @@ New features

Other enhancements
^^^^^^^^^^^^^^^^^^


- Strings passed to ``DataFrame.groupby()`` as the ``by`` parameter may now reference either column names or index level names (:issue:`5677`)

This comment has been minimized.

Copy link
@jreback

jreback Oct 19, 2016

Contributor

this will need an example here. put the same one in groupby.rst (make also need to add to the groupby doc-string)

This comment has been minimized.

Copy link
@jonmmease

jonmmease Nov 5, 2016

Author Contributor

@jreback Example added here and in groupby.rst. I didn't add anything to the groupby-docstring yet as I wasn't quite sure where it would fit in (there are only two examples there right now). Let me know what you think.

expected = df_multi_both.groupby(pd.Grouper(key='inner')).mean()
assert_frame_equal(result, expected)
not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean()
assert not result.index.equals(not_expected.index)

This comment has been minimized.

Copy link
@jreback

jreback Oct 19, 2016

Contributor

self.assertFalse

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2016

@i think that we should raise in the ambiguous case, which just works on the column now (which this PR uses to establish precedence), e.g.

df.groupby('inner') when inner is both a column and a level name.

This is almost always an error by the user, or if its actually wanted, then the user should be more specific (by using pd.Grouper).

Currently, AFIK, this just takes the column.

@jonmmease

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2016

@jreback I have no problem with raising an exception in the ambiguous case. As you noted, the only reason to have columns take precedence was to reproduce the behavior of previous versions if ambiguity was present.

@shoyer expressed a preference for the precedence approach in #14355 (comment) for the df.merge operation. And I do agree that it makes sense to handle ambiguity in a consistent way across #5677 (groupby), #14353 (sort_values), and #14355 (merge).

@TomAugspurger @jorisvandenbossche @shoyer Do any of you object to raising an exception in the ambiguous case for each of these 3 enhancements?

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 19, 2016

Yes, we could error in the ambiguous case, but only eventually, after a deprecation cycle.

@jonmmease

This comment has been minimized.

Copy link
Contributor Author

commented Oct 20, 2016

@shoyer
If we went the deprecation cycle route, would that mean that we'd keep the column precedence behavior but add a FutureWarning in the ambiguous case? Something like the following?

warnings.warn(("'%s' is both a column name and an index level, defaulting to column."
               "This will raise an ambiguity error in a future version") % grp,
               FutureWarning, stacklevel=2)

And then I assume we'd also add a deprecation note to the whatsnew file.

@jreback @jorisvandenbossche Thoughts?

@shoyer

This comment has been minimized.

Copy link
Member

commented Oct 20, 2016

@jmmease Yes, that's right. We should probably be a little reluctant to add deprecated behavior right now, though, because there may be a long wait between the next pandas feature release (1.0?) and pandas 2.0.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2016

there is no problem with deprecating things
w certainly are not going to just 'wait' for pandas 2.0

@jonmmease jonmmease force-pushed the jonmmease:enh_5677 branch from 224ba7d to 010e66c Oct 27, 2016

@jonmmease

This comment has been minimized.

Copy link
Contributor Author

commented Oct 27, 2016

@jreback @shoyer @jorisvandenbossche
Rebased now that #14342 has landed. Added FutureWarning for the case where a string matches both a column and an index level.

Example for whatsnew, groupby.rst, and groupby doc-string still to come

@@ -94,6 +94,9 @@ The mapping can be specified many different ways:
- For DataFrame objects, a string indicating a column to be used to group. Of
course ``df.groupby('A')`` is just syntactic sugar for
``df.groupby(df['A'])``, but it makes life simpler
- For DataFrame objects, a string indicating an index level to be used to group.

This comment has been minimized.

Copy link
@jreback

jreback Oct 27, 2016

Contributor

versionadded tag

i would also make this into a note section

This comment has been minimized.

Copy link
@jonmmease

jonmmease Oct 27, 2016

Author Contributor

@jreback I wasn't quite sure how to handle versionadded and keep the description in the list. I left the description in the list (without ambiguity explanation) and added a note section below with versionadded tag that describes the change and the ambiguity behavior.

@codecov-io

This comment has been minimized.

Copy link

commented Oct 27, 2016

Current coverage is 85.30% (diff: 100%)

Merging #14432 into master will increase coverage by 0.01%

@@             master     #14432   diff @@
==========================================
  Files           140        144     +4   
  Lines         50719      51004   +285   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43259      43510   +251   
- Misses         7460       7494    +34   
  Partials          0          0          

Powered by Codecov. Last update dca0185...2d35461

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Oct 27, 2016

@jonmmease jonmmease force-pushed the jonmmease:enh_5677 branch from dea98f1 to da7b406 Nov 5, 2016

@jonmmease jonmmease force-pushed the jonmmease:enh_5677 branch from da7b406 to 98af4d7 Nov 18, 2016

@jorisvandenbossche jorisvandenbossche merged commit a8cabb8 into pandas-dev:master Dec 14, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Dec 14, 2016

@jmmease Thanks a lot!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 17, 2016

@jmmease can you do a quick followup to catch this warning that's appearing (you may need to run all groupby tests)

test_groupby_multi_categorical_as_index (pandas.tests.test_groupby.TestGroupBy) ... /Users/jreback/pandas/pandas/core/groupby.py:370: FutureWarning: 'cat' is both a column name and an index level.
Defaulting to column but this will raise an ambiguity error in a future version
  mutated=self.mutated)
@jonmmease

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2016

Sure. See #14902

jreback added a commit that referenced this pull request Dec 17, 2016

Catch warning introduced by GH14432 in test case
Follow on to #14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes #14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index

ischurov added a commit to ischurov/pandas that referenced this pull request Dec 19, 2016

ischurov added a commit to ischurov/pandas that referenced this pull request Dec 19, 2016

Catch warning introduced by GH14432 in test case
Follow on to pandas-dev#14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes pandas-dev#14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index

ShaharBental added a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016

Catch warning introduced by GH14432 in test case
Follow on to pandas-dev#14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes pandas-dev#14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index
@goldenbull

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2017

So what's the next step of the deprecation cycle? What shall I do if I want to enforce the column to be used in grouping?

@jsexauer jsexauer referenced this pull request Jun 5, 2017

Open

DEPR: deprecations from prior versions #6581

0 of 94 tasks complete
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2017

@goldenbull this will be changed to an error in 1.0 (after 0.21)

gfyoung added a commit to forking-repos/pandas that referenced this pull request Aug 19, 2018

@jreback jreback referenced this pull request Aug 19, 2018

Open

DEPR: deprecations log for removed issues #13777

135 of 135 tasks complete

jreback added a commit that referenced this pull request Aug 22, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.