Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow the groupby by param to handle columns and index levels (GH5677) #14432

Merged
merged 8 commits into from
Dec 14, 2016

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Oct 15, 2016

Change to allow strings passed as the by parameter to df.groupby to reference columns (existing behavior) or index level names if no column match is found. Columns take precedence in the case of ambiguity to maintain backward compatibility.

@@ -29,8 +29,7 @@ New features

Other enhancements
^^^^^^^^^^^^^^^^^^


- Strings passed to ``DataFrame.groupby()`` as the ``by`` parameter may now reference either column names or index level names (:issue:`5677`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need an example here. put the same one in groupby.rst (make also need to add to the groupby doc-string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Example added here and in groupby.rst. I didn't add anything to the groupby-docstring yet as I wasn't quite sure where it would fit in (there are only two examples there right now). Let me know what you think.

expected = df_multi_both.groupby(pd.Grouper(key='inner')).mean()
assert_frame_equal(result, expected)
not_expected = df_multi_both.groupby(pd.Grouper(level='inner')).mean()
assert not result.index.equals(not_expected.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertFalse

@jreback
Copy link
Contributor

jreback commented Oct 19, 2016

@i think that we should raise in the ambiguous case, which just works on the column now (which this PR uses to establish precedence), e.g.

df.groupby('inner') when inner is both a column and a level name.

This is almost always an error by the user, or if its actually wanted, then the user should be more specific (by using pd.Grouper).

Currently, AFIK, this just takes the column.

@jonmmease
Copy link
Contributor Author

@jreback I have no problem with raising an exception in the ambiguous case. As you noted, the only reason to have columns take precedence was to reproduce the behavior of previous versions if ambiguity was present.

@shoyer expressed a preference for the precedence approach in #14355 (comment) for the df.merge operation. And I do agree that it makes sense to handle ambiguity in a consistent way across #5677 (groupby), #14353 (sort_values), and #14355 (merge).

@TomAugspurger @jorisvandenbossche @shoyer Do any of you object to raising an exception in the ambiguous case for each of these 3 enhancements?

@shoyer
Copy link
Member

shoyer commented Oct 19, 2016

Yes, we could error in the ambiguous case, but only eventually, after a deprecation cycle.

@jonmmease
Copy link
Contributor Author

@shoyer
If we went the deprecation cycle route, would that mean that we'd keep the column precedence behavior but add a FutureWarning in the ambiguous case? Something like the following?

warnings.warn(("'%s' is both a column name and an index level, defaulting to column."
               "This will raise an ambiguity error in a future version") % grp,
               FutureWarning, stacklevel=2)

And then I assume we'd also add a deprecation note to the whatsnew file.

@jreback @jorisvandenbossche Thoughts?

@shoyer
Copy link
Member

shoyer commented Oct 20, 2016

@jmmease Yes, that's right. We should probably be a little reluctant to add deprecated behavior right now, though, because there may be a long wait between the next pandas feature release (1.0?) and pandas 2.0.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

there is no problem with deprecating things
w certainly are not going to just 'wait' for pandas 2.0

@jonmmease
Copy link
Contributor Author

@jreback @shoyer @jorisvandenbossche
Rebased now that #14342 has landed. Added FutureWarning for the case where a string matches both a column and an index level.

Example for whatsnew, groupby.rst, and groupby doc-string still to come

@@ -94,6 +94,9 @@ The mapping can be specified many different ways:
- For DataFrame objects, a string indicating a column to be used to group. Of
course ``df.groupby('A')`` is just syntactic sugar for
``df.groupby(df['A'])``, but it makes life simpler
- For DataFrame objects, a string indicating an index level to be used to group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag

i would also make this into a note section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I wasn't quite sure how to handle versionadded and keep the description in the list. I left the description in the list (without ambiguity explanation) and added a note section below with versionadded tag that describes the change and the ambiguity behavior.

@codecov-io
Copy link

codecov-io commented Oct 27, 2016

Current coverage is 85.30% (diff: 100%)

Merging #14432 into master will increase coverage by 0.01%

@@             master     #14432   diff @@
==========================================
  Files           140        144     +4   
  Lines         50719      51004   +285   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43259      43510   +251   
- Misses         7460       7494    +34   
  Partials          0          0          

Powered by Codecov. Last update dca0185...2d35461

@jorisvandenbossche jorisvandenbossche merged commit a8cabb8 into pandas-dev:master Dec 14, 2016
@jorisvandenbossche
Copy link
Member

@jmmease Thanks a lot!

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

@jmmease can you do a quick followup to catch this warning that's appearing (you may need to run all groupby tests)

test_groupby_multi_categorical_as_index (pandas.tests.test_groupby.TestGroupBy) ... /Users/jreback/pandas/pandas/core/groupby.py:370: FutureWarning: 'cat' is both a column name and an index level.
Defaulting to column but this will raise an ambiguity error in a future version
  mutated=self.mutated)

@jonmmease
Copy link
Contributor Author

Sure. See #14902

jreback pushed a commit that referenced this pull request Dec 17, 2016
Follow on to #14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes #14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
Follow on to pandas-dev#14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes pandas-dev#14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Follow on to pandas-dev#14432 to catch the newly introduced `FutureWarning` in
the `test_groupby_multi_categorical_as_index` test case.

Author: Jon M. Mease <jon.mease@jhuapl.edu>

Closes pandas-dev#14902 from jmmease/GH14432_follow_on and squashes the following commits:

c30fa2b [Jon M. Mease] Trap warning introduced by GH14432 in test_groupby_multi_categorical_as_index
@goldenbull
Copy link
Contributor

goldenbull commented Jun 5, 2017

So what's the next step of the deprecation cycle? What shall I do if I want to enforce the column to be used in grouping?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2017

@goldenbull this will be changed to an error in 1.0 (after 0.21)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants