Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: Emit warning when groupby by a tuple #18731

Merged
merged 12 commits into from
Dec 18, 2017

Conversation

TomAugspurger
Copy link
Contributor

Closes #18314

@TomAugspurger
Copy link
Contributor Author

cc @toobaz

@codecov
Copy link

codecov bot commented Dec 11, 2017

Codecov Report

Merging #18731 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18731      +/-   ##
==========================================
- Coverage   91.64%   91.62%   -0.02%     
==========================================
  Files         154      154              
  Lines       51401    51408       +7     
==========================================
- Hits        47106    47104       -2     
- Misses       4295     4304       +9
Flag Coverage Δ
#multiple 89.49% <100%> (ø) ⬆️
#single 40.83% <0%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.07% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.68% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb178fc...3057aab. Read the comment docs.

@@ -2850,7 +2850,15 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
elif isinstance(key, BaseGrouper):
return key, [], obj

# Everything which is not a list is a key (including tuples):
tuple_as_list = isinstance(key, tuple) and key not in obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only disadvantage I see with this approach is that

pd.DataFrame(1, index=range(3), columns=pd.MultiIndex.from_product([[1, 2], [3,4]])).groupby((7, 8)).mean()

will raise KeyError: 7 while KeyError: (7,8) would be more correct. Do you think

isinstance(key, tuple) and key not in obj and set(key).issubset(obj)

is too expensive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change would be ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Except that, as @jorisvandenbossche reminded below, set(key) could contains non-hashable objects, so this possibility should be catched)

with tm.assert_produces_warning(FutureWarning) as w:
df[['a', 'b', 'c']].groupby(('a', 'b')).c.mean()

assert "Interpreting tuple 'by' as a list" in str(w[0].message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as above?

@jreback jreback added Deprecate Functionality to remove in pandas MultiIndex labels Dec 12, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example from the docs (see top post #18314 (comment) for the code to reproduce) is still failing with this branch.

@@ -202,6 +202,9 @@ Deprecations
- ``Series.from_array`` and ``SparseSeries.from_array`` are deprecated. Use the normal constructor ``Series(..)`` and ``SparseSeries(..)`` instead (:issue:`18213`).
- ``DataFrame.as_matrix`` is deprecated. Use ``DataFrame.values`` instead (:issue:`18458`).
- ``Series.asobject``, ``DatetimeIndex.asobject``, ``PeriodIndex.asobject`` and ``TimeDeltaIndex.asobject`` have been deprecated. Use ``.astype(object)`` instead (:issue:`18572`)
- Grouping by a tuple of keys now emits a ``FutureWarning`` and is deprecated.
In the future, a tuple passed to ``'by'`` will always refer to a single key
that is the actual tuple, instead of treating the tuple as multiple keys (:issue:`18314`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention you can simply replace the tuple with a list

msg = ("Interpreting tuple 'by' as a list of keys, rather than "
"a single key. Use 'by={!r}' instead of 'by={!r}'. In the "
"future, a tuple will always mean a single key.".format(
list(key), key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key can contain a long array or column, so not sure it is a good idea to format it like this into the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought NumPy's short repr kicked in sooner that it does. I'll fix this

@pep8speaks
Copy link

pep8speaks commented Dec 15, 2017

Hello @TomAugspurger! Thanks for updating the PR.

Line 2866:17: W503 line break before binary operator

Comment last updated on December 18, 2017 at 12:53 Hours UTC

all_hashable = is_tuple and all(is_hashable(x) for x in key)

if is_tuple:
if not all_hashable or key not in obj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lost. Why do you check that elements are not hashable? I would have done instead

if all_hashable and key not in obj and set(key).issubset(obj):

or (if we want to account for the to-be-deprecated possibility to index with missing keys):

if all_hashable and key not in obj and set(key) & (obj):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better - performance-wise:

if key not in obj and all(is_hashable(x) for x in key) and set(key).issubset(obj):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case where you're grouping by non-hashable arrays like in #18314 (comment)

In that case, don't we know that they're certainly relying on groupby((a, b)) to be groupby([a, b]), so we want to warn and listify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still need to handle your KeyError example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see what you're doing now. Yes, that's probably better, and will make handling the KeyError easier.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 15, 2017

Huh about the example:

df = pd.DataFrame(1, index=range(3), columns=pd.MultiIndex.from_product([[1, 2], [3,4]]))
df.groupby((7, 8)).mean()

On master that gives me

Out[4]:
   1     2
   3  4  3  4
7  1  1  1  1
8  1  1  1  1

Is that correct? That seems like it should throw a KeyError, right?

Opened #18798 for that.

@TomAugspurger
Copy link
Contributor Author

OK, updated to use your suggestion @toobaz, with a slight modification so that we warn when either

  1. the tuple isn't a valid key and elements are hashable and all of the elements are valid keys
  2. any of the elements are not hashable

@toobaz
Copy link
Member

toobaz commented Dec 15, 2017

Yeah, I think that's perfect (I had forgot case 2.). Two small comments:

  • you could replace all(is_hashable(x) for x in key) with is_hashable(key) - but without significant performance gain, so not sure it's an improvement
  • the duplicated test is still there

@TomAugspurger
Copy link
Contributor Author

Thanks, fixed. Should be good to go hopefully.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2017

lgtm. needs a rebase to fix conflict. merge on green.

@TomAugspurger TomAugspurger merged commit b6a7cc9 into pandas-dev:master Dec 18, 2017
@TomAugspurger TomAugspurger deleted the warn-groupby-tuple branch December 18, 2017 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants