New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: grouping by tuple of values gives TypeError: unhashable type #18314

Closed
jorisvandenbossche opened this Issue Nov 15, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Nov 15, 2017

This example in the docs started failing somewhere since the last commits:

In [1]: business_dates = date_range(start='4/1/2014', end='6/30/2014', freq='B')

In [2]: df = DataFrame(1, index=business_dates, columns=['a', 'b'])

In [3]: df.groupby((df.index.year, df.index.month)).nth([0, 3, -1])
...
TypeError: unhashable type: 'Int64Index'
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 15, 2017

Ah, this was deliberately changed by #18249 cc @toobaz @jreback

It seems it also affected dask cc @TomAugspurger

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 15, 2017

If we decide to stick with what #18249 did (not special case Multi-Index, always see tuple as label), then we could consider deprecating this first?

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 15, 2017

If we decide to stick with what #18249 did (not special case Multi-Index, always see tuple as label)

In general, every time a collection of keys is provided as argument to a pandas method, it should be in a list, not in a tuple, because otherwise we have ambiguity (and not just with MultiIndex). And maybe this should be documented more clearly in the docs (I see "list of labels or tuples" here, but only talking about indexing). But indeed, the groubpy docstring (before my intervention) mentioned a generic "iterable" but then specified "list of strs" (which I changed to "list of labels").

But then, in this specific case, the tuple does not contain keys, but rather full series, and in principle we might want to allow this (although at a cost, because we would need to distinguish valid - e.g. hashable - keys from ndarray-like objects). Indeed my PR had not considered the possibility to pass more than one series, which is not mentioned in the docstring, but which we certainly don't want to drop. The question is whether we want to allow passing them in a tuple.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Nov 16, 2017

Yes, this did affect dask, though their use case was a bit strange since the user-provided ddf.groupby(['a', 'b']) when through a function that takes *args, so the list became a tuple.

How hard would it be to make df.groupby(('a', 'b')) emit a future warning for a release, and then making it always be a key in the next?

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 16, 2017

How hard would it be to make df.groupby(('a', 'b')) emit a future warning for a release

Not hard... the warning would affect also grouping by a legitimate tuple label (contained in a flat index), but this was not supported until my change, so it shouldn't annoy too many people. Shall I proceed?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Nov 16, 2017

Not hard... the warning would affect also grouping by a legitimate tuple label

What are your thoughts on these cases?

df = pd.DataFrame({('a', 'b'): [1, 1, 2, 2], 'a': [1, 1, 1, 2], 'b': [1, 2, 2, 2], 'c': [1, 1, 1, 1]})

# Case 1, no ambiguity.  No warning, use the tuple?
df[[('a', 'b'), 'a', 'c']].groupby(('a', 'b')).c.mean()

# Case 2, no ambiguity. Warn and listify the tuple?
df[['a', 'b', 'c']].groupby(('a', 'b')).c.mean()

# Case 3, ambiguous. Probably use the tuple?
df.groupby(('a', 'b')).c.mean()

IIUC, 2 is the case that we "broke", and would like to warn for. I don't know what happened with case 3 before, but I think the correct thing to do is just use the tuple, probably without a warning?

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 16, 2017

What are your thoughts on these cases?

My plan was actually to not check the content of the columns. So all 3 cases would listify and emit a warning (and case 1 would actually break as before my PR).

This said, while I think in general trying to find labels in indexes and then catching KeyErrors as fallback (and listifying, in this case) is a strategy we should avoid, for a temporary fix it could make sense. In this case, things would go as you suggested above, and case 3 would use the tuple (which is good, since it is the behavior we finally want to keep). Then after one release cycle, we would stop any "listification".

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 11, 2017

TomAugspurger added a commit that referenced this issue Dec 18, 2017

COMPAT: Emit warning when groupby by a tuple (#18731)
* COMPAT: Emit warning when groupby by a tuple

Closes #18314

* DOC: avoid future warning

* Cleanup, test unhashable

* PEP8

* Correct KeyError

* update

* xfail

* remove old comments

* pep8

* Fixups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment