Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrameGroupBy.__getitem__ should warn on tuple of length 1 #36302

Open
rhshadrach opened this issue Sep 12, 2020 · 5 comments
Open

BUG: DataFrameGroupBy.__getitem__ should warn on tuple of length 1 #36302

rhshadrach opened this issue Sep 12, 2020 · 5 comments
Labels
Bug Groupby Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

Currently warnings are only emitted when the length of a tuple is greater than 1. From this comment in PR where the deprecation was implemented, there was some confusion as to the behavior of df.groupby('a')['b'] vs df.groupby('a')[('b', )]. The former is the SeriesGroupBy that the comment refers to where key is a string, the latter should still be deprecated.

@rhshadrach
Copy link
Member Author

rhshadrach commented Sep 12, 2020

I didn't notice that in the whatsnew of PR #30546 there are the lines:

    # single key, returns SeriesGroupBy
    g['B']

    # tuple of single key, returns SeriesGroupBy
    g[('B',)]

This makes me think that perhaps it wasn't due to any confusion. However, I don't see any mention of tuples of length one in the original issue #23566 and allowing only tuples of length one seems odd to me.

@jorisvandenbossche Any thoughts?

@rhshadrach
Copy link
Member Author

This is also inconsistent with DataFrame, e.g.

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df[('b',)]

raises KeyError: ('b', )

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Sep 26, 2020
@rhshadrach rhshadrach removed their assignment Dec 12, 2020
@tehunter
Copy link
Contributor

tehunter commented Apr 16, 2024

Is there any interest in making the DataGrameGroupBy.__getitem__ selection mirror the DataFrame.__getitem__ selection? In my mind, there should be some parity like so:

df = pd.DataFrame(...)
df_gb = df.groupby("A")

# Passing a tuple returns the Series for the column matching that MultiIndex tuple representation
s: pd.Series = df[("B", "1")]
# Passing a tuple *should* return the SeriesGroupBy for the column matching that MultiIndex tuple representation
s_gb: pd.SeriesGroupBy = df_gb[("B", "1")]
# This should be equivalent
s_gb_equiv: pd.SeriesGroupBy = s.groupby(df["A"])

Currently, it doesn't seem like there's really any way to reduce a DataFrameGroupBy to a SeriesGroupBy if the dataframe has a MultiIndex, besides applying a squeeze() to the result.

@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 16, 2024

I think so @tehunter. E.g. this works just fine:

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]}).set_index("a")
df.columns = pd.MultiIndex.from_tuples([("b",)])
df.groupby("a")[("b",)].sum()

and will continue to work fine under your proposed behavior. We still need to deprecate the case where you are passing a tuple and the columns are not a MultiIndex nor tuples, so this issue needs to remain, but we can enable your desired behavior without waiting for that deprecation. Could you open up a separate issue?

@tehunter
Copy link
Contributor

Thanks, just opened #58282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants