Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: SeriesGroupBy._aggregate_named #41090

Open
jbrockmendel opened this issue Apr 21, 2021 · 11 comments
Open

DEPR: SeriesGroupBy._aggregate_named #41090

jbrockmendel opened this issue Apr 21, 2021 · 11 comments
Labels
Deprecate Functionality to remove in pandas Groupby Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

AFAICT this exists to allow users to apply a different function to each group. I could be convinced that this is worth supporting, but it shouldn't be shoehorned into SeriesGroupBy.aggregate the way it is.

AFAICT this is not documented, supports only one test (parametrized over 4 dtypes):

@pytest.mark.parametrize("dtype", ["int64", "int32", "float64", "float32"])
def test_basic(dtype):
    
    data = pd.Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype)

    index = np.arange(9)
    np.random.shuffle(index)
    data = data.reindex(index)

    grouped = data.groupby(lambda x: x // 3)
    [...]
    
    group_constants = {0: 10, 1: 20, 2: 30}
    agged = grouped.agg(lambda x: group_constants[x.name] + x.mean())
    assert agged[1] == 21
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Deprecate Functionality to remove in pandas labels Apr 21, 2021
@jorisvandenbossche jorisvandenbossche added Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 22, 2021
@jbrockmendel
Copy link
Member Author

Tracking down all the places where we do this pinning of "name", disabling them all breaks 16 tests

xref #15062 "I also think that the .name arg is somewhat undocumented when passed to .apply, if you want to add a note there (or small example), might be useful."

@jbrockmendel
Copy link
Member Author

Discussed in April dev meeting, decided this was probably too-breaking

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 20, 2021
@jbrockmendel
Copy link
Member Author

@rhshadrach id like to revive this and am hoping to get you on board. all the places in the groupby code where we do object.__setattr__(group, "name", name) i want to deprecate. Thoughts?

@rhshadrach
Copy link
Member

I could be convinced that this is worth supporting, but it shouldn't be shoehorned into SeriesGroupBy.aggregate the way it is.

I agree with this in its entirety. I personally haven't found a use for the group keys, but don't have a sense how utilized this is. Without this, currently I think users would be left with iterating a groupby object themselves and combining the results (essentially reimplementating apply / agg / transform).

While I'm hesitant to add arguments to apply / agg / transform, it seems to me having a pass_keys or something similar would be a better approach. Defaulting to False, when true, the UDF would be passed two arguments: keys and group. I think this would at least be straightforward to support and test. Thoughts on this alternative?

@jbrockmendel
Copy link
Member Author

would be left with iterating a groupby object themselves and combining the results

I'd be OK with telling users to do this; it's not as if our implementation does anything fancy for performance. And we do have an uncomfortable amount of guessing when it comes time to wrap/concat the results.

I think this would at least be straightforward to support and test. Thoughts on this alternative?

I'd prefer this to the status quo, but share your hesitancy to add keywords. A separate method might be even cleaner? Maybe we just deprecate this behavior entirely (how?) and wait to add an alternative until someone complains?

It's ugly as sin, but we could introspect a UDF for keywords.

@jbrockmendel
Copy link
Member Author

We have 6 places in the groupby code where we do this. Of those, disabling it in DFGB._transform_general (twice) and in filter does not break any tests. The filter docstring does mention this pinning, so that would need a deprecation despite not being tested.

The usage in _aggregate_named will be easy to deprecate bc _aggregate_named is only called from one place and that is only after the non-pinning variant raises.

The usage in SGB._transform_general is needed for test_transform_lambda_with_datetimetz, which actually looks like a pretty legit use case for something like this.

@rhshadrach
Copy link
Member

I'd prefer this to the status quo, but share your hesitancy to add keywords. A separate method might be even cleaner? Maybe we just deprecate this behavior entirely (how?) and wait to add an alternative until someone complains?

It's ugly as sin, but we could introspect a UDF for keywords.

With the exception introspecting, I'd be good with any of those options; and I even could probably be dragged into being okay with introspection. A separate method seems a little odd to me because it'd be the same implementation except what is passed to the UDF.

I'm good with deprecating and seeing if users raise any issues. I suspect this is a feature that users aren't aware of let alone use, but could very much be wrong here.

@jbrockmendel
Copy link
Member Author

OK let's start with just deprecating and go from there.

@jbrockmendel
Copy link
Member Author

For the places where we do pinning other than _aggregate_named, we could do the deprecation by adding a _pin_name keyword and doing something like:

def filter(...):

    try:
        return self._filter(..., _pin_name=False)
    except Exception:
        out = self._filter(..., _pin_name=True)
        warnings.warn(...)
        return out

and put the object.__setattr__s behind a if _pin_name: check. The two potential problems are a) unlikely case UDF has a _pin_name keyword, b) case where user is depending on this behavior for their UDF fails to raise with _pin_name=False.

@jbrockmendel
Copy link
Member Author

Dangit. Looks like GroupBy.hist relies on the name being pinned in apply_groupwise, but doesn't raise if that pinning is disabled, just gives silently incorrect results (e.g. test_groupby_hist_series_with_legend)

@jbrockmendel
Copy link
Member Author

Looks like #25457 is an example of a bug caused by this name-pinning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants