Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Decouple Series.apply from Series.agg #53400

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 26, 2023

This PR makes Series.apply not call Series.agg, when given a list-like or dict-like, decoupling the two methods (partly) and solves #53325 (comment). This makes the code base clearer by making Series.apply and Series.agg going less back and forth between each other, which was very confusing. Merging this PR paves the way for merging #53325 afterwards.

To decouple this, I've added a new parameter by_row to Series.apply. It defaults to True, keeping current behavior for backward compatibility when given a single callable and internally calling apply with by_row=False, when given a list-like or dict-like, also for backward compatibility (needed when doing e.g. ser.apply([lambda x: x]).

This parameter is also relevant for #52140, where I proposed adding the parameter to solve a different set of problems with Series.apply. If this PR gets accepted, the solution to #52140 will be to change by_row from True to False after a deprecation process (plus removing the parameter altogether even longer term).

Also, I've renamed apply_multiple to apply_list_or_dict_like, as I think that name is clearer.

EDIT: Originally I called new parameter array_ops_only, but I've changed it to by_row

@topper-123 topper-123 changed the title REF: Decouple apply.apply from apply.agg REF: Decouple Series.apply from Series.agg May 26, 2023
@mroeschke mroeschke added the Apply Apply, Aggregate, Transform label May 26, 2023
@rhshadrach
Copy link
Member

Certainly agreed with where this is going, but I'm not sure this is the right way to get there.

After the deprecation of the default value, the plan is to then deprecate this argument in its entirety, right? So all uses of this method would need to be changed twice

Merging this PR paves the way for merging #53325 afterwards.

That can only be merged after the default value of array_ops_only is changed to True and the argument itself is removed, is that right? Otherwise agg can still be taking the apply path. So in order to enforce the deprecation in #53325, we'll have to wait for 3 major releases.

The following is dependent on me being correct above, and if I'm not then it can probably be ignored.

The change here and #53325 by itself wouldn't be concerning, but we've been talking about making many changes to the apply/agg/transform API, some of which hinges on the issues here being fixed. It seems complex, noisy for users, and slow to do all this piecemeal to me. This is why I prefer something like #41112.

I think we should not attempt both routes simultaneously (piecemeal + something like #41112), so we should decide on a route to move forward.

@topper-123
Copy link
Contributor Author

topper-123 commented May 27, 2023

That can only be merged after the default value of array_ops_only is changed to True and the argument itself is removed, is that right?

It is actually intended to be merged right after this. If you check the code path in this PR for a hypothetical ser.apply([func]) call, we end up in Apply._apply_list_like with op_name="apply", which means that we will (for each func in a list of funcs) no longer call ser.agg(func) like in main, but will instead call ser.apply(func, array_ops_only=True).

This means that calls to Series.apply with list and dicts of callables will no longer call Series.agg (like single callables already don't call Series.agg). So, after this PR, changes to Series.agg (like #53325) can no longer affect the behavior of Series.apply.

So this PR and #53325 can be implemented as-is, AFICS (though this is obviously quite complex, and could definitely use more eyes on it to verify whether I'm missing something).

Implementing #52140 will however require deprecating array_ops_only=False in v2.x and making users set array_ops_only=True in their code to be compatible with v3.0. So if users have set array_ops_only=True in their last version before upgrading to v3.0, everything will work unchanged in v3.0. The parameter array_ops_only will have to exist in v.3.0 to avoid raising when jumping version from 2.x to v3.0, but setting array_ops_only in Series.apply will have no effect, except emitting a FutureWarning that the parameter will be removed in the future. This empty parameter will have to keep dangling there until v4.0, but that will though not affect any code in v3.x at all. So if we implement #52140, all code will be in place in v3.0 and the only vestige of the old world will be a non-functioning & deprecated ghost parameter array_ops_onlyin Series.apply.

@topper-123
Copy link
Contributor Author

Above I just discuss your first half (before "The following is dependent..."). I think we can maybe discuss the later part in a follow-up, as I guess that discussion can depend on the conclusions for the first part.

@rhshadrach
Copy link
Member

rhshadrach commented May 28, 2023

Thanks for the correction; I agree with your assessment. Still, I think my concerns from my previous comment remain. In particular

I think we should not attempt both routes simultaneously (piecemeal + something like #41112), so we should decide on a route to move forward.

Do you agree with this line here?

@topper-123
Copy link
Contributor Author

I may not be experienced enough with the groupby and window methods to know, but I think it's worth looking into if the Series and Dataframe methods can have their undesired behavior deprecated using a normal process, i.e we can avoid a parallel implementation for them.

TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process. I may still have missed something and be proven wrong of course, but that's part of the discussion...

The way I see it that after this PR and #53325 for example SeriesApply.agg will be (shortened a bit):

    def agg(self):
        result = super().agg()
        if result is None:
            obj = self.obj
            f = self.f

            try:
                result = obj.apply(f)
            except (ValueError, AttributeError, TypeError):
                result = f(self.obj)
            else:
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
            return result

The above just deprecates calling obj.apply(f) and we above actually not guaranteed that f(obj) returns a aggregated value, for example f =lambda x: np.sqrt(x), then agg will return transformed values, not a aggregate.

But if we change the above to:

    def agg(self):
        result = super().agg()
        if result is None:
            obj = self.obj
            f = self.f

            try:
                result = obj.apply(f)
            except (ValueError, AttributeError, TypeError):
                result = f(self.obj)

        if not self._is_aggregate_value(result):  # aside: how do we know something is an aggregate value?
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

        return result

then returning a non-aggregate value will emit a warning. So the question is if the above is backward compatible and, if the user fix their code to not emit warnings, their code will be compatible with v.3.0.:

In pandas v3.0 the method will become:

    def agg(self):
        result = super().agg()
        if result is None:
            result = self.f(self.obj)

        if not self._is_aggregate_value(result):
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

        return result

The same can be done similarly with Series.transform where we also need to check the result is a transformed result and not an aggregation or something else. To do it with Series.apply requires a array_ops_only parameter however.

My suspicion is that if the above can be done on Series.(agg|transform|apply), then DataFrame.(agg|transform|apply) will also fall into place easily. I haven't looked enough into (Groupby|Resample|Window).(agg|transform|apply) enough to have a clear opinion, but maybe everything will fall into place like a puzzle there also, if the Series/DataFrame methods can be deprecated correctly. And if not, the new Series/DataFrame methods can be used in a new versions of (Groupby|Resample|Window) methods, easing their implementations?

@rhshadrach
Copy link
Member

TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process.

Indeed, you've made progress where I didn't think it was possible.

For agg, I think pandas should not take a stance on what an aggregated value is but rather treat the return from the UDF as if it were a scalar (even when you would typically think it's not). But we can discuss this at a later point.

I haven't looked enough into (Groupby|Resample|Window).(agg|transform|apply) enough to have a clear opinion, but maybe everything will fall into place like a puzzle there also, if the Series/DataFrame methods can be deprecated correctly.

For UDFs (as opposed to string aliases), these implementations are largely independent of that in Series/DataFrame.

And if not, the new Series/DataFrame methods can be used in a new versions of (Groupby|Resample|Window) methods, easing their implementations?

Agreed! Let's move forward with these and see where we get to.

pandas/core/apply.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v2.1.0.rst Outdated Show resolved Hide resolved
pandas/core/apply.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, needs tests.

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 2, 2023

I can see it needs tests for dict_likes, do you have anything else in mind?

EDIT: and also by_row = (True|False).

@rhshadrach
Copy link
Member

Yep - was really just thinking by_row

@topper-123
Copy link
Contributor Author

I've updated the tests.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good!

@@ -101,6 +101,7 @@ Other enhancements
- :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
- :meth:`SeriesGroupby.agg` and :meth:`DataFrameGroupby.agg` now support passing in multiple functions for ``engine="numba"`` (:issue:`53486`)
- Added ``engine_kwargs`` parameter to :meth:`DataFrame.to_excel` (:issue:`53220`)
- Added a new parameter ``array_ops_only`` to :meth:`Series.apply`. When set to ``True`` the supplied callables will always operate on the whole Series (:issue:`53400`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_row now; not array_ops_only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, changed.

if is_groupby:
engine = self.kwargs.get("engine", None)
engine_kwargs = self.kwargs.get("engine_kwargs", None)
kwargs = {"engine": engine, "engine_kwargs": engine_kwargs}
kwds.update({"engine": engine, "engine_kwargs": engine_kwargs})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBD, but I wonder why the change from kwargs to kwds? In pandas.core we overwhelmingly use kwargs instead of kwds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs would make a line further down excedd 88 lines and be reformatted to fill 3 lines. So a stylistic preference, but not a strong opinion,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think consistency in variable names is more important here.

@@ -693,8 +717,8 @@ def values(self):
def apply(self) -> DataFrame | Series:
"""compute the results"""
# dispatch to agg
if is_list_like(self.func):
return self.apply_multiple()
if is_list_like(self.func) or is_dict_like(self.func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dicts are considered list-like; no need for the 2nd check here.

Copy link
Contributor Author

@topper-123 topper-123 Jun 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed it. I've changed the comment above instead to explain dictlike go here too.

@@ -1079,8 +1106,8 @@ def apply(self) -> DataFrame | Series:
return self.apply_empty_result()

# dispatch to agg
if is_list_like(self.func):
return self.apply_multiple()
if is_list_like(self.func) or is_dict_like(self.func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good!

@topper-123
Copy link
Contributor Author

I've updated the PR.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach
Copy link
Member

Just out of curiosity - are you seeing each of my reviews get duplicated too?

image

@topper-123
Copy link
Contributor Author

Just out of curiosity - are you seeing each of my reviews get duplicated too?

yes, pretty weird.

@topper-123 topper-123 added the Refactor Internal refactoring of code label Jun 4, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jun 5, 2023
@rhshadrach rhshadrach merged commit d9c3777 into pandas-dev:main Jun 5, 2023
38 checks passed
@rhshadrach
Copy link
Member

Thanks @topper-123

@topper-123 topper-123 deleted the decouple_Apply.apply_from_Apply.agg branch June 5, 2023 21:27
topper-123 added a commit to topper-123/pandas that referenced this pull request Jun 5, 2023
topper-123 added a commit to topper-123/pandas that referenced this pull request Jun 6, 2023
mroeschke pushed a commit that referenced this pull request Jun 7, 2023
* BUG: make Series.agg aggregate when possible

* fix doc build

* deprecate instead of treating as a bug

* CLN: Apply.agg_list_like

* some cleanups

* REF/CLN: func in core.apply (#53437)

* REF/CLN: func in core.apply

* Remove type-hint

* REF: Decouple Series.apply from Series.agg (#53400)

* update test

* fix issues

* fix issues

* fix issues

---------

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
Comment on lines +306 to +307
if op_name == "apply":
kwargs = {**kwargs, "by_row": False}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topper-123: shouldn't by_row here be True for backwards compatibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I'm thinking this should now be self.by_row when that attribute exists. If a user calls ser.apply(["sum", "mean"], by_row=True) (or with by_row=False), shouldn't we be passing the argument down to the next call to apply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. I'll make a new PR on that.

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
* BUG: make Series.agg aggregate when possible

* fix doc build

* deprecate instead of treating as a bug

* CLN: Apply.agg_list_like

* some cleanups

* REF/CLN: func in core.apply (pandas-dev#53437)

* REF/CLN: func in core.apply

* Remove type-hint

* REF: Decouple Series.apply from Series.agg (pandas-dev#53400)

* update test

* fix issues

* fix issues

* fix issues

---------

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants