REF: Decouple Series.apply from Series.agg #53400

topper-123 · 2023-05-26T11:55:41Z

This PR makes Series.apply not call Series.agg, when given a list-like or dict-like, decoupling the two methods (partly) and solves #53325 (comment). This makes the code base clearer by making Series.apply and Series.agg going less back and forth between each other, which was very confusing. Merging this PR paves the way for merging #53325 afterwards.

To decouple this, I've added a new parameter by_row to Series.apply. It defaults to True, keeping current behavior for backward compatibility when given a single callable and internally calling apply with by_row=False, when given a list-like or dict-like, also for backward compatibility (needed when doing e.g. ser.apply([lambda x: x]).

This parameter is also relevant for #52140, where I proposed adding the parameter to solve a different set of problems with Series.apply. If this PR gets accepted, the solution to #52140 will be to change by_row from True to False after a deprecation process (plus removing the parameter altogether even longer term).

Also, I've renamed apply_multiple to apply_list_or_dict_like, as I think that name is clearer.

EDIT: Originally I called new parameter array_ops_only, but I've changed it to by_row

rhshadrach · 2023-05-27T13:55:31Z

Certainly agreed with where this is going, but I'm not sure this is the right way to get there.

After the deprecation of the default value, the plan is to then deprecate this argument in its entirety, right? So all uses of this method would need to be changed twice

Merging this PR paves the way for merging #53325 afterwards.

That can only be merged after the default value of array_ops_only is changed to True and the argument itself is removed, is that right? Otherwise agg can still be taking the apply path. So in order to enforce the deprecation in #53325, we'll have to wait for 3 major releases.

The following is dependent on me being correct above, and if I'm not then it can probably be ignored.

The change here and #53325 by itself wouldn't be concerning, but we've been talking about making many changes to the apply/agg/transform API, some of which hinges on the issues here being fixed. It seems complex, noisy for users, and slow to do all this piecemeal to me. This is why I prefer something like #41112.

I think we should not attempt both routes simultaneously (piecemeal + something like #41112), so we should decide on a route to move forward.

topper-123 · 2023-05-27T17:48:41Z

That can only be merged after the default value of array_ops_only is changed to True and the argument itself is removed, is that right?

It is actually intended to be merged right after this. If you check the code path in this PR for a hypothetical ser.apply([func]) call, we end up in Apply._apply_list_like with op_name="apply", which means that we will (for each func in a list of funcs) no longer call ser.agg(func) like in main, but will instead call ser.apply(func, array_ops_only=True).

This means that calls to Series.apply with list and dicts of callables will no longer call Series.agg (like single callables already don't call Series.agg). So, after this PR, changes to Series.agg (like #53325) can no longer affect the behavior of Series.apply.

So this PR and #53325 can be implemented as-is, AFICS (though this is obviously quite complex, and could definitely use more eyes on it to verify whether I'm missing something).

Implementing #52140 will however require deprecating array_ops_only=False in v2.x and making users set array_ops_only=True in their code to be compatible with v3.0. So if users have set array_ops_only=True in their last version before upgrading to v3.0, everything will work unchanged in v3.0. The parameter array_ops_only will have to exist in v.3.0 to avoid raising when jumping version from 2.x to v3.0, but setting array_ops_only in Series.apply will have no effect, except emitting a FutureWarning that the parameter will be removed in the future. This empty parameter will have to keep dangling there until v4.0, but that will though not affect any code in v3.x at all. So if we implement #52140, all code will be in place in v3.0 and the only vestige of the old world will be a non-functioning & deprecated ghost parameter array_ops_onlyin Series.apply.

topper-123 · 2023-05-27T18:04:25Z

Above I just discuss your first half (before "The following is dependent..."). I think we can maybe discuss the later part in a follow-up, as I guess that discussion can depend on the conclusions for the first part.

rhshadrach · 2023-05-28T12:26:11Z

Thanks for the correction; I agree with your assessment. Still, I think my concerns from my previous comment remain. In particular

I think we should not attempt both routes simultaneously (piecemeal + something like #41112), so we should decide on a route to move forward.

Do you agree with this line here?

topper-123 · 2023-05-28T21:58:00Z

I may not be experienced enough with the groupby and window methods to know, but I think it's worth looking into if the Series and Dataframe methods can have their undesired behavior deprecated using a normal process, i.e we can avoid a parallel implementation for them.

TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process. I may still have missed something and be proven wrong of course, but that's part of the discussion...

The way I see it that after this PR and #53325 for example SeriesApply.agg will be (shortened a bit):

    def agg(self):
        result = super().agg()
        if result is None:
            obj = self.obj
            f = self.f

            try:
                result = obj.apply(f)
            except (ValueError, AttributeError, TypeError):
                result = f(self.obj)
            else:
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
            return result

The above just deprecates calling obj.apply(f) and we above actually not guaranteed that f(obj) returns a aggregated value, for example f =lambda x: np.sqrt(x), then agg will return transformed values, not a aggregate.

But if we change the above to:

    def agg(self):
        result = super().agg()
        if result is None:
            obj = self.obj
            f = self.f

            try:
                result = obj.apply(f)
            except (ValueError, AttributeError, TypeError):
                result = f(self.obj)

        if not self._is_aggregate_value(result):  # aside: how do we know something is an aggregate value?
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

        return result

then returning a non-aggregate value will emit a warning. So the question is if the above is backward compatible and, if the user fix their code to not emit warnings, their code will be compatible with v.3.0.:

In pandas v3.0 the method will become:

    def agg(self):
        result = super().agg()
        if result is None:
            result = self.f(self.obj)

        if not self._is_aggregate_value(result):
                msg = (
                    f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
                    f"has been deprecated. Use {type(obj).__name__}.transform to "
                    f"keep behavior unchanged."
                )
                warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())

        return result

The same can be done similarly with Series.transform where we also need to check the result is a transformed result and not an aggregation or something else. To do it with Series.apply requires a array_ops_only parameter however.

My suspicion is that if the above can be done on Series.(agg|transform|apply), then DataFrame.(agg|transform|apply) will also fall into place easily. I haven't looked enough into (Groupby|Resample|Window).(agg|transform|apply) enough to have a clear opinion, but maybe everything will fall into place like a puzzle there also, if the Series/DataFrame methods can be deprecated correctly. And if not, the new Series/DataFrame methods can be used in a new versions of (Groupby|Resample|Window) methods, easing their implementations?

rhshadrach · 2023-05-29T12:47:29Z

TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process.

Indeed, you've made progress where I didn't think it was possible.

For agg, I think pandas should not take a stance on what an aggregated value is but rather treat the return from the UDF as if it were a scalar (even when you would typically think it's not). But we can discuss this at a later point.

I haven't looked enough into (Groupby|Resample|Window).(agg|transform|apply) enough to have a clear opinion, but maybe everything will fall into place like a puzzle there also, if the Series/DataFrame methods can be deprecated correctly.

For UDFs (as opposed to string aliases), these implementations are largely independent of that in Series/DataFrame.

And if not, the new Series/DataFrame methods can be used in a new versions of (Groupby|Resample|Window) methods, easing their implementations?

Agreed! Let's move forward with these and see where we get to.

pandas/core/apply.py

doc/source/whatsnew/v2.1.0.rst

pandas/core/apply.py

pandas/core/series.py

rhshadrach

Looks good, needs tests.

topper-123 · 2023-06-02T05:13:29Z

I can see it needs tests for dict_likes, do you have anything else in mind?

EDIT: and also by_row = (True|False).

rhshadrach · 2023-06-02T16:09:09Z

Yep - was really just thinking by_row

topper-123 · 2023-06-04T11:17:45Z

I've updated the tests.

rhshadrach

Tests look good!

rhshadrach · 2023-06-04T14:08:35Z

doc/source/whatsnew/v2.1.0.rst

@@ -101,6 +101,7 @@ Other enhancements
 - :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
 - :meth:`SeriesGroupby.agg` and :meth:`DataFrameGroupby.agg` now support passing in multiple functions for ``engine="numba"`` (:issue:`53486`)
 - Added ``engine_kwargs`` parameter to :meth:`DataFrame.to_excel` (:issue:`53220`)
+- Added a new parameter ``array_ops_only`` to :meth:`Series.apply`. When set to ``True`` the supplied callables will always operate on the whole Series (:issue:`53400`).


by_row now; not array_ops_only.

Yeah, changed.

rhshadrach · 2023-06-04T14:15:20Z

pandas/core/apply.py

        if is_groupby:
            engine = self.kwargs.get("engine", None)
            engine_kwargs = self.kwargs.get("engine_kwargs", None)
-            kwargs = {"engine": engine, "engine_kwargs": engine_kwargs}
+            kwds.update({"engine": engine, "engine_kwargs": engine_kwargs})


NBD, but I wonder why the change from kwargs to kwds? In pandas.core we overwhelmingly use kwargs instead of kwds.

kwargs would make a line further down excedd 88 lines and be reformatted to fill 3 lines. So a stylistic preference, but not a strong opinion,

I think consistency in variable names is more important here.

rhshadrach · 2023-06-04T14:17:53Z

pandas/core/apply.py

@@ -693,8 +717,8 @@ def values(self):
    def apply(self) -> DataFrame | Series:
        """compute the results"""
        # dispatch to agg
-        if is_list_like(self.func):
-            return self.apply_multiple()
+        if is_list_like(self.func) or is_dict_like(self.func):


dicts are considered list-like; no need for the 2nd check here.

Ok, I changed it. I've changed the comment above instead to explain dictlike go here too.

rhshadrach · 2023-06-04T14:18:23Z

pandas/core/apply.py

@@ -1079,8 +1106,8 @@ def apply(self) -> DataFrame | Series:
            return self.apply_empty_result()

        # dispatch to agg
-        if is_list_like(self.func):
-            return self.apply_multiple()
+        if is_list_like(self.func) or is_dict_like(self.func):


Ok, changed.

rhshadrach

Tests look good!

topper-123 · 2023-06-04T14:43:41Z

I've updated the PR.

rhshadrach

lgtm

rhshadrach

lgtm

rhshadrach · 2023-06-04T16:26:36Z

Just out of curiosity - are you seeing each of my reviews get duplicated too?

topper-123 · 2023-06-04T16:33:21Z

Just out of curiosity - are you seeing each of my reviews get duplicated too?

yes, pretty weird.

rhshadrach · 2023-06-05T20:30:52Z

Thanks @topper-123

* BUG: make Series.agg aggregate when possible * fix doc build * deprecate instead of treating as a bug * CLN: Apply.agg_list_like * some cleanups * REF/CLN: func in core.apply (#53437) * REF/CLN: func in core.apply * Remove type-hint * REF: Decouple Series.apply from Series.agg (#53400) * update test * fix issues * fix issues * fix issues --------- Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

rhshadrach · 2023-06-08T21:51:23Z

pandas/core/apply.py

+        if op_name == "apply":
+            kwargs = {**kwargs, "by_row": False}


@topper-123: shouldn't by_row here be True for backwards compatibility?

On second thought, I'm thinking this should now be self.by_row when that attribute exists. If a user calls ser.apply(["sum", "mean"], by_row=True) (or with by_row=False), shouldn't we be passing the argument down to the next call to apply?

I think you are right. I'll make a new PR on that.

* BUG: make Series.agg aggregate when possible * fix doc build * deprecate instead of treating as a bug * CLN: Apply.agg_list_like * some cleanups * REF/CLN: func in core.apply (pandas-dev#53437) * REF/CLN: func in core.apply * Remove type-hint * REF: Decouple Series.apply from Series.agg (pandas-dev#53400) * update test * fix issues * fix issues * fix issues --------- Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

REF: Decouple Series.apply from Series.agg

75ce829

topper-123 changed the title ~~REF: Decouple apply.apply from apply.agg~~ REF: Decouple Series.apply from Series.agg May 26, 2023

topper-123 requested a review from rhshadrach May 26, 2023 11:56

topper-123 added 2 commits May 26, 2023 13:05

add GH number

52db878

fix docstring

fc26828

mroeschke added the Apply Apply, Aggregate, Transform label May 26, 2023

rhshadrach requested changes May 29, 2023

View reviewed changes

topper-123 added 3 commits June 1, 2023 18:13

Merge branch 'master' into decouple_Apply.apply_from_Apply.agg

4d0db30

update according to comments

c521691

rename array_ops_only -> by_row

9353f06

rhshadrach requested changes Jun 1, 2023

View reviewed changes

rename _apply_dict_like -> agg_or_apply_dict_like

e7e3433

topper-123 added 4 commits June 3, 2023 07:36

update tests

755ec07

Merge branch 'master' into decouple_Apply.apply_from_Apply.agg

92e7a9f

add tests

e68c46e

add testr II

9af24b2

rhshadrach requested changes Jun 4, 2023

View reviewed changes

update according to comments

af0417d

kwds -> kwargs

8564968

rhshadrach approved these changes Jun 4, 2023

View reviewed changes

topper-123 added the Refactor Internal refactoring of code label Jun 4, 2023

rhshadrach added this to the 2.1 milestone Jun 5, 2023

rhshadrach merged commit d9c3777 into pandas-dev:main Jun 5, 2023
38 checks passed

topper-123 deleted the decouple_Apply.apply_from_Apply.agg branch June 5, 2023 21:27

topper-123 added a commit to topper-123/pandas that referenced this pull request Jun 5, 2023

REF: Decouple Series.apply from Series.agg (pandas-dev#53400)

92d1e68

topper-123 added a commit to topper-123/pandas that referenced this pull request Jun 6, 2023

REF: Decouple Series.apply from Series.agg (pandas-dev#53400)

13ba267

rhshadrach reviewed Jun 8, 2023

View reviewed changes

This was referenced Jun 10, 2023

fix Series.apply(..., by_row) #53584

Closed

BUG: fix Series.apply(..., by_row), v2. #53601

Merged

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023

REF: Decouple Series.apply from Series.agg (pandas-dev#53400)

d2535c1

rhshadrach mentioned this pull request Jan 5, 2024

DEPR: by_row="compat" in DataFrame.apply and Series.apply #56750

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Decouple Series.apply from Series.agg #53400

REF: Decouple Series.apply from Series.agg #53400

topper-123 commented May 26, 2023 •

edited

rhshadrach commented May 27, 2023

topper-123 commented May 27, 2023 •

edited

topper-123 commented May 27, 2023

rhshadrach commented May 28, 2023 •

edited

topper-123 commented May 28, 2023

rhshadrach commented May 29, 2023

rhshadrach left a comment

topper-123 commented Jun 2, 2023 •

edited

rhshadrach commented Jun 2, 2023

topper-123 commented Jun 4, 2023

rhshadrach left a comment

rhshadrach Jun 4, 2023

topper-123 Jun 4, 2023

rhshadrach Jun 4, 2023

topper-123 Jun 4, 2023

rhshadrach Jun 4, 2023

rhshadrach Jun 4, 2023

topper-123 Jun 4, 2023 •

edited

rhshadrach Jun 4, 2023

topper-123 Jun 4, 2023

rhshadrach left a comment

topper-123 commented Jun 4, 2023

rhshadrach left a comment

rhshadrach left a comment

rhshadrach commented Jun 4, 2023

topper-123 commented Jun 4, 2023

rhshadrach commented Jun 5, 2023

rhshadrach Jun 8, 2023

rhshadrach Jun 9, 2023

topper-123 Jun 10, 2023

REF: Decouple Series.apply from Series.agg #53400

REF: Decouple Series.apply from Series.agg #53400

Conversation

topper-123 commented May 26, 2023 • edited

rhshadrach commented May 27, 2023

topper-123 commented May 27, 2023 • edited

topper-123 commented May 27, 2023

rhshadrach commented May 28, 2023 • edited

topper-123 commented May 28, 2023

rhshadrach commented May 29, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

topper-123 commented Jun 2, 2023 • edited

rhshadrach commented Jun 2, 2023

topper-123 commented Jun 4, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Jun 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

topper-123 commented Jun 4, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Jun 4, 2023

topper-123 commented Jun 4, 2023

rhshadrach commented Jun 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented May 26, 2023 •

edited

topper-123 commented May 27, 2023 •

edited

rhshadrach commented May 28, 2023 •

edited

topper-123 commented Jun 2, 2023 •

edited

topper-123 Jun 4, 2023 •

edited