BUG: fix Series.apply(..., by_row), v2. #53601

topper-123 · 2023-06-11T11:55:06Z

Fixes #53400 (comment) by making the by_row param take "compat" as a parameter and using that internally that when apply is given dicts og lists.

The compatability path is now to deprecate by_rows=(True|"compat") at some point, so by_rows=False will become the default in v3.0.

Supercedes #53584.

Code example:

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})

# On 2.0.x and main
print(df.apply([lambda x: 1]))
#          a        b
#   <lambda> <lambda>
# 0        1        1
# 1        1        1
# 2        1        1

topper-123 · 2023-06-11T11:56:38Z

CC: @rhshadrach.

rhshadrach · 2023-06-11T22:06:53Z

I don't see how this addresses #53584 (comment). The issue is with DataFrame.apply:

pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])

Users have no argument to specify by_row, so they have no way to adopt the future (3.0) behavior of by_rows=False. Am I missing it?

topper-123 · 2023-06-11T22:40:44Z

Puff, I was more focused on the fact than in main we currently have:

>>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
          a  b
<lambda>  1  1

While in v2.0 we had:

>>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
         a        b
  <lambda> <lambda>
0        1        1
1        1        1
2        1        1

but looks like you're right wrt. deprecation process, if we want to deprecate the current behavior for DataFrame, we will need a by_row parameter in DataFrame.apply.

topper-123 · 2023-06-12T21:21:10Z

I've made a new version, where we can deprecate the old behavior by deprecating by_row=True|"compat" in DataFrame.apply, similarly to how do it in Series.apply.

pandas/core/frame.py

rhshadrach

I was thinking the removal of by_row=True would also apply in the Series case, is that right? If not, a lot of my requests below are invalid.

pandas/core/apply.py

rhshadrach · 2023-06-25T18:01:03Z

pandas/core/apply.py

+        kwargs,
+    ) -> None:
+        if by_row is not False and by_row != "compat":
+            raise NotImplementedError(f"by_row={by_row} not implemented")


From the docs, I think NotImplementedError signifies the implementation is currently incomplete, and that users can expect this to be supported once we "get around to it". Can this be a ValueError instead.

I missed this comment somehow. I've changed it now.

rhshadrach · 2023-06-25T18:19:12Z

pandas/core/frame.py

+        by_row : False or "compat", default "compat"
+            If "compat", will if possible first translate the func into pandas
+            methods (e.g. ``Series().apply(np.sum)`` will be translated to
+            ``Series().sum()``). If that doesn't work, will try call to apply again with
+            ``by_row=True`` and if that fails, will call apply again with
+            ``by_row=False``
+            If False, the funcs will be passed the whole Series at once.
+            ``by_row`` only has effect when ``func`` is a listlike or dictlike of funcs
+            and the func isn't a string.
+            ``by_row=True`` has not been implemented, and will raise an
+            ``NotImplenentedError``.


I think it'd be good to have the callout on this only applying to list/dict-likes at the beginning, and adding in that this is compatible with previous versions. What do you think about being more vague about the compat behavior instead of trying to detail it out? Something like

by_row : False or "compat", default "compat" Only has effect an when ``func`` is a listlike or dictlike of funcs on the values that aren't NumPy functions (e.g. ``np.sum``) or string-aliases for operations (e.g. ``"sum"``). "compat" is backwards compatible with previous versions and will sometimes operate by row and sometimes operate on the whole Series at once. If False, the funcs will be passed the whole Series at once.

I'm also okay with keeping the more detailed description of compat if you prefer.

My preference is for the other version, if that's ok. I changed it a bit though.

topper-123 · 2023-06-26T11:03:17Z

I was thinking the removal of by_row=True would also apply in the Series case, is that right? If not, a lot of my requests below are invalid.

Unfortunately not. For Series.apply we have:

>>> ser = pd.Series([1, np.nan, 2])
>>> ser.apply(np.sum, by_row=True)
0    1.0
1    NaN
2    2.0
dtype: float64
>>> ser.apply(np.sum, by_row="compat")
3.0

for Series.apply , by_row="compat" ends up calling SeriesApply.apply_compat, when given a dict- or listlike. For example:

>>> ser.apply(np.sum)  # by_row=True is implicit here
0    1.0
1    NaN
2    2.0
dtype: float64
>>> ser.apply([np.sum])  # for each func in the list, call ser.apply(np.sum, by_row="compat")
sum    3.0
dtype: float64

These two cases can't be combined into one parameter option, so I don't see another way forward here myself.

I'll look into your detailed questions later today.

rhshadrach · 2023-06-26T14:45:53Z

In pandas 3.0, we will just have the by_row=False behavior, right? Why does the user need to be able to specify by_row=True?

topper-123 · 2023-06-27T08:22:41Z

I've changed the by_row parameter value names to better suit your proposals:

True in Series.apply has been changed to "compat"
compat in Series.apply has been changed to "_compat"
False is unchanged

This means that for both Series and DataFrame by_row="compat" now means "do it the same way as in v2.0", by_row=False means pass the whole Series to func in all cases, while by_row="_compat" in Series.apply is internal and should not be called by end users.

I think this is better?

pandas/core/series.py

rhshadrach

Ahh, I see now why you need the third state "_compat". We're now calling apply instead of agg from agg_or_apply_list_like when op_name is apply. Agreed this naming is better.

topper-123 · 2023-06-28T15:16:58Z

I’ve updated. Yeah, it’s needed for backward compatability, unfortunately. After by_row=“compat” has been removed in v.3.0, this will become a whole lot simpler.

rhshadrach

lgtm, just the two open requests from earlier on.

topper-123 · 2023-06-29T03:43:18Z

In pandas 3.0, we will just have the by_row=False behavior, right?

Yes, though that parameter will be deprecated and it's default value changed to lib.no_default in v3.0 in preparation for its removal in v4.0.

rhshadrach

lgtm

rhshadrach · 2023-06-29T21:33:21Z

Thanks @topper-123!

topper-123 · 2023-06-29T23:29:27Z

Nice. I'll scan to see if there are other changes needed for Series/DataFrame.apply/agg, but if not, this concludes the work for these methods, until the deprecation is activated. Series/DataFrame.transform still need some work to get in line with the others, though.

I have not intended to work on the groupby methods, because you're taking care of those, right?

rhshadrach · 2023-06-30T02:01:01Z

I have not intended to work on the groupby methods, because you're taking care of those, right?

Yes - my next step is to refactor groupby methods to by-and-large not share code with core.apply. Then after this is done, evaluate what we can reasonable deprecate for 3.0.

fix Series.apply(..., by_row), v2.

5b7247a

topper-123 mentioned this pull request Jun 11, 2023

fix Series.apply(..., by_row) #53584

Closed

topper-123 added 2 commits June 11, 2023 13:04

add gh number

94dbb32

fix codespell

c8aafbb

add by_row para to DataFrame.apply

84f4284

mroeschke added the Apply Apply, Aggregate, Transform label Jun 12, 2023

rhshadrach reviewed Jun 13, 2023

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

topper-123 added 4 commits June 24, 2023 06:25

Merge branch 'master' into fix_series_Apply_by_row_II

a3ebc43

remove compat=True option

3c6aa02

remove compat=True option, cleanup

b5ef347

remove compat=True option, cleanup II

a70637b

rhshadrach requested changes Jun 25, 2023

View reviewed changes

Merge branch 'master' into fix_series_Apply_by_row_II

748f6cd

replace by_row=True with by_row='compat'

6f7f127

rhshadrach reviewed Jun 28, 2023

View reviewed changes

pandas/core/series.py Outdated Show resolved Hide resolved

rhshadrach reviewed Jun 28, 2023

View reviewed changes

remove '_compat' from public interface

417e958

rhshadrach requested changes Jun 29, 2023

View reviewed changes

Merge branch 'master' into fix_series_Apply_by_row_II

4ad9ba9

topper-123 added 2 commits June 29, 2023 04:46

update according to comments

b40665c

linting

881889c

rhshadrach added this to the 2.1 milestone Jun 29, 2023

rhshadrach changed the title ~~fix Series.apply(..., by_row), v2.~~ BUG: fix Series.apply(..., by_row), v2. Jun 29, 2023

rhshadrach added the Bug label Jun 29, 2023

rhshadrach approved these changes Jun 29, 2023

View reviewed changes

rhshadrach merged commit 9a9fcf6 into pandas-dev:main Jun 29, 2023
34 checks passed

topper-123 deleted the fix_series_Apply_by_row_II branch June 29, 2023 22:04

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023

BUG: fix Series.apply(..., by_row), v2. (pandas-dev#53601)

e6e7ef2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix Series.apply(..., by_row), v2. #53601

BUG: fix Series.apply(..., by_row), v2. #53601

topper-123 commented Jun 11, 2023 •

edited

Loading

topper-123 commented Jun 11, 2023

rhshadrach commented Jun 11, 2023 •

edited

Loading

topper-123 commented Jun 11, 2023 •

edited

Loading

topper-123 commented Jun 12, 2023

rhshadrach left a comment

rhshadrach Jun 25, 2023

This comment was marked as resolved.

topper-123 Jun 29, 2023

rhshadrach Jun 25, 2023

topper-123 Jun 29, 2023

topper-123 commented Jun 26, 2023

rhshadrach commented Jun 26, 2023

topper-123 commented Jun 27, 2023

rhshadrach left a comment •

edited

Loading

topper-123 commented Jun 28, 2023

rhshadrach left a comment

topper-123 commented Jun 29, 2023

rhshadrach left a comment

rhshadrach commented Jun 29, 2023

topper-123 commented Jun 29, 2023

rhshadrach commented Jun 30, 2023

BUG: fix Series.apply(..., by_row), v2. #53601

BUG: fix Series.apply(..., by_row), v2. #53601

Conversation

topper-123 commented Jun 11, 2023 • edited Loading

topper-123 commented Jun 11, 2023

rhshadrach commented Jun 11, 2023 • edited Loading

topper-123 commented Jun 11, 2023 • edited Loading

topper-123 commented Jun 12, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Jun 25, 2023

Choose a reason for hiding this comment

This comment was marked as resolved.

topper-123 Jun 29, 2023

Choose a reason for hiding this comment

rhshadrach Jun 25, 2023

Choose a reason for hiding this comment

topper-123 Jun 29, 2023

Choose a reason for hiding this comment

topper-123 commented Jun 26, 2023

rhshadrach commented Jun 26, 2023

topper-123 commented Jun 27, 2023

rhshadrach left a comment • edited Loading

Choose a reason for hiding this comment

topper-123 commented Jun 28, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

topper-123 commented Jun 29, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Jun 29, 2023

topper-123 commented Jun 29, 2023

rhshadrach commented Jun 30, 2023

topper-123 commented Jun 11, 2023 •

edited

Loading

rhshadrach commented Jun 11, 2023 •

edited

Loading

topper-123 commented Jun 11, 2023 •

edited

Loading

rhshadrach left a comment •

edited

Loading